date:20191128

Re: fe-utils - share query cancellation code

2019-11-28 Thread Fabien COELHO



Bonjour Michaël,


The query cancellation added to pgbench is different than the actual
refactoring, and it is a result of the refactoring, so I would rather
split that into two different commits for clarity.  The split is easy
enough, so that's fine not to send two different patches.


Yep, different file set.


Compilation of the patch fails for me on Windows for psql:
unresolved external symbol sigint_interrupt_jmp
Please note that Mr Robot complains as well at build time:
http://commitfest.cputube.org/fabien-coelho.html

Visibly the problem here is that sigint_interrupt_jmp is declared in
common.h, but you have moved it to a non-WIN32 section of the code in
psql/common.c.  And actually, note that copy.c and mainloop.c make use
of it...


Indeed.


I would not worry much about SIGTERM as you mentioned in the comments,
query cancellations are associated to SIGINT now.  There could be an
argument possible later to allow passing down a custom signal though.


Ok.


Attached is an updated patch with a couple of edits I have done,
including the removal of some noise diffs and the previous edits.


Thanks!

I am switching the patch as waiting on author, bumping it to next CF at 
the same time.  Could you please fix the Windows issue?


I do not have a Windows host, so I can only do blind tests. I just moved 
the declaration out of !WIN32 scope in attached v7, which might solve the 
resolution error. All other issues pointed out above seem fixed in the v6 
you sent.


--
Fabien.diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 4a7ac1f821..5129aea516 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -59,6 +59,7 @@
 
 #include "common/int.h"
 #include "common/logging.h"
+#include "fe_utils/cancel.h"
 #include "fe_utils/conditional.h"
 #include "getopt_long.h"
 #include "libpq-fe.h"
@@ -3894,6 +3895,9 @@ initGenerateDataClientSide(PGconn *con)
 			exit(1);
 		}
 
+		if (CancelRequested)
+			break;
+
 		/*
 		 * If we want to stick with the original logging, print a message each
 		 * 100k inserted rows.
@@ -4109,6 +4113,9 @@ runInitSteps(const char *initialize_steps)
 	if ((con = doConnect()) == NULL)
 		exit(1);
 
+	setup_cancel_handler(NULL);
+	SetCancelConn(con);
+
 	for (step = initialize_steps; *step != '\0'; step++)
 	{
 		instr_time	start;
@@ -4176,6 +4183,7 @@ runInitSteps(const char *initialize_steps)
 	}
 
 	fprintf(stderr, "done in %.2f s (%s).\n", run_time, stats.data);
+	ResetCancelConn();
 	PQfinish(con);
 	termPQExpBuffer();
 }
diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 0a2597046d..48b6279403 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -29,6 +29,7 @@
 #include "copy.h"
 #include "crosstabview.h"
 #include "describe.h"
+#include "fe_utils/cancel.h"
 #include "fe_utils/print.h"
 #include "fe_utils/string_utils.h"
 #include "help.h"
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 53a1ea2bdb..442ff2fe5d 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -23,6 +23,7 @@
 #include "common/logging.h"
 #include "copy.h"
 #include "crosstabview.h"
+#include "fe_utils/cancel.h"
 #include "fe_utils/mbprint.h"
 #include "fe_utils/string_utils.h"
 #include "portability/instr_time.h"
@@ -241,7 +242,7 @@ NoticeProcessor(void *arg, const char *message)
  *
  * SIGINT is supposed to abort all long-running psql operations, not only
  * database queries.  In most places, this is accomplished by checking
- * cancel_pressed during long-running loops.  However, that won't work when
+ * CancelRequested during long-running loops.  However, that won't work when
  * blocked on user input (in readline() or fgets()).  In those places, we
  * set sigint_interrupt_enabled true while blocked, instructing the signal
  * catcher to longjmp through sigint_interrupt_jmp.  We assume readline and
@@ -252,34 +253,11 @@ volatile bool sigint_interrupt_enabled = false;
 
 sigjmp_buf	sigint_interrupt_jmp;
 
-static PGcancel *volatile cancelConn = NULL;
-
-#ifdef WIN32
-static CRITICAL_SECTION cancelConnLock;
-#endif
-
-/*
- * Write a simple string to stderr --- must be safe in a signal handler.
- * We ignore the write() result since there's not much we could do about it.
- * Certain compilers make that harder than it ought to be.
- */
-#define write_stderr(str) \
-	do { \
-		const char *str_ = (str); \
-		int		rc_; \
-		rc_ = write(fileno(stderr), str_, strlen(str_)); \
-		(void) rc_; \
-	} while (0)
-
-
 #ifndef WIN32
 
 static void
-handle_sigint(SIGNAL_ARGS)
+psql_sigint_callback(void)
 {
-	int			save_errno = errno;
-	char		errbuf[256];
-
 	/* if we are waiting for input, longjmp out of it */
 	if (sigint_interrupt_enabled)
 	{
@@ -288,74 +266,20 @@ handle_sigint(SIGNAL_ARGS)
 	}
 
 	/* else, set cancel flag to stop any long-running loops */
-	cancel_pressed = true;
-
-	/* and send QueryCancel if we are processing a database query */
-	if (cancelConn != NULL)
-	{
-		if (PQcancel(cancelConn, errbuf,

Update minimum SSL version

2019-11-28 Thread Peter Eisentraut

I propose to change the default of ssl_min_protocol_version to TLSv1.2 
(from TLSv1, which means 1.0).  Older versions would still be supported, 
just not by default.


The reason is that TLS 1.0 and 1.1 are either already discouraged or 
deprecated or will be by the time PostgreSQL 13 comes out.  So this move 
would be in the direction of "secure by default".  Specifically, PCI DSS 
disallows the use of TLS 1.0 and discourages 1.1 [0], and browser 
vendors are set to disable 1.0 and 1.1 in their products sometime soon [1].


Using TLS 1.2 requires OpenSSL 1.0.1, released in 2012.  I find this to 
be satisfied in CentOS 6 and Debian jessie (oldoldstable), for example.


More details also in my recent blog post [2].


[0]: 
https://blog.pcisecuritystandards.org/are-you-ready-for-30-june-2018-sayin-goodbye-to-ssl-early-tls
[1]: 
https://arstechnica.com/gadgets/2018/10/browser-vendors-unite-to-end-support-for-20-year-old-tls-1-0/
[2]: 
https://www.2ndquadrant.com/en/blog/setting-ssl-tls-protocol-versions-with-postgresql-12/


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From e5d71a3048a8eb17d68ef34caac05ce0bc0b156e Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Fri, 29 Nov 2019 08:08:03 +0100
Subject: [PATCH] Update minimum SSL version

Change default of ssl_min_protocol_version to TLSv1.2 (from TLSv1,
which means 1.0).  Older versions are still supported, just not by
default.

TLS 1.0 is widely deprecated, and TLS 1.1 only slightly less so.  All
OpenSSL versions that support TLS 1.1 also support TLS 1.2, so there
would be very little reason to, say, set the default to TLS 1.1
instead on grounds of better compatibility.
---
 doc/src/sgml/config.sgml  | 6 ++
 src/backend/utils/misc/guc.c  | 2 +-
 src/backend/utils/misc/postgresql.conf.sample | 2 +-
 3 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d4d1fe45cc..901744958c 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1365,10 +1365,8 @@ SSL

 

-The default is TLSv1, mainly to support older
-versions of the OpenSSL library.  You might
-want to set this to a higher value if all software components can
-support the newer protocol versions.
+The default is TLSv1.2, which satisfies industry
+best practices as of this writing.

   
  
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ba4edde71a..adb277d8f2 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -4573,7 +4573,7 @@ static struct config_enum ConfigureNamesEnum[] =
GUC_SUPERUSER_ONLY
},
_min_protocol_version,
-   PG_TLS1_VERSION,
+   PG_TLS1_2_VERSION,
ssl_protocol_versions_info + 1, /* don't allow PG_TLS_ANY */
NULL, NULL, NULL
},
diff --git a/src/backend/utils/misc/postgresql.conf.sample 
b/src/backend/utils/misc/postgresql.conf.sample
index 46a06ffacd..9541879c1f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -105,7 +105,7 @@
 #ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL' # allowed SSL ciphers
 #ssl_prefer_server_ciphers = on
 #ssl_ecdh_curve = 'prime256v1'
-#ssl_min_protocol_version = 'TLSv1'
+#ssl_min_protocol_version = 'TLSv1.2'
 #ssl_max_protocol_version = ''
 #ssl_dh_params_file = ''
 #ssl_passphrase_command = ''
-- 
2.24.0

Re: Implementing Incremental View Maintenance

2019-11-28 Thread Takuma Hoshiai

On Fri, 29 Nov 2019 15:45:13 +0900
Yugo Nagata  wrote:

> The following review on our patch was posted on another thread,
> so I quote here. The tab completion is Hoshiai-san's work, so
> he will handle this issue.
> 
> Regards,
> Yugo Nagata.
> 
> On Thu, 28 Nov 2019 13:00:05 +0900
> nuko yokohama  wrote:
> 
> > Hi.
> > 
> > I'm using the "Incremental Materialized View Maintenance" patch and have
> > reported the following issues.
> > (https://commitfest.postgresql.org/25/2138/)
> > 
> > To Suggest a "DROP INCREMENTAL MATERIALIZED VIEW" in psql, but the syntax
> > error when you run.
> > ("DROP MATERIALIZED VIEW" command can drop Incremental Materialozed view
> > normally.)

Thank you for your review. This psql's suggestion is mistake, 
"INCREMENTAL MATERIALIZED" phrase is only used for CREATE statement.

I will fix it as the following:

diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index 2051bc3..8c4b211 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -1001,7 +1001,7 @@ static const pgsql_thing_t words_after_create[] = {
{"FOREIGN TABLE", NULL, NULL, NULL},
{"FUNCTION", NULL, NULL, Query_for_list_of_functions},
{"GROUP", Query_for_list_of_roles},
-   {"INCREMENTAL MATERIALIZED VIEW", NULL, NULL, 
_for_list_of_matviews},
+   {"INCREMENTAL MATERIALIZED VIEW", NULL, NULL, 
_for_list_of_matviews, THING_NO_DROP | THING_NO_ALTER},
{"INDEX", NULL, NULL, _for_list_of_indexes},
{"LANGUAGE", Query_for_list_of_languages},
{"LARGE OBJECT", NULL, NULL, NULL, THING_NO_CREATE | THING_NO_DROP},


Best Regards,
Takuma Hoshiai

> > ramendb=# CREATE INCREMENTAL MATERIALIZED VIEW pref_count AS SELECT pref,
> > COUNT(pref) FROM shops GROUP BY pref;
> > SELECT 48
> > ramendb=# \d pref_count
> >   Materialized view "public.pref_count"
> > Column |  Type  | Collation | Nullable | Default
> > ---++---+--+-
> >  pref  | text   |   |  |
> >  count | bigint |   |  |
> >  __ivm_count__ | bigint |   |  |
> > 
> > ramendb=# DROP IN
> > INCREMENTAL MATERIALIZED VIEW  INDEX
> > ramendb=# DROP INCREMENTAL MATERIALIZED VIEW pref_count;
> > 2019-11-27 11:51:03.916 UTC [9759] ERROR:  syntax error at or near
> > "INCREMENTAL" at character 6
> > 2019-11-27 11:51:03.916 UTC [9759] STATEMENT:  DROP INCREMENTAL
> > MATERIALIZED VIEW pref_count;
> > ERROR:  syntax error at or near "INCREMENTAL"
> > LINE 1: DROP INCREMENTAL MATERIALIZED VIEW pref_count;
> >  ^
> > ramendb=# DROP MATERIALIZED VIEW pref_count ;
> > DROP MATERIALIZED VIEW
> > ramendb=#
> > 
> > 
> > Regard.
> 
> 
> -- 
> Yugo Nagata 
> 
> 
> -- 
> Yugo Nagata 
> 


-- 
Takuma Hoshiai

Increase footprint of %m and reduce strerror()

2019-11-28 Thread Michael Paquier

Hi all,

Since commit d6c55de1, we support %m in the in-core port for printf
and such.  And it seems to me that we could do better for the frontend
code by reducing the dependency to strerror().

One advantage of doing a switch, or at least reduce the use of
strerror(), would be to ease the work of translators with more error
messages unified between the frontend and the backend.  A possible
drawback is that this could be a cause of minor conflicts when
back-patching.  Always easy enough to fix, still that can be 
annoying.

Thoughts?
--
Michael


signature.asc
Description: PGP signature

Re: To Suggest a "DROP INCREMENTAL MATERIALIZED VIEW" in psql, but the syntax error when you run.

2019-11-28 Thread Yugo Nagata

Hello nuko-san, 

Thank you for your review!

As Michael commentted, we would like to discuss this on the thread
of the patch, so I quote your review in the following post.

https://www.postgresql.org/message-id/20191129154513.943f4ef05896d7b9d3fed69f%40sraoss.co.jp

Regards,
Yugo Nagata

On Thu, 28 Nov 2019 13:05:33 +0900
Michael Paquier  wrote:

> On Thu, Nov 28, 2019 at 01:00:05PM +0900, nuko yokohama wrote:
> > To Suggest a "DROP INCREMENTAL MATERIALIZED VIEW" in psql, but the syntax
> > error when you run.
> > ("DROP MATERIALIZED VIEW" command can drop Incremental Materialozed view
> > normally.)
> 
> It seems to me that this is just an issue with the tab completion the
> patch is adding.  When reviewing the patch, could you just report such
> issues directly on the thread of the patch?  Thanks!
> --
> Michael


-- 
Yugo Nagata

Re: Implementing Incremental View Maintenance

2019-11-28 Thread Yugo Nagata

The following review on our patch was posted on another thread,
so I quote here. The tab completion is Hoshiai-san's work, so
he will handle this issue.

Regards,
Yugo Nagata.

On Thu, 28 Nov 2019 13:00:05 +0900
nuko yokohama  wrote:

> Hi.
> 
> I'm using the "Incremental Materialized View Maintenance" patch and have
> reported the following issues.
> (https://commitfest.postgresql.org/25/2138/)
> 
> To Suggest a "DROP INCREMENTAL MATERIALIZED VIEW" in psql, but the syntax
> error when you run.
> ("DROP MATERIALIZED VIEW" command can drop Incremental Materialozed view
> normally.)
> 
> 
> ramendb=# CREATE INCREMENTAL MATERIALIZED VIEW pref_count AS SELECT pref,
> COUNT(pref) FROM shops GROUP BY pref;
> SELECT 48
> ramendb=# \d pref_count
>   Materialized view "public.pref_count"
> Column |  Type  | Collation | Nullable | Default
> ---++---+--+-
>  pref  | text   |   |  |
>  count | bigint |   |  |
>  __ivm_count__ | bigint |   |  |
> 
> ramendb=# DROP IN
> INCREMENTAL MATERIALIZED VIEW  INDEX
> ramendb=# DROP INCREMENTAL MATERIALIZED VIEW pref_count;
> 2019-11-27 11:51:03.916 UTC [9759] ERROR:  syntax error at or near
> "INCREMENTAL" at character 6
> 2019-11-27 11:51:03.916 UTC [9759] STATEMENT:  DROP INCREMENTAL
> MATERIALIZED VIEW pref_count;
> ERROR:  syntax error at or near "INCREMENTAL"
> LINE 1: DROP INCREMENTAL MATERIALIZED VIEW pref_count;
>  ^
> ramendb=# DROP MATERIALIZED VIEW pref_count ;
> DROP MATERIALIZED VIEW
> ramendb=#
> 
> 
> Regard.


-- 
Yugo Nagata 


-- 
Yugo Nagata

Re: Implementing Incremental View Maintenance

2019-11-28 Thread Amit Langote

Hello,

Thanks a lot for working on this.  It's a great (and big!) feature and
I can see that a lot of work has been put into writing this patch.  I
started looking at the patch (v8), but as it's quite big:

 34 files changed, 5444 insertions(+), 69 deletions(-)

I'm having a bit of trouble reading through, which I suspect others
may be too.  Perhaps, it can be easier for you, as authors, to know
everything that's being changed (added, removed, existing code
rewritten), but certainly not for a reviewer, so I think it would be a
good idea to try to think dividing this into parts.  I still don't
have my head wrapped around the topic of materialized view
maintenance, but roughly it looks to me like there are really *two*
features that are being added:

1. Add a new method to refresh an MV incrementally; IIUC, there's
already one method that's used by REFRESH MATERIALIZED VIEW
CONCURRENTLY, correct?

2. Make the refresh automatic (using triggers on the component tables)

Maybe, there are even:

0. Infrastructure additions

As you can tell, having the patch broken down like this would allow us
to focus on the finer aspects of each of the problem being solved and
solution being adopted, for example:

* It would be easier for someone having an expert opinion on how to
implement incremental refresh to have to only look at the patch for
(1).  If the new method handles more query types than currently, which
obviously means more code is needed, which in turn entails possibility
of bugs, despite the best efforts.  It would be better to get more
eyeballs at this portion of the patch and having it isolated seems
like a good way to attract more eyeballs.

* Someone well versed in trigger infrastructure can help fine tune the
patch for (2)

and so on.

So, please consider giving some thought to this.

Thanks,
Amit

Re: dropdb --force

2019-11-28 Thread Michael Paquier

On Thu, Nov 28, 2019 at 08:53:56AM +0530, Amit Kapila wrote:
> I have pushed the refactoring patch.  In the second patch, I have a
> few more comments.  I am not completely sure if it is a good idea to
> write a new test file 060_dropdb_force.pl when we already have an
> existing file 050_dropdb.pl for dropdb tests, but I think if we want
> to do that, then lets move existing test for dropdb '-f' from
> 050_dropdb.pl to new file and it might be better to name new file as
> 051_dropdb_force.pl.  I see that in some other cases like vacuumdb and
> clusterdb, we have separate test files to cover a different kinds of
> scenarios, so it should be okay to have a new file for dropdb tests.

Amit, as most of the patch set has been committed, would it make sense
to mark this entry as committed in the CF app?
--
Michael


signature.asc
Description: PGP signature

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-11-28 Thread Michael Paquier

On Fri, Nov 29, 2019 at 03:19:49PM +0900, Michael Paquier wrote:
> On Wed, Nov 13, 2019 at 12:53:09PM +0100, Julien Rouhaud wrote:
>> I'd really like to have the queryid function available through SQL,
>> but I think that this specific case wouldn't work very well for
>> pg_stat_statements' approach as it's working with oid.  The query
>> string in pg_stat_activity is the user provided one rather than a
>> fully-qualified version, so in order to get that query's queryid, you
>> need to know the exact search_path in use in that backend, and that's
>> not something available.
> 
> Yeah..  So, we have a patch marked as ready for committer here, and it
> seems to me that we have a couple of issues to discuss more about
> first particularly this query ID of 0.  Again, do others have more
> any input to offer?

And while on it, the latest patch does not apply, so a rebase is
needed here. 
--
Michael


signature.asc
Description: PGP signature

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-11-28 Thread Michael Paquier

On Wed, Nov 13, 2019 at 12:53:09PM +0100, Julien Rouhaud wrote:
> I'd really like to have the queryid function available through SQL,
> but I think that this specific case wouldn't work very well for
> pg_stat_statements' approach as it's working with oid.  The query
> string in pg_stat_activity is the user provided one rather than a
> fully-qualified version, so in order to get that query's queryid, you
> need to know the exact search_path in use in that backend, and that's
> not something available.

Yeah..  So, we have a patch marked as ready for committer here, and it
seems to me that we have a couple of issues to discuss more about
first particularly this query ID of 0.  Again, do others have more
any input to offer?
--
Michael


signature.asc
Description: PGP signature

Re: Improve search for missing parent downlinks in amcheck

2019-11-28 Thread Michael Paquier

On Mon, Aug 19, 2019 at 01:15:19AM +0300, Alexander Korotkov wrote:
> The revised patch seems to fix all of above.

The latest patch is failing to apply.  Please provide a rebase.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

2019-11-28 Thread Michael Paquier

On Sun, Sep 29, 2019 at 01:00:49AM +0200, Tomas Vondra wrote:
> OK. I'll try extending the set of synthetic queries in [1] to also do
> soemthing like this and generate similar plans.

Any progress on that?

Please note that the latest patch does not apply anymore, so a rebase
is needed.  I am switching the patch as waiting on author for now.
--
Michael


signature.asc
Description: PGP signature

Re: pgbench - add \aset to store results of a combined query

2019-11-28 Thread Michael Paquier

On Thu, Aug 15, 2019 at 06:30:13PM +, Ibrar Ahmed wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   tested, passed
> Documentation:not tested
> 
> The patch passed my review, I have not reviewed the documentation changes.
> 
> The new status of this patch is: Ready for Committer

@@ -524,6 +526,7 @@ typedef struct Command
int argc;
char   *argv[MAX_ARGS];
char   *varprefix;
+   boolaset;

It seems to me that there is no point to have the variable aset in
Command because this structure includes already MetaCommand, so the 
information is duplicated.  And I would suggest to change
readCommandResponse() to use a MetaCommand in argument.  Perhaps I am
missing something?
--
Michael


signature.asc
Description: PGP signature

A rather hackish POC for alternative implementation of WITH TIES

2019-11-28 Thread Andrew Gierth

This patch is a rather hacky implementation of the basic idea for
implementing FETCH ... WITH TIES, and potentially also PERCENT, by using
a window function expression to compute a stopping point.

Large chunks of this (the parser/ruleutils changes, docs, tests) are
taken from Surafel Temesgen's patch. The difference is that the executor
change in my version is minimal: Limit allows a boolean column in the
input to signal the point at which to stop. The planner inserts a
WindowAgg node to compute the necessary condition using the rank()
function.

The way this is done in the planner isn't (IMO) the best and should
probably be improved; in particular it currently misses some possible
optimizations (most notably constant-folding of the offset+limit
subexpression). I also haven't tested it properly to see whether I broke
anything, though it does pass regression.

-- 
Andrew (irc:RhodiumToad)

diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 691e402803..2b11c38087 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -44,7 +44,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionexpression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
 [ LIMIT { count | ALL } ]
 [ OFFSET start [ ROW | ROWS ] ]
-[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ]
+[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES } ]
 [ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ] [...] ]
 
 where from_item can be one of:
@@ -1438,7 +1438,7 @@ OFFSET start
 which PostgreSQL also supports.  It is:
 
 OFFSET start { ROW | ROWS }
-FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY
+FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES }
 
 In this syntax, the start
 or count value is required by
@@ -1448,10 +1448,13 @@ FETCH { FIRST | NEXT } [ count ] {
 ambiguity.
 If count is
 omitted in a FETCH clause, it defaults to 1.
-ROW
-and ROWS as well as FIRST
-and NEXT are noise words that don't influence
-the effects of these clauses.
+The WITH TIES option is used to return any additional
+rows that tie for the last place in the result set according to
+ORDER BY clause; ORDER BY
+is mandatory in this case.
+ROW and ROWS as well as
+FIRST and NEXT are noise
+words that don't influence the effects of these clauses.
 According to the standard, the OFFSET clause must come
 before the FETCH clause if both are present; but
 PostgreSQL is laxer and allows either order.
diff --git a/src/backend/catalog/sql_features.txt b/src/backend/catalog/sql_features.txt
index ab3e381cff..cd720eb19a 100644
--- a/src/backend/catalog/sql_features.txt
+++ b/src/backend/catalog/sql_features.txt
@@ -345,7 +345,7 @@ F863	Nested  in 			YES
 F864	Top-level  in views			YES	
 F865	 in 			YES	
 F866	FETCH FIRST clause: PERCENT option			NO	
-F867	FETCH FIRST clause: WITH TIES option			NO	
+F867	FETCH FIRST clause: WITH TIES option			YES	
 R010	Row pattern recognition: FROM clause			NO	
 R020	Row pattern recognition: WINDOW clause			NO	
 R030	Row pattern recognition: full aggregate support			NO	
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index 5e4d02ce4a..a2a8864b28 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -108,8 +108,20 @@ ExecLimit(PlanState *pstate)
 			}
 
 			/*
-			 * Okay, we have the first tuple of the window.
+			 * Okay, we have the first tuple of the window. However, if we're
+			 * doing LIMIT WHEN, our stop condition might be true already; so
+			 * check that.
 			 */
+			if (node->whenColno > 0)
+			{
+bool isnull = false;
+Datum res = slot_getattr(slot, node->whenColno, );
+if (!isnull && DatumGetBool(res))
+{
+	node->lstate = LIMIT_EMPTY;
+	return NULL;
+}
+			}
 			node->lstate = LIMIT_INWINDOW;
 			break;
 
@@ -152,6 +164,23 @@ ExecLimit(PlanState *pstate)
 	node->lstate = LIMIT_SUBPLANEOF;
 	return NULL;
 }
+/*
+ * Check whether our termination condition is reached, and
+ * pretend the subplan ran out if so. The subplan remains
+ * pointing at the terminating row, we must be careful not
+ * to advance it further as that will mess up backward scan.
+ */
+if (node->whenColno > 0)
+{
+	bool isnull = false;
+	Datum res = slot_getattr(slot, node->whenColno, );
+	if (!isnull && DatumGetBool(res))
+	{
+		node->lstate = LIMIT_SUBPLANEOF;
+		return NULL;
+	}
+}
+
 node->subSlot = slot;
 node->position++;
 			}
@@ -372,6 +401,7 @@ ExecInitLimit(Limit *node, EState *estate, int eflags)
 		   (PlanState *) limitstate);
 	limitstate->limitCount = ExecInitExpr((Expr *) node->limitCount,
 		  (PlanState *) limitstate);
+	limitstate->whenColno = node->whenColno;
 
 	/*
 	 * Initialize result type.
diff --git

RE: [patch]socket_timeout in interfaces/libpq

2019-11-28 Thread nagaura.ryo...@fujitsu.com

Hi, Michael-san.

Sorry, I have missed your e-mail...

> From: Michael Paquier 
> On Wed, Jun 26, 2019 at 11:56:28AM +, nagaura.ryo...@fujitsu.com wrote:
> > It seems that you did not think so at that time.
> > # Please refer to [1]
> >
> > I don't think all the reviewers are completely negative.
> 
> I recall having a negative impression on the patch when first looking at it, 
> and still
> have the same impression when looking at the last version.  Just with a quick
> look, assuming that you can bypass all cleanup operations normally taken by
> pqDropConnection() through a hijacking of pqWait() is not fine as this routine
> explicitely assumes to *never* have a timeout for its wait.
I couldn't understand what you meant.
Do you say that we shouldn't change pqWait() behavior?
Or should I modify my patch to use pqDropConnection()?

Best regards,
-
Ryohei Nagaura

Re: FETCH FIRST clause WITH TIES option

2019-11-28 Thread Andrew Gierth

> "Alvaro" == Alvaro Herrera  writes:

 Alvaro> First, I noticed that there's a significant unanswered issue
 Alvaro> from Andrew Gierth about this using a completely different
 Alvaro> mechanism, namely an implicit window function. Robert was
 Alvaro> concerned about the performance of Andrew's proposed approach,
 Alvaro> but I don't see any reply from Surafel on the whole issue.
 
 Alvaro> Andrew: Would you rather not have this feature in this form at
 Alvaro> all?

I have done a proof-of-concept hack for my alternative approach, which I
will post in another thread so as not to confuse the cfbot.

The main advantage, as I see it, of a window function approach is that
it can also work for PERCENT with only a change to the generated
expression; the executor part of the code can handle any stopping
condition. It can also be generalized (though the POC does not do this)
to allow an SQL-level extension, something like "LIMIT WHEN condition"
to indicate that it should stop just before the first row for which the
condition is true. Whether this is a desirable feature or not is another
question.

-- 
Andrew.

Re: pglz performance

2019-11-28 Thread Andrey Borodin




> 29 нояб. 2019 г., в 3:43, Tomas Vondra  
> написал(а):
> 
> OK, pushed, with some minor cosmetic tweaks on the comments (essentially
> using the formatting trick pointed out by Alvaro), and removing one
> unnecessary change in pglz_maximum_compressed_size.

Cool, thanks!

> pglz_maximum_compressed_size
It was an artifact of pgindent.

Best regards, Andrey Borodin.

Re: Implementing Incremental View Maintenance

2019-11-28 Thread Tatsuo Ishii

>> But if we introduce IMV, IVM would be used in much less places in the
>> doc and source code, so less confusion would happen, I guess.
> 
> Make senses. However, we came to think that "Incrementally Maintainable
> Materialized Views" (IMMs) would be good. So, how about using this for now?
> When other better opinions are raised, let's discuss again

Sounds good to me.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: Implementing Incremental View Maintenance

2019-11-28 Thread Yugo Nagata

On Fri, 29 Nov 2019 07:19:44 +0900 (JST)
Tatsuo Ishii  wrote:

> >> As promised, I have created the doc (CREATE MATERIALIZED VIEW manual)
> >> patch.
> > 
> > -  because the triggers will be invoked.
> > +  because the triggers will be invoked. We call this form of 
> > materialized
> > +  view as "Incremantal materialized View Maintenance" (IVM).
> > 
> > This part seems incorrect to me.  Incremental (materialized) View
> > Maintenance (IVM) is a way to maintain materialized views and is not a
> > word to refer views to be maintained.
> > 
> > However, it would be useful if there is a term referring views which
> > can be maintained using IVM. Off the top of my head, we can call this
> > Incrementally Maintainable Views (= IMVs), but this might cofusable with
> > IVM, so I'll think about that a little more
> 
> But if we introduce IMV, IVM would be used in much less places in the
> doc and source code, so less confusion would happen, I guess.

Make senses. However, we came to think that "Incrementally Maintainable
Materialized Views" (IMMs) would be good. So, how about using this for now?
When other better opinions are raised, let's discuss again

Regards,
Yugo Nagata

> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp


-- 
Yugo Nagata

Re: [HACKERS] advanced partition matching algorithm for partition-wise join

2019-11-28 Thread Etsuro Fujita

On Fri, Nov 22, 2019 at 10:08 PM Etsuro Fujita  wrote:
> Done.  I modified compare_range_partitions() as well to match its the
> variable names with others.  Attached is a new version of the patch.
>
> I reviewed the rest of the partbounds.c changes.  Here are my review comments:
>
> * I don't think this analysis is correct:
>
> +/*
> + * merge_null_partitions
> + *
> + * Merge NULL partitions, i.e. a partition that can hold NULL values for a 
> lis\
> t
> + * partitioned table, if any. Find the index of merged partition to which the
> + * NULL values would belong in the join result. If one joining relation has a
> + * NULL partition but not the other, try matching it with the default 
> partitio\
> n
> + * from the other relation since the default partition may have rows with 
> NULL
> + * partition key. We can eliminate a NULL partition when it appears only on 
> th\
> e
> + * inner side of the join and the outer side doesn't have a default 
> partition.
> + *
> + * When the equality operator used for join is strict, two NULL values will 
> no\
> t
> + * be considered as equal, and thus a NULL partition can be eliminated for an
> + * inner join. But we don't check the strictness operator here.
> + */
>
> First of all, I think we can assume here that the equality operator is
> *strict*, because 1) have_partkey_equi_join(), which is executed
> before calling merge_null_partitions(), requires that the
> corresponding clause is mergejoinable, and 2) currently, we assume
> that mergejoinable operators are strict (see MJEvalOuterValues() and
> MJEvalInnerValues()).  So I don't think it's needed to try merging a
> NULL partition on one side with the default partition on the other
> side as above.  (merge_null_partitions() tries merging NULL partitions
> as well, but in some cases I don't think we need to do that, either.)
> So I rewrote merge_null_partitions() completely.  Another change is
> that I eliminated the NULL partition of the joinrel in more cases.
> Attached is a patch (v28-0002-modify-merge_null_partitions.patch) for
> that created on top of the main patch.  I might be missing something,
> though.
>
> Other changes in that patch:
>
> * I fixed memory leaks in partition_list_bounds_merge() and
> partition_range_bounds_merge().
>
> * I noticed this in merge_default_partitions():
>
> +   Assert(outer_has_default && inner_has_default);
> +
> +   *default_index = map_and_merge_partitions(outer_map,
> + inner_map,
> + outer_default,
> + inner_default,
> + next_index);
> +   if (*default_index == -1)
> +   return false;
>
> I think the merging should be done successfully, because of 1) this in
> process_outer_partition():
>
> +   if (inner_has_default)
> +   {
> +   Assert(inner_default >= 0);
> +
> +   /*
> +* If the outer side has the default partition as well, we need to
> +* merge the default partitions (see merge_default_partitions()); give
> +* up on it.
> +*/
> +   if (outer_has_default)
> +   return false;
>
> and 2) this in process_inner_partition():
>
> +   if (outer_has_default)
> +   {
> +   Assert(outer_default >= 0);
> +
> +   /*
> +* If the inner side has the default partition as well, we need to
> +* merge the default partitions (see merge_default_partitions()); give
> +* up on it.
> +*/
> +   if (inner_has_default)
> +   return false;
>
> So I removed the above if test (ie, *default_index == -1).  I also
> modified that function a bit further, including comments.
>
> * I simplified process_outer_partition() and process_inner_partition()
> a bit, changing the APIs to match that of map_and_merge_partitions().
> Also, I think this in these functions is a bit ugly;
>
> +   /* Don't add this index to the list of merged indexes. */
> +   *merged_index = -1;
>
> so I removed it, and modified the condition on whether or not we add
> merged bounds to the lists in partition_list_bounds_merge() and
> partition_range_bounds_merge(), instead.

Moved to the next CF.

Best regards,
Etsuro Fujita

Re: A problem about partitionwise join

2019-11-28 Thread Etsuro Fujita

On Fri, Nov 29, 2019 at 12:08 PM Richard Guo  wrote:
> On Fri, Nov 29, 2019 at 11:03 AM Michael Paquier  wrote:
>> On Tue, Nov 26, 2019 at 08:35:33PM +0900, Etsuro Fujita wrote:
>> > I've just started reviewing this patch.  One comment I have for now
>> > is: this is categorized into Bug Fixes, but we have a workaround at
>> > least to the regression test case in the patch (ie, just reorder join
>> > clauses), so this seems to me more like an improvement than a bug fix.
>>
>> Hmm.  Agreed.  Changed the category and moved to next CF.

> thanks Michael for the change.

+1

Best regards,
Etsuro Fujita

Re: libpq sslpassword parameter and callback function

2019-11-28 Thread Greg Nancarrow

The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, failed

Hi Andrew,

I've reviewed your "libpq sslpassword parameter and callback function" patch 
(0001-libpq-sslpassword-der-support.patch), and only found a few minor things 
(otherwise it looked good to me):

1) There's a few trailing white-space warnings on patch application (from where 
you modified to skip 2 of the tests):
git apply 0001-libpq-sslpassword-der-support.patch
0001-libpq-sslpassword-der-support.patch:649: trailing whitespace.
# so they don't hang. For now they are not performed. 
0001-libpq-sslpassword-der-support.patch:659: trailing whitespace.

warning: 2 lines add whitespace errors.


2) src/interfaces/libpq/libpq-fe.h
The following portion of the comment should be removed.

+ * 2ndQPostgres extension. If you need to be compatible with unpatched libpq
+ * you must dlsym() these.

3) Documentation for the "PQsslpassword" function should be added to the libpq 
"33.2 Connection Status Functions" section.


I made the following notes about how/what I reviewed and tested:

- Applied patch and built Postgres (--with-openssl --enable-tap-tests), checked 
build output
- Checked patch code modifications (format, logic, memory usage, efficiency, 
corner cases etc.)
- Built documentation and checked updated portions (format, grammar, details, 
completeness etc.)
- Checked test updates 
- Ran updated contrib/dblink tests - confirmed all passed
- Ran updated SSL (TAP) tests - confirmed all passed
- Ran "make installcheck-world", as per review requirements
- Wrote small libpq-based app to test:
  - new APIs (PQsslpassword, PQsetSSLKeyPassHook, PQgetSSLKeyPassHook, 
PQdefaultSSLKeyPassHook)
  - passphrase-protected key with/without patch
  - patch with/without new key password callack
  - patch and certificate with/without pass phrase protection on key
  - default callback, callback delegation
  - PEM/DER keys


Regards,
Greg

Re: A problem about partitionwise join

2019-11-28 Thread Richard Guo

On Fri, Nov 29, 2019 at 11:03 AM Michael Paquier 
wrote:

> On Tue, Nov 26, 2019 at 08:35:33PM +0900, Etsuro Fujita wrote:
> > I've just started reviewing this patch.  One comment I have for now
> > is: this is categorized into Bug Fixes, but we have a workaround at
> > least to the regression test case in the patch (ie, just reorder join
> > clauses), so this seems to me more like an improvement than a bug fix.
>
> Hmm.  Agreed.  Changed the category and moved to next CF.
>

Thanks Etsuro for the comment and thanks Michael for the change.

Thanks
Richard

Re: A problem about partitionwise join

2019-11-28 Thread Michael Paquier

On Tue, Nov 26, 2019 at 08:35:33PM +0900, Etsuro Fujita wrote:
> I've just started reviewing this patch.  One comment I have for now
> is: this is categorized into Bug Fixes, but we have a workaround at
> least to the regression test case in the patch (ie, just reorder join
> clauses), so this seems to me more like an improvement than a bug fix.

Hmm.  Agreed.  Changed the category and moved to next CF.
--
Michael


signature.asc
Description: PGP signature

Re: PATCH: standby crashed when replay block which truncated in standby but failed to truncate in master node

2019-11-28 Thread Michael Paquier

On Thu, Oct 03, 2019 at 05:54:40PM +0900, Fujii Masao wrote:
> On Thu, Oct 3, 2019 at 1:57 PM Michael Paquier  wrote:
> >
> > On Thu, Oct 03, 2019 at 01:49:34PM +0900, Fujii Masao wrote:
> > > But this can cause subsequent recovery to always fail with invalid-pages 
> > > error
> > > and the server not to start up. This is bad. So, to allviate the 
> > > situation,
> > > I'm thinking it would be worth adding something like igore_invalid_pages
> > > developer parameter. When this parameter is set to true, the startup 
> > > process
> > > always ignores invalid-pages errors. Thought?
> >
> > That could be helpful.
> 
> So attached patch adds new developer GUC "ignore_invalid_pages".
> Setting ignore_invalid_pages to true causes the system
> to ignore the failure (but still report a warning), and continue recovery.
> 
> I will add this to next CommitFest.

No actual objections against this patch from me as a dev option.

+Detection of WAL records having references to invalid pages during
+recovery causes PostgreSQL to report
+an error, aborting the recovery. Setting
Well, that's not really an error.  This triggers a PANIC, aka crashes
the server.  And in this case the actual problem is that you may not
be able to move on with recovery when restarting the server again,
except if luck is on your side because you would continuously face
it..

+recovery. This behavior may cause crashes, data loss,
+ propagate or hide corruption, or other serious problems.
Nit: indentation on the second line here.

+However, it may allow you to get past the error, finish the recovery,
+and cause the server to start up.
For consistency here I would suggest the second part of the sentence
to be "TO finish recovery, and TO cause the server to start up".

+The default setting is off, and it can only be set at server start.
Nit^2: Missing a  markup for "off"?
--
Michael


signature.asc
Description: PGP signature

Re: Write visibility map during CLUSTER/VACUUM FULL

2019-11-28 Thread Michael Paquier

On Fri, Sep 20, 2019 at 01:05:06PM -0700, Andres Freund wrote:
> I don't think we should have yet another copy of logic determining
> visibility. It has repeatedly proven hard to get right in the past, and
> it'll not get better by having yet another copy.  Especially not because
> we've basically already done a HTSV (via
> heapam_relation_copy_for_cluster), and we're now redoing a poor man's
> version of it.

These comments are unanswered for more than 2 months, so I am marking
this entry as returned with feedback.  Please feel free to change if
you think that's not adapted.
--
Michael


signature.asc
Description: PGP signature

Re: [PATCH] ltree, lquery, and ltxtquery binary protocol support

2019-11-28 Thread Michael Paquier

On Mon, Nov 11, 2019 at 03:44:54PM +0100, Nino Floris wrote:
> Alright, as usual life got in the way. I've attached a new version of
> the patch with pgindent changes.
> 
> > What do you think about writing patch for ALTER TYPE?
> I'd rather not :$
> 
> > 1) Write migration script, which directly updates pg_type.
> This sounds like the best option right now, if we don't want people to
> do manual migrations to ltree 1.2.
> How would I best go at this?
> 
> > Travis has a small complaint:
> Should be fixed!

The latest patch provided fails to apply for me on HEAD.  Please
provide a rebase.  For now I am bumping this patch to next CF with
"waiting on author" as status.
--
Michael


signature.asc
Description: PGP signature

Re: Proposal: Add more compile-time asserts to expose inconsistencies.

2019-11-28 Thread Michael Paquier

On Wed, Nov 27, 2019 at 12:23:33PM +, Smith, Peter wrote:
> * That is beyond the scope for what I wanted my patch to achieve; my
> * use-cases are C code only.

Well, FWIW, I do have some extensions using __cplusplus and I am
pretty sure that I am not the only one with that.  The thing is that
with your patch folks would not get any compilation failures *now*
because all the declarations of StaticAssertDecl() are added within
the .c files, but once a patch which includes a declaration in a
header, something very likely to happen, is merged then we head into
breaking suddenly the compilation of those modules.  And that's not
nice.  That's also a point raised by Andres upthread.

> I am happy if somebody else with more ability to test C++ properly
> wants to add the __cplusplus variant of the new macro.

In short, attached is an updated version of your patch which attempts
to solve that.  I have tested this with some cplusplus stuff, and GCC
for both versions (static_assert is available in GCC >= 6, but a
manual change of c.h does the trick).

I have edited the patch a bit while on it, your assertions did not use
project-style grammar, the use of parenthesis was inconsistent (see
relpath.c for example), and pgindent has complained a bit.

Also, I am bumping the patch to next CF for now.  Do others have
thoughts to share about this version?  I would be actually fine to
commit that, even if the message generated for the fallback versions
is a bit crappy with a complain about a negative array size, but
that's not new to this patch as we use that as well with
StaticAssertStmt().
--
Michael
diff --git a/src/include/c.h b/src/include/c.h
index 00e41ac546..91d6d50e76 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -845,11 +845,16 @@ extern void ExceptionalCondition(const char *conditionName,
 	do { _Static_assert(condition, errmessage); } while(0)
 #define StaticAssertExpr(condition, errmessage) \
 	((void) ({ StaticAssertStmt(condition, errmessage); true; }))
+/* StaticAssertDecl is suitable for use at file scope. */
+#define StaticAssertDecl(condition, errmessage) \
+	_Static_assert(condition, errmessage)
 #else			/* !HAVE__STATIC_ASSERT */
 #define StaticAssertStmt(condition, errmessage) \
 	((void) sizeof(struct { int static_assert_failure : (condition) ? 1 : -1; }))
 #define StaticAssertExpr(condition, errmessage) \
 	StaticAssertStmt(condition, errmessage)
+#define StaticAssertDecl(condition, errmessage) \
+	extern void static_assert_func(int static_assert_failure[(condition) ? 1 : -1])
 #endif			/* HAVE__STATIC_ASSERT */
 #else			/* C++ */
 #if defined(__cpp_static_assert) && __cpp_static_assert >= 200410
@@ -857,12 +862,16 @@ extern void ExceptionalCondition(const char *conditionName,
 	static_assert(condition, errmessage)
 #define StaticAssertExpr(condition, errmessage) \
 	({ static_assert(condition, errmessage); })
-#else
+#define StaticAssertDecl(condition, errmessage) \
+	static_assert(condition, errmessage)
+#else			/* !__cpp_static_assert */
 #define StaticAssertStmt(condition, errmessage) \
 	do { struct static_assert_struct { int static_assert_failure : (condition) ? 1 : -1; }; } while(0)
 #define StaticAssertExpr(condition, errmessage) \
 	((void) ({ StaticAssertStmt(condition, errmessage); }))
-#endif
+#define StaticAssertDecl(condition, errmessage) \
+	extern void static_assert_func(int static_assert_failure[(condition) ? 1 : -1])
+#endif			/* __cpp_static_assert */
 #endif			/* C++ */
 
 
diff --git a/src/backend/utils/adt/lockfuncs.c b/src/backend/utils/adt/lockfuncs.c
index bc62c6eef7..e7e992eb85 100644
--- a/src/backend/utils/adt/lockfuncs.c
+++ b/src/backend/utils/adt/lockfuncs.c
@@ -36,6 +36,9 @@ const char *const LockTagTypeNames[] = {
 	"advisory"
 };
 
+StaticAssertDecl(lengthof(LockTagTypeNames) == (LOCKTAG_ADVISORY + 1),
+ "LockTagTypeNames array inconsistency");
+
 /* This must match enum PredicateLockTargetType (predicate_internals.h) */
 static const char *const PredicateLockTagTypeNames[] = {
 	"relation",
@@ -43,6 +46,9 @@ static const char *const PredicateLockTagTypeNames[] = {
 	"tuple"
 };
 
+StaticAssertDecl(lengthof(PredicateLockTagTypeNames) == (PREDLOCKTAG_TUPLE + 1),
+ "PredicateLockTagTypeNames array inconsistency");
+
 /* Working status for pg_lock_status */
 typedef struct
 {
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ba4edde71a..3dfffac88f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -236,6 +236,9 @@ static const struct config_enum_entry bytea_output_options[] = {
 	{NULL, 0, false}
 };
 
+StaticAssertDecl(lengthof(bytea_output_options) == (BYTEA_OUTPUT_HEX + 2),
+ "bytea_output_options array inconsistency");
+
 /*
  * We have different sets for client and server message level options because
  * they sort slightly different (see "log" level), and because "fatal"/"panic"
@@ -281,6 +284,9 @@ static const struct config_enum_entry intervalstyle_options[]

Re: progress report for ANALYZE

2019-11-28 Thread Tatsuro Yamada


Hi Michael,

On 2019/11/27 13:25, Michael Paquier wrote:

On Wed, Nov 27, 2019 at 12:45:41PM +0900, Tatsuro Yamada wrote:

Fixed.


Patch was waiting on input from author, so I have switched it back to
"Needs review", and moved it to next CF while on it as you are working
on it.


Thanks for your CF manager work.
I will do my best to be committed at the next CF because
Progress reporting feature is useful for DBAs, as you know. :)


Thanks,
Tatsuro Yamada

Re: Implementing Incremental View Maintenance

2019-11-28 Thread Yugo Nagata

On Fri, 29 Nov 2019 09:50:49 +0900 (JST)
Tatsuo Ishii  wrote:

> > Hi,
> > 
> > Attached is the latest patch (v8) to add support for Incremental View
> > Maintenance (IVM).  This patch adds OUTER join support in addition
> > to the patch (v7) submitted last week in the following post.
> 
> There's a compiler warning:
> 
> matview.c: In function ‘getRteListCell’:
> matview.c:2685:9: warning: ‘rte_lc’ may be used uninitialized in this 
> function [-Wmaybe-uninitialized]
>   return rte_lc;
>  ^~

Thanks! I'll fix this.

> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp


-- 
Yugo Nagata

Re: progress report for ANALYZE

2019-11-28 Thread Tatsuro Yamada


Hi Alvaro!


Hmmm... I understand your opinion but I'd like to get more opinions too. :)
Do you prefer these column names? See below:


Here's my take on it:

  
   pid
   datid
   datname
   relid
   phase
   sample_blks_total
   sample_blks_scanned
   ext_stats_total
   ext_stats_computed
   child_tables_total
   child_tables_done
   current_child_table_relid

It seems to make sense to keep using the "child table" terminology in
that last column; but since the column carries an OID then as Robert
said it should have "relid" in the name.  For the other two "child
tables" columns, not using "relid" is appropriate because what they have
is not relids.

I think there should be an obvious correspondence in columns that are
closely related, which there isn't if you use "sample" in one and "heap"
in the other, so I'd go for "sample" in both.



Thanks for the comment.
Oops, You are right, I overlooked they are not relids..
I agreed with you and Amit's opinion so I'll send a revised patch on the next 
mail. :-)

Next patch will be included:
 - New columns definition of the view (as above)
 - Renamed the phase name: s/acquiring inh sample rows/acquiring inherited 
sample rows/
 - Update document


Thanks,
Tatsuro Yamada

Re: Implementing Incremental View Maintenance

2019-11-28 Thread Tatsuo Ishii

> Hi,
> 
> Attached is the latest patch (v8) to add support for Incremental View
> Maintenance (IVM).  This patch adds OUTER join support in addition
> to the patch (v7) submitted last week in the following post.

There's a compiler warning:

matview.c: In function ‘getRteListCell’:
matview.c:2685:9: warning: ‘rte_lc’ may be used uninitialized in this function 
[-Wmaybe-uninitialized]
  return rte_lc;
 ^~

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: checkpointer: PANIC: could not fsync file: No such file or directory

2019-11-28 Thread Thomas Munro

On Fri, Nov 29, 2019 at 11:14 AM Justin Pryzby  wrote:
> On Fri, Nov 29, 2019 at 10:50:36AM +1300, Thomas Munro wrote:
> > On Fri, Nov 29, 2019 at 3:13 AM Thomas Munro  wrote:
> > > On Wed, Nov 27, 2019 at 7:53 PM Justin Pryzby  
> > > wrote:
> > > >  2019-11-26 23:41:50.009-05 | could not fsync file 
> > > > "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such file or 
> > > > directory
> > >
> > > I managed to reproduce this (see below).  I think I know what the
> >
> > Here is a patch that fixes the problem by sending all the
> > SYNC_FORGET_REQUEST messages up front.
>
> I managed to reproduce it too, but my recipe is crummy enough that I'm not 
> even
> going to send it..
>
> I confirmed that patch also seems to work for my worse recipe.

Thanks.

One thing I'm wondering about is what happens if you encounter EPERM,
EIO etc while probing the existing files, so that we give up early and
don't deal with some higher numbered files.

(1)  We'll still have unlinked the lower numbered files, and we'll
leave the higher numbered files where they are, which might confuse
matters if the relfilenode number if later recycled so the zombie
segments appear to belong to the new relfilenode.  That is
longstanding PostgreSQL behaviour not changed by commit 3eb77eba or
this patch IIUC (despite moving a few things around), and I guess it's
unlikely to bite you considering all the coincidences required, but if
there's a transient EPERM (think Windows virus checker opening files
without the shared read flag), it's not inconceivable.  One solution
to that would be not to queue the unlink request for segment 0 if
anything goes wrong while unlinking the other segments (in other
words: if you can't unlink all the segments, deliberately leak segment
0 and thus the *whole relfilenode*, not just the problem file(s)).

(2)  Even with the fix I just proposed, if you give up early due to
EPERM, EIO etc, there might still be sync requests queued for high
numbered segments, so you could reach the PANIC case.  It's likely
that, once you've hit such an error, the checkpointer is going to be
panicking anyway when it starts seeing similar errors, but still, it'd
be possible to do better here (especially if the error was transient
and short lived).  If we don't accept that behaviour, we could switch
(back) to a single cancel message that can whack every request
relating to the relation (essentially as it was in 11, though it
requires a bit more work for the new design), or stop using
_mdfd_getseg() for this so that you can remove segments independently
without worrying about sync requests for other segments (it was
actually like that in an earlier version of the patch for commit
3eb77eba, but someone complained that it didn't benifit from fd
caching).

Re: pglz performance

2019-11-28 Thread Tomas Vondra


On Wed, Nov 27, 2019 at 11:27:49PM +0500, Andrey Borodin wrote:




27 нояб. 2019 г., в 20:28, Tomas Vondra  
написал(а):



6) I'm pretty sure the comment in the 'while (off < len)' branch will be
badly mangled by pgindent.


I think I can just write it without line limit and then run pgindent.
Will try to do it this evening. Also, I will try to write more about
memmove.

Well, yes, I could not make pgindent format some parts of that comment, gave up 
and left only simple text.




7) The last change moving "copy" to the next line seems unnecessary.


Oh, looks like I had been rewording this comment, and eventually came
to the same text..Yes, this change is absolutely unnecessary.

Thanks!



Good. I'll wait for an updated version of the patch and then try to get
it pushed by the end of the CF.




OK, pushed, with some minor cosmetic tweaks on the comments (essentially
using the formatting trick pointed out by Alvaro), and removing one
unnecessary change in pglz_maximum_compressed_size.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-28 Thread Noah Misch

On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote:
> I measured the performance with the latest patch set.
> 
> > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> >minute when done via syncs.
> > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > 3. Wait 10s.
> > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.

If you have the raw data requested in (5), please share them here so folks
have the option to reproduce your graphs and calculations.

> I did the following benchmarking.
> 
> 1. Initialize bench database
> 
>   $ pgbench -i -s 20
> 
> 2. Start server with wal_level = replica (all other variables are not
> changed) then run the attached ./bench.sh

The bench.sh attachment was missing; please attach it.  Please give the output
of this command:

  select name, setting from pg_settings where setting <> boot_val;

> 3. Restart server with wal_level = replica then run the bench.sh
> twice.

I assume this is wal_level=minimal, not wal_level=replica.

Re: Implementing Incremental View Maintenance

2019-11-28 Thread Tatsuo Ishii

>> As promised, I have created the doc (CREATE MATERIALIZED VIEW manual)
>> patch.
> 
> -  because the triggers will be invoked.
> +  because the triggers will be invoked. We call this form of materialized
> +  view as "Incremantal materialized View Maintenance" (IVM).
> 
> This part seems incorrect to me.  Incremental (materialized) View
> Maintenance (IVM) is a way to maintain materialized views and is not a
> word to refer views to be maintained.
> 
> However, it would be useful if there is a term referring views which
> can be maintained using IVM. Off the top of my head, we can call this
> Incrementally Maintainable Views (= IMVs), but this might cofusable with
> IVM, so I'll think about that a little more

But if we introduce IMV, IVM would be used in much less places in the
doc and source code, so less confusion would happen, I guess.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: checkpointer: PANIC: could not fsync file: No such file or directory

2019-11-28 Thread Justin Pryzby

On Fri, Nov 29, 2019 at 10:50:36AM +1300, Thomas Munro wrote:
> On Fri, Nov 29, 2019 at 3:13 AM Thomas Munro  wrote:
> > On Wed, Nov 27, 2019 at 7:53 PM Justin Pryzby  wrote:
> > >  2019-11-26 23:41:50.009-05 | could not fsync file 
> > > "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such file or 
> > > directory
> >
> > I managed to reproduce this (see below).  I think I know what the
> 
> Here is a patch that fixes the problem by sending all the
> SYNC_FORGET_REQUEST messages up front.

I managed to reproduce it too, but my recipe is crummy enough that I'm not even
going to send it..

I confirmed that patch also seems to work for my worse recipe.

Thanks,
Justin

Re: checkpointer: PANIC: could not fsync file: No such file or directory

2019-11-28 Thread Thomas Munro

On Fri, Nov 29, 2019 at 3:13 AM Thomas Munro  wrote:
> On Wed, Nov 27, 2019 at 7:53 PM Justin Pryzby  wrote:
> >  2019-11-26 23:41:50.009-05 | could not fsync file 
> > "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such file or 
> > directory
>
> I managed to reproduce this (see below).  I think I know what the
> problem is: mdsyncfiletag() uses _mdfd_getseg() to open the segment to
> be fsync'd, but that function opens all segments up to the one you
> requested, so if a lower-numbered segment has already been unlinked,
> it can fail.  Usually that's unlikely because it's hard to get the
> request queue to fill up and therefore hard to split up the cancel
> requests for all the segments for a relation, but your workload and
> the repro below do it.  In fact, the path it shows in the error
> message is not even the problem file, that's the one it really wanted,
> but first it was trying to open lower-numbered ones.  I can see a
> couple of solutions to the problem (unlink in reverse order, send all
> the forget messages first before unlinking anything, or go back to
> using a single atomic "forget everything for this rel" message instead
> of per-segment messages), but I'll have to think more about that
> tomorrow.

Here is a patch that fixes the problem by sending all the
SYNC_FORGET_REQUEST messages up front.


0001-Fix-ordering-bug-in-mdunlinkfork.patch
Description: Binary data

Re: Do XID sequences need to be contiguous?

2019-11-28 Thread Tom Lane

Mark Dilger  writes:
> While working on the problem of XID wraparound within the LISTEN/NOTIFY
> system, I tried to increment XIDs by more than one per transaction. 
> This leads to a number of test failures, many which look like:

IIRC, the XID-creation logic is designed to initialize the next clog
page whenever it allocates an exact-multiple-of-BLCKSZ*4 transaction
number.  Skipping over such numbers would create trouble.

> First, I'd like a good method of burning through transaction ids in
> tests designed to check for problems in XID wrap-around.

Don't "burn through them".  Stop the cluster and use pg_resetwal to
set the XID counter wherever you want it.  (You might need to set it
just before a page or segment boundary; I'm not sure if pg_resetwal
has any logic of its own to initialize a new CLOG page/file when you
move the counter this way.  Perhaps it's worth improving that.)

> Second, I'd like to add Asserts where appropriate regarding this
> assumption.

I'm not excited about that, and it's *certainly* not a problem that
justifies additional configure infrastructure.

regards, tom lane

Re: pg_upgrade fails to preserve old versions of the predefined collations

2019-11-28 Thread Alexander Lakhin

28.11.2019 23:25, Thomas Munro пишет:
> On Fri, Nov 29, 2019 at 9:08 AM Alexander Lakhin  wrote:
>> So for now it seems dangerous to use predefined collations as their old
>> versions are not preserved by pg_upgrade and the user doesn't know which
>> indexes affected by the actual ICU collation changes.
> Yeah, we noticed this while working on a proposal for new
> per-database-object version dependency tracking, and Peter E has
> written a patch to address it:
>
> https://commitfest.postgresql.org/25/2328/
Thank you! This patch is working for me. After pg_upgrade with the
applied patch I'm getting:
postgres=# SELECT oid, collname, collnamespace, collprovider,
collversion FROM pg_collation WHERE collname like 'ru%';
  oid  |  collname   | collnamespace | collprovider | collversion
---+-+---+--+-
 17561 | ru-BY-x-icu |    11 | i    | 58.0.0.50
 17562 | ru-KG-x-icu |    11 | i    | 58.0.0.50
 17563 | ru-KZ-x-icu |    11 | i    | 58.0.0.50
 17564 | ru-MD-x-icu |    11 | i    | 58.0.0.50
 17565 | ru-RU-x-icu |    11 | i    | 58.0.0.50
 17566 | ru-UA-x-icu |    11 | i    | 58.0.0.50
 17567 | ru-x-icu    |    11 | i    | 58.0.0.50
 17568 | ru_RU   |    11 | c    |
 17569 | ru_RU.utf8  |    11 | c    |
 17696 | russian |  2200 | i    | 58.0.0.50
(10 rows)

Best regards,
Alexander

Re: pg_upgrade fails to preserve old versions of the predefined collations

2019-11-28 Thread Thomas Munro

On Fri, Nov 29, 2019 at 9:08 AM Alexander Lakhin  wrote:
> So for now it seems dangerous to use predefined collations as their old
> versions are not preserved by pg_upgrade and the user doesn't know which
> indexes affected by the actual ICU collation changes.

Yeah, we noticed this while working on a proposal for new
per-database-object version dependency tracking, and Peter E has
written a patch to address it:

https://commitfest.postgresql.org/25/2328/

pg_upgrade fails to preserve old versions of the predefined collations

2019-11-28 Thread Alexander Lakhin

Hello hackers,

When dealing with an OS upgrade, a some kind of anomaly related to
collations was found.
Suppose, we have Debian 8 with postgresql 12 installed.
Then we create a custom collation:
CREATE COLLATION russian (provider=icu, locale='ru_RU');
and
SELECT oid, collname, collnamespace, collprovider, collversion FROM
pg_collation WHERE collname like 'ru%';
returns
 12847 | ru-RU-x-icu |    11 | i    | 58.0.0.50
...
 16384 | russian |  2200 | i    | 58.0.0.50
Then let's create two tables with text columns and indexes and fill them
with some data:
CREATE TABLE test_icu_ru (f1 varchar COLLATE "ru-RU-x-icu", i int);
INSERT INTO test_icu_ru SELECT chr(x), x FROM generate_series(1, 2000)
as y(x); CREATE INDEX ON test_icu_ru (f1);

CREATE TABLE test_icu_russian (f1 varchar COLLATE "russian", i int);
INSERT INTO test_icu_russian SELECT chr(x), x FROM generate_series(1,
2000) as y(x); CREATE INDEX ON test_icu_russian (f1);

Perform two test queries:
postgres=# select * from test_icu_ru where f1=chr(821);
 f1 |  i 
+-
  ̵ | 821
(1 row)

postgres=# select * from test_icu_russian where f1=chr(821);
 f1 |  i 
+-
  ̵ | 821
(1 row)

postgres=# EXPLAIN select * from test_icu_ru where f1=chr(821);
  QUERY
PLAN 
--
 Index Scan using test_icu_ru_f1_idx on test_icu_ru  (cost=0.28..8.29
rows=1 width=6)
   Index Cond: ((f1)::text = '̵'::text)
(2 rows)

postgres=# EXPLAIN select * from test_icu_russian where f1=chr(821);
   QUERY
PLAN  

 Index Scan using test_icu_russian_f1_idx on test_icu_russian 
(cost=0.28..8.29 rows=1 width=6)
   Index Cond: ((f1)::text = '̵'::text)
(2 rows)
(The indexes are indeed used by the above queries.)

Now suppose that the OS is upgraded to Debian 9 (or the pgdata just
moved to Debian 9 with the postgresql 12).
The same queries return:
postgres=# select * from test_icu_ru where f1=chr(821);
WARNING:  collation "ru-RU-x-icu" has version mismatch
DETAIL:  The collation in the database was created using version
58.0.0.50, but the operating system provides version 153.64.29.
HINT:  Rebuild all objects affected by this collation and run ALTER
COLLATION pg_catalog."ru-RU-x-icu" REFRESH VERSION, or build PostgreSQL
with the right library version.
 f1 | i
+---
(0 rows)

postgres=# select * from test_icu_russian where f1=chr(821);
WARNING:  collation "russian" has version mismatch
DETAIL:  The collation in the database was created using version
58.0.0.50, but the operating system provides version 153.64.29.
HINT:  Rebuild all objects affected by this collation and run ALTER
COLLATION public.russian REFRESH VERSION, or build PostgreSQL with the
right library version.
 f1 | i
+---
(0 rows)

We get no data due to the real collation/sort order change but the
warning says what to do.
The query presented at
https://www.postgresql.org/docs/12/sql-altercollation.html returns:
    Collation    |   Object   
-+-
 collation "ru-RU-x-icu" | column f1 of table test_icu_ru
 collation "ru-RU-x-icu" | index test_icu_ru_f1_idx
 collation russian   | column f1 of table test_icu_russian
 collation russian   | index test_icu_russian_f1_idx
So the documented behavior is observed.

But after pg_upgrade:
pg_createcluster 12 new
/usr/lib/postgresql/12/bin/pg_upgrade -b /usr/lib/postgresql/12/bin -B
/usr/lib/postgresql/12/bin -d /etc/postgresql/12/main -D
/etc/postgresql/12/new
In the new cluster the same queries return:
postgres=# select * from test_icu_russian where f1=chr(821);
WARNING:  collation "russian" has version mismatch
DETAIL:  The collation in the database was created using version
58.0.0.50, but the operating system provides version 153.64.29.
HINT:  Rebuild all objects affected by this collation and run ALTER
COLLATION public.russian REFRESH VERSION, or build PostgreSQL with the
right library version.
 f1 | i
+---
(0 rows)

postgres=# select * from test_icu_ru where f1=chr(821);
 f1 | i
+---
(0 rows)
(There is no warning for the predefined collation now.)

The query presented at
https://www.postgresql.org/docs/12/sql-altercollation.html returns:
 Collation |   Object   
---+-
 collation russian | column f1 of table test_icu_russian
 collation russian | index test_icu_russian_f1_idx
(2 rows)

and
SELECT oid, collname, collnamespace, collprovider, collversion FROM
pg_collation WHERE collname like 'ru%';
returns
  oid  |  collname   | collnamespace | collprovider | collversion

Do XID sequences need to be contiguous?

2019-11-28 Thread Mark Dilger


Hackers,

While working on the problem of XID wraparound within the LISTEN/NOTIFY
system, I tried to increment XIDs by more than one per transaction. 
This leads to a number of test failures, many which look like:


+ERROR:  could not access status of transaction 7485
+DETAIL:  Could not read from file "pg_subtrans/" at offset 24576: 
read too few bytes.


I might not have read the right documentation, but

I do not see anything in src/backend/access/transam/README nor elsewhere
documenting a design decision or assumption that transaction IDs must
be assigned contiguously.  I suppose this is such a fundamental
assumption that it is completely implicit and nobody thought to document
it, but I'd like to check for two reasons:

First, I'd like a good method of burning through transaction ids in
tests designed to check for problems in XID wrap-around.

Second, I'd like to add Asserts where appropriate regarding this
assumption.  It seems strange to me that I should have gotten as far
as a failing read() without having tripped an Assert somewhere along the
way.

To duplicate the errors I hit, you can either apply this simple change:


diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 33fd052156..360b7335bb 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -83,7 +83,7 @@ FullTransactionIdFromEpochAndXid(uint32 epoch, 
TransactionId xid)

 static inline void
 FullTransactionIdAdvance(FullTransactionId *dest)
 {
-   dest->value++;
+   dest->value += 2;
while (XidFromFullTransactionId(*dest) < FirstNormalTransactionId)
dest->value++;
 }


or apply the much larger WIP patch, attached, and then be sure to
provide the --enable-xidcheck flag to configure before building.

--
Mark Dilger
diff --git a/configure b/configure
index 1d88983b34..909907127a 100755
--- a/configure
+++ b/configure
@@ -841,6 +841,7 @@ with_CC
 with_llvm
 enable_depend
 enable_cassert
+enable_xidcheck
 enable_thread_safety
 with_icu
 with_tcl
@@ -1521,6 +1522,7 @@ Optional Features:
   --enable-tap-tests  enable TAP tests (requires Perl and IPC::Run)
   --enable-depend turn on automatic dependency tracking
   --enable-cassertenable assertion checks (for debugging)
+  --enable-xidcheck   enable transactionid checks (for debugging)
   --disable-thread-safety disable thread-safety in client libraries
   --disable-largefile omit support for large files
 
@@ -7197,6 +7199,36 @@ fi
 
 
 
+#
+# Enable transactionid checks
+#
+
+
+# Check whether --enable-xidcheck was given.
+if test "${enable_xidcheck+set}" = set; then :
+  enableval=$enable_xidcheck;
+  case $enableval in
+yes)
+
+$as_echo "#define USE_XID_CHECKING 1" >>confdefs.h
+
+  ;;
+no)
+  :
+  ;;
+*)
+  as_fn_error $? "no argument expected for --enable-xidcheck option" 
"$LINENO" 5
+  ;;
+  esac
+
+else
+  enable_xidcheck=no
+
+fi
+
+
+
+
 #
 # Include directories
 #
diff --git a/configure.in b/configure.in
index a2cb20b5e3..5aac6c24f1 100644
--- a/configure.in
+++ b/configure.in
@@ -680,6 +680,14 @@ PGAC_ARG_BOOL(enable, cassert, no, [enable assertion 
checks (for debugging)],
  [Define to 1 to build with assertion checks. 
(--enable-cassert)])])
 
 
+#
+# Enable transactionid checks
+#
+PGAC_ARG_BOOL(enable, xidcheck, no, [enable transactionid checks (for 
debugging)],
+  [AC_DEFINE([USE_XID_CHECKING], 1,
+ [Define to 1 to build with transactionid checks. 
(--enable-xidcheck)])])
+
+
 #
 # Include directories
 #
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d4d1fe45cc..a3dbe27640 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9246,6 +9246,26 @@ dynamic_library_path = 
'C:\tools\postgresql;H:\my_project\lib;$libdir'
   
  
 
+ 
+  debug_xidcheck (boolean)
+  
+   debug_xidcheck configuration 
parameter
+  
+  
+  
+   
+Reports whether PostgreSQL has been built
+with transaction id checking enabled. That is the case if the
+macro USE_XID_CHECKING is defined
+when PostgreSQL is built (accomplished
+e.g. by the configure option
+--enable-xidcheck). By
+default PostgreSQL is built without
+transaction id checking.
+   
+  
+ 
+
  
   integer_datetimes (boolean)
   
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 9c10a897f1..eedfd021a0 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1524,6 +1524,19 @@ build-postgresql:

   
 
+  
+   --enable-xidcheck
+   
+
+ Enables transaction ID checks in the server, 
which test
+ the 64-bit transaction ID management.  This may prove valuable for
+ code development purposes, but enabling this feature will consume a 
small
+ amount of

Re: missing estimation for coalesce function

2019-11-28 Thread Pavel Stehule

čt 28. 11. 2019 v 15:51 odesílatel Laurenz Albe 
napsal:

> On Wed, 2019-11-27 at 08:47 +0100, Pavel Stehule wrote:
> > The most significant issue was missing correct estimation for coalesce
> function.
> > He had to rewrite coalesce(var, X) = X to "var IS NULL or var = X".
> > Then the result was very satisfactory.
> >
> > postgres=# explain analyze select * from xxx where coalesce(a, 0) = 0;
> >  QUERY PLAN
>
> >
> 
> >  Seq Scan on xxx  (cost=0.00..194.00 rows=60 width=4) (actual
> time=0.041..4.276 rows=11000 loops=1)
>
> I think that this is asking for a planner support function:
> https://www.postgresql.org/docs/current/xfunc-optimization.html

Probably it needs more work - currently this support is for SRF function or
for boolean functions.

On second hand coalesce is not function - it's expr node. Originally I
though so selectivity function can be enough. Now I think so it is not
enough. It is similar to DISTINCT FROM operator.

So some plan can look like

1. introduction isnull_or_eq operator
2. this operator can be used for indexscan too
3. implement selectivity function for this operator (and maybe for coalesce)
4. translate COALESCE(var, const) = const --> var isnull_or_eq const

I am not sure if @4 is possible or if some more complex transformations are
possible COALESCE(var1, var2) = var2

But what I read about it - MSSQL and Oracle has does this optimization

Regards

Pavel

>
> Yours,
> Laurenz Albe
>
>

Re: Memory-Bounded Hash Aggregation

2019-11-28 Thread Tomas Vondra


On Wed, Nov 27, 2019 at 02:58:04PM -0800, Jeff Davis wrote:

On Wed, 2019-08-28 at 12:52 -0700, Taylor Vesely wrote:

Right now the patch always initializes 32 spill partitions. Have you
given
any thought into how to intelligently pick an optimal number of
partitions yet?


Attached a new patch that addresses this.

1. Divide hash table memory used by the number of groups in the hash
table to get the average memory used per group.
2. Multiply by the number of groups spilled -- which I pessimistically
estimate as the number of tuples spilled -- to get the total amount of
memory that we'd like to have to process all spilled tuples at once.


Isn't the "number of tuples = number of groups" estimate likely to be
way too pessimistic? IIUC the consequence is that it pushes us to pick
more partitions than necessary, correct?

Could we instead track how many tuples we actually consumed for the the
in-memory groups, and then use this information to improve the estimate
of number of groups? I mean, if we know we've consumed 1000 tuples which
created 100 groups, then we know there's ~1:10 ratio.


3. Divide the desired amount of memory by work_mem to get the number of
partitions we'd like to have such that each partition can be processed
in work_mem without spilling.
4. Apply a few sanity checks, fudge factors, and limits.

Using this runtime information should be substantially better than
using estimates and projections.

Additionally, I removed some branches from the common path. I think I
still have more work to do there.

I also rebased of course, and fixed a few other things.



A couple of comments based on eye-balling the patch:


1) Shouldn't the hashagg_mem_overflow use the other GUC naming, i.e.
maybe it should be enable_hashagg_mem_overflow or something similar?


2) I'm a bit puzzled by this code in ExecInterpExpr (there are multiple
such blocks, this is just an example)

aggstate = op->d.agg_init_trans.aggstate;
pergroup_allaggs = aggstate->all_pergroups[op->d.agg_init_trans.setoff];
pergroup = _allaggs[op->d.agg_init_trans.transno];

/* If transValue has not yet been initialized, do so now. */
if (pergroup_allaggs != NULL && pergroup->noTransValue)
{ ... }

How could the (pergroup_allaggs != NULL) protect against anything? Let's
assume the pointer really is NULL. Surely we'll get a segfault on the
preceding line which does dereference it

pergroup = _allaggs[op->d.agg_init_trans.transno];

Or am I missing anything?


3) execGrouping.c

A couple of functions would deserve a comment, explaining what it does.

 - LookupTupleHashEntryHash
 - prepare_hash_slot
 - calculate_hash

And it's not clear to me why we should remove part of the comment before
TupleHashTableHash.


4) I'm not sure I agree with this reasoning that HASH_PARTITION_FACTOR
making the hash tables smaller is desirable - it may be, but if that was
generally the case we'd just use small hash tables all the time. It's a
bit annoying to give user the capability to set work_mem and then kinda
override that.

 * ... Another benefit of having more, smaller partitions is that small
 * hash tables may perform better than large ones due to memory caching
 * effects.


5) Not sure what "directly" means in this context?

 * partitions at the time we need to spill, and because this algorithm
 * shouldn't depend too directly on the internal memory needs of a
 * BufFile.

#define HASH_PARTITION_MEM (HASH_MIN_PARTITIONS * BLCKSZ)

Does that mean we don't want to link to PGAlignedBlock, or what?


6) I think we should have some protection against underflows in this
piece of code:

- this would probably deserve some protection against underflow if 
HASH_PARTITION_MEM gets too big

if (hashagg_mem_overflow)
aggstate->hash_mem_limit = SIZE_MAX;
else
aggstate->hash_mem_limit = (work_mem * 1024L) - HASH_PARTITION_MEM;

At the moment it's safe because work_mem is 64kB at least, and
HASH_PARTITION_MEM is 32kB (4 partitions, 8kB each). But if we happen to
bump HASH_MIN_PARTITIONS up, this can underflow.


7) Shouldn't lookup_hash_entry briefly explain why/how it handles the
memory limit?


8) The comment before lookup_hash_entries says:

 ...
 * Return false if hash table has exceeded its memory limit.
 ..

But that's clearly bogus, because that's a void function.


9) Shouldn't the hash_finish_initial_spills calls in agg_retrieve_direct
have a comment, similar to the surrounding code? Might be an overkill,
not sure.


10) The comment for agg_refill_hash_table says

 * Should only be called after all in memory hash table entries have been
 * consumed.

Can we enforce that with an assert, somehow?


11) The hash_spill_npartitions naming seems a bit confusing, because it
seems to imply it's about the "spill" while in practice it just choses
number of spill partitions. Maybe hash_choose_num_spill_partitions would
be better?


12) It's not clear to me why we need HASH_MAX_PARTITIONS? What's the
reasoning behind the

RE: [Incident report]Backend process crashed when executing 2pc transaction

2019-11-28 Thread Ranier Vilela

Marco wrote:
> One interesting thing is the prepared transaction name generated by
> the coordinator, which follows the form: citus_ id>___ number in session>. The server-wide transaction number is a 64-bit
> counter that is kept in shared memory and starts at 1. That means that
> over 4 billion (4207001212) transactions happened on the coordinator
> since the server started, which quite possibly resulted in 4 billion
> prepared transactions on this particular server. I'm wondering if some
> counter is overflowing.

Amit wrote:
>Interesting.  This does kind of gets us closer to figuring out what
>might have gone wrong, but hard to tell without the core dump at hand.

If something is corrupting memory rarely. It would be interesting to consider 
all the possibilities.
The MemSet function has an error alert on line 785 (twophase.c).
The size the var "_vstart" buffer, is not multiple size of the type long.
Maybe it's filling more than it should.

Ranier Vilela

Re: allow_system_table_mods stuff

2019-11-28 Thread Tom Lane

Peter Eisentraut  writes:
> On 2019-11-27 09:26, Michael Paquier wrote:
>> Peter, are you planning to look at that again?  Note: the patch has no
>> reviewers registered.

> Here is an updated patch series.

> After re-reading the discussion again, I have kept the existing name of 
> the option.  I have also moved the tests to the "unsafe_tests" suite, 
> which seems like a better place.  And I have split the patch into three.

Personally I'd have gone with the renaming to allow_system_table_ddl,
but it's not a huge point.  Updating the code to agree with that
naming would make the patch much more invasive, so maybe it's not
worth it.

> Other than those cosmetic changes, I think everything here has been 
> discussed and agreed to, so unless anyone expresses any concern or a 
> wish to do more review, I think this is ready to commit.

I read through the patch set and have just one quibble: in the
proposed new docs,

+Allows modification of the structure of system tables as well as
+certain other risky actions on system tables.  This is otherwise not
+allowed even for superusers.  This is used by
+initdb.  Inconsiderate use of this setting can
+cause irretrievable data loss or seriously corrupt the database
+system.  Only superusers can change this setting.

"Inconsiderate" doesn't seem like le mot juste.  Maybe "Ill-advised"?

(I'm also wondering whether the sentence about initdb is worth keeping.)

I marked the CF entry RFC.

regards, tom lane

Re: Yet another vectorized engine

2019-11-28 Thread Konstantin Knizhnik




On 28.11.2019 12:23, Hubert Zhang wrote:

Hi hackers,

We just want to introduce another POC for vectorized execution engine 
https://github.com/zhangh43/vectorize_engine and want to get some 
feedback on the idea.


The basic idea is to extend the TupleTableSlot and introduce 
VectorTupleTableSlot, which is an array of datums organized by 
projected columns.  The array of datum per column is continuous in 
memory. This makes the expression evaluation cache friendly and SIMD 
could be utilized. We have refactored the SeqScanNode and AggNode to 
support VectorTupleTableSlot currently.


Below are features in our design.
1. Pure extension. We don't hack any code into postgres kernel.

2. CustomScan node. We use CustomScan framework to replace original 
executor node such as SeqScan, Agg etc. Based on CustomScan, we could 
extend the CustomScanState, BeginCustomScan(), ExecCustomScan(), 
EndCustomScan() interface to implement vectorize executor logic.


3. Post planner hook. After plan is generated, we use plan_tree_walker 
to traverse the plan tree and check whether it could be vectorized. If 
yes, the non-vectorized nodes (SeqScan, Agg etc.) are replaced with 
vectorized nodes (in form of CustomScan node) and use vectorized 
executor. If no, we will revert to the original plan and use 
non-vectorized executor. In future this part could be enhanced, for 
example, instead of revert to original plan when some nodes cannot be 
vectorized, we could add Batch/UnBatch node to generate a plan with 
both vectorized as well as non-vectorized node.


4. Support implement new vectorized executor node gradually. We 
currently only vectorized SeqScan and Agg but other queries which 
including Join could also be run when vectorize extension is enabled.


5. Inherit original executor code. Instead of rewriting the whole 
executor, we choose a more smooth method to modify current Postgres 
executor node and make it vectorized. We copy the current executor 
node's c file into our extension, and add vectorize logic based on it. 
When Postgres enhance its executor, we could relatively easily merge 
them back. We want to know whether this is a good way to write 
vectorized executor extension?


6. Pluggable storage. Postgres has supported pluggable storage now. 
TupleTableSlot is refactored as abstract struct TupleTableSlotOps. 
VectorTupleTableSlot could be implemented under this framework when we 
upgrade the extension to latest PG.


We run the TPCH(10G) benchmark and result of Q1 is 50sec(PG) V.S. 
28sec(Vectorized PG). Performance gain can be improved by:
1. heap tuple deform occupy many CPUs. We will try zedstore in future, 
since vectorized executor is more compatible with column store.


2. vectorized agg is not fully vectorized and we have many 
optimization need to do. For example, batch compute the hash value, 
optimize hash table for vectorized HashAgg.


3. Conversion cost from Datum to actual type and vice versa is also 
high, for example DatumGetFloat4 & Float4GetDatum. One optimization 
maybe that we store the actual type in VectorTupleTableSlot directly, 
instead of an array of datums.


Related works:
1. VOPS is a vectorized execution extension. Link: 
https://github.com/postgrespro/vops.
It doesn't use custom scan framework and use UDF to do the vectorized 
operation e.g. it changes the SQL syntax to do aggregation.


2. Citus vectorized executor is another POC. Link: 
https://github.com/citusdata/postgres_vectorization_test.
It uses ExecutorRun_hook to run the vectorized executor and uses 
cstore fdw to support column storage.


Note that the vectorized executor engine is based on PG9.6 now, but it 
could be ported to master / zedstore with some effort.  We would 
appreciate some feedback before moving further in that direction.


Thanks,
Hubert Zhang, Gang Xiong, Ning Yu, Asim Praveen


Hi,

I think that vectorized executor is absolutely necessary thing for 
Postgres, especially taken in account that now we have columnar store 
prototype (zedstore).
To take all advantages of columnar store we definitely need a vectorized 
executor.


But I do not completely understand why you are proposing to implement it 
as extension.
Yes, custom nodes makes it possible to provide vector execution without 
affecting Postgres core.
But for efficient integration of zedstore and vectorized executor we 
need to extend table-AM (VectorTupleTableSlot and correspondent scan 
functions).
Certainly it is easier to contribute vectorized executor as extension, 
but sooner or later I think it should be added to Postgres core.


As far as I understand you already have some prototype implementation 
(otherwise how you got the performance results)?
If so, are you planning to publish it or you think that executor should 
be developed from scratch?


Some my concerns based on VOPS experience:

1. Vertical (columnar) model is preferable for some kind of queries, but 
there are some classes of queries for which it is less efficient.

Re: remove useless returns

2019-11-28 Thread Tom Lane

Alvaro Herrera  writes:
> Here's a silly insomnia-inspired little patch that removes useless
> "return;" lines from some routines.

+1, I always thought that was poor style.

regards, tom lane

Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.

2019-11-28 Thread Andy Fan

On Thu, Nov 28, 2019 at 7:19 PM Jinbao Chen  wrote:

> Hi Andy,
>
> I just test the query on 12.1. But pg use big_table as inner.
>
> demo=# explain (costs off) select * from t_small, t_big where a = b;
>  QUERY PLAN
> 
>  Hash Join
>Hash Cond: (t_small.a = t_big.b)
>->  Seq Scan on t_small
>->  Hash
>  ->  Seq Scan on t_big
>
> Do you insert data and  set max_parallel_workers_per_gather to 0 like
> above?
>

Sorry for the noise..  you are right.  I thought I load the data but and
run the query immediately without running the analyzing.

now it is using big table as inner table.

> create table t_small(a int);
> create table t_big(b int);
> insert into t_small select i%100 from generate_series(0, 3000)i;
> insert into t_big select i%10 from generate_series(1, 1)i ;
> analyze t_small;
> analyze t_big;
> set max_parallel_workers_per_gather = 0;
>
> On Thu, Nov 28, 2019 at 5:46 PM Andy Fan  wrote:
>
>>
>>
>> On Fri, Nov 22, 2019 at 6:51 PM Jinbao Chen  wrote:
>>
>>> Hi hackers,
>>>
>>> I have made a patch to fix the problem.
>>>
>>> Added the selection rate of the inner table non-empty bucket
>>>
>>> The planner will use big table as inner table in hash join
>>> if small table have fewer unique values. But this plan is
>>> much slower than using small table as inner table.
>>>
>>> In general, the cost of creating a hash table is higher
>>> than the cost of querying a hash table. So we tend to use
>>> small tables as internal tables. But if the average chain
>>> length of the bucket is large, the situation is just the
>>> opposite.
>>>
>>> If virtualbuckets is much larger than innerndistinct, and
>>> outerndistinct is much larger than innerndistinct. Then most
>>> tuples of the outer table will match the empty bucket. So when
>>> we calculate the cost of traversing the bucket, we need to
>>> ignore the tuple matching empty bucket.
>>>
>>> So we add the selection rate of the inner table non-empty
>>> bucket. The formula is:
>>> (1 - ((outerndistinct - innerndistinct)/outerndistinct)*
>>> ((virtualbuckets - innerndistinct)/virtualbuckets))
>>>
>>>
>>> On Tue, Nov 19, 2019 at 5:56 PM Jinbao Chen  wrote:
>>>
 I think we have the same understanding of this issue.

 Sometimes use smaller costs on scanning the chain in bucket like below
 would
 be better.
 run_cost += outer_path_rows * some_small_probe_cost;
 run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
 In some version of GreenPlum(a database based on postgres), we just
 disabled
 the cost on scanning the bucket chain. In most cases, this can get a
 better query
 plan. But I am worried that it will be worse in some cases.

 Now only the small table's distinct value is much smaller than the
 bucket number,
 and much smaller than the distinct value of the large table, the
 planner will get the
 wrong plan.

 For example, if inner table has 100 distinct values, and 3000 rows.
 Hash table
 has 1000 buckets. Outer table has 1 distinct values.
 We can assume that all the 100 distinct values of the inner table are
 included in the
 1 distinct values of the outer table. So (100/1)*outer_rows
 tuples will
 probe the buckets has chain. And (9900/1)*outer_rows tuples will
 probe
 all the 1000 buckets randomly. So (9900/1)*outer_rows*(900/1000)
 tuples will
 probe empty buckets. So the costs on scanning bucket chain is

 hash_qual_cost.per_tuple*innerbucketsize*outer_rows*
 (1 - ((outer_distinct - inner_distinct)/outer_distinct)*((buckets_num -
 inner_disttinct)/buckets_num))

 Do you think this assumption is reasonable?

 On Tue, Nov 19, 2019 at 3:46 PM Thomas Munro 
 wrote:

> On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen 
> wrote:
> > In the test case above, the small table has 3000 tuples and 100
> distinct values on column ‘a’.
> > If we use small table as inner table.  The chan length of the bucket
> is 30. And we need to
> > search the whole chain on probing the hash table. So the cost of
> probing is bigger than build
> > hash table, and we need to use big table as inner.
> >
> > But in fact this is not true. We initialized 620,000 buckets in
> hashtable. But only 100 buckets
> > has chains with length 30. Other buckets are empty. Only hash values
> need to be compared.
> > Its costs are very small. We have 100,000 distinct key and
> 100,000,000 tuple on outer table.
> > Only (100/10)* tuple_num tuples will search the whole chain. The
> other tuples
> > (number = (98900/10)*tuple_num*) in outer
> > table just compare with the hash value. So the actual cost is much
> smaller than the planner
> > calculated. This is the reason why using a small table as inner is
> faster.
>
> So

Re: missing estimation for coalesce function

2019-11-28 Thread Laurenz Albe

On Wed, 2019-11-27 at 08:47 +0100, Pavel Stehule wrote:
> The most significant issue was missing correct estimation for coalesce 
> function.
> He had to rewrite coalesce(var, X) = X to "var IS NULL or var = X".
> Then the result was very satisfactory.
> 
> postgres=# explain analyze select * from xxx where coalesce(a, 0) = 0;
>  QUERY PLAN   
>   
> 
>  Seq Scan on xxx  (cost=0.00..194.00 rows=60 width=4) (actual 
> time=0.041..4.276 rows=11000 loops=1)

I think that this is asking for a planner support function:
https://www.postgresql.org/docs/current/xfunc-optimization.html

Yours,
Laurenz Albe

Re: FETCH FIRST clause WITH TIES option

2019-11-28 Thread Alvaro Herrera

On 2019-Nov-28, Surafel Temesgen wrote:

> On Thu, Nov 28, 2019 at 12:36 AM Alvaro Herrera 
> wrote:
> 
> > I think you should add a /* fall-though */ comment after changing state.
> > Like this (this flow seems clearer; also DRY):
> >
> > if (!node->noCount &&
> > node->position - node->offset >= node->count)
> > {
> > if (node->limitOption == LIMIT_OPTION_COUNT)
> > {
> > node->lstate = LIMIT_WINDOWEND;
> > return NULL;
> > }
> > else
> > {
> > node->lstate = LIMIT_WINDOWEND_TIES;
> > /* fall-through */
> > }
> > }
> > else
> > ...
>
> changed

But you did not read my code snippet, did you ...?

> > I think you need to stare a that thing a little harder.
>
> This is because the new state didn't know about backward scan
> and there was off by one error it scan one position past to detect
> ties. The attached patch fix both

Oh, thanks, glad it was that easy.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

remove useless returns

2019-11-28 Thread Alvaro Herrera

Here's a silly insomnia-inspired little patch that removes useless
"return;" lines from some routines.  (I left some alone, because they
seemed to be there more for documentation purposes, such as the ones in
from_char_set_mode and from_char_set_int; also regcomp.c since the
pedigree there is unclear.)

This seems pretty uncontroversial, so I'm not thinking of waiting over
the US holidays to get opinions about it ...

-- 
Álvaro HerreraPostgreSQL Expert, https://www.2ndQuadrant.com/
>From 630a65869f8824ed6a77810f26143924a7d95fc1 Mon Sep 17 00:00:00 2001
From: Alvaro Herrera 
Date: Wed, 27 Nov 2019 10:28:37 -0300
Subject: [PATCH] Remove useless "return;" lines

---
 src/backend/access/gist/gist.c | 2 --
 src/backend/access/rmgrdesc/genericdesc.c  | 2 --
 src/backend/access/transam/twophase.c  | 2 --
 src/backend/access/transam/xact.c  | 1 -
 src/backend/executor/nodeTableFuncscan.c   | 2 --
 src/backend/jit/llvm/llvmjit.c | 2 --
 src/backend/libpq/hba.c| 1 -
 src/backend/replication/logical/launcher.c | 2 --
 src/backend/replication/walsender.c| 3 ---
 src/backend/rewrite/rowsecurity.c  | 2 --
 src/backend/utils/adt/datetime.c   | 2 --
 src/bin/pg_ctl/pg_ctl.c| 2 --
 src/bin/pg_dump/compress_io.c  | 4 
 src/bin/pg_dump/pg_backup_archiver.c   | 5 -
 src/bin/pg_dump/pg_backup_custom.c | 7 ---
 src/bin/pg_dump/pg_backup_directory.c  | 7 ---
 src/bin/pg_dump/pg_backup_null.c   | 3 ---
 src/bin/pg_dump/pg_backup_tar.c| 3 ---
 src/bin/pg_dump/pg_dump.c  | 2 --
 src/bin/pg_upgrade/parallel.c  | 4 
 src/bin/pg_upgrade/relfilenode.c   | 4 
 src/bin/pg_upgrade/tablespace.c| 2 --
 src/bin/psql/common.c  | 2 --
 src/fe_utils/print.c   | 2 --
 src/interfaces/ecpg/compatlib/informix.c   | 2 --
 src/interfaces/ecpg/pgtypeslib/datetime.c  | 1 -
 src/interfaces/ecpg/pgtypeslib/dt_common.c | 2 --
 src/interfaces/ecpg/pgtypeslib/timestamp.c | 1 -
 src/interfaces/libpq/fe-auth-scram.c   | 1 -
 src/pl/plperl/plperl.c | 4 
 src/pl/plpgsql/src/pl_exec.c   | 2 --
 src/port/win32error.c  | 1 -
 32 files changed, 82 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8d9c8d025d..3800f58ce7 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -1088,8 +1088,6 @@ gistFindCorrectParent(Relation r, GISTInsertStack *child)
 		LockBuffer(child->parent->buffer, GIST_EXCLUSIVE);
 		gistFindCorrectParent(r, child);
 	}
-
-	return;
 }
 
 /*
diff --git a/src/backend/access/rmgrdesc/genericdesc.c b/src/backend/access/rmgrdesc/genericdesc.c
index 0fe02dc85a..3bad0578eb 100644
--- a/src/backend/access/rmgrdesc/genericdesc.c
+++ b/src/backend/access/rmgrdesc/genericdesc.c
@@ -43,8 +43,6 @@ generic_desc(StringInfo buf, XLogReaderState *record)
 		else
 			appendStringInfo(buf, "offset %u, length %u", offset, length);
 	}
-
-	return;
 }
 
 /*
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index b3ad0d08e3..529976885f 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -2436,6 +2436,4 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 	if (gxact->ondisk)
 		RemoveTwoPhaseFile(xid, giveWarning);
 	RemoveGXact(gxact);
-
-	return;
 }
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5c0d0f2af0..5353b6ab0b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -3428,7 +3428,6 @@ CheckTransactionBlock(bool isTopLevel, bool throwError, const char *stmtType)
 	/* translator: %s represents an SQL statement name */
 			 errmsg("%s can only be used in transaction blocks",
 	stmtType)));
-	return;
 }
 
 /*
diff --git a/src/backend/executor/nodeTableFuncscan.c b/src/backend/executor/nodeTableFuncscan.c
index 21cf3ca0c9..ffcf4613f4 100644
--- a/src/backend/executor/nodeTableFuncscan.c
+++ b/src/backend/executor/nodeTableFuncscan.c
@@ -335,8 +335,6 @@ tfuncFetchRows(TableFuncScanState *tstate, ExprContext *econtext)
 
 	MemoryContextSwitchTo(oldcxt);
 	MemoryContextReset(tstate->perTableCxt);
-
-	return;
 }
 
 /*
diff --git a/src/backend/jit/llvm/llvmjit.c b/src/backend/jit/llvm/llvmjit.c
index 1c851e00fc..3b52d8cc21 100644
--- a/src/backend/jit/llvm/llvmjit.c
+++ b/src/backend/jit/llvm/llvmjit.c
@@ -834,8 +834,6 @@ llvm_create_types(void)
 	 * Leave the module alive, otherwise references to function would be
 	 * dangling.
 	 */
-
-	return;
 }
 
 /*
diff --git a/src/backend/libpq/hba.c b/src/backend/libpq/hba.c
index cc4b661433..9989904769 100644
--- a/src/backend/libpq/hba.c
+++ b/src/backend/libpq/hba.c
@@ -2903,7 +2903,6 @@ check_ident_usermap(IdentLine *identLine, const char

Re: Option to dump foreign data in pg_dump

2019-11-28 Thread Alvaro Herrera

On 2019-Nov-12, Luis Carril wrote:

> The nitpicks have been addressed.  However, it seems that the new file
> containing the test FDW seems missing from the new version of the patch.  Did
> you forget to git add the file?
> 
> Yes, I forgot, thanks for noticing. New patch attached again.

Luis,

It seems you've got enough support for this concept, so let's move
forward with this patch.  There are some comments from Tom about the
patch; would you like to send an updated version perhaps?

Thanks

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: checkpointer: PANIC: could not fsync file: No such file or directory

2019-11-28 Thread Thomas Munro

On Wed, Nov 27, 2019 at 7:53 PM Justin Pryzby  wrote:
>  2019-11-26 23:41:50.009-05 | could not fsync file 
> "pg_tblspc/16401/PG_12_201909212/16460/973123799.10": No such file or 
> directory

I managed to reproduce this (see below).  I think I know what the
problem is: mdsyncfiletag() uses _mdfd_getseg() to open the segment to
be fsync'd, but that function opens all segments up to the one you
requested, so if a lower-numbered segment has already been unlinked,
it can fail.  Usually that's unlikely because it's hard to get the
request queue to fill up and therefore hard to split up the cancel
requests for all the segments for a relation, but your workload and
the repro below do it.  In fact, the path it shows in the error
message is not even the problem file, that's the one it really wanted,
but first it was trying to open lower-numbered ones.  I can see a
couple of solutions to the problem (unlink in reverse order, send all
the forget messages first before unlinking anything, or go back to
using a single atomic "forget everything for this rel" message instead
of per-segment messages), but I'll have to think more about that
tomorrow.

=== repro ===

Recompile with RELSEG_SIZE 2 in pg_config.h.  Run with
checkpoint_timeout=30s and shared_buffers=128kB.  Then:

create table t (i int primary key);
cluster t using t_pkey;
insert into t select generate_series(1, 1);

Session 1:
cluster t;
\watch 1

Session 2:
update t set i = i;
\watch 1.1

Re: Implementing Incremental View Maintenance

2019-11-28 Thread Alvaro Herrera

One thing pending in this development line is how to catalogue aggregate
functions that can be used in incrementally-maintainable views.
I saw a brief mention somewhere that the devels knew it needed to be
done, but I don't see in the thread that they got around to doing it.
Did you guys have any thoughts on how it can be represented in catalogs?
It seems sine-qua-non ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: FETCH FIRST clause WITH TIES option

2019-11-28 Thread Surafel Temesgen

On Thu, Nov 28, 2019 at 12:36 AM Alvaro Herrera 
wrote:

> Thanks.
>
> (I would suggest renaming the new state LIMIT_WINDOWEND_TIES, because it
> seems to convey what it does a little better.)
>
> I think you should add a /* fall-though */ comment after changing state.
> Like this (this flow seems clearer; also DRY):
>
> if (!node->noCount &&
> node->position - node->offset >= node->count)
> {
> if (node->limitOption == LIMIT_OPTION_COUNT)
> {
> node->lstate = LIMIT_WINDOWEND;
> return NULL;
> }
> else
> {
> node->lstate = LIMIT_WINDOWEND_TIES;
> /* fall-through */
> }
> }
> else
> ...
>
>
changed


> I've been playing with this a little more, and I think you've overlooked
> a few places in ExecLimit that need to be changed.  In the regression
> database, I tried this:
>
> 55432 13devel 17282=# begin;
> BEGIN
> 55432 13devel 17282=# declare c1 cursor for select * from int8_tbl order
> by q1 fetch first 2 rows with ties;
> DECLARE CURSOR
> 55432 13devel 17282=# fetch all in c1;
>  q1  │q2
> ─┼──
>  123 │  456
>  123 │ 4567890123456789
> (2 filas)
>
> 55432 13devel 17282=# fetch all in c1;
>  q1 │ q2
> ┼
> (0 filas)
>
> 55432 13devel 17282=# fetch forward all in c1;
>  q1 │ q2
> ┼
> (0 filas)
>
> So far so good .. things look normal.  But here's where things get
> crazy:
>
> 55432 13devel 17282=# fetch backward all in c1;
> q1│q2
> ──┼──
>  4567890123456789 │  123
>   123 │ 4567890123456789
> (2 filas)
>
> (huh???)
>
> 55432 13devel 17282=# fetch backward all in c1;
>  q1 │ q2
> ┼
> (0 filas)
>
> Okay -- ignoring the fact that we got a row that should not be in the
> result, we've exhausted the cursor going back.  Let's scan it again from
> the start:
>
> 55432 13devel 17282=# fetch forward all in c1;
> q1│q2
> ──┼───
>   123 │  4567890123456789
>  4567890123456789 │   123
>  4567890123456789 │  4567890123456789
>  4567890123456789 │ -4567890123456789
> (4 filas)
>
> This is just crazy.
>
> I think you need to stare a that thing a little harder.
>
>
This is because the new state didn't know about backward scan
and there was off by one error it scan one position past to detect
ties. The attached patch fix both
regards
Surafel
From cad411bf6bd5ffc7f5bb5cb8d9ce456fc55ac694 Mon Sep 17 00:00:00 2001
From: Surafel Temesgen 
Date: Thu, 28 Nov 2019 16:18:45 +0300
Subject: [PATCH] FETCH FIRST clause WITH TIES option

---
 doc/src/sgml/ref/select.sgml|  15 ++-
 src/backend/catalog/sql_features.txt|   2 +-
 src/backend/executor/nodeLimit.c| 156 +---
 src/backend/nodes/copyfuncs.c   |   7 ++
 src/backend/nodes/equalfuncs.c  |   2 +
 src/backend/nodes/outfuncs.c|   7 ++
 src/backend/nodes/readfuncs.c   |   6 +
 src/backend/optimizer/plan/createplan.c |  44 ++-
 src/backend/optimizer/plan/planner.c|   1 +
 src/backend/optimizer/util/pathnode.c   |   2 +
 src/backend/parser/analyze.c|   3 +
 src/backend/parser/gram.y   | 117 ++
 src/backend/utils/adt/ruleutils.c   |  10 +-
 src/include/nodes/execnodes.h   |   4 +
 src/include/nodes/nodes.h   |  13 ++
 src/include/nodes/parsenodes.h  |   2 +
 src/include/nodes/pathnodes.h   |   1 +
 src/include/nodes/plannodes.h   |   5 +
 src/include/optimizer/pathnode.h|   1 +
 src/include/optimizer/planmain.h|   3 +-
 src/test/regress/expected/limit.out | 104 
 src/test/regress/sql/limit.sql  |  31 +
 22 files changed, 482 insertions(+), 54 deletions(-)

diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 691e402803..2b11c38087 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -44,7 +44,7 @@ SELECT [ ALL | DISTINCT [ ON ( expressionexpression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
 [ LIMIT { count | ALL } ]
 [ OFFSET start [ ROW | ROWS ] ]
-[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ]
+[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES } ]
 [ FOR { UPDATE | NO KEY UPDATE | SHARE | KEY SHARE } [ OF table_name [, ...] ] [ NOWAIT | SKIP LOCKED ] [...] ]
 
 where from_item can be one of:
@@ -1438,7 +1438,7 @@ OFFSET start
 which PostgreSQL also supports.  It is:
 
 OFFSET start { ROW | ROWS }
-FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY
+FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } {

Re: progress report for ANALYZE

2019-11-28 Thread Alvaro Herrera

On 2019-Nov-28, Tatsuro Yamada wrote:

> Hmmm... I understand your opinion but I'd like to get more opinions too. :)
> Do you prefer these column names? See below:

Here's my take on it:

  pid
  datid
  datname
  relid
  phase
  sample_blks_total
  sample_blks_scanned
  ext_stats_total
  ext_stats_computed
  child_tables_total
  child_tables_done
  current_child_table_relid

It seems to make sense to keep using the "child table" terminology in
that last column; but since the column carries an OID then as Robert
said it should have "relid" in the name.  For the other two "child
tables" columns, not using "relid" is appropriate because what they have
is not relids.

I think there should be an obvious correspondence in columns that are
closely related, which there isn't if you use "sample" in one and "heap"
in the other, so I'd go for "sample" in both.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: How to prohibit parallel scan through tableam?

2019-11-28 Thread Rafia Sabih

On Wed, 27 Nov 2019 at 12:33, Konstantin Knizhnik 
wrote:

> Hi hackers,
>
> I wonder how it is possible to prohibit parallel scan for the external
> storage accessed through tableam?
> For example if I want to implement specialized tableam for fast access
> to temp tables, how can I inform optimizer that
> parallel scan is not possible (because table data is local to the backend)?
>
> One moment, isn't that parallel scans are already restricted for temp
tables, or I have misunderstood something here...?



-- 
Regards,
Rafia Sabih

Re: Collation versioning

2019-11-28 Thread Julien Rouhaud

On Thu, Nov 28, 2019 at 5:50 AM Michael Paquier  wrote:
>
> On Wed, Nov 27, 2019 at 04:09:57PM -0500, Robert Haas wrote:
> > Yeah, I like #3 too. If we're going to the trouble to build all of
> > this mechanism, it seems worth it to build the additional machinery to
> > be precise about the difference between "looks like a problem" and "we
> > don't know".
>
> Indeed, #3 sounds like a sensible way of doing things.  The two others
> may cause random problems which are harder to actually detect and act
> on as we should avoid as much as possible a forced system-wide REINDEX
> after an upgrade to a post-13 PG.

Thanks everyone for the feedback!  Since there seems to be an
agreement on #3, here's a proposal.

What we could do is storing an empty string if the compatibility is
unknown, and detect it in index_check_collation_version() to report a
slightly different message.  I'm assuming that not knowing the
compatibility would be system-wide rather than per collation, so we
could use an sql query like this:

ALTER INDEX idx_name DEPENDS ON COLLATION UNKNOWN VERSION

If adding (un)reserved keywords is not an issue, we could also instead
use something along ALTER INDEX idx_name DEPENDS ON ALL COLLATIONS
and/or ALL VERSIONS UNKNOWN, or switch to:

ALTER INDEX idx_name ALTER [ COLLATION coll_name | ALL COLLATIONS ]
DEPENDS ON [ UNKOWN VERSION | VERSION 'version_string' ]

Obviously, specific versions would require a specific collation, and
at least UNKNOWN VERSION would only be allowed in binary upgrade mode,
and not documented.  I have also some other ideas for additional DDL
to let users deal with catalog update after a compatible collation
library upgrade, but let's deal with that later.

Then for pg_upgrade, we can add a --collations-are-binary-compatible
switch or similar:

If upgrading from pre-v13
  - without the switch, we'd generate the VERSION UNKNOWN for all
indexes during pg_dump in upgrade_mode
  - with the switch, do nothing as all indexes would already be
pointing to the currently installed version

If upgrading from post v13, the switch shouldn't be useful as versions
will be restored, and if there was a collation library upgrade it
should be handled manually, same as if such an upgrade is done without
pg_upgrade-ing the cluster.  I'd personally disallow it to avoid users
to shoot themselves in the foot.

Does this sounds sensible?

Re: Remove page-read callback from XLogReaderState.

2019-11-28 Thread Kyotaro Horiguchi

At Wed, 27 Nov 2019 01:11:40 -0300, Alvaro Herrera  
wrote in 
> On 2019-Nov-27, Kyotaro Horiguchi wrote:
> 
> > At Thu, 24 Oct 2019 14:51:01 +0900 (JST), Kyotaro Horiguchi 
> >  wrote in 
> > > Rebased.
> > 
> > 0dc8ead463 hit this. Rebased.
> 
> Please review the pg_waldump.c hunks in 0001; they revert recent changes.

Ughhh! I'l check it. Thank you for noticing!!

At Wed, 27 Nov 2019 12:57:47 +0900, Michael Paquier  wrote 
in 
> Note: Moved to next CF.

Thanks.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-28 Thread Kyotaro Horiguchi

I measured the performance with the latest patch set.

> 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
>minute when done via syncs.
> 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> 3. Wait 10s.
> 4. Start one DDL backend that runs $DDL_COUNT transactions.
> 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.

I did the following benchmarking.

1. Initialize bench database

  $ pgbench -i -s 20

2. Start server with wal_level = replica (all other variables are not
changed) then run the attached ./bench.sh

  $ ./bench.sh   

where count is the number of repetition, pages is the number of pages
to write in a run, and mode is "s" (sync) or "w"(WAL). The 
doesn't affect if wal_level = replica. The script shows the following
result.

| before: tps 240.2, lat 44.087 ms (29 samples)
| during: tps 109.1, lat 114.887 ms (14 samples)
| after : tps 269.9, lat 39.557 ms (107 samples)
| DDL time = 13965 ms
| # transaction type: 

before: mean numbers before "the DDL" starts.
during: mean numbers while "the DDL" is running.
after : mean numbers after "the DDL" ends.
DDL time: the time took to run "the DDL".

3. Restart server with wal_level = replica then run the bench.sh
twice.

  $ ./bench.sh   s
  $ ./bench.sh   w


Finally I got three graphs. (attached 1, 2, 3. PNGs)

* Graph 1 - The affect of the DDL on pgbench's TPS

 The virtical axis means "during TPS" / "before TPS" in %. Larger is
 better. The horizontal axis means the table pages size.
 
 Replica and Minimal-sync are almost flat.  Minimal-WAL getting worse
 as table size increases. 500 pages seems to be the crosspoint.


* Graph 2 - The affect of the DDL on pgbench's latency.

 The virtical axis means "during-letency" / "before-latency" in
 %. Smaller is better. Like TPS but more quickly WAL-latency gets
 worse as table size increases. The crosspoint seems to be 300 pages
 or so.


* Graph 3 - The affect of pgbench's work load on DDL runtime.

 The virtical axis means "time the DDL takes to run with pgbench" /
 "time the DDL to run solely". Smaller is better. Replica and
 Minimal-SYNC shows similar tendency. On Minimal-WAL the DDL runs
 quite fast with small tables. The crosspoint seems to be about 2500
 pages.

Seeing this, I became to be worry that the optimization might give far
smaller advantage than expected. Putting aside that, it seems to me
that the default value for the threshold would be 500-1000, same as
the previous benchmark showed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] WAL logging problem in 9.4.3?

2019-11-28 Thread Kyotaro Horiguchi

At Tue, 26 Nov 2019 21:37:52 +0900 (JST), Kyotaro Horiguchi 
 Is is not fully checked. I didn't merged and mesured 
performance yet,
> but I post the status-quo patch for now.

It was actually inconsistency caused by swap_relation_files.

1. rd_createSubid of relcache for r2 is not turned off. This prevents
   the relcache entry from flushed. Commit processes pendingSyncs and
   leaves the relcache entry with rd_createSubid != Invalid. It is
   inconsistency.

2. relation_open(r1) returns a relcache entry with its relfilenode has
   the old value (relfilenode1) since command counter has not been
   incremented. On the other hand if it is incremented just before,
   AssertPendingSyncConsistency() aborts because of the inconsistency
   between relfilenode and rd_firstRel*.

As the result, I returned to think that we need to modify both
relcache entries with right relfilenode.

I once thought that taking AEL in the function has no side effect but
the code path is executed also when wal_level = replica or higher. And
as I mentioned upthread, we can even get there without taking any lock
on r1 or sometimes ShareLock. So upgrading to AEL emits Standby/LOCK
WAL and propagates to standby. After all I'd like to take the weakest
lock (AccessShareLock) there.

The attached is the new version of the patch.

- v26-0001-version-nm24.patch
  Same with v24

- v26-0002-change-swap_relation_files.patch

 Changes to swap_relation_files as mentioned above.

- v26-0003-Improve-the-performance-of-relation-syncs.patch

 Do multiple pending syncs by one shared_buffers scanning.

- v26-0004-Revert-FlushRelationBuffersWithoutRelcache.patch

 v26-0003 makes the function useless. Remove it.

- v26-0005-Fix-gistGetFakeLSN.patch

 gistGetFakeLSN fix.

- v26-0006-Sync-files-shrinked-by-truncation.patch

 Fix the problem of commit-time-FPI after truncation after checkpoint.
 I'm not sure this is the right direction but pendingSyncHash is
 removed from pendingDeletes list again.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From ee96bb1e14969823eab79ab1531d68e8aadc1915 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Nov 2019 15:28:06 +0900
Subject: [PATCH v26 1/6] version nm24

Noah Misch's version 24.
---
 doc/src/sgml/config.sgml |  43 +++--
 doc/src/sgml/perform.sgml|  47 ++
 src/backend/access/gist/gistutil.c   |   7 +-
 src/backend/access/heap/heapam.c |  45 +-
 src/backend/access/heap/heapam_handler.c |  22 +--
 src/backend/access/heap/rewriteheap.c|  21 +--
 src/backend/access/nbtree/nbtsort.c  |  41 ++---
 src/backend/access/transam/README|  47 +-
 src/backend/access/transam/xact.c|  14 ++
 src/backend/access/transam/xloginsert.c  |  10 +-
 src/backend/access/transam/xlogutils.c   |  17 +-
 src/backend/catalog/heap.c   |   4 +
 src/backend/catalog/storage.c| 198 +--
 src/backend/commands/cluster.c   |  11 ++
 src/backend/commands/copy.c  |  58 +--
 src/backend/commands/createas.c  |  11 +-
 src/backend/commands/matview.c   |  12 +-
 src/backend/commands/tablecmds.c |  11 +-
 src/backend/storage/buffer/bufmgr.c  |  37 +++--
 src/backend/storage/smgr/md.c|   9 +-
 src/backend/utils/cache/relcache.c   | 122 ++
 src/backend/utils/misc/guc.c |  13 ++
 src/include/access/heapam.h  |   3 -
 src/include/access/rewriteheap.h |   2 +-
 src/include/access/tableam.h |  18 +--
 src/include/catalog/storage.h|   5 +
 src/include/storage/bufmgr.h |   5 +
 src/include/utils/rel.h  |  57 +--
 src/include/utils/relcache.h |   8 +-
 src/test/regress/pg_regress.c|   2 +
 30 files changed, 551 insertions(+), 349 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d4d1fe45cc..d0f7dbd7d7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2483,21 +2483,14 @@ include_dir 'conf.d'
 levels.  This parameter can only be set at server start.


-In minimal level, WAL-logging of some bulk
-operations can be safely skipped, which can make those
-operations much faster (see ).
-Operations in which this optimization can be applied include:
-
- CREATE TABLE AS
- CREATE INDEX
- CLUSTER
- COPY into tables that were created or truncated in the same
- transaction
-
-But minimal WAL does not contain enough information to reconstruct the
-data from a base backup and the WAL logs, so replica or
-higher must be used to enable WAL archiving
-() and streaming replication.
+In minimal level, no information is logged for
+tables or indexes for the remainder of a transaction that creates or
+truncates them.  This

Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.

2019-11-28 Thread Jinbao Chen

Hi Andy,

I just test the query on 12.1. But pg use big_table as inner.

demo=# explain (costs off) select * from t_small, t_big where a = b;
 QUERY PLAN

 Hash Join
   Hash Cond: (t_small.a = t_big.b)
   ->  Seq Scan on t_small
   ->  Hash
 ->  Seq Scan on t_big

Do you insert data and  set max_parallel_workers_per_gather to 0 like above?

create table t_small(a int);
create table t_big(b int);
insert into t_small select i%100 from generate_series(0, 3000)i;
insert into t_big select i%10 from generate_series(1, 1)i ;
analyze t_small;
analyze t_big;
set max_parallel_workers_per_gather = 0;

On Thu, Nov 28, 2019 at 5:46 PM Andy Fan  wrote:

>
>
> On Fri, Nov 22, 2019 at 6:51 PM Jinbao Chen  wrote:
>
>> Hi hackers,
>>
>> I have made a patch to fix the problem.
>>
>> Added the selection rate of the inner table non-empty bucket
>>
>> The planner will use big table as inner table in hash join
>> if small table have fewer unique values. But this plan is
>> much slower than using small table as inner table.
>>
>> In general, the cost of creating a hash table is higher
>> than the cost of querying a hash table. So we tend to use
>> small tables as internal tables. But if the average chain
>> length of the bucket is large, the situation is just the
>> opposite.
>>
>> If virtualbuckets is much larger than innerndistinct, and
>> outerndistinct is much larger than innerndistinct. Then most
>> tuples of the outer table will match the empty bucket. So when
>> we calculate the cost of traversing the bucket, we need to
>> ignore the tuple matching empty bucket.
>>
>> So we add the selection rate of the inner table non-empty
>> bucket. The formula is:
>> (1 - ((outerndistinct - innerndistinct)/outerndistinct)*
>> ((virtualbuckets - innerndistinct)/virtualbuckets))
>>
>>
>> On Tue, Nov 19, 2019 at 5:56 PM Jinbao Chen  wrote:
>>
>>> I think we have the same understanding of this issue.
>>>
>>> Sometimes use smaller costs on scanning the chain in bucket like below
>>> would
>>> be better.
>>> run_cost += outer_path_rows * some_small_probe_cost;
>>> run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
>>> In some version of GreenPlum(a database based on postgres), we just
>>> disabled
>>> the cost on scanning the bucket chain. In most cases, this can get a
>>> better query
>>> plan. But I am worried that it will be worse in some cases.
>>>
>>> Now only the small table's distinct value is much smaller than the
>>> bucket number,
>>> and much smaller than the distinct value of the large table, the planner
>>> will get the
>>> wrong plan.
>>>
>>> For example, if inner table has 100 distinct values, and 3000 rows. Hash
>>> table
>>> has 1000 buckets. Outer table has 1 distinct values.
>>> We can assume that all the 100 distinct values of the inner table are
>>> included in the
>>> 1 distinct values of the outer table. So (100/1)*outer_rows
>>> tuples will
>>> probe the buckets has chain. And (9900/1)*outer_rows tuples will
>>> probe
>>> all the 1000 buckets randomly. So (9900/1)*outer_rows*(900/1000)
>>> tuples will
>>> probe empty buckets. So the costs on scanning bucket chain is
>>>
>>> hash_qual_cost.per_tuple*innerbucketsize*outer_rows*
>>> (1 - ((outer_distinct - inner_distinct)/outer_distinct)*((buckets_num -
>>> inner_disttinct)/buckets_num))
>>>
>>> Do you think this assumption is reasonable?
>>>
>>>
>>> On Tue, Nov 19, 2019 at 3:46 PM Thomas Munro 
>>> wrote:
>>>
 On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen  wrote:
 > In the test case above, the small table has 3000 tuples and 100
 distinct values on column ‘a’.
 > If we use small table as inner table.  The chan length of the bucket
 is 30. And we need to
 > search the whole chain on probing the hash table. So the cost of
 probing is bigger than build
 > hash table, and we need to use big table as inner.
 >
 > But in fact this is not true. We initialized 620,000 buckets in
 hashtable. But only 100 buckets
 > has chains with length 30. Other buckets are empty. Only hash values
 need to be compared.
 > Its costs are very small. We have 100,000 distinct key and
 100,000,000 tuple on outer table.
 > Only (100/10)* tuple_num tuples will search the whole chain. The
 other tuples
 > (number = (98900/10)*tuple_num*) in outer
 > table just compare with the hash value. So the actual cost is much
 smaller than the planner
 > calculated. This is the reason why using a small table as inner is
 faster.

 So basically we think that if t_big is on the outer side, we'll do
 100,000,000 probes and each one is going to scan a t_small bucket with
 chain length 30, so that looks really expensive.  Actually only a
 small percentage of its probes find tuples with the right hash value,
 but final_cost_hash_join() doesn't know that.  So we hash t_big
 instead, which we

Re: Parallel grouping sets

2019-11-28 Thread Pengzhou Tang

Hi Hackers,

Richard pointed out that he get incorrect results with the patch I
attached, there are bugs somewhere,
I fixed them now and attached the newest version, please refer to [1] for
the fix.

Thanks,
Pengzhou

References:
[1] https://github.com/greenplum-db/postgres/tree/parallel_groupingsets
_3

On Mon, Sep 30, 2019 at 5:41 PM Pengzhou Tang  wrote:

> Hi Richard & Tomas:
>
> I followed the idea of the second approach to add a gset_id in the
> targetlist of the first stage of
> grouping sets and uses it to combine the aggregate in final stage. gset_id
> stuff is still kept
> because of GROUPING() cannot uniquely identify a grouping set, grouping
> sets may contain
> duplicated set, eg: group by grouping sets((c1, c2), (c1,c2)).
>
> There are some differences to implement the second approach comparing to
> the original idea from
> Richard, gset_id is not used as additional group key in the final stage,
> instead, we use it to
> dispatch the input tuple to the specified grouping set directly and then
> do the aggregate.
> One advantage of this is that we can handle multiple rollups with better
> performance without APPEND node.
>
> the plan now looks like:
>
> gpadmin=# explain select c1, c2 from gstest group by grouping
> sets(rollup(c1, c2), rollup(c3));
>  QUERY PLAN
>
> 
>  Finalize MixedAggregate  (cost=1000.00..73108.57 rows=8842 width=12)
>Dispatched by: (GROUPINGSETID())
>Hash Key: c1, c2
>Hash Key: c1
>Hash Key: c3
>Group Key: ()
>  Group Key: ()
>->  Gather  (cost=1000.00..71551.48 rows=17684 width=16)
>  Workers Planned: 2
>  ->  Partial MixedAggregate  (cost=0.00..68783.08 rows=8842
> width=16)
>Hash Key: c1, c2
>Hash Key: c1
>Hash Key: c3
>Group Key: ()
>Group Key: ()
>->  Parallel Seq Scan on gstest  (cost=0.00..47861.33
> rows=208 width=12)
> (16 rows)
>
> gpadmin=# set enable_hashagg to off;
> gpadmin=# explain select c1, c2 from gstest group by grouping
> sets(rollup(c1, c2), rollup(c3));
>QUERY PLAN
>
> 
>  Finalize GroupAggregate  (cost=657730.66..663207.45 rows=8842 width=12)
>Dispatched by: (GROUPINGSETID())
>Group Key: c1, c2
>Sort Key: c1
>  Group Key: c1
>  Group Key: ()
>  Group Key: ()
>Sort Key: c3
>  Group Key: c3
>->  Sort  (cost=657730.66..657774.87 rows=17684 width=16)
>  Sort Key: c1, c2
>  ->  Gather  (cost=338722.94..656483.04 rows=17684 width=16)
>Workers Planned: 2
>->  Partial GroupAggregate  (cost=337722.94..653714.64
> rows=8842 width=16)
>  Group Key: c1, c2
>  Group Key: c1
>  Group Key: ()
>  Group Key: ()
>  Sort Key: c3
>Group Key: c3
>  ->  Sort  (cost=337722.94..342931.28 rows=208
> width=12)
>Sort Key: c1, c2
>->  Parallel Seq Scan on gstest
>  (cost=0.00..47861.33 rows=208 width=12)
>
> References:
> [1] https://github.com/greenplum-db/postgres/tree/parallel_groupingsets
> _3
>
> On Wed, Jul 31, 2019 at 4:07 PM Richard Guo  wrote:
>
>> On Tue, Jul 30, 2019 at 11:05 PM Tomas Vondra <
>> tomas.von...@2ndquadrant.com> wrote:
>>
>>> On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote:
>>> >On Wed, Jun 12, 2019 at 10:58 AM Richard Guo  wrote:
>>> >
>>> >> Hi all,
>>> >>
>>> >> Paul and I have been hacking recently to implement parallel grouping
>>> >> sets, and here we have two implementations.
>>> >>
>>> >> Implementation 1
>>> >> 
>>> >>
>>> >> Attached is the patch and also there is a github branch [1] for this
>>> >> work.
>>> >>
>>> >
>>> >Rebased with the latest master.
>>> >
>>>
>>> Hi Richard,
>>>
>>> thanks for the rebased patch. I think the patch is mostly fine (at least
>>> I
>>> don't see any serious issues). A couple minor comments:
>>>
>>
>> Hi Tomas,
>>
>> Thank you for reviewing this patch.
>>
>>
>>>
>>> 1) I think get_number_of_groups() would deserve a short explanation why
>>> it's OK to handle (non-partial) grouping sets and regular GROUP BY in the
>>> same branch. Before these cases were clearly separated, now it seems a
>>> bit
>>> mixed up and it may not be immediately obvious why it's OK.
>>>
>>
>> Added a short comment in get_number_of_groups() explaining the behavior
>> when doing partial aggregation for grouping sets.
>>
>>
>>>
>>> 2) There are new regression

Re: [HACKERS] Block level parallel vacuum

2019-11-28 Thread Amit Kapila

On Thu, Nov 28, 2019 at 4:10 PM Mahendra Singh  wrote:
>
> On Thu, 28 Nov 2019 at 13:32, Masahiko Sawada 
>  wrote:
>>
>> On Wed, 27 Nov 2019 at 19:21, Mahendra Singh  wrote:
>> >
>> >
>> > Thanks for the re-based patches.
>> >
>> > On the top of v35 patch, I can see one compilation warning.
>> >>
>> >> parallel.c: In function ‘LaunchParallelWorkers’:
>> >> parallel.c:502:2: warning: ISO C90 forbids mixed declarations and code 
>> >> [-Wdeclaration-after-statement]
>> >>   int   i;
>> >>   ^
>> >
>> >
>> > Above warning is due to one extra semicolon added at the end of 
>> > declaration line in v35-0003 patch. Please fix this in next version.
>> > +   int nworkers_to_launch = Min(nworkers, pcxt->nworkers);;
>>
>> Thanks. I will fix it in the next version patch.
>>
>> >
>> > I will continue my testing on the top of v35 patch set and will post 
>> > results.
>
>
> While reviewing v35 patch set and doing testing, I found that if we disable 
> leader participation, then we are launching 1 less parallel worker than total 
> number of indexes. (I am using max_parallel_workers = 20, 
> max_parallel_maintenance_workers = 20)
>
> For example: If table have 3 indexes and we gave 6 parallel vacuum 
> degree(leader participation is disabled), then I think, we should launch 3 
> parallel workers but we are launching 2 workers due to below check.
> +   nworkers = lps->nindexes_parallel_bulkdel - 1;
> +
> +   /* Cap by the worker we computed at the beginning of parallel lazy vacuum 
> */
> +   nworkers = Min(nworkers, lps->pcxt->nworkers);
>
> Please let me know your thoughts for this.
>

I think it is probably because this part of the code doesn't consider
PARALLEL_VACUUM_DISABLE_LEADER_PARTICIPATION.  I think if we want we
can change it but I am slightly nervous about the code complexity this
will bring but maybe that is fine.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: progress report for ANALYZE

2019-11-28 Thread Tatsuro Yamada


Hi Amit-san,

On 2019/11/28 10:59, Amit Langote wrote:

Yamada-san,

Thank you for updating the patch.

On Wed, Nov 27, 2019 at 12:46 PM Tatsuro Yamada
 wrote:

But I just remembered I replaced column name "*_table" with "*_relid"
based on Robert's comment three months ago, see below:


/me reviews.

+  scanning_table

I think this should be retitled to something that ends in 'relid',
like all of the corresponding cases in existing progress views.
Perhaps 'active_relid' or 'current_relid'.


So, it would be better to use same rule against child_tables_total and
child_tables_done. Thus I changed these column names to others and added
to the view.


Robert's comment seems OK for a column that actually reports an OID,
but for columns that report counts, names containing "relids" sound a
bit strange to me.  So, I prefer child_tables_total /
child_tables_done over child_relids_total / child_relids_done.  Would
like to hear more opinions.


Hmmm... I understand your opinion but I'd like to get more opinions too. :)
Do you prefer these column names? See below:


 pid
 datid
 datname
 relid
 phase
 sample_blks_total
 heap_blks_scanned
 ext_stats_total
 ext_stats_computed
 child_tables_total  <= Renamed: s/relid/table/
 child_tables_done   <= Renamed: s/relid/table/
 current_child_table <= Renamed: s/relid/table/




Also, inheritance tree stats are created *after* creating single table
stats, so I think that it would be better to have a distinct phase
name for that, say "acquiring inherited sample rows".  In
do_analyze_rel(), you can select which of two phases to set based on
whether inh is true or not.


Okay! I'll also add the new phase "acquiring inherited sample rows" on
the next patch.



Fixed.

I tried to abbreviate it to "acquiring inh sample rows" because I thought
"acquiring inherited sample rows" is a little long for the phase name.


I think phase names should contain full English words, because they
are supposed to be descriptive.  Users are very likely to not
understand what "inh" means without looking up the docs.



Okay, I will fix it.
   s/acquiring inh sample rows/acquiring inherited sample rows/


Thanks,
Tatsuro Yamada

Re: [HACKERS] Block level parallel vacuum

2019-11-28 Thread Mahendra Singh

On Thu, 28 Nov 2019 at 13:32, Masahiko Sawada <
masahiko.saw...@2ndquadrant.com> wrote:

> On Wed, 27 Nov 2019 at 19:21, Mahendra Singh  wrote:
> >
> >
> > Thanks for the re-based patches.
> >
> > On the top of v35 patch, I can see one compilation warning.
> >>
> >> parallel.c: In function ‘LaunchParallelWorkers’:
> >> parallel.c:502:2: warning: ISO C90 forbids mixed declarations and code
> [-Wdeclaration-after-statement]
> >>   int   i;
> >>   ^
> >
> >
> > Above warning is due to one extra semicolon added at the end of
> declaration line in v35-0003 patch. Please fix this in next version.
> > +   int nworkers_to_launch = Min(nworkers, pcxt->nworkers);;
>
> Thanks. I will fix it in the next version patch.
>
> >
> > I will continue my testing on the top of v35 patch set and will post
> results.
>

While reviewing v35 patch set and doing testing, I found that if we disable
leader participation, then we are launching 1 less parallel worker than
total number of indexes. (I am using max_parallel_workers = 20,
max_parallel_maintenance_workers = 20)

For example: If table have 3 indexes and we gave 6 parallel vacuum
degree(leader participation is disabled), then I think, we should launch 3
parallel workers but we are launching 2 workers due to below check.
+   nworkers = lps->nindexes_parallel_bulkdel - 1;
+
+   /* Cap by the worker we computed at the beginning of parallel lazy
vacuum */
+   nworkers = Min(nworkers, lps->pcxt->nworkers);

Please let me know your thoughts for this.

Thanks and Regards
Mahendra Thalor
EnterpriseDB: http://www.enterprisedb.com

Re: Planner chose a much slower plan in hashjoin, using a large table as the inner table.

2019-11-28 Thread Andy Fan

On Fri, Nov 22, 2019 at 6:51 PM Jinbao Chen  wrote:

> Hi hackers,
>
> I have made a patch to fix the problem.
>
> Added the selection rate of the inner table non-empty bucket
>
> The planner will use big table as inner table in hash join
> if small table have fewer unique values. But this plan is
> much slower than using small table as inner table.
>
> In general, the cost of creating a hash table is higher
> than the cost of querying a hash table. So we tend to use
> small tables as internal tables. But if the average chain
> length of the bucket is large, the situation is just the
> opposite.
>
> If virtualbuckets is much larger than innerndistinct, and
> outerndistinct is much larger than innerndistinct. Then most
> tuples of the outer table will match the empty bucket. So when
> we calculate the cost of traversing the bucket, we need to
> ignore the tuple matching empty bucket.
>
> So we add the selection rate of the inner table non-empty
> bucket. The formula is:
> (1 - ((outerndistinct - innerndistinct)/outerndistinct)*
> ((virtualbuckets - innerndistinct)/virtualbuckets))
>
>
> On Tue, Nov 19, 2019 at 5:56 PM Jinbao Chen  wrote:
>
>> I think we have the same understanding of this issue.
>>
>> Sometimes use smaller costs on scanning the chain in bucket like below
>> would
>> be better.
>> run_cost += outer_path_rows * some_small_probe_cost;
>> run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
>> In some version of GreenPlum(a database based on postgres), we just
>> disabled
>> the cost on scanning the bucket chain. In most cases, this can get a
>> better query
>> plan. But I am worried that it will be worse in some cases.
>>
>> Now only the small table's distinct value is much smaller than the bucket
>> number,
>> and much smaller than the distinct value of the large table, the planner
>> will get the
>> wrong plan.
>>
>> For example, if inner table has 100 distinct values, and 3000 rows. Hash
>> table
>> has 1000 buckets. Outer table has 1 distinct values.
>> We can assume that all the 100 distinct values of the inner table are
>> included in the
>> 1 distinct values of the outer table. So (100/1)*outer_rows
>> tuples will
>> probe the buckets has chain. And (9900/1)*outer_rows tuples will probe
>> all the 1000 buckets randomly. So (9900/1)*outer_rows*(900/1000)
>> tuples will
>> probe empty buckets. So the costs on scanning bucket chain is
>>
>> hash_qual_cost.per_tuple*innerbucketsize*outer_rows*
>> (1 - ((outer_distinct - inner_distinct)/outer_distinct)*((buckets_num -
>> inner_disttinct)/buckets_num))
>>
>> Do you think this assumption is reasonable?
>>
>>
>> On Tue, Nov 19, 2019 at 3:46 PM Thomas Munro 
>> wrote:
>>
>>> On Mon, Nov 18, 2019 at 7:48 PM Jinbao Chen  wrote:
>>> > In the test case above, the small table has 3000 tuples and 100
>>> distinct values on column ‘a’.
>>> > If we use small table as inner table.  The chan length of the bucket
>>> is 30. And we need to
>>> > search the whole chain on probing the hash table. So the cost of
>>> probing is bigger than build
>>> > hash table, and we need to use big table as inner.
>>> >
>>> > But in fact this is not true. We initialized 620,000 buckets in
>>> hashtable. But only 100 buckets
>>> > has chains with length 30. Other buckets are empty. Only hash values
>>> need to be compared.
>>> > Its costs are very small. We have 100,000 distinct key and 100,000,000
>>> tuple on outer table.
>>> > Only (100/10)* tuple_num tuples will search the whole chain. The
>>> other tuples
>>> > (number = (98900/10)*tuple_num*) in outer
>>> > table just compare with the hash value. So the actual cost is much
>>> smaller than the planner
>>> > calculated. This is the reason why using a small table as inner is
>>> faster.
>>>
>>> So basically we think that if t_big is on the outer side, we'll do
>>> 100,000,000 probes and each one is going to scan a t_small bucket with
>>> chain length 30, so that looks really expensive.  Actually only a
>>> small percentage of its probes find tuples with the right hash value,
>>> but final_cost_hash_join() doesn't know that.  So we hash t_big
>>> instead, which we estimated pretty well and it finishes up with
>>> buckets of length 1,000 (which is actually fine in this case, they're
>>> not unwanted hash collisions, they're duplicate keys that we need to
>>> emit) and we probe them 3,000 times (which is also fine in this case),
>>> but we had to do a bunch of memory allocation and/or batch file IO and
>>> that turns out to be slower.
>>>
>>> I am not at all sure about this but I wonder if it would be better to
>>> use something like:
>>>
>>>   run_cost += outer_path_rows * some_small_probe_cost;
>>>   run_cost += hash_qual_cost.per_tuple * approximate_tuple_count();
>>>
>>> If we can estimate how many tuples will actually match accurately,
>>> that should also be the number of times we have to run the quals,
>>> since we don't usually expect hash collisions (bucket collisions, yes,

Re: Why JIT speed improvement is so modest?

2019-11-28 Thread Konstantin Knizhnik

Hi,

On 28.11.2019 10:36, guangnan he wrote:

Hi,
you mean if we don't add new compiler options the compiler will do the 
loop unrolling using SIMD automatically?

Yes, most of modern compiler are doing it.
GCC requires -O3 option (-O2 is not enough), but clang is using them 
even with -O2.

But Postgres is using more sophisticated Youngs-Cramer algorithm for 
calculating SUM/AVG aggregates. And here SIMD instructions do not help much.
My original assumption was that huge difference in speed between 
VOPS/ISPRAS JIT and Vanilla JIT can be explained by the difference in 
accumulation algorithm.

This is why I implemented calculation of AVG in VOPS using Youngs-Cramer 
algorithm.
And it certainly affect performance: Q1 with SUM aggregates is executed 
by VOPS almost three times faster than with AVG aggregates (700 msec vs. 
2000 msec).
But even with Youngs-Cramer algorithm VOPS is 6 times faster than 
standard Postgres with JIT and 5 times faster than my in-memory storage.

Beside the function calls, cache miss etc, for VOPS I think the call 
stack is squeezing too, but the JIT optimize still process rows one by 
one.
If we do not take in account overhead of heap traversal and tuples 
packing then amount of calculations doesn't depend on data model: 
whether it is vertical or horizontal.
By implementing in-memory storage which just keeps unpacked tuples in L2 
list in backend's private memory and so doesn't spend time for unpacking 
or visibility checks

I  want to exclude this overhead and reach almost the same speed as VOPS.
But it doesn't happen.

Konstantin Knizhnik > 于2019年11月28日周四 下午3:08写道：

On 27.11.2019 19:05, Tomas Vondra wrote:
> On Wed, Nov 27, 2019 at 06:38:45PM +0300, Konstantin Knizhnik wrote:
>>
>>
>> On 25.11.2019 18:24, Merlin Moncure wrote:
>>> On Mon, Nov 25, 2019 at 9:09 AM Konstantin Knizhnik
>>> mailto:k.knizh...@postgrespro.ru>>
wrote:
 JIT was not able to significantly (times) increase speed on
Q1 query?
 Experiment with VOPS shows that used aggregation algorithm
itself
 is not
 a bottleneck.
 Profile also give no answer for this question.
 Any ideas?
>>> Well, in the VOPS variant around 2/3 of the time is spent in
routines
>>> that are obviously aggregation.  In the JIT version, it's
around 20%.
>>> So this suggests that the replacement execution engine is more
>>> invasive.  I would also guess (!) that the VOPS engine
optimizes fewer
>>> classes of query plan.   ExecScan for example, looks to be
completely
>>> optimized out VOPS but is still utilized in the JIT engine.
>>
>> The difference in fraction of time spent in aggregate
calculation is
>> not so large (2 times vs. 10 times).
>> I suspected that a lot of time is spent in relation traversal
code,
>> tuple unpacking and visibility checks.
>> To check this hypothesis I have implement in-memory table access
>> method which stores tuples in unpacked form and
>> doesn't perform any visibility checks at all.
>> Results were not so existed. I have to disable parallel execution
>> (because it is not possible for tuples stored in backend private
>> memory).
>> Results are the following:
>>
>> lineitem:   13736 msec
>> inmem_lineitem:  10044 msec
>> vops_lineitem:    1945 msec
>>
>> The profile of inmem_lineitem is the following:
>>
>>   16.79%  postgres  postgres [.] float4_accum
>>   12.86%  postgres  postgres [.] float8_accum
>>    5.83%  postgres  postgres [.]
TupleHashTableHash.isra.8
>>    4.44%  postgres  postgres [.] lookup_hash_entries
>>    3.37%  postgres  postgres [.] check_float8_array
>>    3.11%  postgres  postgres [.] tuplehash_insert
>>    2.91%  postgres  postgres [.] hash_uint32
>>    2.83%  postgres  postgres [.] ExecScan
>>    2.56%  postgres  postgres [.] inmem_getnextslot
>>    2.22%  postgres  postgres [.] FunctionCall1Coll
>>    2.14%  postgres  postgres [.] LookupTupleHashEntry
>>    1.95%  postgres  postgres [.]
TupleHashTableMatch.isra.9
>>    1.76%  postgres  postgres [.] pg_detoast_datum
>>    1.58%  postgres  postgres [.] AggCheckCallContext
>>    1.57%  postgres  postgres [.] tts_minimal_clear
>>    1.35%  postgres  perf-3054.map    [.] 0x7f558db60010
>>    1.23%  postgres  postgres [.] fetch_input_tuple
>>    1.15%  postgres  postgres [.] SeqNext
>>    1.06%  postgres  postgres [.] ExecAgg
>>    1.00%  postgres  postgres [.]
tts_minimal_store_tuple
>>
>> So now fraction of time spent in aggregation is

Yet another vectorized engine

2019-11-28 Thread Hubert Zhang

Hi hackers,

We just want to introduce another POC for vectorized execution engine
https://github.com/zhangh43/vectorize_engine and want to get some feedback
on the idea.

The basic idea is to extend the TupleTableSlot and introduce
VectorTupleTableSlot, which is an array of datums organized by projected
columns.  The array of datum per column is continuous in memory. This makes
the expression evaluation cache friendly and SIMD could be utilized. We
have refactored the SeqScanNode and AggNode to support VectorTupleTableSlot
currently.

Below are features in our design.
1. Pure extension. We don't hack any code into postgres kernel.

2. CustomScan node. We use CustomScan framework to replace original
executor node such as SeqScan, Agg etc. Based on CustomScan, we could
extend the CustomScanState, BeginCustomScan(), ExecCustomScan(),
EndCustomScan() interface to implement vectorize executor logic.

3. Post planner hook. After plan is generated, we use plan_tree_walker to
traverse the plan tree and check whether it could be vectorized. If yes,
the non-vectorized nodes (SeqScan, Agg etc.) are replaced with vectorized
nodes (in form of CustomScan node) and use vectorized executor. If no, we
will revert to the original plan and use non-vectorized executor. In future
this part could be enhanced, for example, instead of revert to original
plan when some nodes cannot be vectorized, we could add Batch/UnBatch node
to generate a plan with both vectorized as well as non-vectorized node.

4. Support implement new vectorized executor node gradually. We currently
only vectorized SeqScan and Agg but other queries which including Join
could also be run when vectorize extension is enabled.

5. Inherit original executor code. Instead of rewriting the whole executor,
we choose a more smooth method to modify current Postgres executor node and
make it vectorized. We copy the current executor node's c file into our
extension, and add vectorize logic based on it. When Postgres enhance its
executor, we could relatively easily merge them back. We want to know
whether this is a good way to write vectorized executor extension?

6. Pluggable storage. Postgres has supported pluggable storage now.
TupleTableSlot is refactored as abstract struct TupleTableSlotOps.
VectorTupleTableSlot could be implemented under this framework when we
upgrade the extension to latest PG.

We run the TPCH(10G) benchmark and result of Q1 is 50sec(PG) V.S.
28sec(Vectorized PG). Performance gain can be improved by:
1. heap tuple deform occupy many CPUs. We will try zedstore in future,
since vectorized executor is more compatible with column store.

2. vectorized agg is not fully vectorized and we have many optimization
need to do. For example, batch compute the hash value, optimize hash table
for vectorized HashAgg.

3. Conversion cost from Datum to actual type and vice versa is also high,
for example DatumGetFloat4 & Float4GetDatum. One optimization maybe that we
store the actual type in VectorTupleTableSlot directly, instead of an array
of datums.

Related works:
1. VOPS is a vectorized execution extension. Link:
https://github.com/postgrespro/vops.
It doesn't use custom scan framework and use UDF to do the vectorized
operation e.g. it changes the SQL syntax to do aggregation.

2. Citus vectorized executor is another POC. Link:
https://github.com/citusdata/postgres_vectorization_test.
It uses ExecutorRun_hook to run the vectorized executor and uses cstore fdw
to support column storage.

Note that the vectorized executor engine is based on PG9.6 now, but it
could be ported to master / zedstore with some effort.  We would appreciate
some feedback before moving further in that direction.

Thanks,
Hubert Zhang, Gang Xiong, Ning Yu, Asim Praveen

Re: [Incident report]Backend process crashed when executing 2pc transaction

2019-11-28 Thread Amit Langote

Hi Marco,

On Thu, Nov 28, 2019 at 5:02 PM Marco Slot  wrote:
>
> On Thu, Nov 28, 2019 at 6:18 AM Amit Langote  wrote:
> > Interesting.  Still, I think you'd be in better position than anyone
> > else to come up with reproduction steps for vanilla PostgreSQL by
> > analyzing the stack trace if and when the crash next occurs (or using
> > the existing core dump).  It's hard to tell by only guessing what may
> > have gone wrong when there is external code involved, especially
> > something like Citus that hooks into many points within vanilla
> > PostgreSQL.
>
> To clarify: In a Citus cluster you typically have a coordinator which
> contains the "distributed tables" and one or more workers which
> contain the data. All are PostgreSQL servers with the citus extension.
> The coordinator uses every available hook in PostgreSQL to make the
> distributed tables behave like regular tables. Any crash on the
> coordinator is likely to be attributable to Citus, because most of the
> code that is exercised is Citus code. The workers are used as regular
> PostgreSQL servers with the coordinator acting as a regular client. On
> the worker, the ProcessUtility hook will just pass on the arguments to
> standard_ProcessUtility without any processing. The crash happened on
> a worker.

Thanks for clarifying.

> One interesting thing is the prepared transaction name generated by
> the coordinator, which follows the form: citus_ id>___ number in session>. The server-wide transaction number is a 64-bit
> counter that is kept in shared memory and starts at 1. That means that
> over 4 billion (4207001212) transactions happened on the coordinator
> since the server started, which quite possibly resulted in 4 billion
> prepared transactions on this particular server. I'm wondering if some
> counter is overflowing.

Interesting.  This does kind of gets us closer to figuring out what
might have gone wrong, but hard to tell without the core dump at hand.

Thanks,
Amit

Re: format of pg_upgrade loadable_libraries warning

2019-11-28 Thread Daniel Gustafsson

> On 28 Nov 2019, at 02:26, Bruce Momjian  wrote:

> I have applied the patch, with improved wording.  I only applied this to
> PG 13 since I was worried old tools might be checking for the old error
> text.  Should this be backpatched more?

I don't think it's unreasonable to assume that there are lots of inhouse
tooling for pg_upgrade orchestration which grep for specific messages.
Stopping at 13 seems perfectly reasonable for this change.

cheers ./daniel

Re: Implementing Incremental View Maintenance

2019-11-28 Thread Yugo Nagata

On Thu, 28 Nov 2019 11:26:40 +0900 (JST)
Tatsuo Ishii  wrote:

> > Note that this is the last patch in the series of IVM patches: now we
> > would like focus on blushing up the patches, rather than adding new
> > SQL support to IVM, so that the patch is merged into PostgreSQL 13
> > (hopefully). We are very welcome reviews, comments on the patch.
> > 
> > BTW, the SGML docs in the patch is very poor at this point. I am going
> > to add more descriptions to the doc.
> 
> As promised, I have created the doc (CREATE MATERIALIZED VIEW manual)
> patch.

-  because the triggers will be invoked.
+  because the triggers will be invoked. We call this form of materialized
+  view as "Incremantal materialized View Maintenance" (IVM).

This part seems incorrect to me.  Incremental (materialized) View
Maintenance (IVM) is a way to maintain materialized views and is not a
word to refer views to be maintained.

However, it would be useful if there is a term referring views which
can be maintained using IVM. Off the top of my head, we can call this
Incrementally Maintainable Views (= IMVs), but this might cofusable with
IVM, so I'll think about that a little more

Regards,
Yugo Nagata

> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp

-- 
Yugo Nagata

Re: [Incident report]Backend process crashed when executing 2pc transaction

2019-11-28 Thread Marco Slot

On Thu, Nov 28, 2019 at 6:18 AM Amit Langote  wrote:
> Interesting.  Still, I think you'd be in better position than anyone
> else to come up with reproduction steps for vanilla PostgreSQL by
> analyzing the stack trace if and when the crash next occurs (or using
> the existing core dump).  It's hard to tell by only guessing what may
> have gone wrong when there is external code involved, especially
> something like Citus that hooks into many points within vanilla
> PostgreSQL.

To clarify: In a Citus cluster you typically have a coordinator which
contains the "distributed tables" and one or more workers which
contain the data. All are PostgreSQL servers with the citus extension.
The coordinator uses every available hook in PostgreSQL to make the
distributed tables behave like regular tables. Any crash on the
coordinator is likely to be attributable to Citus, because most of the
code that is exercised is Citus code. The workers are used as regular
PostgreSQL servers with the coordinator acting as a regular client. On
the worker, the ProcessUtility hook will just pass on the arguments to
standard_ProcessUtility without any processing. The crash happened on
a worker.

One interesting thing is the prepared transaction name generated by
the coordinator, which follows the form: citus. The server-wide transaction number is a 64-bit
counter that is kept in shared memory and starts at 1. That means that
over 4 billion (4207001212) transactions happened on the coordinator
since the server started, which quite possibly resulted in 4 billion
prepared transactions on this particular server. I'm wondering if some
counter is overflowing.

cheers,
Marco

Re: [HACKERS] Block level parallel vacuum

2019-11-28 Thread Masahiko Sawada

On Wed, 27 Nov 2019 at 19:21, Mahendra Singh  wrote:
>
>
> Thanks for the re-based patches.
>
> On the top of v35 patch, I can see one compilation warning.
>>
>> parallel.c: In function ‘LaunchParallelWorkers’:
>> parallel.c:502:2: warning: ISO C90 forbids mixed declarations and code 
>> [-Wdeclaration-after-statement]
>>   int   i;
>>   ^
>
>
> Above warning is due to one extra semicolon added at the end of declaration 
> line in v35-0003 patch. Please fix this in next version.
> +   int nworkers_to_launch = Min(nworkers, pcxt->nworkers);;

Thanks. I will fix it in the next version patch.

>
> I will continue my testing on the top of v35 patch set and will post results.

Thank you!

Regards,

-- 
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

76 matches

Mail list logo