Re: Why are wait events not reported even though it reads/writes a timeline history file?

2020-04-30 Thread Fujii Masao



On 2020/05/01 10:07, Masahiro Ikeda wrote:

On 2020-05-01 00:25, Fujii Masao wrote:

On 2020/04/28 17:42, Masahiro Ikeda wrote:

On 2020-04-28 15:09, Michael Paquier wrote:

On Tue, Apr 28, 2020 at 02:49:00PM +0900, Fujii Masao wrote:

Isn't it safer to report the wait event during fgets() rather than putting
those calls around the whole loop, like other code does? For example,
writeTimeLineHistory() reports the wait event during read() rather than
whole loop.


Yeah, I/O wait events should be taken only during the duration of the
system calls.  Particularly here, you may finish with an elog() that
causes the wait event to be set longer than it should, leading to a
rather incorrect state if a snapshot of pg_stat_activity is taken.
--


Thanks for your comments.

I fixed it to report the wait event during fgets() only.
Please review the v2 patch I attached.


Thanks for updating the patch! Here are the review comments from me.

+    char   *result;
+    pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_READ);
+    result = fgets(fline, sizeof(fline), fd);
+    pgstat_report_wait_end();
+    if (result == NULL)
+    break;
+
 /* skip leading whitespace and check for # comment */
 char   *ptr;

Since the variable name "result" has been already used in this function,
it should be renamed.


Sorry for that.

I thought to rename it, but I changed to use feof()
for clarify the difference from ferror().



The code should not be inject into the variable declaration block.


Thanks for the comment.
I moved the code block after the variable declaration block.



When reading this patch, I found that IO-error in fgets() has not
been checked there. Though this is not the issue that you reported,
but it seems better to fix it together. So what about adding
the following code?

if (ferror(fd))
    ereport(ERROR,
    (errcode_for_file_access(),
 errmsg("could not read file \"%s\": %m", path)));


Thanks, I agree your comment.
I added the above code to the v3 patch I attached.


Thanks for updating the patch! It looks good to me.

I applied cosmetic changes to the patch (attached). Barring any objection,
I will push this patch (also back-patch to v10 where wait-event for timeline
file was added).

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/src/backend/access/transam/timeline.c 
b/src/backend/access/transam/timeline.c
index de57d699af..aeaaf44a4a 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -78,7 +78,6 @@ readTimeLineHistory(TimeLineID targetTLI)
List   *result;
charpath[MAXPGPATH];
charhistfname[MAXFNAMELEN];
-   charfline[MAXPGPATH];
FILE   *fd;
TimeLineHistoryEntry *entry;
TimeLineID  lasttli = 0;
@@ -123,15 +122,27 @@ readTimeLineHistory(TimeLineID targetTLI)
 * Parse the file...
 */
prevend = InvalidXLogRecPtr;
-   while (fgets(fline, sizeof(fline), fd) != NULL)
+   for (;;)
{
-   /* skip leading whitespace and check for # comment */
+   charfline[MAXPGPATH];
+   char   *res;
char   *ptr;
TimeLineID  tli;
uint32  switchpoint_hi;
uint32  switchpoint_lo;
int nfields;
 
+   pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_READ);
+   res = fgets(fline, sizeof(fline), fd);
+   pgstat_report_wait_end();
+   if (ferror(fd))
+   ereport(ERROR,
+   (errcode_for_file_access(),
+errmsg("could not read file \"%s\": 
%m", path)));
+   if (res == NULL)
+   break;
+
+   /* skip leading whitespace and check for # comment */
for (ptr = fline; *ptr; ptr++)
{
if (!isspace((unsigned char) *ptr))
@@ -393,6 +404,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID 
parentTLI,
 
nbytes = strlen(buffer);
errno = 0;
+   pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_WRITE);
if ((int) write(fd, buffer, nbytes) != nbytes)
{
int save_errno = errno;
@@ -408,6 +420,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID 
parentTLI,
(errcode_for_file_access(),
 errmsg("could not write to file \"%s\": %m", 
tmppath)));
}
+   pgstat_report_wait_end();
 
pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_SYNC);
if (pg_fsync(fd) != 0)


Re: SLRU statistics

2020-04-30 Thread Fujii Masao




On 2020/05/01 3:19, Tomas Vondra wrote:

On Fri, May 01, 2020 at 03:02:59AM +0900, Fujii Masao wrote:



On 2020/04/02 9:41, Tomas Vondra wrote:

Hi,

I've pushed this after some minor cleanup and improvements.


+static char *slru_names[] = {"async", "clog", "commit_timestamp",
+  "multixact_offset", "multixact_member",
+  "oldserxid", "pg_xact", "subtrans",
+  "other" /* has to be last */};

When I tried pg_stat_slru, I found that it returns a row for "pg_xact".
But since there is no "pg_xact" slru ("clog" slru exists instead),
"pg_xact" should be removed? Patch attached.



Yeah, I think I got confused and accidentally added both "clog" and
"pg_xact". I'll get "pg_xact" removed.


Thanks!

Another thing I found is; pgstat_send_slru() should be called also by
other processes than backend? For example, since clog data is flushed
basically by checkpointer, checkpointer seems to need to send slru stats.
Otherwise, pg_stat_slru.flushes would not be updated.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Binary search in ScalarArrayOpExpr for OR'd constant arrays

2020-04-30 Thread James Coleman
On Tue, Apr 28, 2020 at 8:25 AM Tomas Vondra
 wrote:
...
> Any particular reasons to pick dynahash over simplehash? ISTM we're
> using simplehash elsewhere in the executor (grouping, tidbitmap, ...),
> while there are not many places using dynahash for simple short-lived
> hash tables. Of course, that alone is a weak reason to insist on using
> simplehash here, but I suppose there were reasons for not using dynahash
> and we'll end up facing the same issues here.

I've attached a patch series that includes switching to simplehash.
Obviously we'd really just collapse all of these patches, but it's
perhaps interesting to see the changes required to use each style
(binary search, dynahash, simplehash).

As before, there are clearly comments and naming things to be
addressed, but the implementation should be reasonably clean.

James
From d455a57701df064f96e5f2a3be2b1c736bacc485 Mon Sep 17 00:00:00 2001
From: James Coleman 
Date: Tue, 28 Apr 2020 21:33:12 -0400
Subject: [PATCH v3 3/3] Try simple hash

---
 src/backend/executor/execExpr.c   | 24 +++
 src/backend/executor/execExprInterp.c | 99 +--
 src/include/executor/execExpr.h   | 26 ++-
 3 files changed, 100 insertions(+), 49 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 6249db5426..37c1669153 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -950,13 +950,10 @@ ExecInitExprRec(Expr *node, ExprState *state,
 			{
 ScalarArrayOpExpr *opexpr = (ScalarArrayOpExpr *) node;
 Oid			func;
-Oid			hash_func;
 Expr	   *scalararg;
 Expr	   *arrayarg;
 FmgrInfo   *finfo;
 FunctionCallInfo fcinfo;
-FmgrInfo   *hash_finfo;
-FunctionCallInfo hash_fcinfo;
 AclResult	aclresult;
 bool		useBinarySearch = false;
 
@@ -996,18 +993,20 @@ ExecInitExprRec(Expr *node, ExprState *state,
 	nitems = ArrayGetNItems(ARR_NDIM(arr), ARR_DIMS(arr));
 	if (nitems >= MIN_ARRAY_SIZE_FOR_BINARY_SEARCH)
 	{
+		Oid			hash_func;
+
 		/*
-		 * Find the hash op that matches the originally
-		 * planned equality op.
+		 * Find the hash op that matches the originally planned
+		 * equality op. If we don't have one, we'll just fall
+		 * back to the default linear scan implementation.
 		 */
 		useBinarySearch = get_op_hash_functions(opexpr->opno, NULL, &hash_func);
 
-		/*
-		 * But we may not have one, so fall back to the
-		 * default implementation if necessary.
-		 */
 		if (useBinarySearch)
 		{
+			FmgrInfo   *hash_finfo;
+			FunctionCallInfo hash_fcinfo;
+
 			hash_finfo = palloc0(sizeof(FmgrInfo));
 			hash_fcinfo = palloc0(SizeForFunctionCallInfo(2));
 			fmgr_info(hash_func, hash_finfo);
@@ -1015,6 +1014,10 @@ ExecInitExprRec(Expr *node, ExprState *state,
 			InitFunctionCallInfoData(*hash_fcinfo, hash_finfo, 2,
 	 opexpr->inputcollid, NULL, NULL);
 			InvokeFunctionExecuteHook(hash_func);
+
+			scratch.d.scalararraybinsearchop.hash_finfo = hash_finfo;
+			scratch.d.scalararraybinsearchop.hash_fcinfo_data = hash_fcinfo;
+			scratch.d.scalararraybinsearchop.hash_fn_addr = hash_finfo->fn_addr;
 		}
 	}
 }
@@ -1044,9 +1047,6 @@ ExecInitExprRec(Expr *node, ExprState *state,
 	scratch.d.scalararraybinsearchop.finfo = finfo;
 	scratch.d.scalararraybinsearchop.fcinfo_data = fcinfo;
 	scratch.d.scalararraybinsearchop.fn_addr = finfo->fn_addr;
-	scratch.d.scalararraybinsearchop.hash_finfo = hash_finfo;
-	scratch.d.scalararraybinsearchop.hash_fcinfo_data = hash_fcinfo;
-	scratch.d.scalararraybinsearchop.hash_fn_addr = hash_finfo->fn_addr;
 }
 else
 {
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index e54e807c6b..aa7f5ae0df 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -178,6 +178,27 @@ ExecAggPlainTransByRef(AggState *aggstate, AggStatePerTrans pertrans,
 	   AggStatePerGroup pergroup,
 	   ExprContext *aggcontext, int setno);
 
+static bool
+element_match(struct saophash_hash *tb, Datum key1, Datum key2);
+static uint32 element_hash(struct saophash_hash *tb, Datum key);
+
+/*
+ * Define parameters for ScalarArrayOpExpr hash table code generation. The interface is
+ * *also* declared in execnodes.h (to generate the types, which are externally
+ * visible).
+ */
+#define SH_PREFIX saophash
+#define SH_ELEMENT_TYPE ScalarArrayOpExprHashEntryData
+#define SH_KEY_TYPE Datum
+#define SH_KEY key
+#define SH_HASH_KEY(tb, key) element_hash(tb, key)
+#define SH_EQUAL(tb, a, b) element_match(tb, a, b)
+#define SH_SCOPE extern
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) a->hash
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
 /*
  * Prepare ExprState for interpreted execution.
  */
@@ -3591,8 +3612,6 @@ ExecEvalScalarArrayOp(ExprState *state, ExprEvalStep *op)
 	

Comment simplehash/dynahash trade-offs

2020-04-30 Thread James Coleman
In another thread [1] I'd mused that "there might be some value in a
README or comments
addition that would be a guide to what the various hash
implementations are useful for...so that we have something to make the
code base a bit more discoverable."

I'd solicited feedback from Andres (as the author of the simplehash
implementation) and gotten further explanation from Tomas (both cc'd
here) and have tried to condense that into the comment changes in this
patch series.

v1-0001-Summarize-trade-offs-between-simplehash-and-dynah.patch
Contains the summaries mentioned above.

v1-0002-Improve-simplehash-usage-notes.patch
I'd noticed while adding a simplehash implementation in that other
thread that the facts that:
- The element type requires a status member.
- Why storing a hash in the element type is useful (i.e., when to
define SH_STORE_HASH/SH_GET_HASH).
- The availability of  private_data member for metadata access from callbacks.
are either undocumented or hard to discover, and so I've added the
information directly to the usage notes section.

v1-0003-Show-sample-simplehash-method-signatures.patch
I find it hard to read the macro code "templating" particularly for
seeing what the available API is and so added sample method signatures
in comments to the macro generated method signature defines.

James

[1]: 
https://www.postgresql.org/message-id/CAAaqYe956a-zbm1qR8pqz%3DiLbF8LW5vBrQGrzXVHXdLk3at5_Q%40mail.gmail.com
From 36f6b670b30194efb4b7d0e622afc3085ecd49b6 Mon Sep 17 00:00:00 2001
From: James Coleman 
Date: Thu, 30 Apr 2020 21:39:04 -0400
Subject: [PATCH v1 2/3] Improve simplehash usage notes

---
 src/include/lib/simplehash.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index e5101f2f1d..8daf7c4d1a 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -48,6 +48,19 @@
  *	  - SH_STORE_HASH - if defined the hash is stored in the elements
  *	  - SH_GET_HASH(tb, a) - return the field to store the hash in
  *
+ *The element type is required to contain a "uint32 status" member.
+ *
+ *While SH_STORE_HASH (and subsequently SH_GET_HASH) are optional, because
+ *the hash table implementation needs to compare hashes to move elements
+ *(particularly when growing the hash), it's preferable, if possible, to
+ *store the element's hash in the element's data type. If the hash is so
+ *stored, the hash table will also compare hashes before calling SH_EQUAL
+ *when comparing two keys.
+ *
+ *For convenience the hash table create functions accept a void pointer
+ *will be stored in the hash table type's member private_data. This allows
+ *callbacks to reference caller provided data.
+ *
  *	  For examples of usage look at tidbitmap.c (file local definition) and
  *	  execnodes.h/execGrouping.c (exposed declaration, file local
  *	  implementation).
-- 
2.17.1

From 9116329e51419cb71f00d31f6819294cf2608e74 Mon Sep 17 00:00:00 2001
From: James Coleman 
Date: Thu, 30 Apr 2020 21:40:44 -0400
Subject: [PATCH v1 3/3] Show sample simplehash method signatures

---
 src/include/lib/simplehash.h | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 8daf7c4d1a..d7f7134541 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -176,24 +176,47 @@ typedef struct SH_ITERATOR
 
 /* externally visible function prototypes */
 #ifdef SH_RAW_ALLOCATOR
+/* _hash _create(uint32 nelements, void *private_data) */
 SH_SCOPE	SH_TYPE *SH_CREATE(uint32 nelements, void *private_data);
 #else
+/*
+ * _hash _create(MemoryContext ctx, uint32 nelements,
+ * void *private_data)
+ */
 SH_SCOPE	SH_TYPE *SH_CREATE(MemoryContext ctx, uint32 nelements,
 			   void *private_data);
 #endif
+/* void _destroy(_hash *tb) */
 SH_SCOPE void SH_DESTROY(SH_TYPE * tb);
+/* void _reset(_hash *tb) */
 SH_SCOPE void SH_RESET(SH_TYPE * tb);
+/* void _grow(_hash *tb) */
 SH_SCOPE void SH_GROW(SH_TYPE * tb, uint32 newsize);
+/*  _insert(_hash *tb,  key, bool *found) */
 SH_SCOPE	SH_ELEMENT_TYPE *SH_INSERT(SH_TYPE * tb, SH_KEY_TYPE key, bool *found);
+/*
+ *  _insert_hash(_hash *tb,  key, uint32 hash,
+ *   bool *found)
+ */
 SH_SCOPE	SH_ELEMENT_TYPE *SH_INSERT_HASH(SH_TYPE * tb, SH_KEY_TYPE key,
 			uint32 hash, bool *found);
+/*  _lookup(_hash *tb,  key) */
 SH_SCOPE	SH_ELEMENT_TYPE *SH_LOOKUP(SH_TYPE * tb, SH_KEY_TYPE key);
+/*  _lookup_hash(_hash *tb,  key, uint32 hash) */
 SH_SCOPE	SH_ELEMENT_TYPE *SH_LOOKUP_HASH(SH_TYPE * tb, SH_KEY_TYPE key,
 			uint32 hash);
+/* bool _delete(_hash *tb,  key) */
 SH_SCOPE bool SH_DELETE(SH_TYPE * tb, SH_KEY_TYPE key);
+/* void _start_iterate(_hash *tb, _iterator *iter) */
 SH_SCOPE void SH_START_ITERATE(SH_TYPE * tb, SH_ITERATOR * iter);
+/*
+ * void _start_iterate_at(_hash *tb, _iterator *iter,
+ *  uint32 at)
+ */
 SH_SCOPE void

Re: do {} while (0) nitpick

2020-04-30 Thread Bruce Momjian
On Thu, Apr 30, 2020 at 09:51:10PM -0400, Tom Lane wrote:
> John Naylor  writes:
> > As I understand it, the point of having "do {} while (0)" in a
> > multi-statement macro is to turn it into a simple statement.
> 
> Right.
> 
> > As such,
> > ending with a semicolon in both the macro definition and the
> > invocation will turn it back into multiple statements, creating
> > confusion if someone were to invoke the macro in an "if" statement.
> 
> Yeah.  I'd call these actual bugs, and perhaps even back-patch worthy.

Agreed.  Those semicolons could easily create bugs.

-- 
  Bruce Momjian  https://momjian.us
  EnterpriseDB https://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +




Re: do {} while (0) nitpick

2020-04-30 Thread Tom Lane
John Naylor  writes:
> As I understand it, the point of having "do {} while (0)" in a
> multi-statement macro is to turn it into a simple statement.

Right.

> As such,
> ending with a semicolon in both the macro definition and the
> invocation will turn it back into multiple statements, creating
> confusion if someone were to invoke the macro in an "if" statement.

Yeah.  I'd call these actual bugs, and perhaps even back-patch worthy.

regards, tom lane




do {} while (0) nitpick

2020-04-30 Thread John Naylor
Hi,

As I understand it, the point of having "do {} while (0)" in a
multi-statement macro is to turn it into a simple statement. As such,
ending with a semicolon in both the macro definition and the
invocation will turn it back into multiple statements, creating
confusion if someone were to invoke the macro in an "if" statement.
Even if that never happens, it seems good to keep them all consistent,
as in the attached patch.

-- 
John Naylorhttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


do-while-zero.patch
Description: Binary data


Re: Why are wait events not reported even though it reads/writes a timeline history file?

2020-04-30 Thread Masahiro Ikeda

On 2020-05-01 00:25, Fujii Masao wrote:

On 2020/04/28 17:42, Masahiro Ikeda wrote:

On 2020-04-28 15:09, Michael Paquier wrote:

On Tue, Apr 28, 2020 at 02:49:00PM +0900, Fujii Masao wrote:
Isn't it safer to report the wait event during fgets() rather than 
putting
those calls around the whole loop, like other code does? For 
example,
writeTimeLineHistory() reports the wait event during read() rather 
than

whole loop.


Yeah, I/O wait events should be taken only during the duration of the
system calls.  Particularly here, you may finish with an elog() that
causes the wait event to be set longer than it should, leading to a
rather incorrect state if a snapshot of pg_stat_activity is taken.
--


Thanks for your comments.

I fixed it to report the wait event during fgets() only.
Please review the v2 patch I attached.


Thanks for updating the patch! Here are the review comments from me.

+   char   *result;
+   pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_READ);
+   result = fgets(fline, sizeof(fline), fd);
+   pgstat_report_wait_end();
+   if (result == NULL)
+   break;
+
/* skip leading whitespace and check for # comment */
char   *ptr;

Since the variable name "result" has been already used in this 
function,

it should be renamed.


Sorry for that.

I thought to rename it, but I changed to use feof()
for clarify the difference from ferror().



The code should not be inject into the variable declaration block.


Thanks for the comment.
I moved the code block after the variable declaration block.



When reading this patch, I found that IO-error in fgets() has not
been checked there. Though this is not the issue that you reported,
but it seems better to fix it together. So what about adding
the following code?

if (ferror(fd))
ereport(ERROR,
(errcode_for_file_access(),
 errmsg("could not read file \"%s\": %m", path)));


Thanks, I agree your comment.
I added the above code to the v3 patch I attached.


Regards,
--
Masahiro Ikeda
NTT DATA CORPORATIONFrom 976a796c1ff7ed191d9f3df6ca333e700e3a Mon Sep 17 00:00:00 2001
From: Masahiro Ikeda 
Date: Fri, 1 May 2020 09:38:51 +0900
Subject: [PATCH v3] Fix to report wait events about timeline history file.

Even though a timeline history file is read or written,
some wait events are not reported. This patch fixes those issues.
---
 src/backend/access/transam/timeline.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index de57d699af..4dfe9fddd1 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -123,7 +123,7 @@ readTimeLineHistory(TimeLineID targetTLI)
 	 * Parse the file...
 	 */
 	prevend = InvalidXLogRecPtr;
-	while (fgets(fline, sizeof(fline), fd) != NULL)
+	for (;;)
 	{
 		/* skip leading whitespace and check for # comment */
 		char	   *ptr;
@@ -132,6 +132,17 @@ readTimeLineHistory(TimeLineID targetTLI)
 		uint32		switchpoint_lo;
 		int			nfields;
 
+		pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_READ);
+		fgets(fline, sizeof(fline), fd);
+		pgstat_report_wait_end();
+
+		if (feof(fd))
+			break;
+		else if (ferror(fd))
+			ereport(ERROR,
+(errcode_for_file_access(),
+errmsg("could not read file \"%s\": %m", path)));
+
 		for (ptr = fline; *ptr; ptr++)
 		{
 			if (!isspace((unsigned char) *ptr))
@@ -393,6 +404,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 
 	nbytes = strlen(buffer);
 	errno = 0;
+	pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_WRITE);
 	if ((int) write(fd, buffer, nbytes) != nbytes)
 	{
 		int			save_errno = errno;
@@ -408,6 +420,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 (errcode_for_file_access(),
  errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
+	pgstat_report_wait_end();
 
 	pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_SYNC);
 	if (pg_fsync(fd) != 0)
-- 
2.25.1



Re: Raw device on PostgreSQL

2020-04-30 Thread Jonah H. Harris
On Wed, Apr 29, 2020 at 8:34 PM Jonah H. Harris 
wrote:

> On Tue, Apr 28, 2020 at 8:10 AM Andreas Karlsson 
> wrote:
>
>> To get the performance benefits from using raw devices I think you would
>> want to add support for asynchronous IO to PostgreSQL rather than
>> implementing your own layer to emulate the kernel's buffered IO.
>>
>> Andres Freund did a talk on aync IO in PostgreSQL earlier this year. It
>> was not recorded but the slides are available.
>>
>>
>> https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/
>
>
> FWIW, in 2007/2008, when I was at EnterpriseDB, Inaam Rana and I
> implemented a benchmarkable proof-of-concept patch for direct I/O and
> asynchronous I/O (for libaio and POSIX). We made that patch public, so it
> should be on the list somewhere. But, we began to run into performance
> issues related to buffer manager scaling in terms of locking and,
> specifically, replacement. We began prototyping alternate buffer managers
> (going back to the old MRU/LRU model with midpoint insertion and testing a
> 2Q variant) but that wasn't public. I had also prototyped raw device
> support, which is a good amount of work and required implementing a custom
> filesystem (similar to Oracle's ASM) within the storage manager. It's
> probably a bit harder now than it was then, given the number of different
> types of file access.
>

Here's a hack job merge of that preliminary PoC AIO/DIO patch against
13devel. This was designed to keep the buffer manager clean using AIO and
is write-only. I'll have to dig through some of my other old Postgres 8.x
patches to find the AIO-based prefetching version with aio_req_t modified
to handle read vs. write in FileAIO. Also, this will likely have an issue
with O_DIRECT as additional buffer manager alignment is needed and I
haven't tracked it down in 13 yet. As my default development is on a Mac, I
have POSIX AIO only. As such, I can't natively play with the O_DIRECT or
libaio paths to see if they work without going into Docker or VirtualBox -
and I don't care that much right now :)

The code is nasty, but maybe it will give someone ideas. If I get some time
to work on it, I'll rewrite it properly.

-- 
Jonah H. Harris


13dev_aiodio_latest.patch
Description: Binary data


Re: design for parallel backup

2020-04-30 Thread Robert Haas
On Thu, Apr 30, 2020 at 3:52 PM Andres Freund  wrote:
> Why 8kb? That's smaller than what we currently do in pg_basebackup,
> afaictl, and you're actually going to be bottlenecked by syscall
> overhead at that point (unless you disable / don't have the whole intel
> security mitigation stuff).

I just picked something. Could easily try other things.

> > , divided into 1, 2, 4, 8, or 16 equal size files, with each file
> > written by a separate process, and an fsync() at the end before
> > process exit. So in this test, there is no question of whether the
> > master can read the data fast enough, nor is there any issue of
> > network bandwidth. It's purely a test of whether it's faster to have
> > one process write a big file or whether it's faster to have multiple
> > processes each write a smaller file.
>
> That's not necessarily the only question though, right? There's also the
> approach one process writing out multiple files (via buffered, not async
> IO)? E.g. one basebackup connecting to multiple backends, or just
> shuffeling multiple files through one copy stream.

Sure, but that seems like it can't scale better than this. You have
the scaling limitations of the filesystem, plus the possibility that
the process is busy doing something else when it could be writing to
any particular file.

> If you can provide me with the test program, I'd happily run it on some
> decent, but not upper end, NVMe SSDs.

It was attached, but I forgot to mention that in the body of the email.

> I think you might also be seeing some interaction with write caching on
> the raid controller here. The file sizes are small enough to fit in
> there to a significant degree for the single file tests.

Yeah, that's possible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options

2020-04-30 Thread Krasiyan Andreev
Thank you very much for feedback and yes, that is very useful SQL syntax.
Maybe you miss my previous answer, but you are right, that patch is
currently dead,
because some important design questions must be discussed here, before
patch rewriting.

I have dropped support of from first/last for nth_value(),
but also I reimplemented it in a different way,
by using negative number for the position argument, to be able to get the
same frame in exact reverse order.
After that patch becomes much more simple and some concerns about
precedence hack has gone.

I have not renamed special bool type "ignorenulls"
(I know that it is not acceptable way for calling extra version
of window functions, but also it makes things very easy and it can reuse
frames),
but I removed the other special bool type "fromlast".

Attached file was for PostgreSQL 13 (master git branch, last commit fest),
everything was working and patch was at the time in very good shape, all
tests was passed.

I read previous review and suggestions from Tom about special bool type and
unreserved keywords and also,
that IGNORE NULLS could be implemented inside the WinGetFuncArgXXX
functions,
but I am not sure how exactly to proceed (some example will be very
helpful).

На чт, 30.04.2020 г. в 21:58 Stephen Frost  написа:

> Greetings,
>
> This seems to have died out, and that's pretty unfortunate because this
> is awfully useful SQL standard syntax that people look for and wish we
> had.
>
> * Andrew Gierth (and...@tao11.riddles.org.uk) wrote:
> > So I've tried to rough out a decision tree for the various options on
> > how this might be implemented (discarding the "use precedence hacks"
> > option). Opinions? Additions?
> >
> > (formatted for emacs outline-mode)
> >
> > * 1. use lexical lookahead
> >
> >   +: relatively straightforward parser changes
> >   +: no new reserved words
> >   +: has the option of working extensibly with all functions
> >
> >   -: base_yylex needs extending to 3 lookahead tokens
>
> This sounds awful grotty and challenging to do and get right, and the
> alternative (just reserving these, as the spec does) doesn't seem so
> draconian as to be that much of an issue.
>
> > * 2. reserve nth_value etc. as functions
> >
> >   +: follows the spec reasonably well
> >   +: less of a hack than extending base_yylex
> >
> >   -: new reserved words
> >   -: more parser rules
> >   -: not extensible
> >
>
> For my 2c, at least, reserving these strikes me as entirely reasonable.
> Yes, it sucks that we have to partially-reserve some additional
> keywords, but such is life.  I get that we'll throw syntax errors
> sometimes when we might have given a better error, but I think we can
> accept that.
>
> >   (now goto 1.2.1)
>
> Hmm, not sure this was right?  but sure, I'll try...
>
> > *** 1.2.1. Check the function name in parse analysis against a fixed
> list.
> >
> >   +: simple
> >   -: not extensible
>
> Seems like this is more-or-less required since we'd be reserving them..?
>
> > *** 1.2.2. Provide some option in CREATE FUNCTION
> >
> >   +: extensible
> >   -: fairly intrusive, adding stuff to create function and pg_proc
>
> How would this work though, if we reserve the functions as keywords..?
> Maybe I'm not entirely following, but wouldn't attempts to use other
> functions end up with syntax errors in at least some of the cases,
> meaning that having other functions support this wouldn't really work?
> I don't particularly like the idea that some built-in functions would
> always work but others would work but only some of the time.
>
> > *** 1.2.3. Do something magical with function argument types
> >
> >   +: doesn't need changes in create function / pg_proc
> >   -: it's an ugly hack
>
> Not really a fan of 'ugly hack'.
>
> > * 3. "just say no" to the spec
> >
> >   e.g. add new functions like lead_ignore_nulls(), or add extra boolean
> >   args to lead() etc. telling them to skip nulls
> >
> >   +: simple
> >   -: doesn't conform to spec
> >   -: using extra args isn't quite the right semantics
>
> Ugh, no thank you.
>
> Thanks!
>
> Stephen
>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 28035f1635..3d73c96891 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -15702,7 +15702,7 @@ SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;

  lag(value anyelement
  [, offset integer
- [, default anyelement ]])
+ [, default anyelement ]]) [null_treatment]

   
   
@@ -15731,7 +15731,7 @@ SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;

  lead(value anyelement
   [, offset integer
-  [, default anyelement ]])
+  [, default anyelement ]]) [null_treatment]

   
   
@@ -15757,7 +15757,7 @@ SELECT xmlagg(x) FROM (SELECT x FROM test ORDER BY y DESC) AS tab;

 first_value

-   first_value(value any)
+   first_value(

Re: Avoiding hash join batch explosions with extreme skew and weird stats

2020-04-30 Thread Thomas Munro
On Fri, May 1, 2020 at 2:30 AM Melanie Plageman
 wrote:
> I made a few edits to the message and threw it into a draft patch (on
> top of master, of course). I didn't want to junk up peoples' inboxes, so
> I didn't start a separate thread, but, it will be pretty hard to
> collaboratively edit the comment/ever register it for a commitfest if it
> is wedged into this thread. What do you think?

+1, this is a good description and I'm sure you're right about the
name of the algorithm.  It's a "hybrid" between a simple no partition
hash join, and partitioning like the Grace machine, since batch 0 is
processed directly without touching the disk.

You mention that PHJ finalises the number of batches during build
phase while SHJ can extend it later.  There's also a difference in the
probe phase: although inner batch 0 is loaded into the hash table
directly and not written to disk during the build phase (= classic
hybrid, just like the serial algorithm), outer batch 0 *is* written
out to disk at the start of the probe phase (unlike classic hybrid at
least as we have it for serial hash join).  That's because I couldn't
figure out how to begin emitting tuples before partitioning was
finished, without breaking the deadlock-avoidance programming rule
that you can't let the program counter escape from the node when
someone might wait for you.  So maybe it's erm, a hybrid between
hybrid and Grace...




Re: design for parallel backup

2020-04-30 Thread Andres Freund
Hi,

On 2020-04-30 14:50:34 -0400, Robert Haas wrote:
> On Mon, Apr 20, 2020 at 4:19 PM Andres Freund  wrote:
> > One question I have not really seen answered well:
> >
> > Why do we want parallelism here. Or to be more precise: What do we hope
> > to accelerate by making what part of creating a base backup
> > parallel. There's several potential bottlenecks, and I think it's
> > important to know the design priorities to evaluate a potential design.
>
> I spent some time today trying to understand just one part of this,
> which is how long it will take to write the base backup out to disk
> and whether having multiple independent processes helps. I settled on
> writing and fsyncing 64GB of data, written in 8kB chunks

Why 8kb? That's smaller than what we currently do in pg_basebackup,
afaictl, and you're actually going to be bottlenecked by syscall
overhead at that point (unless you disable / don't have the whole intel
security mitigation stuff).


> , divided into 1, 2, 4, 8, or 16 equal size files, with each file
> written by a separate process, and an fsync() at the end before
> process exit. So in this test, there is no question of whether the
> master can read the data fast enough, nor is there any issue of
> network bandwidth. It's purely a test of whether it's faster to have
> one process write a big file or whether it's faster to have multiple
> processes each write a smaller file.

That's not necessarily the only question though, right? There's also the
approach one process writing out multiple files (via buffered, not async
IO)? E.g. one basebackup connecting to multiple backends, or just
shuffeling multiple files through one copy stream.


> I tested this on EDB's cthulhu. It's an older server, but it happens
> to have 4 mount points available for testing, one with XFS + magnetic
> disks, one with ext4 + magnetic disks, one with XFS + SSD, and one
> with ext4 + SSD.

IIRC cthulhu's SSDs are not that fast, compared to NVMe storage (by
nearly an order of magnitude IIRC). So this might be disadvantaging the
parallel case more than it should. Also perhaps the ext4 disadvantage is
smaller on more modern kernel versions?

If you can provide me with the test program, I'd happily run it on some
decent, but not upper end, NVMe SSDs.


> The fastest write performance of any test was the 16-way XFS-SSD test,
> which wrote at about 2.56 gigabytes per second. The fastest
> single-file test was on ext4-magnetic, though ext4-ssd and
> xfs-magnetic were similar, around 0.66 gigabytes per second.

I think you might also be seeing some interaction with write caching on
the raid controller here. The file sizes are small enough to fit in
there to a significant degree for the single file tests.


> Your system must be a LOT faster, because you were seeing
> pg_basebackup running at, IIUC, ~3 gigabytes per second, and that
> would have been a second process both writing and doing other
> things.

Right. On my workstation I have a NVMe SSD that can do ~2.5 GiB/s
sustained, in my laptop one that peaks to ~3.2GiB/s but then quickly goes
to ~2GiB/s.

FWIW, I ran a "benchmark" just now just using dd, on my laptop, on
battery (so take this with a huge grain of salt). With 1 dd writing out
150GiB in 8kb blocks I get 1.8GiB/s, and with two writing 75GiB each
~840MiB/s, with three writing 50GiB each 550MiB/s.


> Now, I don't know how much this matters. To get limited by this stuff,
> you'd need an incredibly fast network - 10 or maybe 40 or 100 Gigabit
> Ethernet or something like that - or to be doing a local backup. But I
> thought that it was interesting and that I should share it, so here
> you go! I do wonder if the apparently concurrency problems with ext4
> might matter on systems with high connection counts just in normal
> operation, backups aside.

I have seen such problems. Some of them have gotten better though. For
most (all?) linux filesystems we can easily run into filesystem
concurrency issues from within postgres. There's basically a file level
exclusive lock for buffered writes (only for the copy into the page
cache though), due to posix requirements about the effects of a write
being atomic.

Greetings,

Andres Freund




Re: Avoiding hash join batch explosions with extreme skew and weird stats

2020-04-30 Thread Alvaro Herrera
On 2020-Apr-30, Melanie Plageman wrote:

> On Tue, Apr 28, 2020 at 11:50 PM Heikki Linnakangas  wrote:

> > I haven't looked at the patch in detail, but thanks [...]

> Thanks for taking a look, Heikki!

Hmm.  We don't have Heikki's message in the archives.  In fact, the last
message from Heikki we seem to have in any list is
cca4e4dc-32ac-b9ab-039d-98dcb5650...@iki.fi dated February 19 in
pgsql-bugs.  I wonder if there's some problem between Heikki and the
lists.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Bug with subqueries in recursive CTEs?

2020-04-30 Thread Laurenz Albe
On Thu, 2020-04-30 at 04:37 +0100, Andrew Gierth wrote:
> "Laurenz" == Laurenz Albe  writes:
> 
>  Laurenz> I played with a silly example and got a result that surprises
>  Laurenz> me:
> 
>  Laurenz>   WITH RECURSIVE fib AS (
>  Laurenz> SELECT n, "fibₙ"
>  Laurenz> FROM (VALUES (1, 1::bigint), (2, 1)) AS f(n,"fibₙ")
>  Laurenz>  UNION ALL
>  Laurenz> SELECT max(n) + 1,
>  Laurenz>sum("fibₙ")::bigint
>  Laurenz> FROM (SELECT n, "fibₙ"
>  Laurenz>   FROM fib
>  Laurenz>   ORDER BY n DESC
>  Laurenz>   LIMIT 2) AS tail
>  Laurenz> HAVING max(n) < 10
>  Laurenz>   )
>  Laurenz>   SELECT * FROM fib;
> 
>  Laurenz> I would have expected either the Fibonacci sequence or
> 
>  Laurenz>   ERROR: aggregate functions are not allowed in a recursive
>  Laurenz>   query's recursive term

Thanks for looking at this!

> You don't get a Fibonacci sequence because the recursive term only sees
> the rows (in this case only one row) added by the previous iteration,
> not the entire result set so far.

Ah, of course.  You are right.

> So the result seems correct as far as that goes. The reason the
> "aggregate functions are not allowed" error isn't hit is that the
> aggregate and the recursive reference aren't ending up in the same query
> - the check for aggregates is looking at the rangetable of the query
> level containing the agg to see if it has an RTE_CTE entry which is a
> recursive reference.

But I wonder about that.

The source says that
  "Per spec, aggregates can't appear in a recursive term."
Is that the only reason for that error message, or is there a deeper reason
to forbid it?

It feels wrong that a subquery would make using an aggregate legal when
it is illegal without the subquery.
But then, it doesn't bother me enough to research, and as long as the result
as such is correct, I feel much better.

Yours,
Laurenz Albe





Re: Add RESPECT/IGNORE NULLS and FROM FIRST/LAST options

2020-04-30 Thread Stephen Frost
Greetings,

This seems to have died out, and that's pretty unfortunate because this
is awfully useful SQL standard syntax that people look for and wish we
had.

* Andrew Gierth (and...@tao11.riddles.org.uk) wrote:
> So I've tried to rough out a decision tree for the various options on
> how this might be implemented (discarding the "use precedence hacks"
> option). Opinions? Additions?
> 
> (formatted for emacs outline-mode)
> 
> * 1. use lexical lookahead
> 
>   +: relatively straightforward parser changes
>   +: no new reserved words
>   +: has the option of working extensibly with all functions
> 
>   -: base_yylex needs extending to 3 lookahead tokens

This sounds awful grotty and challenging to do and get right, and the
alternative (just reserving these, as the spec does) doesn't seem so
draconian as to be that much of an issue.

> * 2. reserve nth_value etc. as functions
> 
>   +: follows the spec reasonably well
>   +: less of a hack than extending base_yylex
> 
>   -: new reserved words
>   -: more parser rules
>   -: not extensible
> 

For my 2c, at least, reserving these strikes me as entirely reasonable.
Yes, it sucks that we have to partially-reserve some additional
keywords, but such is life.  I get that we'll throw syntax errors
sometimes when we might have given a better error, but I think we can
accept that.

>   (now goto 1.2.1)

Hmm, not sure this was right?  but sure, I'll try...

> *** 1.2.1. Check the function name in parse analysis against a fixed list.
> 
>   +: simple
>   -: not extensible

Seems like this is more-or-less required since we'd be reserving them..?

> *** 1.2.2. Provide some option in CREATE FUNCTION
> 
>   +: extensible
>   -: fairly intrusive, adding stuff to create function and pg_proc

How would this work though, if we reserve the functions as keywords..?
Maybe I'm not entirely following, but wouldn't attempts to use other
functions end up with syntax errors in at least some of the cases,
meaning that having other functions support this wouldn't really work?
I don't particularly like the idea that some built-in functions would
always work but others would work but only some of the time.

> *** 1.2.3. Do something magical with function argument types
> 
>   +: doesn't need changes in create function / pg_proc
>   -: it's an ugly hack

Not really a fan of 'ugly hack'.

> * 3. "just say no" to the spec
> 
>   e.g. add new functions like lead_ignore_nulls(), or add extra boolean
>   args to lead() etc. telling them to skip nulls
> 
>   +: simple
>   -: doesn't conform to spec
>   -: using extra args isn't quite the right semantics

Ugh, no thank you.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: design for parallel backup

2020-04-30 Thread Robert Haas
On Mon, Apr 20, 2020 at 4:19 PM Andres Freund  wrote:
> One question I have not really seen answered well:
>
> Why do we want parallelism here. Or to be more precise: What do we hope
> to accelerate by making what part of creating a base backup
> parallel. There's several potential bottlenecks, and I think it's
> important to know the design priorities to evaluate a potential design.

I spent some time today trying to understand just one part of this,
which is how long it will take to write the base backup out to disk
and whether having multiple independent processes helps. I settled on
writing and fsyncing 64GB of data, written in 8kB chunks, divided into
1, 2, 4, 8, or 16 equal size files, with each file written by a
separate process, and an fsync() at the end before process exit. So in
this test, there is no question of whether the master can read the
data fast enough, nor is there any issue of network bandwidth. It's
purely a test of whether it's faster to have one process write a big
file or whether it's faster to have multiple processes each write a
smaller file.

I tested this on EDB's cthulhu. It's an older server, but it happens
to have 4 mount points available for testing, one with XFS + magnetic
disks, one with ext4 + magnetic disks, one with XFS + SSD, and one
with ext4 + SSD. I did the experiment described above on each mount
point separately, and then I also tried 4, 8, or 16 equal size files
spread evenly across the 4 mount points. To summarize the results very
briefly:

1. ext4 degraded really badly with >4 concurrent writers. XFS did not.
2. SSDs were faster than magnetic disks, but you had to use XFS and
>=4 concurrent writers to get the benefit.
3. Spreading writes across the mount points works well, but the
slowest mount point sets the pace.

Here are more detailed results, with times in seconds:

filesystem media 1@64GB 2@32GB 4@16GB 8@8GB 16@4GB
xfs mag 97 53 60 67 71
ext4 mag 94 68 66 335 549
xfs ssd 97 55 33 27 25
ext4 ssd 116 70 66 227 450
spread spread n/a n/a 48 42 44

The spread test with 16 files @ 4GB llooks like this:

[/mnt/data-ssd/robert.haas/test14] open: 0, write: 7, fsync: 0, close:
0, total: 7
[/mnt/data-ssd/robert.haas/test10] open: 0, write: 7, fsync: 2, close:
0, total: 9
[/mnt/data-ssd/robert.haas/test2] open: 0, write: 7, fsync: 2, close:
0, total: 9
[/mnt/data-ssd/robert.haas/test6] open: 0, write: 7, fsync: 2, close:
0, total: 9
[/mnt/data-ssd2/robert.haas/test3] open: 0, write: 16, fsync: 0,
close: 0, total: 16
[/mnt/data-ssd2/robert.haas/test11] open: 0, write: 16, fsync: 0,
close: 0, total: 16
[/mnt/data-ssd2/robert.haas/test15] open: 0, write: 17, fsync: 0,
close: 0, total: 17
[/mnt/data-ssd2/robert.haas/test7] open: 0, write: 18, fsync: 0,
close: 0, total: 18
[/mnt/data-mag/robert.haas/test16] open: 0, write: 7, fsync: 18,
close: 0, total: 25
[/mnt/data-mag/robert.haas/test4] open: 0, write: 7, fsync: 19, close:
0, total: 26
[/mnt/data-mag/robert.haas/test12] open: 0, write: 7, fsync: 19,
close: 0, total: 26
[/mnt/data-mag/robert.haas/test8] open: 0, write: 7, fsync: 22, close:
0, total: 29
[/mnt/data-mag2/robert.haas/test9] open: 0, write: 20, fsync: 23,
close: 0, total: 43
[/mnt/data-mag2/robert.haas/test13] open: 0, write: 18, fsync: 25,
close: 0, total: 43
[/mnt/data-mag2/robert.haas/test5] open: 0, write: 19, fsync: 24,
close: 0, total: 43
[/mnt/data-mag2/robert.haas/test1] open: 0, write: 18, fsync: 25,
close: 0, total: 43

The fastest write performance of any test was the 16-way XFS-SSD test,
which wrote at about 2.56 gigabytes per second. The fastest
single-file test was on ext4-magnetic, though ext4-ssd and
xfs-magnetic were similar, around 0.66 gigabytes per second. Your
system must be a LOT faster, because you were seeing pg_basebackup
running at, IIUC, ~3 gigabytes per second, and that would have been a
second process both writing and doing other things. For comparison,
some recent local pg_basebackup testing on this machine by some of my
colleagues ran at about 0.82 gigabytes per second.

I suspect it would be possible to get significantly higher numbers on
this hardware by (1) changing all the filesystems over to XFS and (2)
dividing the data dynamically based on write speed rather than writing
the same amount of it everywhere. I bet we could reach 6-8 gigabytes
per second if we did all that.

Now, I don't know how much this matters. To get limited by this stuff,
you'd need an incredibly fast network - 10 or maybe 40 or 100 Gigabit
Ethernet or something like that - or to be doing a local backup. But I
thought that it was interesting and that I should share it, so here
you go! I do wonder if the apparently concurrency problems with ext4
might matter on systems with high connection counts just in normal
operation, backups aside.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
#include 
#include 
#include 
#include 
#include 

#define BLCKSZ 8192

extern void runtest(unsigned long long blocks_total, cha

Re: SLRU statistics

2020-04-30 Thread Tomas Vondra

On Fri, May 01, 2020 at 03:02:59AM +0900, Fujii Masao wrote:



On 2020/04/02 9:41, Tomas Vondra wrote:

Hi,

I've pushed this after some minor cleanup and improvements.


+static char *slru_names[] = {"async", "clog", "commit_timestamp",
+ "multixact_offset", 
"multixact_member",
+ "oldserxid", "pg_xact", 
"subtrans",
+ "other" /* has to be 
last */};

When I tried pg_stat_slru, I found that it returns a row for "pg_xact".
But since there is no "pg_xact" slru ("clog" slru exists instead),
"pg_xact" should be removed? Patch attached.



Yeah, I think I got confused and accidentally added both "clog" and
"pg_xact". I'll get "pg_xact" removed.


Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6562cc400b..ba6d8d2123 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3483,7 +3483,7 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
   predefined list (async, clog,
   commit_timestamp, multixact_offset,
   multixact_member, oldserxid,
-   pg_xact, subtrans and
+   subtrans and
   other) resets counters for only that entry.
   Names not included in this list are treated as other.
  
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 50eea2e8a8..2ba3858d31 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -152,7 +152,7 @@ PgStat_MsgBgWriter BgWriterStats;
 */
static char *slru_names[] = {"async", "clog", "commit_timestamp",
  "multixact_offset", 
"multixact_member",
- "oldserxid", "pg_xact", 
"subtrans",
+ "oldserxid", 
"subtrans",
  "other" /* has to be 
last */};

/* number of elemenents of slru_name array */



--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WIP/PoC for parallel backup

2020-04-30 Thread David Zhang

On 2020-04-30 2:18 a.m., Sumanta Mukherjee wrote:


Hi,

Would it be possible to put in the absolute numbers of the perf
so that it is easier to understand the amount of improvement with
and without the patch and different loads and workers.


Here is the parameters used to record the perf data on both server and 
client side, for example, after applied the patch15 using 4 workers with 
load,


perf record -o postgres_patch_j4_load -e block:block_rq_insert -e 
cpu-clock -e cycles:k -e skb:consume_skb -aR -s -- 
/home/ec2-user/after/bin/postgres -D /mnt/test/data


perf record -o backup_patch_j4_load -e block:block_rq_insert -e 
cpu-clock -e cycles:k -e skb:consume_skb -aR -s -- 
/home/ec2-user/after/bin/pg_basebackup -h ${PG_SERVER} -p 5432 -D 
/mnt/backup/data -v


And this is how the report is generated.
perf report  -i postgres_patch_j4_load --stdio > postgres_patch_j4_load.txt

The original perf data files are still available, can you please clarify 
which parameter you would like to be added for regenerating the report, 
or any other parameters need to be added to recreate the perf.data and 
then generate the report?




I am also unsure why the swapper is taking such a huge percentage of 
the absolute time

in the base run of just the postgres server and pg_basebackup client.

With Regards,
Sumanta Mukherjee.
EnterpriseDB: http://www.enterprisedb.com


On Thu, Apr 30, 2020 at 1:18 PM David Zhang > wrote:


Hi,

Thanks a lot for sharing the test results. Here is the our test
results using perf on three ASW t2.xlarge with below configuration.

Machine configuration:
  Instance Type    :t2.xlarge
  Volume type  :io1
  Memory (MiB) :16GB
  vCPU #   :4
  Architecture   :x86_64
  IOP :6000
  Database Size (GB)  :45 (Server)

case 1: postgres server: without patch and without load

* Disk I/O:

# Samples: 342K of event 'block:block_rq_insert'
# Event count (approx.): 342834
#
# Overhead  Command  Shared Object Symbol
#   ...  . .
#
    97.65%  postgres [kernel.kallsyms]  [k] __elv_add_request
 2.27%  kworker/u30:0    [kernel.kallsyms]  [k] __elv_add_request


* CPU:

# Samples: 6M of event 'cpu-clock'
# Event count (approx.): 155944475
#
# Overhead  Command  Shared Object Symbol
#   ...  
.
#
    64.73%  swapper  [kernel.kallsyms] [k]
native_safe_halt
    10.89%  postgres [vdso]    [.]
__vdso_gettimeofday
 5.64%  postgres [kernel.kallsyms] [k] do_syscall_64
 5.43%  postgres libpthread-2.26.so
    [.] __libc_recv
 1.72%  postgres [kernel.kallsyms] [k]
pvclock_clocksource_read

* Network:

# Samples: 2M of event 'skb:consume_skb'
# Event count (approx.): 2739785
#
# Overhead  Command  Shared Object Symbol
#   ...  .
...
#
    91.58%  swapper  [kernel.kallsyms]  [k] consume_skb
 7.09%  postgres [kernel.kallsyms]  [k] consume_skb
 0.61%  kswapd0  [kernel.kallsyms]  [k] consume_skb
 0.44%  ksoftirqd/3  [kernel.kallsyms]  [k] consume_skb


case 1: pg_basebackup client: without patch and without load

* Disk I/O:

# Samples: 371K of event 'block:block_rq_insert'
# Event count (approx.): 371362
#
# Overhead  Command  Shared Object Symbol
#   ...  . .
#
    96.78%  kworker/u30:0    [kernel.kallsyms]  [k] __elv_add_request
 2.82%  pg_basebackup    [kernel.kallsyms]  [k] __elv_add_request
 0.29%  kworker/u30:1    [kernel.kallsyms]  [k] __elv_add_request
 0.09%  xfsaild/xvda1    [kernel.kallsyms]  [k] __elv_add_request


* CPU:

# Samples: 3M of event 'cpu-clock'
# Event count (approx.): 90352700
#
# Overhead  Command  Shared Object Symbol
#   ...  ..
.
#
    87.99%  swapper  [kernel.kallsyms]   [k] native_safe_halt
 3.14%  swapper  [kernel.kallsyms]   [k] __lock_text_start
 0.48%  swapper  [kernel.kallsyms]   [k]
__softirqentry_text_start
 0.37%  pg_basebackup    [kernel.kallsyms]   [k]
copy_user_enhanced_fast_string
 0.35%  swapper  [kernel.kallsyms]   [k] do_csum

* Network:

# Samples: 12M of event 'skb:consume_skb'
# Event count (approx.): 12260713
#
# Overhead  Command  Shared Ob

Re: SLRU statistics

2020-04-30 Thread Fujii Masao



On 2020/04/02 9:41, Tomas Vondra wrote:

Hi,

I've pushed this after some minor cleanup and improvements.


+static char *slru_names[] = {"async", "clog", "commit_timestamp",
+ "multixact_offset", 
"multixact_member",
+ "oldserxid", "pg_xact", 
"subtrans",
+ "other" /* has to be 
last */};

When I tried pg_stat_slru, I found that it returns a row for "pg_xact".
But since there is no "pg_xact" slru ("clog" slru exists instead),
"pg_xact" should be removed? Patch attached.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6562cc400b..ba6d8d2123 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3483,7 +3483,7 @@ SELECT pid, wait_event_type, wait_event FROM 
pg_stat_activity WHERE wait_event i
predefined list (async, clog,
commit_timestamp, 
multixact_offset,
multixact_member, oldserxid,
-   pg_xact, subtrans and
+   subtrans and
other) resets counters for only that entry.
Names not included in this list are treated as other.
   
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 50eea2e8a8..2ba3858d31 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -152,7 +152,7 @@ PgStat_MsgBgWriter BgWriterStats;
  */
 static char *slru_names[] = {"async", "clog", "commit_timestamp",
  "multixact_offset", 
"multixact_member",
- "oldserxid", 
"pg_xact", "subtrans",
+ "oldserxid", 
"subtrans",
  "other" /* has to be 
last */};
 
 /* number of elemenents of slru_name array */


Re: Proposing WITH ITERATIVE

2020-04-30 Thread Jeff Davis
On Tue, 2020-04-28 at 11:57 -0400, Jonah H. Harris wrote:
> Yeah, in that specific case, one of the other implementations seems
> to carry the counters along in the executor itself. But, as not all
> uses of this functionality are iteration-count-based, I think that's
> a little limiting. Using a terminator expression (of some kind) seems
> most adaptable, I think. I'll give some examples of both types of
> cases.

In my experience, graph algorithms or other queries doing more
specialized analysis tend to get pretty complex with lots of special
cases. Users may want to express these algorithms in a more familiar
language (R, Python, etc.), and to version the code (probably in an
extension).

Have you considered taking this to the extreme and implementing
something like User-Defined Table Operators[1]? Or is there a
motivation for writing such algorithms inline in SQL?

Regards,
Jeff Davis

[1] http://www.vldb.org/conf/1999/P47.pdf






Re: Why are wait events not reported even though it reads/writes a timeline history file?

2020-04-30 Thread Fujii Masao




On 2020/04/28 17:42, Masahiro Ikeda wrote:

On 2020-04-28 15:09, Michael Paquier wrote:

On Tue, Apr 28, 2020 at 02:49:00PM +0900, Fujii Masao wrote:

Isn't it safer to report the wait event during fgets() rather than putting
those calls around the whole loop, like other code does? For example,
writeTimeLineHistory() reports the wait event during read() rather than
whole loop.


Yeah, I/O wait events should be taken only during the duration of the
system calls.  Particularly here, you may finish with an elog() that
causes the wait event to be set longer than it should, leading to a
rather incorrect state if a snapshot of pg_stat_activity is taken.
--


Thanks for your comments.

I fixed it to report the wait event during fgets() only.
Please review the v2 patch I attached.


Thanks for updating the patch! Here are the review comments from me.

+   char   *result;
+   pgstat_report_wait_start(WAIT_EVENT_TIMELINE_HISTORY_READ);
+   result = fgets(fline, sizeof(fline), fd);
+   pgstat_report_wait_end();
+   if (result == NULL)
+   break;
+
/* skip leading whitespace and check for # comment */
char   *ptr;

Since the variable name "result" has been already used in this function,
it should be renamed.

The code should not be inject into the variable declaration block.

When reading this patch, I found that IO-error in fgets() has not
been checked there. Though this is not the issue that you reported,
but it seems better to fix it together. So what about adding
the following code?

if (ferror(fd))
ereport(ERROR,
(errcode_for_file_access(),
 errmsg("could not read file \"%s\": %m", path)));

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: +(pg_lsn, int8) and -(pg_lsn, int8) operators

2020-04-30 Thread Fujii Masao




On 2020/04/28 15:03, Michael Paquier wrote:

On Tue, Apr 28, 2020 at 12:56:19PM +0900, Fujii Masao wrote:

Yes. Attached is the updated version of the patch, which introduces
+(pg_lsn, numeric) and -(pg_lsn, numeric) operators.
To implement them, I added also numeric_pg_lsn() function that converts
numeric to pg_lsn.


-those write-ahead log locations.
+those write-ahead log locations. Also the number of bytes can be added
+into and substracted from LSN using the + and
+- operators, respectively.
That's short.  Should this mention the restriction with numeric (or
just recommend its use) because we don't have a 64b unsigned type
internally, basically Robert's point?


Thanks for the review! What about the following description?

-
Also the number of bytes can be added into and substracted from LSN using the
+(pg_lsn,numeric) and -(pg_lsn,numeric)
operators, respectively. Note that the calculated LSN should be in the range
of pg_lsn type, i.e., between 0/0 and
/.
-



+   /* XXX would it be better to return NULL? */
+   if (NUMERIC_IS_NAN(num))
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("cannot convert NaN to pg_lsn")));
That would be good to test, and an error sounds fine to me.


You mean that we should add the test that goes through this code block,
into the regression test?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: Avoiding hash join batch explosions with extreme skew and weird stats

2020-04-30 Thread Melanie Plageman
On Tue, Apr 28, 2020 at 11:50 PM Heikki Linnakangas  wrote:

> On 29/04/2020 05:03, Melanie Plageman wrote:
> > I've attached a patch which should address some of the previous feedback
> > about code complexity. Two of my co-workers and I wrote what is
> > essentially a new prototype of the idea. It uses the main state machine
> > to route emitting unmatched tuples instead of introducing a separate
> > state. The logic for falling back is also more developed.
>
> I haven't looked at the patch in detail, but thanks for the commit
> message; it describes very well what this is all about. It would be nice
> to copy that explanation to the top comment in nodeHashJoin.c in some
> form. I think we're missing a high level explanation of how the batching
> works even before this new patch, and that commit message does a good
> job at it.
>
>
Thanks for taking a look, Heikki!

I made a few edits to the message and threw it into a draft patch (on
top of master, of course). I didn't want to junk up peoples' inboxes, so
I didn't start a separate thread, but, it will be pretty hard to
collaboratively edit the comment/ever register it for a commitfest if it
is wedged into this thread. What do you think?

-- 
Melanie Plageman
From 1deb1d777693ffcb73c96130ac51b282cd968577 Mon Sep 17 00:00:00 2001
From: Melanie Plageman 
Date: Thu, 30 Apr 2020 07:16:28 -0700
Subject: [PATCH v1] Describe hybrid hash join implementation

This is just a draft to spark conversation on what a good comment might
be like in this file on how the hybrid hash join algorithm is
implemented in Postgres. I'm pretty sure this is the accepted term for
this algorithm https://en.wikipedia.org/wiki/Hash_join#Hybrid_hash_join
---
 src/backend/executor/nodeHashjoin.c | 36 +
 1 file changed, 36 insertions(+)

diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index cc8edacdd0..86bfdaef7f 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -10,6 +10,42 @@
  * IDENTIFICATION
  *	  src/backend/executor/nodeHashjoin.c
  *
+ *   HYBRID HASH JOIN
+ *
+ *  If the inner side tuples of a hash join do not fit in memory, the hash join
+ *  can be executed in multiple batches.
+ *
+ *  If the statistics on the inner side relation are accurate, planner chooses a
+ *  multi-batch strategy and estimates the number of batches.
+ *
+ *  The query executor measures the real size of the hashtable and increases the
+ *  number of batches if the hashtable grows too large.
+ *
+ *  The number of batches is always a power of two, so an increase in the number
+ *  of batches doubles it.
+ *
+ *  Serial hash join measures batch size lazily -- waiting until it is loading a
+ *  batch to determine if it will fit in memory. While inserting tuples into the
+ *  hashtable, serial hash join will, if that tuple were to exceed work_mem,
+ *  dump out the hashtable and reassign them either to other batch files or the
+ *  current batch resident in the hashtable.
+ *
+ *  Parallel hash join, on the other hand, completes all changes to the number
+ *  of batches during the build phase. If it increases the number of batches, it
+ *  dumps out all the tuples from all batches and reassigns them to entirely new
+ *  batch files. Then it checks every batch to ensure it will fit in the space
+ *  budget for the query.
+ *
+ *  In both parallel and serial hash join, the executor currently makes a best
+ *  effort. If a particular batch will not fit in memory, it tries doubling the
+ *  number of batches. If after a batch increase, there is a batch which
+ *  retained all or none of its tuples, the executor disables growth in the
+ *  number of batches globally. After growth is disabled, all batches that would
+ *  have previously triggered an increase in the number of batches instead
+ *  exceed the space allowed.
+ *
+ *  TODO: should we discuss that tuples can only spill forward?
+ *
  * PARALLELISM
  *
  * Hash joins can participate in parallel query execution in several ways.  A
-- 
2.20.1



Back-patch is necessary? Re: Don't try fetching future segment of a TLI.

2020-04-30 Thread Fujii Masao




On 2020/04/08 1:49, Fujii Masao wrote:



On 2020/04/07 20:21, David Steele wrote:


On 4/7/20 3:48 AM, Kyotaro Horiguchi wrote:

At Tue, 7 Apr 2020 12:15:00 +0900, Fujii Masao  
wrote in

This doesn't seem a bug, so I'm thinking to merge this to next *major*
version release, i.e., v13.

Not a bug, perhaps, but I think we do consider back-patching
performance problems. The rise in S3 usage has just exposed how poorly
this performed code in high-latency environments.


I understood the situation and am fine to back-patch that. But I'm not
sure
if it's fair to do that. Maybe we need to hear more opinions about
this?
OTOH, feature freeze for v13 is today, so what about committing the
patch
in v13 at first, and then doing the back-patch after hearing opinions
and
receiving many +1?


+1 for commit only v13 today, then back-patch if people wants and/or
accepts.


Please let me revisit this. Currently Grigory Smolkin, David Steele,
Michael Paquier and Pavel Suderevsky agree to the back-patch and
there has been no objection to that. So we should do the back-patch?
Or does anyone object to that?

I don't think that this is a feature bug because archive recovery works
fine from a functional perspective without this commit. OTOH,
I understand that, without the commit, there is complaint about that
archive recovery may be slow unnecessarily when archival storage is
located in remote, e.g., Amazon S3 and it takes a long time to fetch
the non-existent archive WAL file. So I'm ok to the back-patch unless
there is no strong objection to that.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION




Re: WIP/PoC for parallel backup

2020-04-30 Thread Amit Kapila
On Thu, Apr 30, 2020 at 4:15 PM Amit Kapila  wrote:
>
> On Wed, Apr 29, 2020 at 6:11 PM Suraj Kharage
>  wrote:
> >
> > Hi,
> >
> > We at EnterpriseDB did some performance testing around this parallel backup 
> > to check how this is beneficial and below are the results. In this testing, 
> > we run the backup -
> > 1) Without Asif’s patch
> > 2) With Asif’s patch and combination of workers 1,2,4,8.
> >
> > We run those test on two setup
> >
> > 1) Client and Server both on the same machine (Local backups)
> >
> > 2) Client and server on a different machine (remote backups)
> >
> >
> > Machine details:
> >
> > 1: Server (on which local backups performed and used as server for remote 
> > backups)
> >
> > 2: Client (Used as a client for remote backups)
> >
> >
> ...
> >
> >
> > Client & Server on the same machine, the result shows around 50% 
> > improvement in parallel run with worker 4 and 8.  We don’t see the huge 
> > performance improvement with more workers been added.
> >
> >
> > Whereas, when the client and server on a different machine, we don’t see 
> > any major benefit in performance.  This testing result matches the testing 
> > results posted by David Zhang up thread.
> >
> >
> >
> > We ran the test for 100GB backup with parallel worker 4 to see the CPU 
> > usage and other information. What we noticed is that server is consuming 
> > the CPU almost 100% whole the time and pg_stat_activity shows that server 
> > is busy with ClientWrite most of the time.
> >
> >
>
> Was this for a setup where the client and server were on the same
> machine or where the client was on a different machine?  If it was for
> the case where both are on the same machine, then ideally, we should
> see ClientRead events in a similar proportion?
>

During an offlist discussion with Robert, he pointed out that current
basebackup's code doesn't account for the wait event for the reading
of files which can change what pg_stat_activity shows?  Can you please
apply his latest patch to improve basebackup.c's code [1] which will
take care of that waitevent before getting the data again?

[1] - 
https://www.postgresql.org/message-id/CA%2BTgmobBw-3573vMosGj06r72ajHsYeKtksT_oTxH8XvTL7DxA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Postgres Windows build system doesn't work with python installed in Program Files

2020-04-30 Thread Victor Wagner

Collegues,

Accidently I've come over minor bug in the Mkvcbuild.pm.
It happens, that it doesn't tolerate spaces in the $config->{python}
path, because it want to call python in order to find out version,
prefix and so on, and doesn't properly quote command.

Fix is very simple, see attach.

Patch is made against REL_12_STABLE, but probably applicable to other
versions as well.
--
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 99f39caa522..6c95b7068d8 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -497,7 +497,7 @@ sub mkvcbuild
 		my $pythonprog = "import sys;print(sys.prefix);"
 		  . "print(str(sys.version_info[0])+str(sys.version_info[1]))";
 		my $prefixcmd =
-		  $solution->{options}->{python} . "\\python -c \"$pythonprog\"";
+		'"' . $solution->{options}->{python} . "\\python\" -c \"$pythonprog\"";
 		my $pyout = `$prefixcmd`;
 		die "Could not query for python version!\n" if $?;
 		my ($pyprefix, $pyver) = split(/\r?\n/, $pyout);


Re: Transactions involving multiple postgres foreign servers, take 2

2020-04-30 Thread Masahiko Sawada
On Tue, 28 Apr 2020 at 19:37, Muhammad Usama  wrote:
>
>
>
> On Wed, Apr 8, 2020 at 11:16 AM Masahiko Sawada 
>  wrote:
>>
>> On Fri, 27 Mar 2020 at 22:06, Muhammad Usama  wrote:
>> >
>> > Hi Sawada San,
>> >
>> > I have been further reviewing and testing the transaction involving 
>> > multiple server patches.
>> > Overall the patches are working as expected bar a few important exceptions.
>> > So as discussed over the call I have fixed the issues I found during the 
>> > testing
>> > and also rebased the patches with the current head of the master branch.
>> > So can you please have a look at the attached updated patches.
>>
>> Thank you for reviewing and updating the patch!
>>
>> >
>> > Below is the list of changes I have made on top of V18 patches.
>> >
>> > 1- In register_fdwxact(), As we are just storing the callback function 
>> > pointers from
>> > FdwRoutine in fdw_part structure, So I think we can avoid calling
>> > GetFdwRoutineByServerId() in TopMemoryContext.
>> > So I have moved the MemoryContextSwitch to TopMemoryContext after the
>> > GetFdwRoutineByServerId() call.
>>
>> Agreed.
>>
>> >
>> >
>> > 2- If PrepareForeignTransaction functionality is not present in some FDW 
>> > then
>> > during the registration process we should only set the 
>> > XACT_FLAGS_FDWNOPREPARE
>> > transaction flag if the modified flag is also set for that server. As for 
>> > the server that has
>> > not done any data modification within the transaction we do not do 
>> > two-phase commit anyway.
>>
>> Agreed.
>>
>> >
>> > 3- I have moved the foreign_twophase_commit in sample file after
>> > max_foreign_transaction_resolvers because the default value of 
>> > max_foreign_transaction_resolvers
>> > is 0 and enabling the foreign_twophase_commit produces an error with 
>> > default
>> > configuration parameter positioning in postgresql.conf
>> > Also, foreign_twophase_commit configuration was missing the comments
>> > about allowed values in the sample config file.
>>
>> Sounds good. Agreed.
>>
>> >
>> > 4- Setting ForeignTwophaseCommitIsRequired in 
>> > is_foreign_twophase_commit_required()
>> > function does not seem to be the correct place. The reason being, even when
>> > is_foreign_twophase_commit_required() returns true after setting 
>> > ForeignTwophaseCommitIsRequired
>> > to true, we could still end up not using the two-phase commit in the case 
>> > when some server does
>> > not support two-phase commit and foreign_twophase_commit is set to 
>> > FOREIGN_TWOPHASE_COMMIT_PREFER
>> > mode. So I have moved the ForeignTwophaseCommitIsRequired assignment to 
>> > PreCommit_FdwXacts()
>> > function after doing the prepare transaction.
>>
>> Agreed.
>>
>> >
>> > 6- In prefer mode, we commit the transaction in single-phase if the server 
>> > does not support
>> > the two-phase commit. But instead of doing the single-phase commit right 
>> > away,
>> > IMHO the better way is to wait until all the two-phase transactions are 
>> > successfully prepared
>> > on servers that support the two-phase. Since an error during a "PREPARE" 
>> > stage would
>> > rollback the transaction and in that case, we would end up with committed 
>> > transactions on
>> > the server that lacks the support of the two-phase commit.
>>
>> When an error occurred before the local commit, a 2pc-unsupported
>> server could be rolled back or committed depending on the error
>> timing. On the other hand all 2pc-supported servers are always rolled
>> back when an error occurred before the local commit. Therefore even if
>> we change the order of COMMIT and PREPARE it is still possible that we
>> will end up committing the part of 2pc-unsupported servers while
>> rolling back others including 2pc-supported servers.
>>
>> I guess the motivation of your change is that since errors are likely
>> to happen during executing PREPARE on foreign servers, we can minimize
>> the possibility of rolling back 2pc-unsupported servers by deferring
>> the commit of 2pc-unsupported server as much as possible. Is that
>> right?
>
>
> Yes, that is correct. The idea of doing the COMMIT on NON-2pc-supported 
> servers
> after all the PREPAREs are successful is to minimize the chances of partial 
> commits.
> And as you mentioned there will still be chances of getting a partial commit 
> even with
> this approach but the probability of that would be less than what it is with 
> the
> current sequence.
>
>
>>
>>
>> > So I have modified the flow a little bit and instead of doing a one-phase 
>> > commit right away
>> > the servers that do not support a two-phase commit is added to another 
>> > list and that list is
>> > processed after once we have successfully prepared all the transactions on 
>> > two-phase supported
>> > foreign servers. Although this technique is also not bulletproof, still it 
>> > is better than doing
>> > the one-phase commits before doing the PREPAREs.
>>
>> Hmm the current logic seems complex. Maybe we can just reverse the
>> order of COMMIT an

Re: Optimization for hot standby XLOG_STANDBY_LOCK redo

2020-04-30 Thread Amit Kapila
On Thu, Apr 30, 2020 at 4:07 PM 邱宇航  wrote:
>
> I noticed that in hot standby, XLOG_STANDBY_LOCK redo is sometimes block by 
> another query, and all the rest redo is blocked by this lock getting 
> operation, which is not good and often happed in my database, so the hot 
> standby will be left behind and master will store a lot of WAL which can’t be 
> purged.
>
> So here is the idea:
> We can do XLOG_STANDBY_LOCK redo asynchronously, and the rest redo will 
> continue.
>

Hmm, I don't think we can do this.  The XLOG_STANDBY_LOCK WAL is used
for AccessExclusiveLock on a Relation which means it is a lock for a
DDL operation.  If you skip processing the WAL for this lock, the
behavior of queries running on standby will be unpredictable.
Consider a case where on the master, the user has dropped the table
 and when it will replay such an operation on standby the
concurrent queries on t1 will be blocked due to replay of
XLOG_STANDBY_LOCK WAL and if you skip that WAL, the drop of table and
query on the same table can happen simultaneously leading to
unpredictable behavior.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: WIP/PoC for parallel backup

2020-04-30 Thread Amit Kapila
On Wed, Apr 29, 2020 at 6:11 PM Suraj Kharage
 wrote:
>
> Hi,
>
> We at EnterpriseDB did some performance testing around this parallel backup 
> to check how this is beneficial and below are the results. In this testing, 
> we run the backup -
> 1) Without Asif’s patch
> 2) With Asif’s patch and combination of workers 1,2,4,8.
>
> We run those test on two setup
>
> 1) Client and Server both on the same machine (Local backups)
>
> 2) Client and server on a different machine (remote backups)
>
>
> Machine details:
>
> 1: Server (on which local backups performed and used as server for remote 
> backups)
>
> 2: Client (Used as a client for remote backups)
>
>
...
>
>
> Client & Server on the same machine, the result shows around 50% improvement 
> in parallel run with worker 4 and 8.  We don’t see the huge performance 
> improvement with more workers been added.
>
>
> Whereas, when the client and server on a different machine, we don’t see any 
> major benefit in performance.  This testing result matches the testing 
> results posted by David Zhang up thread.
>
>
>
> We ran the test for 100GB backup with parallel worker 4 to see the CPU usage 
> and other information. What we noticed is that server is consuming the CPU 
> almost 100% whole the time and pg_stat_activity shows that server is busy 
> with ClientWrite most of the time.
>
>

Was this for a setup where the client and server were on the same
machine or where the client was on a different machine?  If it was for
the case where both are on the same machine, then ideally, we should
see ClientRead events in a similar proportion?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Optimization for hot standby XLOG_STANDBY_LOCK redo

2020-04-30 Thread 邱宇航
I noticed that in hot standby, XLOG_STANDBY_LOCK redo is sometimes block by 
another query, and all the rest redo is blocked by this lock getting operation, 
which is not good and often happed in my database, so the hot standby will be 
left behind and master will store a lot of WAL which can’t be purged.

So here is the idea:
We can do XLOG_STANDBY_LOCK redo asynchronously, and the rest redo will 
continue.
And I wonder will LogStandbySnapshot influence the consistency in hot standby, 
for the redo is not by order. And how to avoid this.

// -- startup --
StartupXLOG()
{
while (readRecord())
{
check_lock_get_state();
if (record.tx is in pending tbl):
append this record to the pending lock for further redo.
redo_record();
}
}

check_lock_get_state()
{
for (tx in pending_tx):
if (tx.all_lock are got):
 redo the rest record for this tx
 free this tx
}

standby_redo
{
if (XLOG_STANDBY_LOCK redo falied)
{
add_lock_to_pending_tx_tbl();
}
}

// -- worker process --
main()
{
 while(true)
{
for (lock in pending locks order by lsn)
try_to_get_lock_from_pending_tbl();
}
}


regards.
Yuhang



Re: WIP/PoC for parallel backup

2020-04-30 Thread Sumanta Mukherjee
Hi,

Would it be possible to put in the absolute numbers of the perf
so that it is easier to understand the amount of improvement with
and without the patch and different loads and workers.

I am also unsure why the swapper is taking such a huge percentage of the
absolute time
in the base run of just the postgres server and pg_basebackup client.

With Regards,
Sumanta Mukherjee.
EnterpriseDB: http://www.enterprisedb.com


On Thu, Apr 30, 2020 at 1:18 PM David Zhang  wrote:

> Hi,
>
> Thanks a lot for sharing the test results. Here is the our test results
> using perf on three ASW t2.xlarge with below configuration.
>
> Machine configuration:
>   Instance Type:t2.xlarge
>   Volume type  :io1
>   Memory (MiB) :16GB
>   vCPU #   :4
>   Architecture   :x86_64
>   IOP :6000
>   Database Size (GB)  :45 (Server)
>
> case 1: postgres server: without patch and without load
>
> * Disk I/O:
>
> # Samples: 342K of event 'block:block_rq_insert'
> # Event count (approx.): 342834
> #
> # Overhead  Command  Shared Object  Symbol
> #   ...  .  .
> #
> 97.65%  postgres [kernel.kallsyms]  [k] __elv_add_request
>  2.27%  kworker/u30:0[kernel.kallsyms]  [k] __elv_add_request
>
>
> * CPU:
>
> # Samples: 6M of event 'cpu-clock'
> # Event count (approx.): 155944475
> #
> # Overhead  Command  Shared Object
> Symbol
> #   ...  
> .
> #
> 64.73%  swapper  [kernel.kallsyms] [k] native_safe_halt
> 10.89%  postgres [vdso][.] __vdso_gettimeofday
>  5.64%  postgres [kernel.kallsyms] [k] do_syscall_64
>  5.43%  postgres libpthread-2.26.so[.] __libc_recv
>  1.72%  postgres [kernel.kallsyms] [k]
> pvclock_clocksource_read
>
> * Network:
>
> # Samples: 2M of event 'skb:consume_skb'
> # Event count (approx.): 2739785
> #
> # Overhead  Command  Shared Object  Symbol
> #   ...  .  ...
> #
> 91.58%  swapper  [kernel.kallsyms]  [k] consume_skb
>  7.09%  postgres [kernel.kallsyms]  [k] consume_skb
>  0.61%  kswapd0  [kernel.kallsyms]  [k] consume_skb
>  0.44%  ksoftirqd/3  [kernel.kallsyms]  [k] consume_skb
>
>
> case 1: pg_basebackup client: without patch and without load
>
> * Disk I/O:
>
> # Samples: 371K of event 'block:block_rq_insert'
> # Event count (approx.): 371362
> #
> # Overhead  Command  Shared Object  Symbol
> #   ...  .  .
> #
> 96.78%  kworker/u30:0[kernel.kallsyms]  [k] __elv_add_request
>  2.82%  pg_basebackup[kernel.kallsyms]  [k] __elv_add_request
>  0.29%  kworker/u30:1[kernel.kallsyms]  [k] __elv_add_request
>  0.09%  xfsaild/xvda1[kernel.kallsyms]  [k] __elv_add_request
>
>
> * CPU:
>
> # Samples: 3M of event 'cpu-clock'
> # Event count (approx.): 90352700
> #
> # Overhead  Command  Shared Object
> Symbol
> #   ...  ..
> .
> #
> 87.99%  swapper  [kernel.kallsyms]   [k] native_safe_halt
>  3.14%  swapper  [kernel.kallsyms]   [k] __lock_text_start
>  0.48%  swapper  [kernel.kallsyms]   [k]
> __softirqentry_text_start
>  0.37%  pg_basebackup[kernel.kallsyms]   [k]
> copy_user_enhanced_fast_string
>  0.35%  swapper  [kernel.kallsyms]   [k] do_csum
>
> * Network:
>
> # Samples: 12M of event 'skb:consume_skb'
> # Event count (approx.): 12260713
> #
> # Overhead  Command  Shared Object  Symbol
> #   ...  .  ...
> #
> 95.12%  swapper  [kernel.kallsyms]  [k] consume_skb
>  3.23%  pg_basebackup[kernel.kallsyms]  [k] consume_skb
>  0.83%  ksoftirqd/1  [kernel.kallsyms]  [k] consume_skb
>  0.45%  kswapd0  [kernel.kallsyms]  [k] consume_skb
>
>
> case 2: postgres server: with patch and with load, 4 backup workers on
> client side
>
> * Disk I/O:
>
> # Samples: 3M of event 'block:block_rq_insert'
> # Event count (approx.): 3634542
> #
> # Overhead  Command  Shared Object  Symbol
> #   ...  .  .
> #
> 98.88%  postgres [kernel.kallsyms]  [k] __elv_add_request
>  0.66%  perf [kernel.kallsyms]  [k] __elv_add_request
>  0.42%  kworker/u30:1[kernel.kallsyms]  [k] __elv_add_request
>  0.01%  sshd [kernel.kallsyms]  [k] __elv_add_request
>
> * CPU:
>
> # Samples: 9M of event 'cpu-clock'
> # Event count (approx.): 229912925
> #
> # Overhead  Command  Shared Object
> Symbol
> # .

Re: WAL usage calculation patch

2020-04-30 Thread Julien Rouhaud
On Thu, Apr 30, 2020 at 9:18 AM Julien Rouhaud  wrote:
>
> On Thu, Apr 30, 2020 at 5:05 AM Amit Kapila  wrote:
> >
> > On Tue, Apr 28, 2020 at 7:38 AM Amit Kapila  wrote:
> > >
> > > On Mon, Apr 27, 2020 at 1:22 PM Julien Rouhaud  wrote:
> > > >
> > >
> > > > I agree with that definition.  I can send a cleanup patch if there's
> > > > no objection.
> > > >
> > >
> > > Okay, feel free to send the patch.  Thanks for taking the initiative
> > > to write a patch for this.
> > >
> >
> > Julien, are you planning to write a cleanup patch for this open item?
>
> Sorry Amit, I've been quite busy at work for the last couple of days.
> I'll take care of that this morning for sure!

Here's the patch.  I included the content of
v3-fix_explain_wal_output.patch you provided before, and tried to
consistently replace full page writes/fpw to full page images/fpi
everywhere on top of it (so documentation, command output, variable
names and comments).


v4-fix_wal_usage.diff
Description: Binary data


Re: Improve errors when setting incorrect bounds for SSL protocols

2020-04-30 Thread Daniel Gustafsson
> On 30 Apr 2020, at 01:14, Michael Paquier  wrote:
> 
> On Wed, Apr 29, 2020 at 01:57:49PM +0200, Daniel Gustafsson wrote:
>> Working in the TLS corners of the backend, I found while re-reviewing and
>> re-testing for the release that this patch actually was a small, but vital,
>> brick shy of a load.  The error handling is always invoked due to a set of
>> missing braces.  Going into the check will cause the context to be freed and
>> be_tls_open_server error out.  The tests added narrowly escapes it by not
>> setting the max version in the final test, but I'm not sure it's worth 
>> changing
>> that now as not setting a value is an interesting testcase too.  Sorry for
>> missing that at the time of reviewing.
> 
> Good catch, fixed.  We would still have keep around the SSL old
> context if both bounds were set.  Testing this case would mean one
> extra full restart of the server, and I am not sure either if that's
> worth the extra cost here.

Agreed.  I don't think the cost is warranted given the low probability of new
errors around here, so I think the commit as it stands is sufficient.  Thanks.

cheers ./daniel



Re: Avoiding hash join batch explosions with extreme skew and weird stats

2020-04-30 Thread David Kimura
On Wed, Apr 29, 2020 at 4:39 PM Melanie Plageman
 wrote:
>
> In addition to many assorted TODOs in the code, there are a few major
> projects left:
> - Batch 0 falling back
> - Stripe barrier deadlock
> - Performance improvements and testing
>

Batch 0 never spills.  That behavior is an artifact of the existing design that
as an optimization special cases batch 0 to fill the initial hash table. This
means it can skip loading and doesn't need to create a batch file.

However in the pathalogical case where all tuples hash to batch 0 there is no
way to redistribute those tuples to other batches. So, existing hash join
implementation allows work_mem to be exceeded for batch 0.

In adaptive hash join approach, there is another way to deal with a batch that
exceeds work_mem. If increasing the number of batches does not work then the
batch can be split into stripes that will not exceed work_mem. Doing this
requires spilling the excess tuples to batch files. Following patch adds logic
to create a batch 0 file for serial hash join so that even in pathalogical case
we do not need to exceed work_mem.

Thanks,
David


v6-0002-Implement-fallback-of-batch-0-for-serial-adaptive.patch
Description: Binary data


Re: Fixes for two separate bugs in nbtree VACUUM's page deletion

2020-04-30 Thread Masahiko Sawada
On Thu, 30 Apr 2020 at 07:29, Peter Geoghegan  wrote:
>
> On Tue, Apr 28, 2020 at 12:21 AM Masahiko Sawada
>  wrote:
> > For the first fix it seems better to push down the logic to the page
> > deletion code as your 0001 patch does so. The following change changes
> > the page deletion code so that it emits LOG message indicating the
> > index corruption when a deleted page is passed. But we used to ignore
> > in this case.
>
> Attached is v2 of the patch set, which includes a new patch that I
> want to target the master branch with. This patch (v2-0003-*)
> refactors btvacuumscan().
>
> This revision also changed the first bugfix patch. I have simplified
> some of the details in nbtree.c that were added by commit 857f9c36cda.
> Can't we call _bt_update_meta_cleanup_info() at a lower level, like in
> the patch? AFAICT it makes more sense to just call it in
> btvacuumscan(). Please let me know what you think of those changes.

Yes, I agree.

>
> The big change in the new master-only refactoring patch (v2-0003-*) is
> that I now treat a call to btvacuumpage() that has to
> "backtrack/recurse" and doesn't find a page that looks like the
> expected right half of a page split (a page split that occurred since
> the VACUUM operation began) as indicating corruption. This corruption
> is logged. I believe that we should never see this happen, for reasons
> that are similar to the reasons why _bt_pagedel() never finds a
> deleted page when moving right, as I went into in my recent e-mail to
> this thread. I think that this is worth tightening up for Postgres 13.
>
> I will hold off on committing v2-0003-*, since I need to nail down the
> reasoning for treating this condition as corruption. Plus it's not
> urgent. I think that the general direction taken in v2-0003-* is the
> right one, in any case. The "recursion" in btvacuumpage() doesn't make
> much sense -- it obscures what's really going on IMV. Also, using the
> variable name "scanblkno" at every level -- btvacuumscan(),
> btvacuumpage(), and _bt_pagedel() -- makes it clear that it's the same
> block at all levels of the code. And that the "backtrack/recursion"
> case doesn't kill tuples from the "scanblkno" block (and doesn't
> attempt to call _bt_pagedel(), which is designed only to handle blocks
> passed by btvacuumpage() when they are the current "scanblkno"). It
> seems unlikely that the VACUUM VERBOSE pages deleted accounting bug
> would have happened if the code was already structured this way.

+1 for refactoring. I often get confused that btvacuumpage() takes two
block numbers (blkno and orig_blkno) in spite of the same value. Your
change also makes it clear.

For the part of treating that case as an index corruption I will need
some time to review because of lacking knowledge of btree indexes. So
I'll review it later.


Regards,

--
Masahiko Sawadahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: WAL usage calculation patch

2020-04-30 Thread Julien Rouhaud
On Thu, Apr 30, 2020 at 5:05 AM Amit Kapila  wrote:
>
> On Tue, Apr 28, 2020 at 7:38 AM Amit Kapila  wrote:
> >
> > On Mon, Apr 27, 2020 at 1:22 PM Julien Rouhaud  wrote:
> > >
> >
> > > I agree with that definition.  I can send a cleanup patch if there's
> > > no objection.
> > >
> >
> > Okay, feel free to send the patch.  Thanks for taking the initiative
> > to write a patch for this.
> >
>
> Julien, are you planning to write a cleanup patch for this open item?

Sorry Amit, I've been quite busy at work for the last couple of days.
I'll take care of that this morning for sure!




Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

2020-04-30 Thread Amit Kapila
On Wed, Apr 29, 2020 at 3:19 PM Dilip Kumar  wrote:
>
> On Wed, Apr 29, 2020 at 2:56 PM Dilip Kumar  wrote:
> >
> > On Tue, Apr 28, 2020 at 3:55 PM Dilip Kumar  wrote:
> > >
> > > On Tue, Apr 28, 2020 at 3:11 PM Amit Kapila  
> > > wrote:
> > > >
> > > > On Mon, Apr 27, 2020 at 4:05 PM Dilip Kumar  
> > > > wrote:
> > > > >
> > > > [latest patches]
> > > >
> > > > v16-0004-Gracefully-handle-concurrent-aborts-of-uncommitt
> > > > - Any actions leading to transaction ID assignment are prohibited.
> > > > That, among others,
> > > > + Note that access to user catalog tables or regular system catalog 
> > > > tables
> > > > + in the output plugins has to be done via the
> > > > systable_* scan APIs only.
> > > > + Access via the heap_* scan APIs will error out.
> > > > + Additionally, any actions leading to transaction ID assignment
> > > > are prohibited. That, among others,
> > > > ..
> > > > @@ -1383,6 +1392,14 @@ heap_fetch(Relation relation,
> > > >   bool valid;
> > > >
> > > >   /*
> > > > + * We don't expect direct calls to heap_fetch with valid
> > > > + * CheckXidAlive for regular tables. Track that below.
> > > > + */
> > > > + if (unlikely(TransactionIdIsValid(CheckXidAlive) &&
> > > > + !(IsCatalogRelation(relation) || 
> > > > RelationIsUsedAsCatalogTable(relation
> > > > + elog(ERROR, "unexpected heap_fetch call during logical decoding");
> > > > +
> > > >
> > > > I think comments and code don't match.  In the comment, we are saying
> > > > that via output plugins access to user catalog tables or regular
> > > > system catalog tables won't be allowed via heap_* APIs but code
> > > > doesn't seem to reflect it.  I feel only
> > > > TransactionIdIsValid(CheckXidAlive) is sufficient here.  See, the
> > > > original discussion about this point [1] (Refer "I think it'd also be
> > > > good to add assertions to codepaths not going through systable_*
> > > > asserting that ...").
> > >
> > > Right,  So I think we can just add an assert in these function that
> > > Assert(!TransactionIdIsValid(CheckXidAlive)) ?
> > >
> > > >
> > > > Isn't it better to block the scan to user catalog tables or regular
> > > > system catalog tables for tableam scan APIs rather than at the heap
> > > > level?  There might be some APIs like heap_getnext where such a check
> > > > might still be required but I guess it is still better to block at
> > > > tableam level.
> > > >
> > > > [1] - 
> > > > https://www.postgresql.org/message-id/20180726200241.aje4dv4jsv25v4k2%40alap3.anarazel.de
> > >
> > > Okay, let me analyze this part.  Because someplace we have to keep at
> > > heap level like heap_getnext and other places at tableam level so it
> > > seems a bit inconsistent.  Also, I think the number of checks might
> > > going to increase because some of the heap functions like
> > > heap_hot_search_buffer are being called from multiple tableam calls,
> > > so we need to put check at every place.
> > >
> > > Another point is that I feel some of the checks what we have today
> > > might not be required like heap_finish_speculative, is not fetching
> > > any tuple for us so why do we need to care about this function?
> >
> > While testing these changes, I have noticed that the systable_* APIs
> > internally, calls tableam apis and so if we just put assert
> > Assert(!TransactionIdIsValid(CheckXidAlive)) then it will always hit
> > that assert.  Whether we put these assert in heap APIs or the tableam
> > APIs because systable_ always access heap through tableam APIs.
> >
..
..
>
> Putting some more thought upon this, I am just wondering what do we
> really want any such check because, we are always getting relation
> description from the reorder buffer code, not from the pgoutput
> plugin.
>

But can't they access other catalogs like pg_publication*?  I think
the basic thing we want to ensure here is that all historic accesses
always use systable* APIs to access catalogs.  We can ensure that via
having Asserts (or elog(ERROR, ..) in heap/tableam APIs.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com