Re: Tid scan improvements

2019-07-31 Thread Thomas Munro
On Thu, Aug 1, 2019 at 5:51 PM Edmund Horner  wrote:
> On Thu, 1 Aug 2019 at 17:47, Thomas Munro  wrote:
> > On Thu, Aug 1, 2019 at 5:34 PM Edmund Horner  wrote:
> > > Should we move to CF?  We have been in the CF cycle for almost a year 
> > > now...
> >
> > Hi Edmund,
> >
> > No worries, that's how it goes sometimes.  Please move it to the next
> > CF if you think you'll find some time before September.  Don't worry
> > if it might have to be moved again.  We want the feature, and good
> > things take time!
>
> Ok thanks.
>
> I tried moving it to the new commitfest, but it seems I cannot from
> its current state.

Done.  You have to move it to "Needs review" first, and then move to
next.  (Perhaps we should change that... I don't think that obstacle
achieves anything?)

-- 
Thomas Munro
https://enterprisedb.com




Re: Tid scan improvements

2019-07-31 Thread Edmund Horner
On Thu, 1 Aug 2019 at 17:47, Thomas Munro  wrote:
> On Thu, Aug 1, 2019 at 5:34 PM Edmund Horner  wrote:
> > Should we move to CF?  We have been in the CF cycle for almost a year now...
>
> Hi Edmund,
>
> No worries, that's how it goes sometimes.  Please move it to the next
> CF if you think you'll find some time before September.  Don't worry
> if it might have to be moved again.  We want the feature, and good
> things take time!

Ok thanks.

I tried moving it to the new commitfest, but it seems I cannot from
its current state.

If it's ok, I'll just leave it to you in 7 hours' time!

Edmund




Re: Tid scan improvements

2019-07-31 Thread Thomas Munro
On Thu, Aug 1, 2019 at 5:34 PM Edmund Horner  wrote:
> Sadly it is the end of the CF and I have not had much time to work on
> this.  I'll probably get some time in the next month to look at the
> heapam changes.
>
> Should we move to CF?  We have been in the CF cycle for almost a year now...

Hi Edmund,

No worries, that's how it goes sometimes.  Please move it to the next
CF if you think you'll find some time before September.  Don't worry
if it might have to be moved again.  We want the feature, and good
things take time!

-- 
Thomas Munro
https://enterprisedb.com




Re: concerns around pg_lsn

2019-07-31 Thread Jeevan Ladhe
Hi Michael,


> On further review of the first patch, I think that it could be a good
> idea to apply the same safeguards within float8in_internal_opt_error.
> Jeevan, what do you think?
>

Sure, agree, it makes sense to address float8in_internal_opt_error(),
there might be more occurrences of such instances in other functions
as well. I think if we agree, as and when encounter them while touching
those areas we should fix them.

What is more dangerous with float8in_internal_opt_error() is, it has
the have_error flag, which is never ever set or used in that function.
Further
more risks are - the callers of this function e.g.
executeItemOptUnwrapTarget()
are passing a non-null pointer to it(default set to false) and expect to
throw
an error if it sees some error during float8in_internal_opt_error(), *but*
float8in_internal_opt_error() has actually never touched the have_error
flag.
So, in this case it is fine because the flag was set to false, if it was not
set, then the garbage value would always result in true and keep on throwing
an error!
Here is relevant code from function executeItemOptUnwrapTarget():

{code}
 975 if (jb->type == jbvNumeric)
 976 {
 977 char   *tmp =
DatumGetCString(DirectFunctionCall1(numeric_out,
 978
NumericGetDatum(jb->val.numeric)));
 979 boolhave_error = false;
 980
 981 (void) float8in_internal_opt_error(tmp,
 982NULL,
 983"double
precision",
 984tmp,
 985_error);
 986
 987 if (have_error)
 988 RETURN_ERROR(ereport(ERROR,
 989
 (errcode(ERRCODE_NON_NUMERIC_JSON_ITEM),
 990   errmsg("jsonpath item
method .%s() can only be applied to a numeric value",
 991
 jspOperationName(jsp->type);
 992 res = jperOk;
 993 }
 994 else if (jb->type == jbvString)
 995 {
 996 /* cast string as double */
 997 double  val;
 998 char   *tmp = pnstrdup(jb->val.string.val,
 999jb->val.string.len);
1000 boolhave_error = false;
1001
1002 val = float8in_internal_opt_error(tmp,
1003   NULL,
1004   "double
precision",
1005   tmp,
1006   _error);
1007
1008 if (have_error || isinf(val))
1009 RETURN_ERROR(ereport(ERROR,
1010
 (errcode(ERRCODE_NON_NUMERIC_JSON_ITEM),
1011   errmsg("jsonpath item
method .%s() can only be applied to a numeric value",
1012
 jspOperationName(jsp->type);
1013
1014 jb = 
1015 jb->type = jbvNumeric;
1016 jb->val.numeric =
DatumGetNumeric(DirectFunctionCall1(float8_numeric,
1017
Float8GetDatum(val)));
1018 res = jperOk;
1019 }
{code}

I will further check if by mistake any further commits have removed
references
to assignments from float8in_internal_opt_error(), evaluate it, and set out
a
patch.

This is one of the reason, I was saying it can be taken as a good practice
to
let the function who is accepting an out parameter sets the value for sure
to
some or other value.

Regards,
Jeevan Ladhe


Re: Rearranging ALTER TABLE to avoid multi-operations bugs

2019-07-31 Thread Thomas Munro
On Tue, Jul 2, 2019 at 2:00 PM Tom Lane  wrote:
> movead li  writes:
> > I applied the 'alter-table-with-recursive-process-utility-calls-wip.patch'
> > on the master(e788e849addd56007a0e75f3b5514f294a0f3bca). And
> > when I test the cases, I find it works well on 'alter table t1 add column
> > f2 int not null, alter column f2 add generated always as identity' case,
> > but it doesn't work on #14827, #15180, #15670, #15710.
>
> This review seems not very on-point, because I made no claim to have fixed
> any of those bugs.  The issue at the moment is how to structure the code
> to allow ALTER TABLE to call other utility statements --- or, if we aren't
> going to do that as Robert seems not to want to, what exactly we're going
> to do instead.
>
> The patch at hand does fix some ALTER TABLE ... IDENTITY bugs, because
> fixing those doesn't require any conditional execution of utility
> statements.  But we'll need infrastructure for such conditional execution
> to fix the original bugs.  I don't see much point in working on that part
> until we have some agreement about how to handle what this patch is
> already doing.

With my CF manager hat:  I've moved this to the next CF so we can
close this one soon, but since it's really a bug report it might be
good to get more eyeballs on the problem sooner than September.

With my hacker hat:  Hmm.  I haven't looked at the patch, but not
passing down the QueryEnvironment when recursing is probably my fault,
and folding all such things into a new mechanism that would avoid such
bugs in the future sounds like a reasonable approach, if potentially
complicated to back-patch.  I'm hoping to come back and look at this
properly in a while.

-- 
Thomas Munro
https://enterprisedb.com




Re: Tid scan improvements

2019-07-31 Thread Edmund Horner
On Mon, 22 Jul 2019 at 19:44, Edmund Horner  wrote:
> On Mon, 22 Jul 2019 at 19:25, David Rowley 
> > On Sat, 20 Jul 2019 at 06:21, Andres Freund  wrote:
> > > Yea, I was thinking of something like 2.  We already have a few extra
> > > types of scan nodes (bitmap heap and sample scan), it'd not be bad to
> > > add another. And as you say, they can just share most of the guts: For
> > > heap I'd just implement pagemode, and perhaps split heapgettup_pagemode
> > > into two parts (one to do the page processing, the other to determine
> > > the relevant page).
> > >
> > > You say that we'd need new fields in HeapScanDescData - not so sure
> > > about that, it seems feasible to just provide the boundaries in the
> > > call? But I think it'd also just be fine to have the additional fields.
> >
> > Thanks for explaining.
> >
> > I've set the CF entry for the patch back to waiting on author.
> >
> > I think if we get this part the way Andres would like it, then we're
> > pretty close.

> [...]

> I'll have to spend a bit of time looking at heapgettup_pagemode to
> figure how to split it as Andres suggests.

Hi everyone,

Sadly it is the end of the CF and I have not had much time to work on
this.  I'll probably get some time in the next month to look at the
heapam changes.

Should we move to CF?  We have been in the CF cycle for almost a year now...

Edmund




Re: Multivariate MCV list vs. statistics target

2019-07-31 Thread Thomas Munro
On Thu, Aug 1, 2019 at 12:16 PM Jamison, Kirk  wrote:
> > On Saturday, July 27, 2019 7:06 AM(GMT+9), Tomas Vondra wrote:
> > > On Fri, Jul 26, 2019 at 07:03:41AM +, Jamison, Kirk wrote:
> > > >Overall, the patch is almost already in good shape for commit.
> > > >I'll wait for the next update.

> > The patch passed the make-world and regression tests as well.
> > I've marked this as "ready for committer".
>
> The patch does not apply anymore.

Based on the above, it sounds like this patch is super close and the
only problem is bitrot, so I've set it back to Ready for Committer.
Over to Tomas to rebase and commit, or move to the next CF if that's
more appropriate.

-- 
Thomas Munro
https://enterprisedb.com




Re: Commitfest 2019-07, the first of five* for PostgreSQL 13

2019-07-31 Thread Robert Eckhardt
On Thu, Aug 1, 2019 at 12:10 AM Thomas Munro  wrote:

> Hi all,
>
> CF1 officially ends in about 8 hours,  when August arrives on the
> volcanic islands of Howard and Baker, according to CURRENT_TIMESTAMP
> AT TIME ZONE '+12'.  I'll probably mark it closed at least 8 hours
> later than that because I'll be asleep.  Anything that is waiting on
> author and hasn't had any recent communication, I'm planning to mark
> as returned with feedback.  Anything that is clearly making good
> progress but isn't yet ready for committer, I'm going to move to the
> next CF.  If you're a patch owner or reviewer and you can help move
> your patches in the right direction, or have other feedback on the
> appropriate state for any or all patches, then please speak up, I'd
> really appreciate it.  In all cases please feel free to change the
> state or complain if you think I or someone else got it wrong; if I
> recall correctly there is a way to get from "returned" to "moved to
> next CF", perhaps via an intermediate state.  Thanks!
>

As a normal lurker on hackers, it has been nice seeing the weekly updates.
Thanks for those.

-- Rob


>
> --
> Thomas Munro
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__enterprisedb.com=DwIBaQ=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw=51tHa8Iv1xJ6zHVF3Sip1AlXYA5E-AYBfRUwz6SDvrs=zzunjUZWnsNXR62PvYhl6kzf6VG6mHBPRpJodFEHOKg=b09bCdTOGVhOmxdWbWwiTx0FedVeDW7Ol0EJV6pN_BQ=
>
>
>


Re: refactoring - share str2*int64 functions

2019-07-31 Thread Michael Paquier
On Mon, Jul 29, 2019 at 07:04:09AM +0200, Fabien COELHO wrote:
> Bonjour Michaël,
> Yep, I will try for this week.

Please note that for now I have marked the patch as returned with
feedback as the CF is ending.
--
Michael


signature.asc
Description: PGP signature


More refactoring for BuildIndexInfo

2019-07-31 Thread Michael Paquier
Hi all,

7cce1593 has introduced a new routine makeIndexInfo to create
IndexInfo nodes.  It happens that we can do a bit more refactoring as
per the attached, as BuildIndexInfo can make use of this routine,
removing some duplication on the way (filling in IndexInfo was still
duplicated in a couple of places before 7cce159).

Any thoughts or objections?
--
Michael
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 99ae159f98..5753870158 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2229,7 +2229,7 @@ index_drop(Oid indexId, bool concurrent, bool concurrent_lock_mode)
 IndexInfo *
 BuildIndexInfo(Relation index)
 {
-	IndexInfo  *ii = makeNode(IndexInfo);
+	IndexInfo  *ii;
 	Form_pg_index indexStruct = index->rd_index;
 	int			i;
 	int			numAtts;
@@ -2239,22 +2239,20 @@ BuildIndexInfo(Relation index)
 	if (numAtts < 1 || numAtts > INDEX_MAX_KEYS)
 		elog(ERROR, "invalid indnatts %d for index %u",
 			 numAtts, RelationGetRelid(index));
-	ii->ii_NumIndexAttrs = numAtts;
-	ii->ii_NumIndexKeyAttrs = indexStruct->indnkeyatts;
-	Assert(ii->ii_NumIndexKeyAttrs != 0);
-	Assert(ii->ii_NumIndexKeyAttrs <= ii->ii_NumIndexAttrs);
 
+	ii = makeIndexInfo(indexStruct->indnatts,
+	   indexStruct->indnkeyatts,
+	   index->rd_rel->relam,
+	   RelationGetIndexExpressions(index),
+	   RelationGetIndexPredicate(index),
+	   indexStruct->indisunique,
+	   indexStruct->indisready,
+	   false);
+
+	/* fill in attribute numbers */
 	for (i = 0; i < numAtts; i++)
 		ii->ii_IndexAttrNumbers[i] = indexStruct->indkey.values[i];
 
-	/* fetch any expressions needed for expressional indexes */
-	ii->ii_Expressions = RelationGetIndexExpressions(index);
-	ii->ii_ExpressionsState = NIL;
-
-	/* fetch index predicate if any */
-	ii->ii_Predicate = RelationGetIndexPredicate(index);
-	ii->ii_PredicateState = NULL;
-
 	/* fetch exclusion constraint info if any */
 	if (indexStruct->indisexclusion)
 	{
@@ -2263,30 +2261,6 @@ BuildIndexInfo(Relation index)
  >ii_ExclusionProcs,
  >ii_ExclusionStrats);
 	}
-	else
-	{
-		ii->ii_ExclusionOps = NULL;
-		ii->ii_ExclusionProcs = NULL;
-		ii->ii_ExclusionStrats = NULL;
-	}
-
-	/* other info */
-	ii->ii_Unique = indexStruct->indisunique;
-	ii->ii_ReadyForInserts = indexStruct->indisready;
-	/* assume not doing speculative insertion for now */
-	ii->ii_UniqueOps = NULL;
-	ii->ii_UniqueProcs = NULL;
-	ii->ii_UniqueStrats = NULL;
-
-	/* initialize index-build state to default */
-	ii->ii_Concurrent = false;
-	ii->ii_BrokenHotChain = false;
-	ii->ii_ParallelWorkers = 0;
-
-	/* set up for possible use by index AM */
-	ii->ii_Am = index->rd_rel->relam;
-	ii->ii_AmCache = NULL;
-	ii->ii_Context = CurrentMemoryContext;
 
 	return ii;
 }


signature.asc
Description: PGP signature


Re: Commitfest 2019-07, the first of five* for PostgreSQL 13

2019-07-31 Thread Thomas Munro
Hi all,

CF1 officially ends in about 8 hours,  when August arrives on the
volcanic islands of Howard and Baker, according to CURRENT_TIMESTAMP
AT TIME ZONE '+12'.  I'll probably mark it closed at least 8 hours
later than that because I'll be asleep.  Anything that is waiting on
author and hasn't had any recent communication, I'm planning to mark
as returned with feedback.  Anything that is clearly making good
progress but isn't yet ready for committer, I'm going to move to the
next CF.  If you're a patch owner or reviewer and you can help move
your patches in the right direction, or have other feedback on the
appropriate state for any or all patches, then please speak up, I'd
really appreciate it.  In all cases please feel free to change the
state or complain if you think I or someone else got it wrong; if I
recall correctly there is a way to get from "returned" to "moved to
next CF", perhaps via an intermediate state.  Thanks!

-- 
Thomas Munro
https://enterprisedb.com




Refactoring code stripping trailing \n and \r from strings

2019-07-31 Thread Michael Paquier
Hi Tom,

b654714 has reworked the way we handle removal of CLRF for several
code paths, and has repeated the same code patterns to do that in 8
different places.  Could it make sense to refactor things as per the
attached with a new routine in common/string.c?

Thanks,
--
Michael
diff --git a/src/backend/libpq/be-secure-common.c b/src/backend/libpq/be-secure-common.c
index 4abbef5bf1..e8f27bc782 100644
--- a/src/backend/libpq/be-secure-common.c
+++ b/src/backend/libpq/be-secure-common.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 
+#include "common/string.h"
 #include "libpq/libpq.h"
 #include "storage/fd.h"
 
@@ -112,11 +113,8 @@ run_ssl_passphrase_command(const char *prompt, bool is_server_start, char *buf,
 		goto error;
 	}
 
-	/* strip trailing newline, including \r in case we're on Windows */
-	len = strlen(buf);
-	while (len > 0 && (buf[len - 1] == '\n' ||
-	   buf[len - 1] == '\r'))
-		buf[--len] = '\0';
+	/* strip trailing newline and carriage return */
+	len = pg_strip_crlf(buf);
 
 error:
 	pfree(command.data);
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 068b65a2af..6ef004f192 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -27,6 +27,7 @@
 #include "common/controldata_utils.h"
 #include "common/file_perm.h"
 #include "common/logging.h"
+#include "common/string.h"
 #include "getopt_long.h"
 #include "utils/pidfile.h"
 
@@ -2219,12 +2220,8 @@ adjust_data_dir(void)
 	pclose(fd);
 	free(my_exec_path);
 
-	/* Remove trailing newline, handling Windows newlines as well */
-	len = strlen(filename);
-	while (len > 0 &&
-		   (filename[len - 1] == '\n' ||
-			filename[len - 1] == '\r'))
-		filename[--len] = '\0';
+	/* strip trailing newline and carriage return */
+	len = pg_strip_crlf(filename);
 
 	free(pg_data);
 	pg_data = pg_strdup(filename);
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 9d9c33d78c..702986539b 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -54,6 +54,7 @@
 #include "common/file_perm.h"
 #include "common/logging.h"
 #include "common/restricted_token.h"
+#include "common/string.h"
 #include "storage/large_object.h"
 #include "pg_getopt.h"
 #include "getopt_long.h"
@@ -557,12 +558,8 @@ CheckDataVersion(void)
 		exit(1);
 	}
 
-	/* remove trailing newline, handling Windows newlines as well */
-	len = strlen(rawline);
-	while (len > 0 &&
-		   (rawline[len - 1] == '\n' ||
-			rawline[len - 1] == '\r'))
-		rawline[--len] = '\0';
+	/* strip trailing newline and carriage return */
+	len = pg_strip_crlf(rawline);
 
 	if (strcmp(rawline, PG_MAJORVERSION) != 0)
 	{
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 7e3d3f1bb2..c74f05c032 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -411,7 +411,6 @@ adjust_data_dir(ClusterInfo *cluster)
 cmd_output[MAX_STRING];
 	FILE	   *fp,
 			   *output;
-	int			len;
 
 	/* Initially assume config dir and data dir are the same */
 	cluster->pgconfig = pg_strdup(cluster->pgdata);
@@ -452,12 +451,8 @@ adjust_data_dir(ClusterInfo *cluster)
 
 	pclose(output);
 
-	/* Remove trailing newline, handling Windows newlines as well */
-	len = strlen(cmd_output);
-	while (len > 0 &&
-		   (cmd_output[len - 1] == '\n' ||
-			cmd_output[len - 1] == '\r'))
-		cmd_output[--len] = '\0';
+	/* strip trailing newline and carriage return */
+	(void) pg_strip_crlf(cmd_output);
 
 	cluster->pgdata = pg_strdup(cmd_output);
 
@@ -518,15 +513,9 @@ get_sock_dir(ClusterInfo *cluster, bool live_check)
 	sscanf(line, "%hu", _cluster.port);
 if (lineno == LOCK_FILE_LINE_SOCKET_DIR)
 {
-	int			len;
-
+	/* strip trailing newline and carriage return */
 	cluster->sockdir = pg_strdup(line);
-	/* strip off newline, handling Windows newlines as well */
-	len = strlen(cluster->sockdir);
-	while (len > 0 &&
-		   (cluster->sockdir[len - 1] == '\n' ||
-			cluster->sockdir[len - 1] == '\r'))
-		cluster->sockdir[--len] = '\0';
+	(void) pg_strip_crlf(cluster->sockdir);
 }
 			}
 			fclose(fp);
diff --git a/src/bin/psql/prompt.c b/src/bin/psql/prompt.c
index 59afbc793a..0fcb8c7783 100644
--- a/src/bin/psql/prompt.c
+++ b/src/bin/psql/prompt.c
@@ -22,6 +22,7 @@
 #include "prompt.h"
 #include "settings.h"
 
+#include "common/string.h"
 
 /*--
  * get_prompt
@@ -264,7 +265,6 @@ get_prompt(promptStatus_t status, ConditionalStack cstack)
 		FILE	   *fd;
 		char	   *file = pg_strdup(p + 1);
 		int			cmdend;
-		int			buflen;
 
 		cmdend = strcspn(file, "`");
 		file[cmdend] = '\0';
@@ -275,10 +275,10 @@ get_prompt(promptStatus_t status, ConditionalStack cstack)
 buf[0] = '\0';
 			pclose(fd);
 		}
-		buflen = strlen(buf);
-		while (buflen > 0 && (buf[buflen - 1] == '\n' ||
-			  buf[buflen - 1] == '\r'))
-			buf[--buflen] = '\0';
+
+		/* strip trailing newline and carriage 

Re: Global temporary tables

2019-07-31 Thread Craig Ringer
On Wed, 31 Jul 2019 at 23:05, Konstantin Knizhnik 
wrote:

> Current Postgres implementation of temporary table causes number of
> problems:
>
> 1. Catalog bloating: if client creates and deletes too many temporary
> tables, then autovacuum get stuck on catalog.
>

This also upsets logical decoding a little - AFAICS it still has to treat
transactions that use temporary tables as catalog-modifying transactions,
tracking them in its historic catalog snapshots and doing extra cache
flushes etc when decoding them.

This will become even more important as we work to support eager/optimistic
output plugin processing of in-progress transactions. We'd have to switch
snapshots more, and that can get quite expensive so using temp tables could
really hurt performance. Or we'd have to serialize on catalog-changing
transactions, in which case using temp tables would negate the benefits of
optimistic streaming of in-progress transactions.


> 3. It is not possible to use temporary tables at replica.


For physical replicas, yes.


> Hot standby
> configuration is frequently used to run OLAP queries on replica
> and results of such queries are used to be saved in temporary tables.
> Right now it is not possible (except "hackers" solution with storing
> results in file_fdw).
>

Right. Because we cannot modify pg_class, pg_attribute etc, even though we
could reasonably enough write to local-only relfilenodes on a replica if we
didn't have to change WAL-logged catalog tables.

I've seen some hacks suggested around this where we have an unlogged fork
of each of the needed catalog tables, allowing replicas to write temp table
info to them. We'd scan both the logged and unlogged forks when doing
relcache management etc. But there are plenty of ugly issues with this.
We'd have to reserve oid ranges for them which is ugly; to make it BC
friendly those reservations would probably have to take the form of some
kind of placeholder entry in the real pg_class. And it gets ickier from
there. It hardly seems worth it when we should probably just implement
global temp tables instead.


> 5. Inefficient memory usage and possible memory overflow: each backend
> maintains its own local buffers for work with temporary tables.
>

Is there any reason that would change with global temp tables? We'd still
be creating a backend-local relfilenode for each backend that actually
writes to the temp table, and I don't see how it'd be useful or practical
to keep those in shared_buffers.

Using local buffers has big advantages too. It saves shared_buffers space
for data where there's actually some possibility of getting cache hits, or
for where we can benefit from lazy/async writeback and write combining. I
wouldn't want to keep temp data there if I had the option.

If you're concerned about the memory use of backend local temp buffers, or
about how we account for and limit those, that's worth looking into. But I
don't think it'd be something that should be affected by global-temp vs
backend-local-temp tables.


> Default size of temporary buffers is 8Mb. It seems to be too small for
> modern servers having hundreds of gigabytes of RAM, causing extra
> copying of data between OS cache and local buffers. But if there are

thousands of backends, each executing queries with temporary tables,

then  total amount of memory used for temporary buffers can exceed

several tens of gigabytes.
>

Right. But what solution do you propose for this? Putting that in
shared_buffers will do nothing except deprive shared_buffers of space that
can be used for other more useful things. A server-wide temp buffer would
add IPC and locking overheads and AFAICS little benefit. One of the big
appeals of temp tables is that we don't need any of that.

If you want to improve server-wide temp buffer memory accounting and
management that makes sense. I can see it being useful to have things like
a server-wide DSM/DSA pool of temp buffers that backends borrow from and
return to based on memory pressure on a LRU-ish basis, maybe. But I can
also see how that'd be complex and hard to get right. It'd also be prone to
priority inversion problems where an idle/inactive backend must be woken up
to release memory or release locks, depriving an actively executing backend
of runtime. And it'd be as likely to create inefficiencies with copying and
eviction as solve them since backends could easily land up taking turns
kicking each other out of memory and re-reading their own data.

I don't think this is something that should be tackled as part of work on
global temp tables personally.



> 6. Connection pooler can not reschedule session which has created
> temporary tables to some other backend because it's data is stored in local
> buffers.
>

Yeah, if you're using transaction-associative pooling. That's just part of
a more general problem though, there are piles of related issues with temp
tables, session GUCs, session advisory locks and more.

I don't see how global temp tables will do you 

Re: concerns around pg_lsn

2019-07-31 Thread Craig Ringer
On the topic of pg_lsn, I recently noticed that there's no
operator(+)(pg_lsn,bigint) nor is there an operator(-)(pg_lsn,bigint) so
you can't compute offsets easily. We don't have a cast between pg_lsn and
bigint because we don't expose an unsigned bigint type in SQL, so you can't
work around it that way.

I may be missing the obvious, but I suggest (and will follow with a patch
for) adding + and - operators for computing offsets. I was considering an
age() function for it too, but I think it's best to force the user to be
clear about what current LSN they want to compare with so I'll skip that.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 2ndQuadrant - PostgreSQL Solutions for the Enterprise


Re: [PATCH] minor bugfix for pg_basebackup (9.6 ~ )

2019-07-31 Thread Ian Barwick

On 7/27/19 6:52 AM, Alvaro Herrera wrote:

On 2019-Jul-25, Michael Paquier wrote:


On Wed, Jul 24, 2019 at 11:23:30AM -0400, Alvaro Herrera wrote:

Heh, yesterday I revised the original patch as attached and was about to
push when the bell rang.  I like this one because it keeps the comment
to one line and it mentions the function name in charge of the
validation (useful for grepping later on).  It's a bit laconic because
of the long function name and the desire to keep it to one line, but it
seems sufficient to me.


Looks fine to me.  A nit: addition of braces for the if block.  Even
if that a one-liner, there is a comment so I think that this makes the
code more readable.


Yeah, I had removed those on purpose, but that was probably inconsistent
with my own reviews of others' patches.  I pushed it with them.


Thanks

Regards

Ian Barwick


--
 Ian Barwick   https://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: partition routing layering in nodeModifyTable.c

2019-07-31 Thread Amit Langote
Fujita-san,

Thanks for the reply and sorry I didn't wait a bit more before posting
the patch.

On Wed, Jul 31, 2019 at 9:04 PM Etsuro Fujita  wrote:
> On Wed, Jul 31, 2019 at 5:05 PM Amit Langote  wrote:
> > On Tue, Jul 30, 2019 at 4:20 PM Amit Langote  
> > wrote:
> > > On Sat, Jul 20, 2019 at 1:52 AM Andres Freund  wrote:
> > > > IOW, we should just change the direct modify calls to get the relevant
> > > > ResultRelationInfo or something in that vein (perhaps just the relevant
> > > > RT index?).
> > >
> > > It seems easy to make one of the two functions that constitute the
> > > direct modify API, IterateDirectModify(), access the result relation
> > > from ForeignScanState by saving either the result relation RT index or
> > > ResultRelInfo pointer itself into the ForeignScanState's FDW-private
> > > area.  For example, for postgres_fdw, one would simply add a new
> > > member to PgFdwDirectModifyState struct.
> > >
> > > Doing that for the other function BeginDirectModify() seems a bit more
> > > involved.  We could add a new field to ForeignScan, say
> > > resultRelation, that's set by either PlanDirectModify() (the FDW code)
> > > or make_modifytable() (the core code) if the ForeignScan node contains
> > > the command for direct modification.  BeginDirectModify() can then use
> > > that value instead of relying on es_result_relation_info being set.
> > >
> > > Thoughts?  Fujita-san, do you have any opinion on whether that would
> > > be a good idea?
>
> I'm still not sure that it's a good idea to remove
> es_result_relation_info, but if I had to say then I think the latter
> would probably be better.

Could you please clarify what you meant by the "latter"?

If it's the approach of adding a resultRelation Index field to
ForeignScan node, I tried and had to give up, realizing that we don't
maintain ResultRelInfos in an array that is indexable by RT indexes.
It would've worked if es_result_relations had mirrored es_range_table,
although that probably complicates how the individual ModifyTable
nodes attach to that array.  In any case, given this discussion,
further hacking on a global variable like es_result_relations may be a
course we might not want to pursue.

>  I'm planning to rework on direct
> modification to base it on upper planner pathification so we can
> perform direct modification without the ModifyTable node.  (I'm not
> sure we can really do this for inherited UPDATE/DELETE, though.)  For
> that rewrite, I'm thinking to call BeginDirectModify() from the
> ForeignScan node (ie, ExecInitForeignScan()) as-is.  The latter
> approach would allow that without any changes and avoid changing that
> API many times.  That's the reason why I think the latter would
> probably be better.

Will the new planning approach you're thinking of get rid of needing
any result relations at all (and so the ResultRelInfos in the
executor)?

Thanks,
Amit




Re: Possible race condition in pg_mkdir_p()?

2019-07-31 Thread Ning Yu
On Wed, Jul 31, 2019 at 2:21 PM Michael Paquier  wrote:
> That's my impression as well.  Please note that this does not involve
> an actual bug in Postgres and that this is rather invasive, so this
> does not really qualify for a back-patch.  No objections to simplify
> the backend on HEAD though.  It would be good if you could actually
> register a patch to the commit fest app [1] and also rework the patch
> set so as at least PathNameCreateTemporaryDir wins its simplifications
> for the first problem (double-checking the other code paths would be
> nice as well).  The EEXIST handling, and the confusion about EEXIST
> showing for both a path and a file need some separate handling (not
> sure what to do on these parts yet).

Thanks for the suggestion and information, we will rework the patch and
register it.  The planned changes are: 1) remove the tests (cool!), 2)
simplify PathNameCreateTemporaryDir() and other callers.  The EEXIST
handling will be put in a separate patch so depends on the reviews we
could accept or drop it easily.

Best Regards
Ning




Re: POC: converting Lists into arrays

2019-07-31 Thread Andres Freund
Hi,

On 2019-07-31 19:40:09 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > Unfortunately foreach(ListCell *lc, ...) doesn't work with the current
> > definition. Which I think isn't great, because the large scopes for loop
> > iteration variables imo makes the code harder to reason about.
>
> Yeah, I tried to make that possible when I redid those macros, but
> couldn't find a way :-(.  Even granting that we're willing to have
> a different macro for this use-case, it doesn't seem easy, because
> you can only put one  into the first element of a
> for (;;).

I remember hitting that at one point and annoyed/confused as there that
restriction came from. Probably some grammar difficulties. But still,
odd.


> That makes the other idea (of a foreach-ish macro declaring the
> listcell value variable) problematic, too :-(.

Hm. One way partially around that would be using an anonymous struct
inside the for(). Something like

#define foreach_node(membertype, name, lst) \
for (struct {membertype *node; ListCell *lc; const List *l; int i;} name = 
{...}; \
 ...)

which then would allow code like

foreach_node(OpExpr, cur, list)
{
do_something_with_node(cur.node);

foreach_delete_current(cur);
}


That's quite similar to your:

> One idea is that we could do something like
>
> foreach_variant(identifier, list_value)
> {
>type *v = (type *) lfirst_variant(identifier);
>...
> }
>
> where the "identifier" isn't actually a variable name but just something
> we use to construct the ForEachState variable's name.  (The only reason
> we need it is to avoid confusion in cases with nested foreach's.)  The
> lfirst_variant macro would fetch the correct value just by looking
> at the ForEachState, so there's no separate ListCell* variable at all.

but would still allow to avoid the variable.

Greetings,

Andres Freund




Re: concerns around pg_lsn

2019-07-31 Thread Michael Paquier
On Wed, Jul 31, 2019 at 09:51:30AM +0900, Michael Paquier wrote:
> I am adding Peter Eisentraut in CC as 21f428e is his commit.  I think
> that the first patch is a good idea, so I would be fine to apply it,
> but let's see the original committer's opinion first.

On further review of the first patch, I think that it could be a good
idea to apply the same safeguards within float8in_internal_opt_error.
Jeevan, what do you think?
--
Michael


signature.asc
Description: PGP signature


Re: POC: converting Lists into arrays

2019-07-31 Thread Tom Lane
I wrote:
> BTW, I think we could make equivalent macros in the old regime,
> which would be a good thing because then it would be possible to
> back-patch code using this notation.

Oh, wait-a-second.  I was envisioning that

for (ListCell *anonymous__lc = ...)

would work for that, but of course that requires C99, so we could
only put it into v12.

But that might still be worth doing.  It'd mean that the backpatchability
of this notation is the same as that of "for (int x = ...)", which
seems worth something.

regards, tom lane




RE: Multivariate MCV list vs. statistics target

2019-07-31 Thread Jamison, Kirk
Hi, 

> From: Jamison, Kirk [mailto:k.jami...@jp.fujitsu.com]
> Sent: Monday, July 29, 2019 10:53 AM
> To: 'Tomas Vondra' 
> Cc: Dean Rasheed ; PostgreSQL Hackers
> 
> Subject: RE: Multivariate MCV list vs. statistics target
> 
> On Saturday, July 27, 2019 7:06 AM(GMT+9), Tomas Vondra wrote:
> > On Fri, Jul 26, 2019 at 07:03:41AM +, Jamison, Kirk wrote:
> > >On Sat, July 20, 2019 8:12 AM (GMT+9), Tomas Vondra wrote:
> > >
> > >> >+   /* XXX What if the target is set to 0? Reset the 
> > >> >statistic?
> > >> */
> > >> >
> > >> >This also makes me wonder. I haven't looked deeply into the code,
> > >> >but since 0 is a valid value, I believe it should reset the stats.
> > >>
> > >> I agree resetting the stats after setting the target to 0 seems
> > >> quite reasonable. But that's not what we do for attribute stats,
> > >> because in that case we simply skip the attribute during the future
> > >> ANALYZE runs
> > >> - we don't reset the stats, we keep the existing stats. So I've
> > >> done the same thing here, and I've removed the XXX comment.
> > >>
> > >> If we want to change that, I'd do it in a separate patch for both
> > >> the regular and extended stats.
> > >
> > >Hi, Tomas
> > >
> > >Sorry for my late reply.
> > >You're right. I have no strong opinion whether we'd want to change
> > >that
> > behavior.
> > >I've also confirmed the change in the patch where setting statistics
> > >target 0 skips the statistics.
> > >
> >
> > OK, thanks.
> >
> > >Maybe only some minor nitpicks in the source code comments below:
> > >1. "it's" should be "its":
> > >> + * Compute statistic target, based on what's set for the
> > statistic
> > >> + * object itself, and for it's attributes.
> > >
> > >2. Consistency whether you'd use either "statistic " or "statisticS ".
> > >Ex. statistic target vs statisticS target, statistics object vs
> > >statistic
> > object, etc.
> > >
> > >> Attached is v4 of the patch, with a couple more improvements:
> > >>
> > >> 1) I've renamed the if_not_exists flag to missing_ok, because
> > >> that's more consistent with the "IF EXISTS" clause in the grammar
> > >> (the old flag was kinda the exact opposite), and I've added a NOTICE
> about the skip.
> > >
> > >+  boolmissing_ok;  /* do nothing if statistics does
> > not exist */
> > >
> > >Confirmed. So we ignore if statistic does not exist, and skip the error.
> > >Maybe to make it consistent with other data structures in
> > >parsernodes.h, you can change the comment of missing_ok to:
> > >/* skip error if statistics object does not exist */
> > >
> >
> > Thanks, I've fixed all those places in the attached v5.
> 
> I've confirmed the fix.
> 
> > >> 2) I've renamed ComputeExtStatsTarget to ComputeExtStatsRows,
> > >> because that's what the function was doing anyway (computing sample
> size).
> > >>
> > >> 3) I've added a couple of regression tests to stats_ext.sql
> > >>
> > >> Aside from that, I've cleaned up a couple of places and improved a
> > >> bunch of comments. Nothing huge.
> > >
> > >I have a question though regarding ComputeExtStatisticsRows.
> > >I'm just curious with the value 300 when computing sample size.
> > >Where did this value come from?
> > >
> > >+  /* compute sample size based on the statistic target */
> > >+  return (300 * result);
> > >
> > >Overall, the patch is almost already in good shape for commit.
> > >I'll wait for the next update.
> > >
> >
> > That's how we compute number of rows to sample, based on the statistics
> target.
> > See std_typanalyze() in analyze.c, which also cites the paper where
> > this comes from.
> Noted. Found it. Thank you for the reference.
> 
> 
> There's just a small whitespace (extra space) below after running git diff
> --check.
> >src/bin/pg_dump/pg_dump.c:7226: trailing whitespace.
> >+
> It would be better to post an updated patch, but other than that, I've 
> confirmed
> the fixes.
> The patch passed the make-world and regression tests as well.
> I've marked this as "ready for committer".

The patch does not apply anymore.
In addition to the whitespace detected,
kindly rebase the patch as there were changes from recent commits
in the following files:
   src/backend/commands/analyze.c
   src/bin/pg_dump/pg_dump.c
   src/bin/pg_dump/t/002_pg_dump.pl
   src/test/regress/expected/stats_ext.out
   src/test/regress/sql/stats_ext.sql

Regards,
Kirk Jamison




Re: POC: converting Lists into arrays

2019-07-31 Thread Tom Lane
I wrote:
> One idea is that we could do something like
> foreach_variant(identifier, list_value)
> {
>type *v = (type *) lfirst_variant(identifier);
>...
> }
> where the "identifier" isn't actually a variable name but just something
> we use to construct the ForEachState variable's name.  (The only reason
> we need it is to avoid confusion in cases with nested foreach's.)

On second thought, there seems no strong reason why you should need
to fetch the current value of a foreach-ish loop that's not the most
closely nested one.  So forget the dummy identifier, and consider
this straw-man proposal:

#define aforeach(list_value) ...

(I'm thinking "anonymous foreach", but bikeshedding welcome.)  This
is just like the current version of foreach(), except it uses a
fixed name for the ForEachState variable and doesn't attempt to
assign to a "cell" variable.

#define aforeach_current() ...

Retrieves the current value of the most-closely-nested aforeach
loop, based on knowing the fixed name of aforeach's loop variable.
This replaces "lfirst(lc)", and we'd also need aforeach_current_int()
and so on for the other variants of lfirst().

So usage would look like, say,

aforeach(my_list)
{
type *my_value = (type *) aforeach_current();
...
}

We'd also want aforeach_delete_current() and aforeach_current_index(),
to provide functionality equivalent to foreach_delete_current() and
foreach_current_index().

These names are a bit long, and maybe we should try to make them
shorter, but more shortness might also mean less clarity.

BTW, I think we could make equivalent macros in the old regime,
which would be a good thing because then it would be possible to
back-patch code using this notation.

Thoughts?

regards, tom lane




Re: POC: converting Lists into arrays

2019-07-31 Thread Tom Lane
Andres Freund  writes:
> Unfortunately foreach(ListCell *lc, ...) doesn't work with the current
> definition. Which I think isn't great, because the large scopes for loop
> iteration variables imo makes the code harder to reason about.

Yeah, I tried to make that possible when I redid those macros, but
couldn't find a way :-(.  Even granting that we're willing to have
a different macro for this use-case, it doesn't seem easy, because
you can only put one  into the first element of a
for (;;).

That makes the other idea (of a foreach-ish macro declaring the
listcell value variable) problematic, too :-(.

One idea is that we could do something like

foreach_variant(identifier, list_value)
{
   type *v = (type *) lfirst_variant(identifier);
   ...
}

where the "identifier" isn't actually a variable name but just something
we use to construct the ForEachState variable's name.  (The only reason
we need it is to avoid confusion in cases with nested foreach's.)  The
lfirst_variant macro would fetch the correct value just by looking
at the ForEachState, so there's no separate ListCell* variable at all.

regards, tom lane




Re: Unused header file inclusion

2019-07-31 Thread Tom Lane
Andres Freund  writes:
> On 2019-07-31 16:55:31 -0400, Tom Lane wrote:
>> Yeah.  I seem to recall a proposal that nodes.h should contain
>> 
>> typedef struct Foo Foo;
>> 
>> for every node type Foo, and then the other headers would just
>> fill in the structs, and we could get rid of a lot of ad-hoc
>> forward struct declarations and other hackery.

> That to me just seems objectively worse. Now adding a new struct as a
> minor implementation detail of some subsystem doesn't just require
> recompiling the relevant files, but just about all of pg.

Er, what?  This list of typedefs would change exactly when enum NodeTag
changes, so AFAICS your objection is bogus.

It's true that this proposal doesn't help for structs that aren't Nodes,
but my sense is that > 90% of our ad-hoc struct references are for Nodes.

> Right now we really have weird dependencies between largely independent
> subsystem.

Agreed, but I think fixing that will take some actually serious design
work.  It's not going to mechanically fall out of changing typedef rules.

> The only reason the explicit forward declaration is needed in the common
> case of a 'struct foo*' parameter is that C has weird visibility rules
> about the scope of forward declarations in paramters.

Yeah, but there's not much we can do about that, nor is getting rid
of typedefs in favor of "struct" going to make it even a little bit
better.

regards, tom lane




Re: POC: converting Lists into arrays

2019-07-31 Thread Petr Jelinek

On 01/08/2019 01:04, Andres Freund wrote:

Hi,

On 2019-07-31 16:00:47 -0700, Andres Freund wrote:

On 2019-07-31 15:57:56 -0700, Andres Freund wrote:

I also wonder if a foreach version that includes the typical
(Type *) var = (Type *) lfirst(lc);
or
(Type *) var = castNode(Type, lfirst(lc));
or
OpExpr *hclause = lfirst_node(OpExpr, lc);

would make it nicer to use lists.

foreach_node_in(Type, name, list) could mean something like

foreach(ListCell *name##_cell, list)
{
 Type* name = lfirst_node(Type, name##_cell);
}


s/lfirst/linitial/ of course. Was looking at code that also used
lfirst...


Bullshit, of course.

/me performs a tactical withdrawal into his brown paper bag.



Reminds me that one advantage of macros like the second one would also
be to reduce the use of the confusingly named linitial*(), helping newer
hackers.


But that point just had two consecutive embarassing demonstrations...



Yeah, pg_list.h is one file I never close.

--
Petr Jelinek
2ndQuadrant - PostgreSQL Solutions for the Enterprise
https://www.2ndQuadrant.com/




Re: POC: converting Lists into arrays

2019-07-31 Thread Andres Freund
Hi,

On 2019-07-31 16:00:47 -0700, Andres Freund wrote:
> On 2019-07-31 15:57:56 -0700, Andres Freund wrote:
> > I also wonder if a foreach version that includes the typical
> > (Type *) var = (Type *) lfirst(lc);
> > or
> > (Type *) var = castNode(Type, lfirst(lc));
> > or
> > OpExpr *hclause = lfirst_node(OpExpr, lc);
> > 
> > would make it nicer to use lists.
> > 
> > foreach_node_in(Type, name, list) could mean something like
> > 
> > foreach(ListCell *name##_cell, list)
> > {
> > Type* name = lfirst_node(Type, name##_cell);
> > }
> 
> s/lfirst/linitial/ of course. Was looking at code that also used
> lfirst...

Bullshit, of course.

/me performs a tactical withdrawal into his brown paper bag.


> Reminds me that one advantage of macros like the second one would also
> be to reduce the use of the confusingly named linitial*(), helping newer
> hackers.

But that point just had two consecutive embarassing demonstrations...

- Andres




Re: POC: converting Lists into arrays

2019-07-31 Thread Andres Freund
Hi,

On 2019-07-31 15:57:56 -0700, Andres Freund wrote:
> I also wonder if a foreach version that includes the typical
> (Type *) var = (Type *) lfirst(lc);
> or
> (Type *) var = castNode(Type, lfirst(lc));
> or
> OpExpr   *hclause = lfirst_node(OpExpr, lc);
> 
> would make it nicer to use lists.
> 
> foreach_node_in(Type, name, list) could mean something like
> 
> foreach(ListCell *name##_cell, list)
> {
> Type* name = lfirst_node(Type, name##_cell);
> }

s/lfirst/linitial/ of course. Was looking at code that also used
lfirst...

Reminds me that one advantage of macros like the second one would also
be to reduce the use of the confusingly named linitial*(), helping newer
hackers.

Greetings,

Andres Freund




Re: POC: converting Lists into arrays

2019-07-31 Thread Andres Freund
Hi,

I was just looking at the diff for a fix, which adds a "ListCell *lc;"
to function scope, even though it's only needed in a pretty narrow
scope.

Unfortunately foreach(ListCell *lc, ...) doesn't work with the current
definition. Which I think isn't great, because the large scopes for loop
iteration variables imo makes the code harder to reason about.

I wonder if we could either have a different version of foreach() that
allows that, or find a way to make the above work. For the latter I
don't immediately have a good idea of how to accomplish that. For the
former it's easy enough if we either don't include the typename (thereby
looking more alien), or if we reference the name separately (making it
more complicated to use).


I also wonder if a foreach version that includes the typical
(Type *) var = (Type *) lfirst(lc);
or
(Type *) var = castNode(Type, lfirst(lc));
or
OpExpr *hclause = lfirst_node(OpExpr, lc);

would make it nicer to use lists.

foreach_node_in(Type, name, list) could mean something like

foreach(ListCell *name##_cell, list)
{
Type* name = lfirst_node(Type, name##_cell);
}

(using a hypothetical foreach that supports defining the ListCell in
scope, just for display simplicity's sake).

Greetings,

Andres Freund




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-07-31 Thread Andres Freund
Hi,

On 2019-07-31 23:51:40 +0200, Julien Rouhaud wrote:
> On Wed, Jul 31, 2019 at 10:55 AM Evgeny Efimkin  
> wrote:
> > What reason to use pg_atomic_uint64?
> 
> The queryid is read and written without holding any lock on the PGPROC
> entry, so the pg_atomic_uint64 will guarantee that we get a consistent
> value in pg_stat_get_activity().  Other reads shouldn't be a problem
> as far as I remember.

Hm, I don't think that's necessary in this case. That's what the
st_changecount protocol is trying to ensure, no?

/*
 * To avoid locking overhead, we use the following protocol: a backend
 * increments st_changecount before modifying its entry, and again after
 * finishing a modification.  A would-be reader should note the value of
 * st_changecount, copy the entry into private memory, then check
 * st_changecount again.  If the value hasn't changed, and if it's even,
 * the copy is valid; otherwise start over.  This makes updates cheap
 * while reads are potentially expensive, but that's the tradeoff we 
want.
 *
 * The above protocol needs memory barriers to ensure that the apparent
 * order of execution is as it desires.  Otherwise, for example, the CPU
 * might rearrange the code so that st_changecount is incremented twice
 * before the modification on a machine with weak memory ordering.  
Hence,
 * use the macros defined below for manipulating st_changecount, rather
 * than touching it directly.
 */
int st_changecount;


And if it were necessary, why wouldn't any of the other fields in
PgBackendStatus need it? There's plenty of other fields written to
without a lock, and several of those are also 8 bytes (so it's not a
case of assuming that 8 byte reads might not be atomic, but for byte
reads are).

Greetings,

Andres Freund




Re: Unused header file inclusion

2019-07-31 Thread Andres Freund
Hi,

On 2019-07-31 16:55:31 -0400, Tom Lane wrote:
> Alvaro Herrera  writes:
> > On 2019-Jul-31, Andres Freund wrote:
> >> * I think a lot of the interlinking stems from the bad idea to use
> >> typedef's everywhere. In contrast to structs they cannot be forward
> >> declared portably in our version of C. We should use a lot more struct
> >> forward declarations, and just not use the typedef.
> 
> > I don't know about that ... I think the problem is that we both declare
> > the typedef *and* define the struct in the same place.  If we were to
> > split those things to separate files, the required rebuilds would be
> > much less, I think, because changing a struct would no longer require
> > recompiles of files that merely pass those structs around (that's very
> > common for Node-derived structs).  Forward-declaring structs in
> > unrelated header files just because they need them, feels a bit like
> > cheating to me.
> 
> Yeah.  I seem to recall a proposal that nodes.h should contain
> 
>   typedef struct Foo Foo;
> 
> for every node type Foo, and then the other headers would just
> fill in the structs, and we could get rid of a lot of ad-hoc
> forward struct declarations and other hackery.

That to me just seems objectively worse. Now adding a new struct as a
minor implementation detail of some subsystem doesn't just require
recompiling the relevant files, but just about all of pg. And just about
every header would acquire a nodes.h include - there's still a lot of them
that today don't.

I don't understand why you guys consider forward declaring structs
ugly. It's what just about every other C project does. The only reason
it's sometimes problematic in postgres is that we "hide the pointer"
within some typedefs, making it not as obvious which type we're
referring to (because the struct usage will be struct FooData*, instead
of just Foo). But we also have confusion due to that in a lot of other
places, so I don't really buy that this is a significant issue.


Right now we really have weird dependencies between largely independent
subsystem. Some are partially because functions aren't always in the
right file, but it's also not often clear what the right one would
be. E.g. snapmgr.h depending on relcache.h (for
TransactionIdLimitedForOldSnapshots() having a Relation parameter), on
resowner.h (for RegisterSnapshotOnOwner()) is imo not good.  For one
they lead to a number of .c files that actually use functionality from
resowner.h to not have the necessary includes.  There's a lot of things
like that.

.oO(the fmgr.h include in snapmgr.h has been unnecessary since 352a24a1f9)


We could of course be super rigorous and have an accompanying
foo_structs.h or something for every foo.h. But that seems to add no
actual advantages, and makes things more complicated.


The only reason the explicit forward declaration is needed in the common
case of a 'struct foo*' parameter is that C has weird visibility rules
about the scope of forward declarations in paramters. If you instead
first have e.g. a function *return* type of that struct type, the
explicit forward declaration isn't even needed - it's visible
afterwards. But for parameters it's basically a *different* struct, that
cannot be referenced again. Note that in C++ the visibility routines are
more consistent, and you don't need an explicit forward declaration in
either case (I'd also be OK with requiring it in both cases, it's just
weird to only need them in one).

Greetings,

Andres Freund




Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-07-31 Thread Julien Rouhaud
Hello,

On Wed, Jul 31, 2019 at 10:55 AM Evgeny Efimkin  wrote:
>
> What reason to use pg_atomic_uint64?

The queryid is read and written without holding any lock on the PGPROC
entry, so the pg_atomic_uint64 will guarantee that we get a consistent
value in pg_stat_get_activity().  Other reads shouldn't be a problem
as far as I remember.

> In docs:
> occured - > occurred

Thanks!  I fixed it on my local branch.




Re: pgbench - implement strict TPC-B benchmark

2019-07-31 Thread Peter Geoghegan
On Wed, Jul 31, 2019 at 2:11 PM Tom Lane  wrote:
> > I agree with this. When I was at EnterpriseDB, while it wasn't audited, we
> > had to develop an actual TPC-B implementation because pgbench was too
> > different. pgbench itself isn't that useful as a benchmark tool, imo, but
> > if we have the ability to make it better (i.e. closer to an actual
> > benchmark kit), I think we should.
>
> [ shrug... ]  TBH, the proposed patch does not look to me like actual
> benchmark kit; it looks like a toy.  Nobody who was intent on making their
> benchmark numbers look good would do a significant amount of work in a
> slow, ad-hoc interpreted language.

According to TPC themselves, "In contrast to TPC-A, TPC-B is not an
OLTP benchmark. Rather, TPC-B can be looked at as a database stress
test..." [1]. Sounds like classic pgbench to me.

Not sure where that leaves this patch. What problem is it actually
trying to solve?

[1] http://www.tpc.org/tpcb/
-- 
Peter Geoghegan




Re: pgbench - implement strict TPC-B benchmark

2019-07-31 Thread Tom Lane
"Jonah H. Harris"  writes:
> On Wed, Jul 31, 2019 at 10:10 AM Fabien COELHO  wrote:
>> I agree that nobody really cares about TPC-B per se. The point of this
>> patch is to provide a built-in example of recent and useful pgbench
>> features that match a real specification.

> I agree with this. When I was at EnterpriseDB, while it wasn't audited, we
> had to develop an actual TPC-B implementation because pgbench was too
> different. pgbench itself isn't that useful as a benchmark tool, imo, but
> if we have the ability to make it better (i.e. closer to an actual
> benchmark kit), I think we should.

[ shrug... ]  TBH, the proposed patch does not look to me like actual
benchmark kit; it looks like a toy.  Nobody who was intent on making their
benchmark numbers look good would do a significant amount of work in a
slow, ad-hoc interpreted language.  I also wonder to what extent the
numbers would reflect pgbench itself being the bottleneck.  Which is
really the fundamental problem I've got with all the stuff that's been
crammed into pgbench of late --- the more computation you're doing there,
the less your results measure the server's capabilities rather than
pgbench's implementation details.

In any case, even if I were in love with the script itself, we cannot
commit something that claims to be "standard TPC-B".  It needs weasel
wording that makes it clear that it isn't TPC-B, and then you have a
problem of user confusion about why we have both not-quite-TPC-B-1
and not-quite-TPC-B-2, and which one to use, or which one was used in
somebody else's tests.

I think if you want to show off what these pgbench features are good
for, it'd be better to find some other example that's not in the
midst of a legal minefield.

regards, tom lane




Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-07-31 Thread Andres Freund
Hi,

On 2019-07-31 12:37:58 -0700, Ashwin Agrawal wrote:
> On Wed, Jul 31, 2019 at 10:55 AM Andres Freund  wrote:
>
> > Hi,
> >
> > On 2019-07-31 10:42:50 -0700, Ashwin Agrawal wrote:
> > > On Tue, Jul 30, 2019 at 2:58 PM Thomas Munro 
> > wrote:
> > >
> > > > On Tue, Jun 25, 2019 at 6:02 AM Andres Freund 
> > wrote:
> > > > > > - CheckForSerializableConflictOut() no more takes HeapTuple nor
> > > > > >   buffer, instead just takes xid. Push heap specific parts from
> > > > > >   CheckForSerializableConflictOut() into its own function
> > > > > >   HeapCheckForSerializableConflictOut() which calls
> > > > > >   CheckForSerializableConflictOut(). The alternative option could
> > be
> > > > > >   CheckForSerializableConflictOut() take callback function and
> > > > > >   callback arguments, which gets called if required after
> > performing
> > > > > >   prechecks. Though currently I fell AM having its own wrapper to
> > > > > >   perform AM specific task and then calling
> > > > > >   CheckForSerializableConflictOut() is fine.
> > > > >
> > > > > I think it's right to move the xid handling out of
> > > > > CheckForSerializableConflictOut(). But I think we also ought to move
> > the
> > > > > subtransaction handling out of the function - e.g. zheap doesn't
> > > > > want/need that.
> > > >
> > > > Thoughts on this Ashwin?
> > > >
> > >
> > > I think the only part its doing for sub-transaction is
> > > SubTransGetTopmostTransaction(xid). If xid passed to this function is
> > > already top most transaction which is case for zheap and zedstore, then
> > > there is no downside to keeping that code here in common place.
> >
> > Well, it's far from a cheap function. It'll do unnecessary on-disk
> > lookups in many cases. I'd call that quite a downside.
> >
>
> Okay, agree, its costly function and better to avoid the call if possible.
>
> Instead of moving the handling out of the function, how do feel about
> adding boolean isTopTransactionId argument to function
> CheckForSerializableConflictOut(). The AMs, which implicitly know, only
> pass top transaction Id to this function, can pass true and avoid the
> function call to SubTransGetTopmostTransaction(xid). With this
> subtransaction code remains in generic place and AMs intending to use it
> continue to leverage the common code, plus explicitly clarifies the
> behavior as well.

Looking at the code as of master, we currently have:

- PredicateLockTuple() calls SubTransGetTopmostTransaction() to figure
  out a whether the tuple has been locked by the current
  transaction. That check afaict just should be
  TransactionIdIsCurrentTransactionId(), without all the other
  stuff that's done today.

  TransactionIdIsCurrentTransactionId() imo ought to be optimized to
  always check for the top level transactionid first - that's a good bet
  today, but even moreso for the upcoming AMs that won't have separate
  xids for subtransactions.  Alternatively we shouldn't make that a
  binary search for each subtrans level, but just have a small
  simplehash hashtable for xids.

- CheckForSerializableConflictOut() wants to get the toplevel xid for
  the tuple, because that's the one the predicate hashtable stores.

  In your patch you've already moved the HTSV() call etc out of
  CheckForSerializableConflictOut(). I'm somewhat inclined to think that
  the SubTransGetTopmostTransaction() call ought to go along with that.
  I don't really think that belongs in predicate.c, especially if
  most/all new AMs don't use subtransaction ids.

  The only downside is that currently the
  TransactionIdEquals(xid, GetTopTransactionIdIfAny()) check
  avoids the SubTransGetTopmostTransaction() check.

  But again, the better fix for that seems to be to improve the generic
  code. As written the check won't prevent a subtrans lookup for heap
  when subtransactions are in use, and it's IME pretty common for tuples
  to get looked at again in the transaction that has created them. So
  I'm somewhat inclined to think that SubTransGetTopmostTransaction()
  should have a fast-path for the current transaction - probably just
  employing TransactionIdIsCurrentTransactionId().

I don't really see what we gain by having the subtrans handling in the
predicate code. Especially given that we've already moved the HTSV()
handling out, it seems architecturally the wrong place to me - but I
admit that that's a fuzzy argument.  The relevant mapping should be one
line in the caller.

I wonder if it'd be wroth to combine the
TransactionIdIsCurrentTransactionId() calls in the heap cases that
currently do both, PredicateLockTuple() and
HeapCheckForSerializableConflictOut(). The heap_fetch() case probably
isn't commonly that hot a pathq, but heap_hot_search_buffer() is.


Minor notes:
- I don't think 'insert_xid' is necessarily great - it could also be the
  updating xid etc. And while you can argue that an update is an insert
  in the current heap, that's not the case for future AMs.
- to me
@@ -1621,7 +1622,8 

Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-07-31 Thread Ashwin Agrawal
On Wed, Jul 31, 2019 at 12:37 PM Ashwin Agrawal  wrote:

>
> On Wed, Jul 31, 2019 at 10:55 AM Andres Freund  wrote:
>
>> Hi,
>>
>> On 2019-07-31 10:42:50 -0700, Ashwin Agrawal wrote:
>> > On Tue, Jul 30, 2019 at 2:58 PM Thomas Munro 
>> wrote:
>> >
>> > > On Tue, Jun 25, 2019 at 6:02 AM Andres Freund 
>> wrote:
>> > > > > - CheckForSerializableConflictOut() no more takes HeapTuple nor
>> > > > >   buffer, instead just takes xid. Push heap specific parts from
>> > > > >   CheckForSerializableConflictOut() into its own function
>> > > > >   HeapCheckForSerializableConflictOut() which calls
>> > > > >   CheckForSerializableConflictOut(). The alternative option could
>> be
>> > > > >   CheckForSerializableConflictOut() take callback function and
>> > > > >   callback arguments, which gets called if required after
>> performing
>> > > > >   prechecks. Though currently I fell AM having its own wrapper to
>> > > > >   perform AM specific task and then calling
>> > > > >   CheckForSerializableConflictOut() is fine.
>> > > >
>> > > > I think it's right to move the xid handling out of
>> > > > CheckForSerializableConflictOut(). But I think we also ought to
>> move the
>> > > > subtransaction handling out of the function - e.g. zheap doesn't
>> > > > want/need that.
>> > >
>> > > Thoughts on this Ashwin?
>> > >
>> >
>> > I think the only part its doing for sub-transaction is
>> > SubTransGetTopmostTransaction(xid). If xid passed to this function is
>> > already top most transaction which is case for zheap and zedstore, then
>> > there is no downside to keeping that code here in common place.
>>
>> Well, it's far from a cheap function. It'll do unnecessary on-disk
>> lookups in many cases. I'd call that quite a downside.
>>
>
> Okay, agree, its costly function and better to avoid the call if possible.
>
> Instead of moving the handling out of the function, how do feel about
> adding boolean isTopTransactionId argument to function
> CheckForSerializableConflictOut(). The AMs, which implicitly know, only
> pass top transaction Id to this function, can pass true and avoid the
> function call to SubTransGetTopmostTransaction(xid). With this
> subtransaction code remains in generic place and AMs intending to use it
> continue to leverage the common code, plus explicitly clarifies the
> behavior as well.
>

Added argument to function to make the subtransaction handling optional in
attached v2 of patch.
From 4ca67592f34b63cf80247068a128407d800c62a6 Mon Sep 17 00:00:00 2001
From: Ashwin Agrawal 
Date: Wed, 31 Jul 2019 13:53:52 -0700
Subject: [PATCH v2] Remove HeapTuple dependency for predicate locking
 functions.

Following changes help to make predicate locking functions generic and
not tied to particular AM.

- PredicateLockTuple() is renamed to PredicateLockTID(). It takes
  ItemPointer and inserting transaction id now instead of HeapTuple.

- CheckForSerializableConflictIn() takes blocknum instead of buffer

- CheckForSerializableConflictOut() no more takes HeapTuple nor buffer
---
 src/backend/access/gin/ginbtree.c|   2 +-
 src/backend/access/gin/ginfast.c |   2 +-
 src/backend/access/gin/gininsert.c   |   4 +-
 src/backend/access/gist/gist.c   |   2 +-
 src/backend/access/hash/hashinsert.c |   2 +-
 src/backend/access/heap/heapam.c | 104 --
 src/backend/access/heap/heapam_handler.c |   7 +-
 src/backend/access/index/indexam.c   |   4 +-
 src/backend/access/nbtree/nbtinsert.c|   4 +-
 src/backend/storage/lmgr/predicate.c | 134 ---
 src/include/access/heapam.h  |   2 +
 src/include/storage/predicate.h  |  10 +-
 12 files changed, 162 insertions(+), 115 deletions(-)

diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 11a8ed7bbc2..e795375495b 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -89,7 +89,7 @@ ginFindLeafPage(GinBtree btree, bool searchMode,
 	stack->predictNumber = 1;
 
 	if (rootConflictCheck)
-		CheckForSerializableConflictIn(btree->index, NULL, stack->buffer);
+		CheckForSerializableConflictIn(btree->index, NULL, btree->rootBlkno);
 
 	for (;;)
 	{
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index 439a91b3e61..d7b52476817 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -246,7 +246,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
 	 * tree, so it conflicts with all serializable scans.  All scans acquire a
 	 * predicate lock on the metabuffer to represent that.
 	 */
-	CheckForSerializableConflictIn(index, NULL, metabuffer);
+	CheckForSerializableConflictIn(index, NULL, GIN_METAPAGE_BLKNO);
 
 	if (collector->sumsize + collector->ntuples * sizeof(ItemIdData) > GinListPageSize)
 	{
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 55eab146173..046a20a3d41 100644
--- 

Re: Unused header file inclusion

2019-07-31 Thread Tom Lane
Alvaro Herrera  writes:
> On 2019-Jul-31, Andres Freund wrote:
>> * I think a lot of the interlinking stems from the bad idea to use
>> typedef's everywhere. In contrast to structs they cannot be forward
>> declared portably in our version of C. We should use a lot more struct
>> forward declarations, and just not use the typedef.

> I don't know about that ... I think the problem is that we both declare
> the typedef *and* define the struct in the same place.  If we were to
> split those things to separate files, the required rebuilds would be
> much less, I think, because changing a struct would no longer require
> recompiles of files that merely pass those structs around (that's very
> common for Node-derived structs).  Forward-declaring structs in
> unrelated header files just because they need them, feels a bit like
> cheating to me.

Yeah.  I seem to recall a proposal that nodes.h should contain

typedef struct Foo Foo;

for every node type Foo, and then the other headers would just
fill in the structs, and we could get rid of a lot of ad-hoc
forward struct declarations and other hackery.

regards, tom lane




Re: Replication & recovery_min_apply_delay

2019-07-31 Thread Alvaro Herrera
On 2019-Jul-31, Konstantin Knizhnik wrote:

> On 08.07.2019 11:05, Michael Paquier wrote:

> > Please note that I have not looked at that stuff in details, but I
> > find the patch proposed kind of ugly with the scan of the last segment
> > using a WAL reader to find out what is the last LSN and react on
> > that..  This does not feel right.
> 
> I am sorry for delay with answer.
> Looks like I have not noticed your reply:(
> Can you explain me please why it is not correct to iterate through WAL using
> WAL reader to get last LSN?

I agree that it's conceptually ugly, but as I mentioned in my previous
reply, I tried several other strategies before giving up and ended up
concluding that this way was a good way to solve the problem.

I don't endorse the exact patch submitted, though.  I think it should
have a lot more commentary on what the code is doing and why.

As for the test module, the one I submitted takes a lot of time to run
(well, 60s) and I don't think it's a good idea to include it as
something to be run all the time by every buildfarm member.  I'm not
sure that there's a leaner way to test for this bug, though, but
certainly it'd be a good idea to ensure that this continues to work.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Unused header file inclusion

2019-07-31 Thread Alvaro Herrera
On 2019-Jul-31, Andres Freund wrote:

> IDK, I find the compilation times annoying. And it's gotten quite
> noticably worse with all the speculative execution mitigations. Although
> to some degree that's not really the fault of individual compilations,
> but our buildsystem being pretty slow.

We're in a much better position now than a decade ago, in terms of clock
time.  Back then I would resort to many tricks to avoid spurious
compiles, even manually touching some files to dates in the past to
avoid them.  Nowadays I never bother with such things.  But yes,
reducing the build time even more would be welcome for sure.

> * I think a lot of the interlinking stems from the bad idea to use
> typedef's everywhere. In contrast to structs they cannot be forward
> declared portably in our version of C. We should use a lot more struct
> forward declarations, and just not use the typedef.

I don't know about that ... I think the problem is that we both declare
the typedef *and* define the struct in the same place.  If we were to
split those things to separate files, the required rebuilds would be
much less, I think, because changing a struct would no longer require
recompiles of files that merely pass those structs around (that's very
common for Node-derived structs).  Forward-declaring structs in
unrelated header files just because they need them, feels a bit like
cheating to me.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: block-level incremental backup

2019-07-31 Thread Robert Haas
On Wed, Jul 31, 2019 at 1:59 PM vignesh C  wrote:
> I feel Robert's suggestion is good.
> We can probably keep one meta file for each backup with some basic information
> of all the files being backed up, this metadata file will be useful in the
> below case:
> Table dropped before incremental backup
> Table truncated and Insert/Update/Delete operations before incremental backup

There's really no need for this with the design I proposed.  The files
that should exist when you restore in incremental backup are exactly
the set of files that exist in the final incremental backup, except
that any .partial files need to be replaced with a correct
reconstruction of the underlying file.  You don't need to know what
got dropped or truncated; you only need to know what's supposed to be
there at the end.

You may be thinking, as I once did, that restoring an incremental
backup would consist of restoring the full backup first and then
layering the incrementals over it, but if you read what I proposed, it
actually works the other way around: you restore the files that are
present in the incremental, and as needed, pull pieces of them from
earlier incremental and/or full backups.  I think this is a *much*
better design than doing it the other way; it avoids any risk of
getting the wrong answer due to truncations or drops, and it also is
faster, because you only read older backups to the extent that you
actually need their contents.

I think it's a good idea to try to keep all the information about a
single file being backup in one place. It's just less confusing.  If,
for example, you have a metadata file that tells you which files are
dropped - that is, which files you DON'T have - then what happen if
one of those files is present in the data directory after all?  Well,
then you have inconsistent information and are confused, and maybe
your code won't even notice the inconsistency.  Similarly, if the
metadata file is separate from the block data, then what happens if
one file is missing, or isn't from the same backup as the other file?
That shouldn't happen, of course, but if it does, you'll get confused.
There's no perfect solution to these kinds of problems: if we suppose
that the backup can be corrupted by having missing or extra files, why
not also corruption within a single file? Still, on balance I tend to
think that keeping related stuff together minimizes the surface area
for bugs.  I realize that's arguable, though.

One consideration that goes the other way: if you have a manifest file
that says what files are supposed to be present in the backup, then
you can detect a disappearing file, which is impossible with the
design I've proposed (and with the current full backup machinery).
That might be worth fixing, but it's a separate feature that has
little to do with incremental backup.

> Probably it can also help us to decide which work the worker needs to do
> if we are planning to backup in parallel.

I don't think we need a manifest file for parallel backup.  One
process or thread can scan the directory tree, make a list of which
files are present, and then hand individual files off to other
processes or threads. In short, the directory listing serves as the
manifest.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: [HACKERS] Cached plans and statement generalization

2019-07-31 Thread Konstantin Knizhnik




On 31.07.2019 19:56, Heikki Linnakangas wrote:

On 09/07/2019 23:59, Konstantin Knizhnik wrote:

Fixed patch version of the path is attached.


Much of the patch and the discussion has been around the raw parsing, 
and guessing which literals are actually parameters that have been 
inlined into the SQL text. Do we really need to do that, though? The 
documentation mentions pgbouncer and other connection poolers, where 
you don't have session state, as a use case for this. But such 
connection poolers could and should still be using the extended query 
protocol, with Parse+Bind messages and query parameters, even if they 
don't use named prepared statements. I'd want to encourage 
applications and middleware to use out-of-band query parameters, to 
avoid SQL injection attacks, regardless of whether they use prepared 
statements or cache query plans. So how about dropping all the raw 
parse tree stuff, and doing the automatic caching of plans based on 
the SQL string, somewhere in the exec_parse_message? Check the 
autoprepare cache in exec_parse_message(), if it was an "unnamed" 
prepared statement, i.e. if the prepared statement name is an empty 
string.


(I'm actually not sure if exec_parse/bind_message is the right place 
for this, but I saw that your current patch did it in 
exec_simple_query, and exec_parse/bind_message are the equivalent of 
that for the extended query protocol).


It will significantly simplify this patch and eliminate all complexity 
and troubles  caused by replacing string literals with parameters
if assumption that all client applications are using extended query 
protocol is true.
But I am not sure that we can expect it. At least I myself see many 
applications which are constructing queries by embedding literal values.
May be it is not so good and safe (SQL injection attack), but many 
applications are doing it. And the idea was to improve execution speed

of existed application without changing them.


Also please notice that extended protocol requires passing more message 
which has negative effect on performance.
At my  notebook I get about 21k TPS on "pgbench -S" and 18k TPS on 
"pgbench -M extended -S".
And it is with unix socket connection! I think that in case of remote 
connections difference will be even larger.


So may be committing simple version of this patch which do not need to 
solve any challenged problems is good idea.
But I afraid that it will significantly reduce positive effect of this 
patch.





Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-07-31 Thread Ashwin Agrawal
On Wed, Jul 31, 2019 at 10:55 AM Andres Freund  wrote:

> Hi,
>
> On 2019-07-31 10:42:50 -0700, Ashwin Agrawal wrote:
> > On Tue, Jul 30, 2019 at 2:58 PM Thomas Munro 
> wrote:
> >
> > > On Tue, Jun 25, 2019 at 6:02 AM Andres Freund 
> wrote:
> > > > > - CheckForSerializableConflictOut() no more takes HeapTuple nor
> > > > >   buffer, instead just takes xid. Push heap specific parts from
> > > > >   CheckForSerializableConflictOut() into its own function
> > > > >   HeapCheckForSerializableConflictOut() which calls
> > > > >   CheckForSerializableConflictOut(). The alternative option could
> be
> > > > >   CheckForSerializableConflictOut() take callback function and
> > > > >   callback arguments, which gets called if required after
> performing
> > > > >   prechecks. Though currently I fell AM having its own wrapper to
> > > > >   perform AM specific task and then calling
> > > > >   CheckForSerializableConflictOut() is fine.
> > > >
> > > > I think it's right to move the xid handling out of
> > > > CheckForSerializableConflictOut(). But I think we also ought to move
> the
> > > > subtransaction handling out of the function - e.g. zheap doesn't
> > > > want/need that.
> > >
> > > Thoughts on this Ashwin?
> > >
> >
> > I think the only part its doing for sub-transaction is
> > SubTransGetTopmostTransaction(xid). If xid passed to this function is
> > already top most transaction which is case for zheap and zedstore, then
> > there is no downside to keeping that code here in common place.
>
> Well, it's far from a cheap function. It'll do unnecessary on-disk
> lookups in many cases. I'd call that quite a downside.
>

Okay, agree, its costly function and better to avoid the call if possible.

Instead of moving the handling out of the function, how do feel about
adding boolean isTopTransactionId argument to function
CheckForSerializableConflictOut(). The AMs, which implicitly know, only
pass top transaction Id to this function, can pass true and avoid the
function call to SubTransGetTopmostTransaction(xid). With this
subtransaction code remains in generic place and AMs intending to use it
continue to leverage the common code, plus explicitly clarifies the
behavior as well.


Re: block-level incremental backup

2019-07-31 Thread vignesh C
On Tue, Jul 30, 2019 at 1:58 AM Robert Haas  wrote:
>
> On Wed, Jul 10, 2019 at 2:17 PM Anastasia Lubennikova
>  wrote:
> > In attachments, you can find a prototype of incremental pg_basebackup,
> > which consists of 2 features:
> >
> > 1) To perform incremental backup one should call pg_basebackup with a
> > new argument:
> >
> > pg_basebackup -D 'basedir' --prev-backup-start-lsn 'lsn'
> >
> > where lsn is a start_lsn of parent backup (can be found in
> > "backup_label" file)
> >
> > It calls BASE_BACKUP replication command with a new argument
> > PREV_BACKUP_START_LSN 'lsn'.
> >
> > For datafiles, only pages with LSN > prev_backup_start_lsn will be
> > included in the backup.
>>
One thought, if the file is not modified no need to check the lsn.
>>
> > They are saved into 'filename.partial' file, 'filename.blockmap' file
> > contains an array of BlockNumbers.
> > For example, if we backuped blocks 1,3,5, filename.partial will contain
> > 3 blocks, and 'filename.blockmap' will contain array {1,3,5}.
>
> I think it's better to keep both the information about changed blocks
> and the contents of the changed blocks in a single file.  The list of
> changed blocks is probably quite short, and I don't really want to
> double the number of files in the backup if there's no real need. I
> suspect it's just overall a bit simpler to keep everything together.
> I don't think this is a make-or-break thing, and welcome contrary
> arguments, but that's my preference.
>
I feel Robert's suggestion is good.
We can probably keep one meta file for each backup with some basic information
of all the files being backed up, this metadata file will be useful in the
below case:
Table dropped before incremental backup
Table truncated and Insert/Update/Delete operations before incremental backup

I feel if we have the metadata, we can add some optimization to decide the
above scenario with the metadata information to identify the file deletion
and avoiding write and delete for pg_combinebackup which Jeevan has told in
his previous mail.

Probably it can also help us to decide which work the worker needs to do
if we are planning to backup in parallel.

Regards,
vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-07-31 Thread Andres Freund
Hi,

On 2019-07-31 10:42:50 -0700, Ashwin Agrawal wrote:
> On Tue, Jul 30, 2019 at 2:58 PM Thomas Munro  wrote:
> 
> > On Tue, Jun 25, 2019 at 6:02 AM Andres Freund  wrote:
> > > > - CheckForSerializableConflictOut() no more takes HeapTuple nor
> > > >   buffer, instead just takes xid. Push heap specific parts from
> > > >   CheckForSerializableConflictOut() into its own function
> > > >   HeapCheckForSerializableConflictOut() which calls
> > > >   CheckForSerializableConflictOut(). The alternative option could be
> > > >   CheckForSerializableConflictOut() take callback function and
> > > >   callback arguments, which gets called if required after performing
> > > >   prechecks. Though currently I fell AM having its own wrapper to
> > > >   perform AM specific task and then calling
> > > >   CheckForSerializableConflictOut() is fine.
> > >
> > > I think it's right to move the xid handling out of
> > > CheckForSerializableConflictOut(). But I think we also ought to move the
> > > subtransaction handling out of the function - e.g. zheap doesn't
> > > want/need that.
> >
> > Thoughts on this Ashwin?
> >
> 
> I think the only part its doing for sub-transaction is
> SubTransGetTopmostTransaction(xid). If xid passed to this function is
> already top most transaction which is case for zheap and zedstore, then
> there is no downside to keeping that code here in common place.

Well, it's far from a cheap function. It'll do unnecessary on-disk
lookups in many cases. I'd call that quite a downside.

Greetings,

Andres Freund




Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-07-31 Thread Andres Freund
Hi,

On 2019-07-31 09:57:58 +1200, Thomas Munro wrote:
> On Tue, Jun 25, 2019 at 6:02 AM Andres Freund  wrote:
> > Hm.  I wonder if we somehow ought to generalize the granularity scheme
> > for predicate locks to not be tuple/page/relation.  But even if, that's
> > probably a separate patch.
> 
> What do you have in mind?

My concern is that continuing to inferring the granularity levels from
the tid doesn't seem like a great path forward. An AMs use of tids might
not necessarily be very amenable to that, if the mapping isn't actually
block based.


> Perhaps you just want to give those things different labels, "TID
> range" instead of page, for the benefit of "logical" TID users?
> Perhaps you want to permit more levels?  That seems premature as long
> as TIDs are defined in terms of blocks and offsets, so this stuff is
> reflected all over the source tree...

I'm mostly wondering if the different levels shouldn't be computed
outside of predicate.c.

Greetings,

Andres Freund




Re: Patch for SortSupport implementation on inet/cdir

2019-07-31 Thread Peter Geoghegan
On Fri, Jul 26, 2019 at 7:25 PM Peter Geoghegan  wrote:
> I guess that the idea here was to prevent masking on ipv6 addresses,
> though not on ipv4 addresses. Obviously we're only dealing with a
> prefix with ipv6 addresses, whereas we usually have the whole raw
> ipaddr with ipv4. Not sure if I'm doing the right thing there in v3,
> even though the tests pass. In any case, this will need to be a lot
> clearer in the final version.

This turned out to be borked for certain IPv6 cases, as suspected.
Attached is a revised v6, which fixes the issue by adding the explicit
handling needed when ipaddr_datum is just a prefix of the full ipaddr
from the authoritative representation. Also made sure that the tests
will catch issues like this. Separately, it occurred to me that it's
probably not okay to do straight type punning of the ipaddr unsigned
char array to a Datum on alignment-picky platforms. Using a memcpy()
seems like the right approach, which is what we do in the latest
revision.

I accepted almost all of Brandur's comment revisions from v5 for v6.

I'm probably going to commit this tomorrow morning Pacific time. Do
you have any final input on the testing, Brandur? I would like to hear
your thoughts on the possibility of edge cases that still don't have
coverage. The tests will break if the new "if (ip_bits(authoritative)
== 0)" branch is removed, but only at one exact point. I'm pretty sure
that there are no remaining subtleties like that one, but two heads
are better than one.

-- 
Peter Geoghegan


v6-0001-Add-sort-support-for-inet-cidr-opfamily.patch
Description: Binary data


Re: Remove HeapTuple and Buffer dependency for predicate locking functions

2019-07-31 Thread Ashwin Agrawal
On Tue, Jul 30, 2019 at 2:58 PM Thomas Munro  wrote:

> On Tue, Jun 25, 2019 at 6:02 AM Andres Freund  wrote:
> > > - CheckForSerializableConflictOut() no more takes HeapTuple nor
> > >   buffer, instead just takes xid. Push heap specific parts from
> > >   CheckForSerializableConflictOut() into its own function
> > >   HeapCheckForSerializableConflictOut() which calls
> > >   CheckForSerializableConflictOut(). The alternative option could be
> > >   CheckForSerializableConflictOut() take callback function and
> > >   callback arguments, which gets called if required after performing
> > >   prechecks. Though currently I fell AM having its own wrapper to
> > >   perform AM specific task and then calling
> > >   CheckForSerializableConflictOut() is fine.
> >
> > I think it's right to move the xid handling out of
> > CheckForSerializableConflictOut(). But I think we also ought to move the
> > subtransaction handling out of the function - e.g. zheap doesn't
> > want/need that.
>
> Thoughts on this Ashwin?
>

I think the only part its doing for sub-transaction is
SubTransGetTopmostTransaction(xid). If xid passed to this function is
already top most transaction which is case for zheap and zedstore, then
there is no downside to keeping that code here in common place.


Re: Assertion for logically decoding multi inserts into the catalog

2019-07-31 Thread Heikki Linnakangas

On 08/07/2019 22:42, Daniel Gustafsson wrote:

My patch for using heap_multi_insert in the catalog [1] failed the logical
decoding part of test/recovery [2].

The assertion it failed on seems to not take multi inserts into the catalog
into consideration, while the main logic does.  This assertion hasn't tripped
since there are no multi inserts into the catalog, but if we introduce them it
will so I’m raising it in a separate thread as it is sort of unrelated from the
patch in question.

The attached patch fixes my test failure and makes sense to me, but this code
is far from my neck of the tree, so I’m really not sure this is the best way to
express the assertion.

--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -974,7 +974,8 @@ DecodeMultiInsert(LogicalDecodingContext *ctx, 
XLogRecordBuffer *buf)
ReorderBufferQueueChange(ctx->reorder, XLogRecGetXid(r),
 buf->origptr, 
change);
}
-   Assert(data == tupledata + tuplelen);
+   Assert(xlrec->flags & XLH_INSERT_CONTAINS_NEW_TUPLE &&
+  data == tupledata + tuplelen);
 }
 
 /*


This patch makes the assertion more strict than it was before. I don't 
see how it could possibly make a regression failure go away. On the 
contrary. So, huh?


- Heikki




Re: [HACKERS] Cached plans and statement generalization

2019-07-31 Thread Heikki Linnakangas

On 09/07/2019 23:59, Konstantin Knizhnik wrote:

Fixed patch version of the path is attached.


Much of the patch and the discussion has been around the raw parsing, 
and guessing which literals are actually parameters that have been 
inlined into the SQL text. Do we really need to do that, though? The 
documentation mentions pgbouncer and other connection poolers, where you 
don't have session state, as a use case for this. But such connection 
poolers could and should still be using the extended query protocol, 
with Parse+Bind messages and query parameters, even if they don't use 
named prepared statements. I'd want to encourage applications and 
middleware to use out-of-band query parameters, to avoid SQL injection 
attacks, regardless of whether they use prepared statements or cache 
query plans. So how about dropping all the raw parse tree stuff, and 
doing the automatic caching of plans based on the SQL string, somewhere 
in the exec_parse_message? Check the autoprepare cache in 
exec_parse_message(), if it was an "unnamed" prepared statement, i.e. if 
the prepared statement name is an empty string.


(I'm actually not sure if exec_parse/bind_message is the right place for 
this, but I saw that your current patch did it in exec_simple_query, and 
exec_parse/bind_message are the equivalent of that for the extended 
query protocol).


- Heikki




Re: How to retain lesser paths at add_path()?

2019-07-31 Thread Tom Lane
Robert Haas  writes:
> Yeah, but I have to admit that this whole design makes me kinda
> uncomfortable.  Every time somebody comes up with a new figure of
> merit, it increases not only the number of paths retained but also the
> cost of comparing two paths to possibly reject one of them. A few
> years ago, you came up with the (good) idea of rejecting some join
> paths before actually creating the paths, and I wonder if we ought to
> try to go further with that somehow. Or maybe, as Peter Geoghegan, has
> been saying, we ought to think about planning top-down with
> memoization instead of bottom up (yeah, I know that's a huge change).
> It just feels like the whole idea of a list of paths ordered by cost
> breaks down when there are so many ways that a not-cheapest path can
> still be worth keeping. Not sure exactly what would be better, though.

Yeah, I agree that add_path is starting to feel creaky.  I don't
know what to do instead though.  Changing to a top-down design
sounds like it would solve some problems while introducing others
(not to mention the amount of work and breakage involved).

regards, tom lane




Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-31 Thread Anastasia Lubennikova

24.07.2019 4:22, Peter Geoghegan wrote:


Attached is a revised version of your v2 that fixes this issue -- I'll
call this v3. In general, my goal for the revision was to make sure
that all of my old tests from the v12 work passed, and to make sure
that amcheck can detect almost any possible problem. I tested the
amcheck changes by corrupting random state in a test index using
pg_hexedit, then making sure that amcheck actually complained in each
case.

I also fixed one or two bugs in passing, including the bug that caused
an assertion failure in _bt_truncate(). That was down to a subtle
off-by-one issue within _bt_insertonpg_in_posting(). Overall, I didn't
make that many changes to your v2. There are probably some things
about the patch that I still don't understand, or things that I have
misunderstood.


Thank you for this review and fixes.

* Changed the custom binary search code within _bt_compare_posting()
to look more like _bt_binsrch() and _bt_binsrch_insert(). Do you know
of any reason not to do it that way?

It's ok to update it. There was no particular reason, just my habit.

* Added quite a few "FIXME"/"XXX" comments at various points, to
indicate where I have general concerns that need more discussion.


+     * FIXME:  The calls to BTreeGetNthTupleOfPosting() allocate 
memory,


If we only need to check TIDs, we don't need BTreeGetNthTupleOfPosting(),
we can use BTreeTupleGetPostingN() instead and iterate over TIDs, not 
tuples.


Fixed in version 4.


* Included my own pageinspect hack to visualize the minimum TIDs in
posting lists. It's broken out into a separate patch file. The code is
very rough, but it might help someone else, so I thought I'd include
it.

Cool, I think we should add it to the final patchset,
probably, as separate function by analogy with tuple_data_split.


I also have some new concerns about the code in the patch that I will
point out now (though only as something to think about a solution on
-- I am unsure myself):

* It's a bad sign that compression involves calls to PageAddItem()
that are allowed to fail (we just give up on compression when that
happens). For one thing, all existing calls to PageAddItem() in
Postgres are never expected to fail -- if they do fail we get a "can't
happen" error that suggests corruption. It was a good idea to take
this approach to get the patch to work, and to prove the general idea,
but we now need to fully work out all the details about the use of
space. This includes complicated new questions around how alignment is
supposed to work.


The main reason to implement this gentle error handling is the fact that
deduplication could cause storage overhead, which leads to running out 
of space

on the page.

First of all, it is a legacy of the previous versions where
BTreeFormPostingTuple was not able to form non-posting tuple even in case
where a number of posting items is 1.

Another case that was in my mind is the situation where we have 2 tuples:
t_tid | t_info | key + t_tid | t_info | key

and compressed result is:
t_tid | t_info | key | t_tid | t_tid

If sizeof(t_info) + sizeof(key) < sizeof(t_tid), resulting posting tuple 
can be

larger. It may happen if keysize <= 4 byte.
In this situation original tuples must have been aligned to size 16 
bytes each,
and resulting tuple is at most 24 bytes (6+2+4+6+6). So this case is 
also safe.


I changed DEBUG message to ERROR in v4 and it passes all regression tests.
I doubt that it covers all corner cases, so I'll try to add more special 
tests.



Alignment in nbtree is already complicated today -- you're supposed to
MAXALIGN() everything in nbtree, so that the MAXALIGN() within
bufpage.c routines cannot be different to the lp_len/IndexTupleSize()
length (note that heapam can have tuples whose lp_len isn't aligned,
so nbtree could do it differently if it proved useful). Code within
nbtsplitloc.c fully understands the space requirements for the
bufpage.c routines, and is very careful about it. (The bufpage.c
details are supposed to be totally hidden from code like
nbtsplitloc.c, but I guess that that ideal isn't quite possible in
reality. Code comments don't really explain the situation today.)

I'm not sure what it would look like for this patch to be as precise
about free space as nbtsplitloc.c already is, even though that seems
desirable (I just know that it would mean you would trust
PageAddItem() to work in all cases). The patch is different to what we
already have today in that it tries to add *less than* a single
MAXALIGN() quantum at a time in some places (when a posting list needs
to grow by one item). The devil is in the details.

* As you know, the current approach to WAL logging is very
inefficient. It's okay for now, but we'll need a fine-grained approach
for the patch to be commitable. I think that this is subtly related to
the last item (i.e. the one about alignment). I have done basic
performance tests using unlogged tables. The patch seems to either
make big INSERT 

Re: TopoSort() fix

2019-07-31 Thread Robert Haas
On Tue, Jul 30, 2019 at 2:10 PM Tom Lane  wrote:
> Sure.  But I think what we can foresee is that if there are any bugs
> reachable this way, they'd be reachable and need fixing regardless.
> We've already established that parallel workers can take and release locks
> that their leader isn't holding.  Apparently, they won't take anything
> stronger than RowExclusiveLock; but even AccessShare is enough to let a
> process participate in all interesting behaviors of the lock manager,
> including blocking, being blocked, and being released early by deadlock
> resolution.  And the advisory-lock functions are pretty darn thin wrappers
> around the lock manager.  So I'm finding it hard to see where there's
> incremental risk, even if a user does intentionally bypass the parallel
> safety markings.  And what we get in return is an easier way to add tests
> for this area.

Sure, I was basically just asking whether you could foresee any
crash-risk of the proposed change.  It sounds like the answer is "no,"
so I'm fine with it on that basis.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: How to retain lesser paths at add_path()?

2019-07-31 Thread Robert Haas
On Wed, Jul 31, 2019 at 11:07 AM Tom Lane  wrote:
> What you'd want to do for something like the above, I think, is to
> have some kind of figure of merit or other special marking for paths
> that will have some possible special advantage in later planning
> steps.  Then you can teach add_path that that's another dimension it
> should consider, in the same way that paths with different sort orders
> or parallizability attributes don't dominate each other.

Yeah, but I have to admit that this whole design makes me kinda
uncomfortable.  Every time somebody comes up with a new figure of
merit, it increases not only the number of paths retained but also the
cost of comparing two paths to possibly reject one of them. A few
years ago, you came up with the (good) idea of rejecting some join
paths before actually creating the paths, and I wonder if we ought to
try to go further with that somehow. Or maybe, as Peter Geoghegan, has
been saying, we ought to think about planning top-down with
memoization instead of bottom up (yeah, I know that's a huge change).
It just feels like the whole idea of a list of paths ordered by cost
breaks down when there are so many ways that a not-cheapest path can
still be worth keeping. Not sure exactly what would be better, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Unused header file inclusion

2019-07-31 Thread Andres Freund
Hi,

On 2019-07-31 11:23:22 -0400, Alvaro Herrera wrote:
> I think removing unnecessary include lines from header files is much
> more useful than from .c files.  However, nowadays even I am not very
> convinced that that is a very fruitful use of time, since many/most
> developers use ccache which will reduce the compile times anyway in many
> cases; and development machines are typically much faster than ten years
> ago.

IDK, I find the compilation times annoying. And it's gotten quite
noticably worse with all the speculative execution mitigations. Although
to some degree that's not really the fault of individual compilations,
but our buildsystem being pretty slow.

I think there's also just modularity reasons for removing includes from
headers. We've some pretty oddly interlinked systems, often without good
reason (*). Cleaning those up imo is well worth the time - but hardly
can be automated.

If one really wanted to automate removal of header files, it'd need to
be a lot smarter than just checking whether a file fails to compile if
one header is removed. In the general case we'd have to test if the .c
file itself uses any of the symbols from the possibly-to-be-removed
header. That's hard to do without using something like llvm's
libclang. The one case that's perhaps a bit easier to automate, and
possibly worthwhile: If a header is not indirectly included (possibly
testable via #ifndef HEADER_NAME_H #error 'already included' etc), and
compilation doesn't fail with it removed, *then* it's actually likely
useless (except for portability cases).

* I think a lot of the interlinking stems from the bad idea to use
typedef's everywhere. In contrast to structs they cannot be forward
declared portably in our version of C. We should use a lot more struct
forward declarations, and just not use the typedef.

Greetings,

Andres Freund




Re: partition routing layering in nodeModifyTable.c

2019-07-31 Thread Andres Freund
Hi,

On 2019-07-31 21:03:58 +0900, Etsuro Fujita wrote:
> I'm still not sure that it's a good idea to remove
> es_result_relation_info, but if I had to say then I think the latter
> would probably be better.  I'm planning to rework on direct
> modification to base it on upper planner pathification so we can
> perform direct modification without the ModifyTable node.  (I'm not
> sure we can really do this for inherited UPDATE/DELETE, though.)  For
> that rewrite, I'm thinking to call BeginDirectModify() from the
> ForeignScan node (ie, ExecInitForeignScan()) as-is.  The latter
> approach would allow that without any changes and avoid changing that
> API many times.  That's the reason why I think the latter would
> probably be better.

I think if we did that, it'd become *more* urgent to remove
es_result_relation. Having more and more plan nodes change global
resources is a recipse for disaster.

Greetings,

Andres Freund




Re: Unused header file inclusion

2019-07-31 Thread Alvaro Herrera
On 2019-Jul-31, vignesh C wrote:

> I noticed that there are many header files being
> included which need not be included.

Yeah, we have tooling for this already in src/tools/pginclude.  It's
been used before, and it has wreaked considerable havoc; see "git log
--grep pgrminclude".

I think doing this sort of cleanup is useful to a point -- as Andres
mentions, some includes are somewhat more "important" than others, so
judgement is needed in each case.

I think removing unnecessary include lines from header files is much
more useful than from .c files.  However, nowadays even I am not very
convinced that that is a very fruitful use of time, since many/most
developers use ccache which will reduce the compile times anyway in many
cases; and development machines are typically much faster than ten years
ago.

Also, I think addition of new include lines to existing .c files should
be a point worth specific attention in patch review, to avoid breaking
reasonable modularity boundaries unnecessarily.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: How to retain lesser paths at add_path()?

2019-07-31 Thread Tom Lane
Kohei KaiGai  writes:
> When we add a new path using add_path(), it checks estimated cost and 
> path-keys,
> then it also removes dominated paths, if any.
> Do we have a reasonable way to retain these "dominated" paths? Once it
> is considered
> lesser paths at a level, however, it may have a combined cheaper cost
> with upper pathnode.

You do *not* want to have add_path fail to remove dominated paths in
general.  Don't even think about it, because otherwise you will have
plenty of time to regret your folly while you wait for the planner
to chew through an exponential number of possible join plans.

What you'd want to do for something like the above, I think, is to
have some kind of figure of merit or other special marking for paths
that will have some possible special advantage in later planning
steps.  Then you can teach add_path that that's another dimension it
should consider, in the same way that paths with different sort orders
or parallizability attributes don't dominate each other.

regards, tom lane




Global temporary tables

2019-07-31 Thread Konstantin Knizhnik
Current Postgres implementation of temporary table causes number of 
problems:


1. Catalog bloating: if client creates and deletes too many temporary 
tables, then autovacuum get stuck on catalog.
2. Parallel queries: right now usage of temporary tables in query 
disables parallel plan.
3. It is not possible to use temporary tables at replica. Hot standby 
configuration is frequently used to run OLAP queries on replica
and results of such queries are used to be saved in temporary tables. 
Right now it is not possible (except "hackers" solution with storing 
results in file_fdw).

4. Temporary tables can not be used in prepared transactions.
5. Inefficient memory usage and possible memory overflow: each backend 
maintains its own local buffers for work with temporary tables.
Default size of temporary buffers is 8Mb. It seems to be too small for 
modern servers having hundreds of gigabytes of RAM, causing extra 
copying of data
between OS cache and local buffers. But if there are thousands of 
backends, each executing queries with temporary tables, then  total 
amount of

memory used for temporary buffers can exceed several tens of gigabytes.
6. Connection pooler can not reschedule session which has created 
temporary tables to some other backend

because it's data is stored in local buffers.

There were several attempts to address this problems.
For example Alexandr Alekseev has implemented patch which allows to 
create fast temporary tables without accessing system catalog:

https://www.postgresql.org/message-id/flat/20160301182500.2c81c3dc%40fujitsu
Unfortunately this patch was too invasive and rejected by community.

There was also attempt to allow under some condition use temporary 
tables in 2PC transactions:

https://www.postgresql.org/message-id/flat/m2d0pllvqy.fsf%40dimitris-macbook-pro.home
https://www.postgresql.org/message-id/flat/3a4b3c88-4fa5-1edb-a878-1ed76fa1c82b%40postgrespro.ru#d8a8342d07317d12e3405b903d3b15e4
Them were also rejected.

I try to make yet another attempt to address this problems, first of all 
1), 2), 5) and 6)
To solve this problems I propose notion of "global temporary" tables, 
similar with ones in Oracle.
Definition of this table (metadata) is shared by all backends but data 
is private to the backend. After session termination data is obviously lost.


Suggested syntax for creation of global temporary tables:

    create global temp table
or
    create session table

Once been created it can be used by all backends.
Global temporary tables are accessed though shared buffers (to solve 
problem 2).
Cleanup of temporary tables data (release of shared buffer and deletion 
of relation files) is performed on backend termination.
In case of abnormal server termination, files of global temporary tables 
are cleaned-up in the same way as of local temporary tables.


Certainly there are cases were global temporary tables can not be used, 
i.e. when application is dynamically constructed name and columns of 
temporary table.
Also access to local buffers is more efficient than access to shared 
buffers because it doesn't require any synchronization.
But please notice that it is always possible to create old (local) 
temporary tables which preserves current behavior.


The problem with replica is still not solved. But shared metadata is 
step in this direction.
I am thinking about reimplementation of temporary tables using new table 
access method API.
The drawback of such approach is that it will be necessary to 
reimplement large bulk of heapam code.
But this approach allows to eliminate visibility check for temporary 
table tuples and decrease size of tuple header.
I still not sure if implementing special table access method for 
temporary tables is good idea.


Patch for global temporary tables is attached to this mail.
The known limitation is that now it supports only B-Tree indexes.
Any feedback is welcome.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

commit 4d7a1f51405cdcb4d9dc421b0cb3a0d3439ba362
Author: Konstantin Knizhnik 
Date:   Tue Jun 25 14:49:50 2019 +0300

Add session tables

diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index c945b28..14d4e48 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -95,13 +95,13 @@ ginRedoInsertEntry(Buffer buffer, bool isLeaf, BlockNumber rightblkno, void *rda
 
 	if (PageAddItem(page, (Item) itup, IndexTupleSize(itup), offset, false, false) == InvalidOffsetNumber)
 	{
-		RelFileNode node;
+		RelFileNodeBackend rnode;
 		ForkNumber	forknum;
 		BlockNumber blknum;
 
-		BufferGetTag(buffer, , , );
+		BufferGetTag(buffer, , , );
 		elog(ERROR, "failed to add item to index page in %u/%u/%u",
-			 node.spcNode, node.dbNode, node.relNode);
+			 rnode.node.spcNode, rnode.node.dbNode, rnode.node.relNode);
 	}
 }
 
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

2019-07-31 Thread Rafia Sabih
On Thu, 25 Jul 2019 at 05:49, Peter Geoghegan  wrote:
>
> On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan  wrote:
> > There seems to be a kind of "synergy" between the nbtsplitloc.c
> > handling of pages that have lots of duplicates and posting list
> > compression. It seems as if the former mechanism "sets up the bowling
> > pins", while the latter mechanism "knocks them down", which is really
> > cool. We should try to gain a better understanding of how that works,
> > because it's possible that it could be even more effective in some
> > cases.
>
> I found another important way in which this synergy can fail to take
> place, which I can fix.
>
> By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes
> from my test suite become much smaller, while most are not affected.
> These indexes were not helped too much by the patch before. For
> example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full
> of duplicates of a single value (that's how it appears after an
> initial TPC-E bulk load), as are a couple of other TPC-E indexes.
> TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its
> idx_lineitem_orderkey index becomes ~15% smaller.
>
> I believe that this happened because rightmost page splits were an
> inefficient case for compression. But rightmost page split heavy
> indexes with lots of duplicates are not that uncommon. Think of any
> index with many NULL values, for example.
>
> I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm
> not sure what the idea is behind it. My sense is that we're likely to
> benefit by delaying page splits, no matter what. Though I am still
> looking at it purely from a space utilization point of view, at least
> for now.
>
Minor comment fix, pointes-->pointer, plus, are we really doing the
half, or is it just splitting into two.
/*
+ * Split posting tuple into two halves.
+ *
+ * Left tuple contains all item pointes less than the new one and
+ * right tuple contains new item pointer and all to the right.
+ *
+ * TODO Probably we can come up with more clever algorithm.
+ */

Some remains of 'he'.
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list.
+ * Caller is responsible for checking BTreeTupleIsPosting to ensure that
+ * it will get what he expects
+ */

Everything reads just fine without 'us'.
/*
+ * This field helps us to find beginning of the remaining tuples from
+ * postings which follow array of offset numbers.
+ */
-- 
Regards,
Rafia Sabih




Re: pgbench - implement strict TPC-B benchmark

2019-07-31 Thread Jonah H. Harris
On Wed, Jul 31, 2019 at 10:10 AM Fabien COELHO  wrote:

>
> Hello Tom,
>
> >>> I'm also highly dubious about labeling this script "standard TPC-B",
> >>> when it resolves only some of the reasons why our traditional script
> >>> is not really TPC-B.  That's treading on being false advertising.
> >
> >> IANAL, but it may not even be permissible to claim that we have
> >> implemented "standard TPC-B".
> >
> > Yeah, very likely you can't legally say that unless the TPC
> > has certified your test.  (Our existing code and docs are quite
> > careful to call pgbench's version "TPC-like" or similar weasel
> > wording, and never claim that it is really TPC-B or even a close
> > approximation.)
>
> Hmmm.
>
> I agree that nobody really cares about TPC-B per se. The point of this
> patch is to provide a built-in example of recent and useful pgbench
> features that match a real specification.
>

I agree with this. When I was at EnterpriseDB, while it wasn't audited, we
had to develop an actual TPC-B implementation because pgbench was too
different. pgbench itself isn't that useful as a benchmark tool, imo, but
if we have the ability to make it better (i.e. closer to an actual
benchmark kit), I think we should.

-- 
Jonah H. Harris


Re: pgbench - implement strict TPC-B benchmark

2019-07-31 Thread Fabien COELHO


Hello Tom,


I'm also highly dubious about labeling this script "standard TPC-B",
when it resolves only some of the reasons why our traditional script
is not really TPC-B.  That's treading on being false advertising.



IANAL, but it may not even be permissible to claim that we have
implemented "standard TPC-B".


Yeah, very likely you can't legally say that unless the TPC
has certified your test.  (Our existing code and docs are quite
careful to call pgbench's version "TPC-like" or similar weasel
wording, and never claim that it is really TPC-B or even a close
approximation.)


Hmmm.

I agree that nobody really cares about TPC-B per se. The point of this 
patch is to provide a built-in example of recent and useful pgbench 
features that match a real specification.


The "strict" only refers to the test script. It cannot match the whole 
spec which addresses many other things, some of them more process than 
tool: table creation and data types, performance data collection, database 
configuration, pricing of hardware used in the tests, post-benchmark run 
checks, auditing constraints, whatever…



[about pgbench] it's got too many "features" already.


I disagree with this judgement.

Although not all features are that useful, the accumulation of recent 
additions (int/float/bool expressions, \if, \gset, non uniform prng, …) 
makes it suitable for testing various realistic scenarii which could not 
be tested before. As said above, my point was to have a builtin 
illustration of available capabilities.


It did not occur to me that a scripts which implements "strictly" a 
particular section of a 25-year obsolete benchmark could raise any legal 
issue.


--
Fabien.

Re: Replication & recovery_min_apply_delay

2019-07-31 Thread Konstantin Knizhnik




On 08.07.2019 11:05, Michael Paquier wrote:

On Mon, Jul 08, 2019 at 07:56:25PM +1200, Thomas Munro wrote:

On Thu, Jan 31, 2019 at 3:34 AM Alvaro Herrera  wrote:

On 2019-Jan-30, Konstantin Knizhnik wrote:

I wonder if it can be considered as acceptable solution of the problem or
there can be some better approach?

I didn't find one.

It sounds like you are in agreement that there is a problem and this
is the best solution.  I didn't look at these patches, I'm just asking
with my Commitfest manager hat on: did I understand correctly, does
this need a TAP test, possibly the one Alvaro posted, and if so, could
we please have a fresh patch that includes the test, so we can see it
passing the test in CI?

Please note that I have not looked at that stuff in details, but I
find the patch proposed kind of ugly with the scan of the last segment
using a WAL reader to find out what is the last LSN and react on
that..  This does not feel right.
--
Michael


I am sorry for delay with answer.
Looks like I have not noticed your reply:(
Can you explain me please why it is not correct to iterate through WAL 
using WAL reader to get last LSN?
From my point of view it may be not so efficient way, but it should 
return correct value, shouldn't it?

Can you suggest some better way to calculate last LSN?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company





Re: Unused header file inclusion

2019-07-31 Thread Tom Lane
Michael Paquier  writes:
> On Wed, Jul 31, 2019 at 11:55:37AM +0530, Amit Kapila wrote:
>> If we can come up with some such tool, we might be able to integrate
>> it with Thomas's patch tester [1] wherein it can apply the patch,
>> verify if there are unnecessary includes in the patch and report the
>> same.

> Or even get something into src/tools/?  If the produced result is
> clean enough, that could be interesting.

I take it nobody has actually bothered to *look* in src/tools.

src/tools/pginclude/README

Note that our experience with this sort of thing has not been very good.
See particularly 1609797c2.

regards, tom lane




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-31 Thread Sehrope Sarkuni
On Wed, Jul 31, 2019 at 2:32 AM Masahiko Sawada 
wrote:

> Just to confirm, we have 21 bits left for nonce in CTR? We have LSN (8
> bytes), page-number (4 bytes) and counter (11 bits) in 16 bytes nonce
> space. Even though we have 21 bits left we cannot store relfilenode to
> the IV.
>

Fields like relfilenode, database, or tablespace could be added to the
derived key, not the per-page IV. There's no space limitations as they are
additional inputs into the HKDF (key derivation function).

Additional salt of any size, either some stored random value or something
deterministic like the relfilenode/database/tablespace, can be added to the
HKDF. There's separate issues with including those specific fields but it's
not a size limitation.


> For WAL encryption,  before flushing WAL we encrypt whole 8k WAL page
> and then write only the encrypted data of the new WAL record using
> pg_pwrite() rather than write whole encrypted page. So each time we
> encrypt 8k WAL page we end up with encrypting different data with the
> same key+nonce but since we don't write to the disk other than space
> where we actually wrote WAL records it's not a problem. Is that right?
>

Ah, this is what I was referring to in my previous mail. I'm not familiar
with how the writes happen yet (reading up...) but, yes, we would need to
ensure that encrypted data is not written more than once (i.e. no writing
of encrypt(zero) followed by writing of encrypt(non-zero) at the same spot).

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: proposal - patch: psql - sort_by_size

2019-07-31 Thread Fabien COELHO



Hello Jeremy,


Comments, notes?


One oddity about pg_relation_size and pg_table_size is that they can be
easily blocked by user activity.  In fact it happens to us often in
reporting environments and we have instead written different versions of
them that avoid the lock contention and still give "close enough" results.

This blocking could result in quite unexpected behavior, that someone uses
your proposed command and it never returns.  Has that been considered as a
reality at least to be documented?


ISTM that it does not change anything wrt the current behavior because of 
the prudent lazy approach: the sorting is *only* performed when the size 
is already available in one of the printed column.


Maybe the more general question could be "is there a caveat somewhere that 
when doing \d.+ a user may have issues with locks because of the size 
computations?".


--
Fabien.




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-31 Thread Sehrope Sarkuni
On Tue, Jul 30, 2019 at 4:48 PM Bruce Momjian  wrote:

> I had more time to think about the complexity of adding relfilenode to
> the IV.  Since relfilenode is only unique within a database/tablespace,
> we would need to have pg_upgrade preserve database/tablespace oids
> (which I assume are the same as the directory and tablespace symlinks).
> Then, to decode a page, you would need to look up those values.  This is
> in addition to the new complexity of CREATE DATABASE and moving files
> between tablespaces.  I am also concerned that crash recovery operations
> and cluster forensics and repair would need to also deal with this.
>
> I am not even clear if pg_upgrade preserving relfilenode is possible ---
> when we wrap the relfilenode counter, does it start at 1 or at the
> first-user-relation-oid?  If the former, it could conflict with oids
> assigned to new system tables in later major releases.  Tying the
> preservation of relations to two restrictions seems risky.
>

Agreed. Unless you know for sure the input is going to immutable across
copies or upgrades, including anything in either the IV or key derivation
gets risky and could tie you down for the future. That's partly why I like
the idea separate salt (basically you directly pay for the complexity by
tracking that).

Even if we do not include a separate per-relation salt or things like
relfilenode when generating a derived key, we can still include other types
of immutable attributes. For example the fork type could be included to
eventually allow multiple forks for the same relation to be encrypted with
the same IV = LSN + Page Number as the derived key per-fork would be
distinct.


> Using just the page LSN and page number allows a page to be be
> decrypted/encrypted independently of its file name, tablespace, and
> database, and I think that is a win for simplicity.  Of course, if it is
> insecure we will not do it.
>

As LSN + Page Number combo is unique for all relations (not just one
relation) I think we're good for pages.

I am thinking for the heap/index IV, it would be:
>
> uint64 lsn;
> unint32 page number;
> /* only uses 11 bits for a zero-based CTR counter for 32k pages */
> uint32 counter;
>

Looks good.


> and for WAL it would be:
>
> uint64 segment_number;
> uint32counter;
> /* guarantees this IV doesn't match any relation IV */
> uint32   2^32-1 /* all 1's */
>

I need to read up more on the structure of the WAL records but here's some
high level thoughts:

WAL encryption should not use the same key as page encryption so there's no
need to design the IV to try to avoid matching the page IVs. Even a basic
derivation with a single fixed WDEK = HKDF(MDEK, "WAL") and TDEK =
HKDF(MDEK, "PAGE") would ensure separate keys. That's the the literal
string "WAL" or "PAGE" being added as a salt to generate the respective
keys, all that matters is they're different.

Ideally WAL encryption would generating new derived keys as part of the WAL
stream. The WAL stream is not fixed so you have the luxury of being able to
add a "Use new random salt XZY going forward" records. Forcing generation
of a new salt/key upon promotion of a replica would ensure that at least
the WAL is unique going forward. Could also generate a new upon server
startup, after every N bytes, or a new one for each new WAL file. There's
much more flexibility compared to page encryption.

As WAL is a single continuous stream, we can start the IV for each derived
WAL key from zero. There's no need to complicate it further as Key + IV
will never be reused.

If WAL is always written as full pages we need to ensure that the empty
parts of the page are actual zeros and not "encrypted zeroes". Otherwise an
XOR of the empty section of the first write of a page against a subsequent
one would give you the plain text.

The non-fixed size of the WAL allows for the addition of a MAC though I'm
not sure yet the best way to incorporate it. It could be part of each
encrypted record or its own summary record (providing a MAC for a series of
WAL records). After I've gone through this a bit more I'm looking to put
together a write up with this and some other thoughts in one place.

Regards,
-- Sehrope Sarkuni
Founder & CEO | JackDB, Inc. | https://www.jackdb.com/


Re: POC: Cleaning up orphaned files using undo logs

2019-07-31 Thread Amit Kapila
On Wed, Jul 31, 2019 at 10:13 AM Amit Kapila  wrote:
>
> On Tue, Jul 30, 2019 at 5:26 PM Thomas Munro  wrote:
> >
> > Hi Amit
> >
> > I've been testing some undo worker workloads (more on that soon),
> >
>
> One small point, there is one small bug in the error queues which is
> that the element pushed into error queue doesn't have an updated value
> of to_urec_ptr which is important to construct the hash key.  This
> will lead to undolauncher/worker think that the action for the same is
> already processed and it removes the same from the hash table. I have
> a fix for the same which I will share in next version of the patch
> (which I am going to share in the next day or two).
>
> >  but
> > here's a small thing: I managed to reach an LWLock self-deadlock in
> > the undo worker launcher:
> >
>
> I could see the problem, will fix in next version.
>

Fixed both of these problems in the patch just posted by me [1].

[1] - 
https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: POC: Cleaning up orphaned files using undo logs

2019-07-31 Thread Amit Kapila
On Wed, Jul 24, 2019 at 10:00 AM Dilip Kumar  wrote:
>
> On Mon, Jul 22, 2019 at 3:51 PM Dilip Kumar  wrote:
> >
> Please find my review comments for
> 0013-Allow-foreground-transactions-to-perform-undo-action
>
> + /* initialize undo record locations for the transaction */
> + for (i = 0; i < UndoLogCategories; i++)
> + {
> + s->start_urec_ptr[i] = InvalidUndoRecPtr;
> + s->latest_urec_ptr[i] = InvalidUndoRecPtr;
> + s->undo_req_pushed[i] = false;
> + }
>
> Can't we just memset this memory?
>

Yeah, that sounds better, so changed.

>
> + * We can't postpone applying undo actions for subtransactions as the
> + * modifications made by aborted subtransaction must not be visible even if
> + * the main transaction commits.
> + */
> + if (IsSubTransaction())
> + return;
>
> I am not completely sure but is it possible that the outer function
> CommitTransactionCommand/AbortCurrentTransaction can avoid
> calling this function in the switch case based on the current state,
> so that under subtransaction this will never be called?
>

I have already explained as a separate response to this email why I
don't think this is a very good idea.

>
> + /*
> + * Prepare required undo request info so that it can be used in
> + * exception.
> + */
> + ResetUndoRequestInfo();
> + urinfo.dbid = dbid;
> + urinfo.full_xid = fxid;
> + urinfo.start_urec_ptr = start_urec_ptr[per_level];
> +
>
> I see that we are preparing urinfo before execute_undo_actions so that
> in case of an error in CATCH we can use that to
> insert into the queue, but can we just initialize urinfo right there
> before inserting into the queue, we have all the information
> Am I missing something?
>

IIRC, the only idea was that we can use the same variable
(urinfo.full_xid) in execute_undo_actions call and in the catch block,
but I think your suggestion sounds better as we can avoid declaring
urinfo as volatile in that case.

> +
> + /*
> + * We need the locations of the start and end undo record pointers when
> + * rollbacks are to be performed for prepared transactions using undo-based
> + * relations.  We need to store this information in the file as the user
> + * might rollback the prepared transaction after recovery and for that we
> + * need it's start and end undo locations.
> + */
> + UndoRecPtr start_urec_ptr[UndoLogCategories];
> + UndoRecPtr end_urec_ptr[UndoLogCategories];
>
> it's -> its
>
>
..
>
> We must have some comments to explain how performUndoActions is used,
> where it's set.  If it's explained somewhere else then we can
> give reference to that code.
>
> + for (i = 0; i < UndoLogCategories; i++)
> + {
> + if (s->latest_urec_ptr[i])
> + {
> + s->performUndoActions = true;
> + break;
> + }
> + }
>
> I think we should chek UndoRecPtrIsValid(s->latest_urec_ptr[i])
>

Changed as per suggestion.

> + PG_TRY();
> + {
> + /*
> + * Prepare required undo request info so that it can be used in
> + * exception.
> + */
> + ResetUndoRequestInfo();
> + urinfo.dbid = dbid;
> + urinfo.full_xid = fxid;
> + urinfo.start_urec_ptr = start_urec_ptr[per_level];
> +
> + /* for subtransactions, we do partial rollback. */
> + execute_undo_actions(urinfo.full_xid,
> + end_urec_ptr[per_level],
> + start_urec_ptr[per_level],
> + !isSubTrans);
> + }
> + PG_CATCH();
>
> Wouldn't it be good to explain in comments that we are not rethrowing
> the error in PG_CATCH but because we don't want the main
> transaction to get an error if there is an error while applying to
> undo action for the main transaction and we will abort the transaction
> in the caller of this function?
>

I have added a comment atop of the function containing this code.

> +tables are only accessible in the backend that has created them.  We can't
> +postpone applying undo actions for subtransactions as the modifications
> +made by aborted subtransaction must not be visible even if the main 
> transaction
> +commits.
>
> I think we need to give detail reasoning why subtransaction changes
> will be visible if we don't apply it's undo and the main
> the transaction commits by mentioning that we don't use separate
> transaction id for the subtransaction and that will make all the
> changes of the transaction id visible when it commits.
>

I have added a detailed explanation in execute_undo_actions() and
given a reference of same here.

The changes are present in the patch series just posted by me [1].

[1] - 
https://www.postgresql.org/message-id/CAA4eK1KKAFBCJuPnFtgdc89djv4xO%3DZkAdXvKQinqN4hWiRbvA%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com




Re: proposal - patch: psql - sort_by_size

2019-07-31 Thread Jeremy Finzel
On Fri, Jun 28, 2019 at 10:13 AM Pavel Stehule 
wrote:

> Hi
>
> I returned to possibility to sort output of \d* and \l by size. There was
> more a experiments in this area, but without success. Last patch was
> example of over engineering, and now, I try to implement this feature
> simply how it is possible. I don't think so we need too complex solution -
> if somebody needs specific report, then it is not hard to run psql with
> "-E" option, get and modify used query (and use a power of SQL). But
> displaying databases objects sorted by size is very common case.
>
> This proposal is based on new psql variable "SORT_BY_SIZE". This variable
> will be off by default. The value of this variable is used only in verbose
> mode (when the size is displayed - I don't see any benefit sort of size
> without showing size). Usage is very simple and implementation too:
>
> \dt -- sorted by schema, name
> \dt+ -- still sorted  by schema, name
>
> \set SORT_BY_SIZE on
> \dt -- sorted by schema, name (size is not calculated and is not visible)
> \dt+ -- sorted by size
>
> \dt+ public.* -- sorted by size from schema public
>
> Comments, notes?
>
> Regards
>
> Pavel
>
>
One oddity about pg_relation_size and pg_table_size is that they can be
easily blocked by user activity.  In fact it happens to us often in
reporting environments and we have instead written different versions of
them that avoid the lock contention and still give "close enough" results.

This blocking could result in quite unexpected behavior, that someone uses
your proposed command and it never returns.  Has that been considered as a
reality at least to be documented?

Thanks,
Jeremy


Re: proposal - patch: psql - sort_by_size

2019-07-31 Thread Rafia Sabih
On Mon, 15 Jul 2019 at 06:12, Pavel Stehule  wrote:
>
> Hi
>
> pá 12. 7. 2019 v 15:10 odesílatel Fabien COELHO  napsal:
>>
>>
>> Hello Pavel,
>>
>> > rebased patch attached
>>
>> I prefer patches with a number rather than a date, if possible. For one
>> thing, there may be several updates in one day.
>>
>> About this version (20180708, probably v3): applies cleanly, compiles,
>> make check ok, doc build ok. No tests.
>
>
> attached version 4
>
>>
>> It works for me on a few manual tests against a 11.4 server.
>>
>> Documentation: if you say "\d*+", then it already applies to \db+ and
>> \dP+, so why listing them? Otherwise, state all commands or make it work
>> on all commands that have a size?
>>
>> About the text:
>>- remove , before "sorts"
>>- ... outputs by decreasing size, when size is displayed.
>>- add: When size is not displayed, the output is sorted by names.
>
>
> fixed
>
>>
>> I still think that the object name should be kept as a secondary sort
>> criterion, in case of size equality, so that the output is deterministic.
>> Having plenty of objects of the same size out of alphabetical order looks
>> very strange.
>
>
> fixed
>
> Regards
>
> Pavel
>>
>>
>> I still do not like much the boolean approach. I understand that the name
>> approach has been rejected, and I can understand why.
>>
>> I've been thinking about another more generic interface, that I'm putting
>> here for discussion, I do not claim that it is a good idea. Probably could
>> fall under "over engineering", but it might not be much harder to
>> implement, and it solves a few potential problems.
>>
>> The idea is to add an option to \d commands, such as "\echo -n":
>>
>>\dt+ [-o 1d,2a] ...
>>
>> meaning do the \dt+, order by column 1 descending, column 2 ascending.
>> With this there would be no need for a special variable nor other
>> extensions to specify some ordering, whatever the user wishes.
>>
>> Maybe it could be "\dt+ [-o '1 DESC, 2 ASC'] ..." so that the string
>> is roughly used as an ORDER BY specification by the query, but it would be
>> longer to specify.
>>
>> It also solves the issue that if someone wants another sorting order we
>> would end with competing boolean variables such as SORT_BY_SIZE,
>> SORT_BY_TYPE, SORT_BY_SCHEMA, which would be pretty unpractical. The
>> boolean approach works for *one* sorting extension and breaks at the next
>> extension.
>>
>> Also, the boolean does not say that it is a descending order. I could be
>> interested in looking at the small tables.
>>
>> Another benefit for me is that I do not like much variables with side
>> effects, whereas with an explicit syntax there would be no such thing, the
>> user has what was asked for. Ok, psql is full of them, but I cannot say I
>> like it for that.
>>
>> The approach could be extended to specify a limit, eg \dt -l 10 would
>> add a LIMIT 10 on the query.
>>
>> Also, the implementation could be high enough so that the description
>> handlers would not have to deal with it individually, it could return
>> the query which would then be completed with SORT/LIMIT clauses before
>> being executed, possibly with a default order if none is specified.

I had a look at this patch, seems like a useful thing to have.
One clarification though,
What is the reason for compatibility with different versions in
listAllDbs and describeTablespaces, precisely

if (verbose && pset.sversion >= 90200)
+ {
  appendPQExpBuffer(,
",\n  pg_catalog.pg_size_pretty(pg_catalog.pg_tablespace_size(oid))
AS \"%s\"",
gettext_noop("Size"));
+ sizefunc = "pg_catalog.pg_tablespace_size(oid)";
+ }
in describeTablespaces but
if (verbose && pset.sversion >= 80200)
+ {
  appendPQExpBuffer(,
",\n   CASE WHEN pg_catalog.has_database_privilege(d.datname,
'CONNECT')\n"
"THEN
pg_catalog.pg_size_pretty(pg_catalog.pg_database_size(d.datname))\n"
"ELSE 'No Access'\n"
"   END as \"%s\"",
gettext_noop("Size"));
+ sizefunc = "pg_catalog.pg_database_size(d.datname)";
+ }
in listAllDbs.


-- 
Regards,
Rafia Sabih




Re: Problem with default partition pruning

2019-07-31 Thread Alvaro Herrera
On 2019-Jul-31, Amit Langote wrote:

> Hello,
> 
> I noticed that the patch is still marked as "Waiting on Author" ever
> since Shawn set it that way on June 17.  Since Hosoya-san posted
> updated patches on June 27, the status should've been changed to
> "Needs Review".  Or maybe "Ready for Committer", because the last time
> I looked, at least the default partition pruning issue seems to be
> sufficiently taken care of by the latest patch.  Whether or not we
> should apply the other patch (more aggressive use of constraint
> exclusion by partprune.c on partitioned partitions), I'm not sure, but
> maybe a committer can decide in an instant. :)

Thanks for the status update.  I intend to get this patch pushed before
the next set of minors.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: partition routing layering in nodeModifyTable.c

2019-07-31 Thread Etsuro Fujita
On Wed, Jul 31, 2019 at 5:05 PM Amit Langote  wrote:
> On Tue, Jul 30, 2019 at 4:20 PM Amit Langote  wrote:
> > On Sat, Jul 20, 2019 at 1:52 AM Andres Freund  wrote:
> > > IOW, we should just change the direct modify calls to get the relevant
> > > ResultRelationInfo or something in that vein (perhaps just the relevant
> > > RT index?).
> >
> > It seems easy to make one of the two functions that constitute the
> > direct modify API, IterateDirectModify(), access the result relation
> > from ForeignScanState by saving either the result relation RT index or
> > ResultRelInfo pointer itself into the ForeignScanState's FDW-private
> > area.  For example, for postgres_fdw, one would simply add a new
> > member to PgFdwDirectModifyState struct.
> >
> > Doing that for the other function BeginDirectModify() seems a bit more
> > involved.  We could add a new field to ForeignScan, say
> > resultRelation, that's set by either PlanDirectModify() (the FDW code)
> > or make_modifytable() (the core code) if the ForeignScan node contains
> > the command for direct modification.  BeginDirectModify() can then use
> > that value instead of relying on es_result_relation_info being set.
> >
> > Thoughts?  Fujita-san, do you have any opinion on whether that would
> > be a good idea?

I'm still not sure that it's a good idea to remove
es_result_relation_info, but if I had to say then I think the latter
would probably be better.  I'm planning to rework on direct
modification to base it on upper planner pathification so we can
perform direct modification without the ModifyTable node.  (I'm not
sure we can really do this for inherited UPDATE/DELETE, though.)  For
that rewrite, I'm thinking to call BeginDirectModify() from the
ForeignScan node (ie, ExecInitForeignScan()) as-is.  The latter
approach would allow that without any changes and avoid changing that
API many times.  That's the reason why I think the latter would
probably be better.

> I looked into trying to do the things I mentioned above and it seems
> to me that revising BeginDirectModify()'s API to receive the
> ResultRelInfo directly as Andres suggested might be the best way
> forward.  I've implemented that in the attached 0001.  Patches that
> were previously 0001, 0002, and 0003 are now 0002, 003, and 0004,
> respectively.  0002 is now a patch to "remove"
> es_result_relation_info.

Sorry for speaking this late.

Best regards,
Etsuro Fujita




Re: concerns around pg_lsn

2019-07-31 Thread Jeevan Ladhe
On Tue, Jul 30, 2019 at 6:06 PM Robert Haas  wrote:

> On Tue, Jul 30, 2019 at 4:52 AM Jeevan Ladhe
>  wrote:
> > My only concern was something that we internally treat as invalid, why do
> > we allow, that as a valid value for that type. While I am not trying to
> > reinvent the wheel here, I am trying to understand if there had been any
> > idea behind this and I am missing it.
>
> Well, the word "invalid" can mean more than one thing.  Something can
> be valid or invalid depending on context.  I can't have -2 dollars in
> my wallet, but I could have -2 dollars in my bank account, because the
> bank will allow me to pay out slightly more money than I actually have
> on the idea that I will pay them back later (and with interest!).  So
> as an amount of money in my wallet, -2 is invalid, but as an amount of
> money in my bank account, it is valid.
>
> 0/0 is not a valid LSN in the sense that (in current releases) we
> never write a WAL record there, but it's OK to compute with it.
> Subtracting '0/0'::pg_lsn seems useful as a way to convert an LSN to
> an absolute byte-index in the WAL stream.
>

Thanks Robert for such a nice and detailed explanation.
I now understand why LSN '0/0' can still be useful.

Regards,
Jeevan Ladhe


Re: how to run encoding-dependent tests by default

2019-07-31 Thread Peter Eisentraut
On 2019-07-29 16:47, Tom Lane wrote:
>> (The two tests create the same schema name, so they cannot be run in
>> parallel.  I opted against changing that here, since it would blow up
>> the patch and increase the diff between the two tests.)
> 
> This does create one tiny nit, which is that the order of the
> parallel and serial schedule files don't match.  Possibly I'm
> overly anal-retentive about that, but I think it's confusing
> when they don't.

Right.  Committed with adjustment to keep these consistent.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: idea: log_statement_sample_rate - bottom limit for sampling

2019-07-31 Thread Adrien Nayrat
On 7/28/19 12:19 AM, Tomas Vondra wrote:
> Hi,
> 
> I've started reviewing this patch, thinking that maybe I could get it
> committed as it's marked as RFC. In general I agree with having this
> fuature, but I think we need to rethink the GUC because the current
> approach is just confusing.
> 
> The way the current patch works is that we have three GUCs:
> 
>  log_min_duration_statement
>  log_statement_sample_limit
>  log_statement_sample_rate
> 
> and it essentially works like this:
> 
> - If the duration exceeds log_min_duration_statement, we start sampling
>  the commands with log_statement_sample rate.
> 
> - If the duration exceeds log_statement_sample_limit, we just log the
>  command every time (i.e. we disable sampling, using sample rate 1.0).
> 
> IMO that's bound to be confusing for users, because one threshold
> behaves as minimum while the other behaves as maximum.

I agree, it took me a while to understand how it behave with the three GUC. That
why I tried to enrich documentation, but it may mean that the functionality is
not properly implemented.

> 
> 
> What I think we should do instead is to use two minimum thresholds.
> 
> 1) log_min_duration_sample - enables sampling of commands, using the
> existing GUC log_statement_sample_rate
> 
> 2) log_min_duration_statement - logs all commands exceeding this
> 
> 
> I think this is going to be much easier for users to understand.

+1, I like this idea.

I don't really have an opinion if we have to revert the whole feature or try to
fix it for v12. But it is true it is a late to fix it.

Furthermore, users who really want this feature in v12 can use an extension for
that purpose [1].

1: I made this extension with same kind of features :
https://github.com/anayrat/pg_sampletolog



signature.asc
Description: OpenPGP digital signature


Re: Implementing Incremental View Maintenance

2019-07-31 Thread Yugo Nagata
Hi,

Attached is the latest patch for supporting min and max aggregate functions.

When new tuples are inserted into base tables, if new values are small
(for min) or large (for max), matview just have to be updated with these
new values. Otherwise, old values just remains.

However, in the case of deletion, this is more complicated. If deleted
values exists in matview as current min or max, we have to recomputate
new min or max values from base tables for affected groups, and matview
should be updated with these recomputated values. 

Also, regression tests for min/max are also added.

In addition, incremental update algorithm of avg aggregate values is a bit
improved. If an avg result in materialized views is updated incrementally
y using the old avg value, numerical errors in avg values are accumulated
and the values get wrong eventually. To prevent this, both of sum and count
values are contained in views as hidden columns and use them to calculate
new avg value instead of using old avg values.

Regards,

On Fri, 26 Jul 2019 11:35:53 +0900
Yugo Nagata  wrote:

> Hi,
> 
> I've updated the wiki page of Incremental View Maintenance.
> 
> https://wiki.postgresql.org/wiki/Incremental_View_Maintenance
> 
> On Thu, 11 Jul 2019 13:28:04 +0900
> Yugo Nagata  wrote:
> 
> > Hi Thomas,
> > 
> > Thank you for your review and discussion on this patch!
> > 
> > > > 2019年7月8日(月) 15:32 Thomas Munro :
> > > > 
> > > > > On Fri, Jun 28, 2019 at 10:56 PM Yugo Nagata  
> > > > > wrote:
> > > > > > Attached is a WIP patch of IVM which supports some aggregate 
> > > > > > functions.
> > > > >
> > > > > Hi Nagata-san and Hoshiai-san,
> > > > >
> > > > > Thank you for working on this.  I enjoyed your talk at PGCon.  I've
> > > > > added Kevin Grittner just in case he missed this thread; he has talked
> > > > > often about implementing the counting algorithm, and he wrote the
> > > > > "trigger transition tables" feature to support exactly this.  While
> > > > > integrating trigger transition tables with the new partition features,
> > > > > we had to make a number of decisions about how that should work, and
> > > > > we tried to come up with answers that would work for IMV, and I hope
> > > > > we made the right choices!
> > 
> > Transition tables is a great feature. I am now using this in my 
> > implementation
> > of IVM, but the first time I used this feature was when I implemented a PoC
> > for extending view updatability of PostgreSQL[1]. At that time, I didn't 
> > know
> > that this feature is made originally aiming to support IVM. 
> > 
> > [1] https://www.pgcon.org/2017/schedule/events/1074.en.html
> > 
> > I think transition tables is a good choice to implement a statement level
> > immediate view maintenance where materialized views are refreshed in a 
> > statement
> > level after trigger. However, when implementing a transaction level 
> > immediate
> > view maintenance where views are refreshed per transaction, or deferred view
> > maintenance, we can't update views in a after trigger, and we will need an
> > infrastructure to manage change logs of base tables. Transition tables can 
> > be
> > used to collect these logs, but using logical decoding of WAL is another 
> > candidate.
> > In any way, if these logs can be collected in a tuplestore, we might able to
> > convert this to "ephemeral named relation (ENR)" and use this to calculate 
> > diff
> > sets for views.
> > 
> > > > >
> > > > > I am quite interested to learn how IVM interacts with SERIALIZABLE.
> > > > >
> > > > 
> > > >  Nagata-san has been studying this. Nagata-san, any comment?
> > 
> > In SERIALIZABLE or REPEATABLE READ level, table changes occurred in other 
> > ransactions are not visible, so views can not be maintained correctly in 
> > AFTER
> > triggers. If a view is defined on two tables and each table is modified in
> > different concurrent transactions respectively, the result of view 
> > maintenance
> > done in trigger functions can be incorrect due to the race condition. This 
> > is the
> > reason why such transactions are aborted immediately in that case in my 
> > current
> > implementation.
> > 
> > One idea to resolve this is performing view maintenance in two phases. 
> > Firstly, 
> > views are updated using only table changes visible in this transaction. 
> > Then, 
> > just after this transaction is committed, views have to be updated 
> > additionally 
> > using changes happened in other transactions to keep consistency. This is a 
> > just 
> > idea, but  to implement this idea, I think, we will need keep to keep and 
> > maintain change logs.
> > 
> > > > > A couple of superficial review comments:
> > 
> > 
> >  
> > > > > +const char *aggname = get_func_name(aggref->aggfnoid);
> > > > > ...
> > > > > +else if (!strcmp(aggname, "sum"))
> > > > >
> > > > > I guess you need a more robust way to detect the supported aggregates
> > > > > than their name, or I guess some way for aggregates 

Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

2019-07-31 Thread Evgeny Efimkin
What reason to use pg_atomic_uint64?
In docs:
occured - > occurred

Re: Built-in connection pooler

2019-07-31 Thread Konstantin Knizhnik



On 26.07.2019 19:20, Ryan Lambert wrote:


> PgPRO EE version of connection pooler has "idle_pool_worker_timeout"
> parameter which allows to terminate idle workers.

+1



I have implemented idle_pool_worker_timeout.
Also I added _idle_clients and n_idle_backends fields to proxy statistic 
returned by pg_pooler_state() function.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 84341a3..acd7041 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -719,6 +719,137 @@ include_dir 'conf.d'
   
  
 
+ 
+  max_sessions (integer)
+  
+   max_sessions configuration parameter
+  
+  
+  
+
+  The maximum number of client sessions that can be handled by
+  one connection proxy when session pooling is enabled.
+  This parameter does not add any memory or CPU overhead, so
+  specifying a large max_sessions value
+  does not affect performance.
+  If the max_sessions limit is reached new connections are not accepted.
+
+
+  The default value is 1000. This parameter can only be set at server start.
+   
+  
+ 
+
+ 
+  session_pool_size (integer)
+  
+   session_pool_size configuration parameter
+  
+  
+  
+
+  Enables session pooling and defines the maximum number of
+  backends that can be used by client sessions for each database/user combination.
+  Launched non-tainted backends are never terminated even if there are no active sessions.
+  Backend is considered as tainted if client updates GUCs, creates temporary table or prepared statements.
+  Tainted backend can server only one client.
+
+
+  The default value is 10, so up to 10 backends will serve each database,
+
+  
+ 
+
+ 
+  proxy_port (integer)
+  
+   proxy_port configuration parameter
+  
+  
+  
+
+  Sets the TCP port for the connection pooler.
+  Clients connected to main "port" will be assigned dedicated backends,
+  while client connected to proxy port will be connected to backends through proxy which
+  performs transaction level scheduling. 
+   
+
+  The default value is 6543.
+
+  
+ 
+
+ 
+  connection_proxies (integer)
+  
+   connection_proxies configuration parameter
+  
+  
+  
+
+  Sets number of connection proxies.
+  Postmaster spawns separate worker process for each proxy. Postmaster scatters connections between proxies using one of scheduling policies (round-robin, random, load-balancing).
+  Each proxy launches its own subset of backends.
+  So maximal number of non-tainted backends is  session_pool_size*connection_proxies*databases*roles.
+   
+
+  The default value is 0, so session pooling is disabled.
+
+  
+ 
+
+ 
+  session_schedule (enum)
+  
+   session_schedule configuration parameter
+  
+  
+  
+
+  Specifies scheduling policy for assigning session to proxies in case of
+  connection pooling. Default policy is round-robin.
+
+
+  With round-robin policy postmaster cyclicly scatter sessions between proxies.
+
+
+  With random policy postmaster randomly choose proxy for new session.
+
+
+  With load-balancing policy postmaster choose proxy with lowest load average.
+  Load average of proxy is estimated by number of clients connection assigned to this proxy with extra weight for SSL connections.
+   
+  
+ 
+
+ 
+  idle_pool_worker_timeout (integer)
+  
+   idle_pool_worker_timeout configuration parameter
+  
+  
+  
+
+ Terminate an idle connection pool worker after the specified number of milliseconds.
+ The default value is 0, so pool workers are never terminated.
+   
+  
+ 
+
+ 
+  restart_pooler_on_reload (string)
+  
+   restart_pooler_on_reload configuration parameter
+  
+  
+  
+
+  Restart session pool workers once pg_reload_conf() is called.
+  The default value is false.
+   
+  
+ 
+
  
   unix_socket_directories (string)
   
diff --git a/doc/src/sgml/connpool.sgml b/doc/src/sgml/connpool.sgml
new file mode 100644
index 000..bc6547b
--- /dev/null
+++ b/doc/src/sgml/connpool.sgml
@@ -0,0 +1,174 @@
+
+
+ 
+  Connection pooling
+
+  
+   built-in connection pool proxy
+  
+
+  
+PostgreSQL spawns a separate process (backend) for each client.
+For large number of clients this model can consume a large number of system
+resources and 

Re: Problem with default partition pruning

2019-07-31 Thread Amit Langote
Hello,

I noticed that the patch is still marked as "Waiting on Author" ever
since Shawn set it that way on June 17.  Since Hosoya-san posted
updated patches on June 27, the status should've been changed to
"Needs Review".  Or maybe "Ready for Committer", because the last time
I looked, at least the default partition pruning issue seems to be
sufficiently taken care of by the latest patch.  Whether or not we
should apply the other patch (more aggressive use of constraint
exclusion by partprune.c on partitioned partitions), I'm not sure, but
maybe a committer can decide in an instant. :)

I've marked it RfC for now.

Thanks,
Amit




Re: Parallel grouping sets

2019-07-31 Thread Richard Guo
On Tue, Jul 30, 2019 at 11:05 PM Tomas Vondra 
wrote:

> On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote:
> >On Wed, Jun 12, 2019 at 10:58 AM Richard Guo  wrote:
> >
> >> Hi all,
> >>
> >> Paul and I have been hacking recently to implement parallel grouping
> >> sets, and here we have two implementations.
> >>
> >> Implementation 1
> >> 
> >>
> >> Attached is the patch and also there is a github branch [1] for this
> >> work.
> >>
> >
> >Rebased with the latest master.
> >
>
> Hi Richard,
>
> thanks for the rebased patch. I think the patch is mostly fine (at least I
> don't see any serious issues). A couple minor comments:
>

Hi Tomas,

Thank you for reviewing this patch.


>
> 1) I think get_number_of_groups() would deserve a short explanation why
> it's OK to handle (non-partial) grouping sets and regular GROUP BY in the
> same branch. Before these cases were clearly separated, now it seems a bit
> mixed up and it may not be immediately obvious why it's OK.
>

Added a short comment in get_number_of_groups() explaining the behavior
when doing partial aggregation for grouping sets.


>
> 2) There are new regression tests, but they are not added to any schedule
> (parallel or serial), and so are not executed as part of "make check". I
> suppose this is a mistake.
>

Yes, thanks. Added the new regression test in parallel_schedule and
serial_schedule.


>
> 3) The regression tests do check plan and results like this:
>
> EXPLAIN (COSTS OFF, VERBOSE) SELECT ...;
> SELECT ... ORDER BY 1, 2, 3;
>
> which however means that the query might easily use a different plan than
> what's verified in the eplain (thanks to the additional ORDER BY clause).
> So I think this should explain and execute the same query.
>
> (In this case the plans seems to be the same, but that may easily change
> in the future, and we could miss it here, failing to verify the results.)
>

Thank you for pointing this out. Fixed it in V4 patch.


>
> 4) It might be a good idea to check the negative case too, i.e. a query on
> data set that we should not parallelize (because the number of partial
> groups would be too high).
>

Yes, agree. Added a negative case.


>
>
> Do you have any plans to hack on the second approach too? AFAICS those two
> approaches are complementary (address different data sets / queries), and
> it would be nice to have both. One of the things I've been wondering is if
> we need to invent gset_id as a new concept, or if we could simply use the
> existing GROUPING() function - that uniquely identifies the grouping set.
>
>
Yes, I'm planning to hack on the second approach in short future. I'm
also reconsidering the gset_id stuff since it brings a lot of complexity
for the second approach.  I agree with you that we can try GROUPING()
function to see if it can replace gset_id.

Thanks
Richard


v4-0001-Implementing-parallel-grouping-sets.patch
Description: Binary data


Re: partition routing layering in nodeModifyTable.c

2019-07-31 Thread Amit Langote
On Tue, Jul 30, 2019 at 4:20 PM Amit Langote  wrote:
> On Sat, Jul 20, 2019 at 1:52 AM Andres Freund  wrote:
> > > The first one (0001) deals with reducing the core executor's reliance
> > > on es_result_relation_info to access the currently active result
> > > relation, in favor of receiving it from the caller as a function
> > > argument.  So no piece of core code relies on it being correctly set
> > > anymore.  It still needs to be set correctly for the third-party code
> > > such as FDWs.
> >
> > I'm inclined to just remove it. There's not much code out there relying
> > on it, as far as I can tell. Most FDWs don't support the direct modify
> > API, and that's afaict the case where we one needs to use
> > es_result_relation_info?
>
> Right, only the directly modify API uses it.
>
> > In fact, I searched through alllFDWs listed on 
> > https://wiki.postgresql.org/wiki/Foreign_data_wrappers
> > that are on github and in first few categories (up and including to
> > "file wrappers"), and there was only one reference to
> > es_result_relation_info, and that just in comments in a test:
> > https://github.com/pgspider/griddb_fdw/search?utf8=%E2%9C%93=es_result_relation_info=
> > which I think was just copied from our source code.
> >
> > IOW, we should just change the direct modify calls to get the relevant
> > ResultRelationInfo or something in that vein (perhaps just the relevant
> > RT index?).
>
> It seems easy to make one of the two functions that constitute the
> direct modify API, IterateDirectModify(), access the result relation
> from ForeignScanState by saving either the result relation RT index or
> ResultRelInfo pointer itself into the ForeignScanState's FDW-private
> area.  For example, for postgres_fdw, one would simply add a new
> member to PgFdwDirectModifyState struct.
>
> Doing that for the other function BeginDirectModify() seems a bit more
> involved.  We could add a new field to ForeignScan, say
> resultRelation, that's set by either PlanDirectModify() (the FDW code)
> or make_modifytable() (the core code) if the ForeignScan node contains
> the command for direct modification.  BeginDirectModify() can then use
> that value instead of relying on es_result_relation_info being set.
>
> Thoughts?  Fujita-san, do you have any opinion on whether that would
> be a good idea?

I looked into trying to do the things I mentioned above and it seems
to me that revising BeginDirectModify()'s API to receive the
ResultRelInfo directly as Andres suggested might be the best way
forward.  I've implemented that in the attached 0001.  Patches that
were previously 0001, 0002, and 0003 are now 0002, 003, and 0004,
respectively.  0002 is now a patch to "remove"
es_result_relation_info.

Thanks,
Amit


v3-0001-Revise-BeginDirectModify-API-to-pass-ResultRelInf.patch
Description: Binary data


v3-0002-Remove-es_result_relation_info.patch
Description: Binary data


v3-0004-Refactor-transition-tuple-capture-code-a-bit.patch
Description: Binary data


v3-0003-Rearrange-partition-update-row-movement-code-a-bi.patch
Description: Binary data


Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-31 Thread Masahiko Sawada
On Wed, Jul 31, 2019 at 3:29 PM Masahiko Sawada  wrote:
>
>
> For WAL encryption,  before flushing WAL we encrypt whole 8k WAL page
> and then write only the encrypted data of the new WAL record using
> pg_pwrite() rather than write whole encrypted page. So each time we
> encrypt 8k WAL page we end up with encrypting different data with the
> same key+nonce but since we don't write to the disk other than space
> where we actually wrote WAL records it's not a problem. Is that right?

Hmm that's incorrect. We always write an entire 8k WAL page even if we
write a few WAl records into a page. It's bad because we encrypt
different pages with the same key+IV, but we cannot change IV for each
WAL writes as we end up with changing also
already-flushed-WAL-records. So we might need to change the WAL write
so that it write only WAL records we actually wrote.


Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: [HACKERS] Restricting maximum keep segments by repslots

2019-07-31 Thread Kyotaro Horiguchi
At Tue, 30 Jul 2019 21:30:45 +0900 (Tokyo Standard Time), Kyotaro Horiguchi 
 wrote in 
<20190730.213045.221405075.horikyota@gmail.com>
> I attach the revised patch. I'll repost the polished version
> sooner.

This is the revised patch.

- Status criteria has been changed.

  "streaming" : restart_lsn is within max_wal_size. (and kept)

  "keeping" : restart_lsn is behind max_wal_size but still kept
   by max_slot_wal_keep_size or wal_keep_segments.

  "losing" : The segment for restart_lsn is being lost or has
   been lost, but active walsender (or session) using the
   slot is still running. If the walsender caught up before
   stopped, the state will transfer to "keeping" or
   "streaming" again.

  "lost" : The segment for restart_lsn has been lost and the
   active session on the slot is gone. The standby cannot
   continue replication using this slot.

  null : restart_lsn is null (never activated).

- remain is null if restart_lsn is null (never activated) or
  wal_status is "losing" or "lost".

- catalogs.sgml is updated.

- Refactored GetLsnAvailability and GetOldestKeepSegment and
  pg_get_replication_slots.

- TAP test is fied. But test for "losing" state cannot be done
  since it needs interactive session. (I think using isolation
  tester is too much)..

reards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
>From a7f2efdcf64eae2cd4fd707981658f29090d36ee Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi 
Date: Thu, 21 Dec 2017 21:20:20 +0900
Subject: [PATCH 1/6] Add WAL relief vent for replication slots

Replication slot is useful to maintain replication connection in the
configurations where replication is so delayed that connection is
broken. On the other hand so many WAL files can fill up disk that the
master downs by a long delay. This feature, which is activated by a
GUC "max_slot_wal_keep_size", protects master servers from suffering
disk full by limiting the number of WAL files reserved by replication
slots.
---
 src/backend/access/transam/xlog.c | 128 +-
 src/backend/replication/slot.c|  62 +
 src/backend/utils/misc/guc.c  |  12 +++
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h |   1 +
 src/include/replication/slot.h|   1 +
 6 files changed, 180 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f553523857..3989f6e54a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -104,6 +104,7 @@ int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
+int			max_slot_wal_keep_size_mb = -1;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -872,6 +873,7 @@ static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
 static void LocalSetXLogInsertAllowed(void);
 static void CreateEndOfRecoveryRecord(void);
 static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+static XLogSegNo GetOldestKeepSegment(XLogRecPtr currpos, XLogRecPtr minSlotPtr);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
@@ -9288,6 +9290,54 @@ CreateRestartPoint(int flags)
 	return true;
 }
 
+/*
+ * Returns minimum segment number that the next checkpoint must leave
+ * considering wal_keep_segments, replication slots and
+ * max_slot_wal_keep_size.
+ *
+ * currLSN is the current insert location.
+ * minSlotLSN is the minimum restart_lsn of all active slots.
+ */
+static XLogSegNo
+GetOldestKeepSegment(XLogRecPtr currLSN, XLogRecPtr minSlotLSN)
+{
+	XLogSegNo	currSeg;
+	XLogSegNo	minSlotSeg;
+	uint64		keepSegs = 0;	/* # of segments actually kept */
+
+	XLByteToSeg(currLSN, currSeg, wal_segment_size);
+	XLByteToSeg(minSlotLSN, minSlotSeg, wal_segment_size);
+
+	/*
+	 * Calculate how many segments are kept by slots first. The second
+	 * term of the condition is just a sanity check.
+	 */
+	if (minSlotLSN != InvalidXLogRecPtr && minSlotSeg <= currSeg)
+		keepSegs = currSeg - minSlotSeg;
+
+	/* Cap keepSegs by max_slot_wal_keep_size */
+	if (max_slot_wal_keep_size_mb >= 0)
+	{
+		uint64 limitSegs;
+
+		limitSegs = ConvertToXSegs(max_slot_wal_keep_size_mb, wal_segment_size);
+
+		/* Reduce it if slots already reserves too many. */
+		if (limitSegs < keepSegs)
+			keepSegs = limitSegs;
+	}
+
+	/* but, keep at least wal_keep_segments segments if any */
+	if (wal_keep_segments > 0 && keepSegs < wal_keep_segments)
+		keepSegs = wal_keep_segments;
+
+	/* avoid underflow, don't go below 1 */
+	if (currSeg <= keepSegs)
+		return 1;
+
+	return currSeg - keepSegs;
+}
+
 /*
  * Retreat *logSegNo to the last segment that we need to retain because of
  * either wal_keep_segments or replication slots.
@@ 

Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-31 Thread Masahiko Sawada
On Tue, Jul 30, 2019 at 10:45 PM Sehrope Sarkuni  wrote:
>
> On Tue, Jul 30, 2019 at 8:16 AM Masahiko Sawada  wrote:
>>
>> On Mon, Jul 29, 2019 at 8:18 PM Sehrope Sarkuni  wrote:
>> >
>> > On Mon, Jul 29, 2019 at 6:42 AM Masahiko Sawada  
>> > wrote:
>> > > > An argument could be made to push that problem upstream, i.e. let the
>> > > > supplier of the passphrase deal with the indirection. You would still
>> > > > need to verify the supplied passphrase/key is correct via something
>> > > > like authenticating against a stored MAC.
>> > >
>> > > So do we need the key for MAC of passphrase/key in order to verify?
>> >
>> > Yes. Any 128 or 256-bit value is a valid AES key and any 16-byte input
>> > can be "decrypted" with it in both CTR and CBC mode, you'll just end
>> > up with garbage data if the key does not match. Verification of the
>> > key prior to usage (i.e. starting DB and encrypting/decrypting data)
>> > is a must as otherwise you'll end up with all kinds of corruption or
>> > data loss.
>> >
>>
>> Do you mean that we encrypt and store a 16 byte input with the correct
>> key to the disk, and then decrypt it with the user supplied key and
>> compare the result to the input data?
>
>
> Yes but we don't compare via decryption of a known input. We instead compute 
> a MAC of the encrypted master key using the user supplied key, and compare 
> that against an expected MAC stored alongside the encrypted master key.
>
> The pseudo code would be something like:
>
> // Read key text from user:
> string raw_kek = read_from_user()
> // Normalize it to a fixed size of 64-bytes
> byte[64] kek = SHA512(SHA512(raw_kek))
> // Split the 64-bytes into a separate encryption and MAC key
> byte[32] user_encryption_key = kek.slice(0,32)
> byte[32] user_mac_key = kek.slice(32, 64)
>
> // Read our saved MAC and encrypted master key
> byte[80] mac_iv_encrypted_master_key = read_from_file()
> // First 32-bytes is the MAC of the rest
> byte[32] expected_mac = mac_iv_encrypted_master_key.slice(0, 32)
> // Rest is a random IV + Encrypted master key
> byte[48] iv_encrypted_master_key = mac_iv_encrypted_master_key(32, 80)
>
> // Compute the MAC with the user supplied key
> byte[32] actual_mac = HMAC(user_mac_key, iv_encrypted_master_key)
> // If it does not match then the user key is invalid
> if (actual_mac != expected_mac) {
>   print_err_and_exit("Bad user key!")
> }
>
> // Our MAC was correct
> // ... so we know user supplied key is valid
> // ... and we know our iv and encrypted_key are valid
> byte[16] iv = iv_encrypted_master_key.slice(0,16)
> byte[32] encrypted_master_key = iv_encrypted_master_key.slice(16, 48)
> // ... so we can use all three to decrypt the master key (MDEK)
> byte[32] master_key = decrypt_aes_cbc(user_encryption_key, iv, 
> encrypted_master_key)
>
>
>> > From a single user supplied passphrase you would derive the MDEK and
>> > compute a MAC (either using the same key or via a separate derived
>> > MDEK-MAC key). If the computed MAC matches against the previously
>> > stored value then you know the MDEK is correct as well.
>>
>> You meant KEK, not MDEK?
>
>
> If the KEK is incorrect then the MAC validation would fail and the decrypt 
> would never be attempted.
>
> If the MAC matches then both the KEK (user supplied key) and MDEK 
> ("master_key" in the pseudo code above) would be confirmed to be valid. So 
> the MDEK is safe to use for deriving keys for encrypt / decrypt.
>
> I'm using the definitions for "KEK" and "MDEK" from Joe's mail 
> https://www.postgresql.org/message-id/c878de71-a0c3-96b2-3e11-9ac2c35357c3%40joeconway.com
>

I now understood. Thank you for explanation!

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: minor fixes after pgindent prototype fixes

2019-07-31 Thread Andres Freund
On 2019-07-28 00:09:51 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > a few prototypes look odd. It appears to be cases where previously the
> > odd indentation was put to some use, by indenting parameters less:
> > ...
> > but now that looks odd:
> > extern void DefineCustomBoolVariable(
> >  const char *name,
> >  const char *short_desc,
> 
> > Unless somebody protests I'm going to remove the now pretty useless
> > looking newline in the cases I can find.
> 
> +1.  I think Alvaro was muttering something about doing this,
> but you beat him to it.

And pushed...

- Andres




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-31 Thread Masahiko Sawada
On Wed, Jul 31, 2019 at 5:48 AM Bruce Momjian  wrote:
>
> On Tue, Jul 30, 2019 at 10:14:14AM -0400, Sehrope Sarkuni wrote:
> > > In general it's fine to use the same IV with different keys. Only 
> > reuse
> > of Key
> > > + IV is a problem and the entire set of possible counter values (IV + 
> > 0,
> > IV +
> > > 1, ...) generated with a key must be unique. That's also why we must
> > leave at
> > > least log2(PAGE_SIZE / AES_BLOCK_SIZE) bits at the end of the IV to be
> > filled
> > > in with 0, 1, 2, ... for each 16-byte AES-block on the page. If our
> > per-page IV
> > > prefix used any of those bits then the counter could overflow into the
> > next
> > > page's IV's range.
> >
> > Agreed.
> >
> > Attached is an updated patch that checks only main relation forks, which
> > I think are the only files we are going ot encrypt, and it has better
> > comments.
> >
> >
> > Okay that makes sense in the context of using a single key and relying on 
> > the
> > LSN based IV to be unique.
>
> I had more time to think about the complexity of adding relfilenode to
> the IV.  Since relfilenode is only unique within a database/tablespace,
> we would need to have pg_upgrade preserve database/tablespace oids
> (which I assume are the same as the directory and tablespace symlinks).
> Then, to decode a page, you would need to look up those values.  This is
> in addition to the new complexity of CREATE DATABASE and moving files
> between tablespaces.  I am also concerned that crash recovery operations
> and cluster forensics and repair would need to also deal with this.
>
> I am not even clear if pg_upgrade preserving relfilenode is possible ---
> when we wrap the relfilenode counter, does it start at 1 or at the
> first-user-relation-oid?  If the former, it could conflict with oids
> assigned to new system tables in later major releases.  Tying the
> preservation of relations to two restrictions seems risky.
>
> Using just the page LSN and page number allows a page to be be
> decrypted/encrypted independently of its file name, tablespace, and
> database, and I think that is a win for simplicity.  Of course, if it is
> insecure we will not do it.
>
> I am thinking for the heap/index IV, it would be:
>
> uint64 lsn;
> unint32 page number;
> /* only uses 11 bits for a zero-based CTR counter for 32k pages */
> uint32 counter;
>

+1
IIUC since this would require to ensure uniqueness by using key+IV we
need to use different keys for different relations. Is that right?

> and for WAL it would be:
>
> uint64 segment_number;
> uint32counter;
> /* guarantees this IV doesn't match any relation IV */
> uint32   2^32-1 /* all 1's */

I would propose to include the page number within a WAL segment to IV
so that we can encrypt each WAL page with the counter always starting
from 0. And if we use different encryption keys for tables/indexes and
WAL I think we don't need 2^32-1.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: Unused header file inclusion

2019-07-31 Thread Andres Freund
Hi,

On 2019-07-31 11:19:08 +0530, vignesh C wrote:
> I noticed that there are many header files being
> included which need not be included.
> I have tried this in a few files and found the
> compilation and regression to be working.
> I have attached the patch for the files that
> I tried.
> I tried this in CentOS, I did not find the header
> files to be platform specific.
> Should we pursue this further and cleanup in
> all the files?

These type of changes imo have a good chance of making things more
fragile. A lot of the includes in header files are purely due to needing
one or two definitions (which often could even be avoided by forward
declarations). If you remove all the includes directly from the c files
that are also included from some .h file, you increase the reliance on
the indirect includes - making it harder to clean up.

If anything, we should *increase* the number of includes, so we don't
rely on indirect includes.  But that's also not necessarily the right
choice, because it adds unnecessary dependencies.


> --- a/src/backend/utils/mmgr/freepage.c
> +++ b/src/backend/utils/mmgr/freepage.c
> @@ -56,7 +56,6 @@
>  #include "miscadmin.h"
>  
>  #include "utils/freepage.h"
> -#include "utils/relptr.h"

I don't think it's a good idea to remove this header, for example. A
*lot* of code in freepage.c relies on it. The fact that freepage.h also
includes it here is just due to needing other parts of it


>  /* Magic numbers to identify various page types */
> diff --git a/src/backend/utils/mmgr/portalmem.c 
> b/src/backend/utils/mmgr/portalmem.c
> index 334e35b..67268fd 100644
> --- a/src/backend/utils/mmgr/portalmem.c
> +++ b/src/backend/utils/mmgr/portalmem.c
> @@ -26,7 +26,6 @@
>  #include "utils/builtins.h"
>  #include "utils/memutils.h"
>  #include "utils/snapmgr.h"
> -#include "utils/timestamp.h"

Similarly, this file uses timestamp.h functions directly. The fact that
timestamp.h already is included is just due to implementation details of
xact.h that this file shouldn't depend on.

Greetings,

Andres Freund




Re: Unused header file inclusion

2019-07-31 Thread Michael Paquier
On Wed, Jul 31, 2019 at 11:55:37AM +0530, Amit Kapila wrote:
> If we can come up with some such tool, we might be able to integrate
> it with Thomas's patch tester [1] wherein it can apply the patch,
> verify if there are unnecessary includes in the patch and report the
> same.
> 
> [1] - http://commitfest.cputube.org/

Or even get something into src/tools/?  If the produced result is
clean enough, that could be interesting.
--
Michael


signature.asc
Description: PGP signature


Re: Unused header file inclusion

2019-07-31 Thread vignesh C
On Wed, Jul 31, 2019 at 11:55 AM Amit Kapila  wrote:
>
> On Wed, Jul 31, 2019 at 11:31 AM vignesh C  wrote:
> >
> > On Wed, Jul 31, 2019 at 11:26 AM Michael Paquier  
> > wrote:
> > >
> > > On Wed, Jul 31, 2019 at 11:19:08AM +0530, vignesh C wrote:
> > > > I noticed that there are many header files being included which need
> > > > not be included.  I have tried this in a few files and found the
> > > > compilation and regression to be working.  I have attached the patch
> > > > for the files that  I tried.  I tried this in CentOS, I did not find
> > > > the header files to be platform specific.
> > > > Should we pursue this further and cleanup in all the files?
> > >
> > > Do you use a particular method here or just manual deduction after
> > > looking at each file individually?  If this can be cleaned up a bit, I
> > > think that's welcome.  The removal of headers is easily forgotten when
> > > moving code from one file to another...
> > >
> > Thanks Michael.
> > I'm writing some perl scripts to identify this.
> > The script will scan through all the files, make changes,
> > and verify.
>
> If we can come up with some such tool, we might be able to integrate
> it with Thomas's patch tester [1] wherein it can apply the patch,
> verify if there are unnecessary includes in the patch and report the
> same.
>
> [1] - http://commitfest.cputube.org/
>
Thanks Amit.
I will post the tool along with the patch once I complete this activity. We
can enhance the tool further based on feedback and take it forward.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com




Re: [Proposal] Table-level Transparent Data Encryption (TDE) and Key Management Service (KMS)

2019-07-31 Thread Masahiko Sawada
On Mon, Jul 29, 2019 at 10:44 PM Bruce Momjian  wrote:
>
> On Mon, Jul 29, 2019 at 08:43:06PM +0900, Masahiko Sawada wrote:
> > > I am thinking of writing some Assert() code that checks that all buffers
> > > using a single LSN are from the same relation (and therefore different
> > > page numbers).  I would do it by creating a static array, clearing it on
> > > XLogBeginInsert(), adding to it for each  XLogInsert(), then checking on
> > > PageSetLSN() that everything in the array is from the same file.  Does
> > > that make sense?
> >
> > I had the same concern before. We could have BKPBLOCK_SAME_REL flag in
> > XLogRecordBlockHeader, which indicates that the relation of the block
> > is the same as the previous block and therefore we skip to write
> > RelFileNode. At first glance I thought it's possible that one WAL
> > record can contain different RelFileNodes but I didn't find any code
> > attempting to do that.
>
> Yes, the point is that the WAL record makes it possible, so we either
> have to test for it or allow it.
>
> > Checking that all buffers using a single LSN are from the same
> > relation would be a good idea but I think it's hard to test it and
> > regard the test result as okay. Even if we passed 'make checkworld',
> > it might still be possible to happen. And even assertion failures
>
> Yes, the problem is that if you embed the relfilenode or tablespace or
> database in the encryption IV, you then need to then make sure you
> re-encrypt any files that move between these.  I am hesitant to do that
> since it then requires these workarounds for encryption going forward.
> We know that most people will not be using encryption, so that will not
> be well tested either.  For pg_upgrade, I used a minimal-impact
> approach, and it has allowed dramatic changes in our code without
> requiring changes and retesting of pg_upgrade.
>
> > don't happen in production environment. So I guess it would be better
> > to have IV so that we never reuse in different relation with the same
> > page. An idea I came up with is that we make  IV from (PageLSN,
> > PageNumber, relNode) and have the encryption keys per tablespace.
> > That way, we never reuse IV in a different relation with the same page
> > number because relNode is unique within a database in a particular
> > tablespace as you mentioned.
>
> Yes, this is what we are discussing.  Whether the relfilenode is part of
> the IV, or we derive a key with a mix of the master encryption key and
> relfilenode is mostly a matter of what fits into which bits.  With CTR,
> I think we agreed it has to be LSN and page-number (and CTR counter),
> and we only have 5 bits left.  If we wanted to add anything else, it
> would be done via the creation of a derived key;  this was covered here:
>

Just to confirm, we have 21 bits left for nonce in CTR? We have LSN (8
bytes), page-number (4 bytes) and counter (11 bits) in 16 bytes nonce
space. Even though we have 21 bits left we cannot store relfilenode to
the IV.

BTW I've received a review about the current design by some
cryptologists in our company. They recommended to use CTR rather than
CBC. The main reason is that in block cipher it's important to ensure
the uniqueness even for every input block to the block cipher. CBC is
hard to ensure that because the previous output is the next block's
input. Whereas in CTR, it encrypts each blocks separately with
key+nonce. Therefore if we can ensure the uniqueness of IV we can meet
that. Also it's not necessary to encrypt IV as it's okey to be
predictable. So I vote for CTR for at least for tables/indexes
encryption, there already might be consensus though.

For WAL encryption,  before flushing WAL we encrypt whole 8k WAL page
and then write only the encrypted data of the new WAL record using
pg_pwrite() rather than write whole encrypted page. So each time we
encrypt 8k WAL page we end up with encrypting different data with the
same key+nonce but since we don't write to the disk other than space
where we actually wrote WAL records it's not a problem. Is that right?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center




Re: Possible race condition in pg_mkdir_p()?

2019-07-31 Thread Michael Paquier
On Wed, Jul 31, 2019 at 02:02:03PM +0800, Ning Yu wrote:
> Yes, in current postgres source code there are several wrappers of
> mkdir() that do similar jobs.  If we could have a safe mkdir_p()
> implementation then we could use it directly in all these wrappers, that
> could save a lot of maintenance effort in the long run.  I'm not saying
> that our patches are enough to make it safe and reliable, and I agree
> that any patches may introduce new bugs, but I think that having a safe
> and unified mkdir_p() is a good direction to go.

That's my impression as well.  Please note that this does not involve
an actual bug in Postgres and that this is rather invasive, so this
does not really qualify for a back-patch.  No objections to simplify
the backend on HEAD though.  It would be good if you could actually
register a patch to the commit fest app [1] and also rework the patch
set so as at least PathNameCreateTemporaryDir wins its simplifications
for the first problem (double-checking the other code paths would be
nice as well).  The EEXIST handling, and the confusion about EEXIST
showing for both a path and a file need some separate handling (not
sure what to do on these parts yet).

> Yes, that's why we put the testing module in a separate patch from the
> fix, feel free to ignore it.  In fact ourselves have concerns about it ;)

I think that it is nice that you took the time to do so as you get
yourself more familiar with the TAP infrastructure in the tree and
prove your point.  For this case, I would not have gone to do this
much though ;p

[1]: https://commitfest.postgresql.org/24/
--
Michael


signature.asc
Description: PGP signature


Re: Possible race condition in pg_mkdir_p()?

2019-07-31 Thread Ning Yu
On Wed, Jul 31, 2019 at 1:31 PM Michael Paquier  wrote:
>
> On Tue, Jul 30, 2019 at 10:19:45PM -0700, Andres Freund wrote:
> > I don't really have a problem fixing this case if we think it's
> > useful. But I'm a bit bothered by all the "fixes" being submitted that
> > don't matter for PG itself. They do eat up resources.
>
> Sure.  In this particular case, we can simplify at least one code path
> in the backend though for temporary path creation.  Such cleanup rings
> like a sufficient argument to me.

Yes, in current postgres source code there are several wrappers of
mkdir() that do similar jobs.  If we could have a safe mkdir_p()
implementation then we could use it directly in all these wrappers, that
could save a lot of maintenance effort in the long run.  I'm not saying
that our patches are enough to make it safe and reliable, and I agree
that any patches may introduce new bugs, but I think that having a safe
and unified mkdir_p() is a good direction to go.

>
> > And sorry, adding in-backend threading to test testing mkdir_p doesn't
> > make me inclined to believe that this is all well considered.  There's
> > minor issues like us not supporting threads in the backend, pthread not
> > being portable, and also it being entirely out of proportion to the
> > issue.
>
> Agreed on this one.  The test case may be useful for the purpose of
> testing the presented patches, but it does not have enough value to be
> merged.

Yes, that's why we put the testing module in a separate patch from the
fix, feel free to ignore it.  In fact ourselves have concerns about it ;)

Best Regards
Ning

> --
> Michael




Re: Unused header file inclusion

2019-07-31 Thread vignesh C
On Wed, Jul 31, 2019 at 11:26 AM Michael Paquier  wrote:
>
> On Wed, Jul 31, 2019 at 11:19:08AM +0530, vignesh C wrote:
> > I noticed that there are many header files being included which need
> > not be included.  I have tried this in a few files and found the
> > compilation and regression to be working.  I have attached the patch
> > for the files that  I tried.  I tried this in CentOS, I did not find
> > the header files to be platform specific.
> > Should we pursue this further and cleanup in all the files?
>
> Do you use a particular method here or just manual deduction after
> looking at each file individually?  If this can be cleaned up a bit, I
> think that's welcome.  The removal of headers is easily forgotten when
> moving code from one file to another...
>
Thanks Michael.
I'm writing some perl scripts to identify this.
The script will scan through all the files, make changes,
and verify.
Finally it will give the changed files.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com