Re: [HACKERS] [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-07 Thread Andres Freund
On 2015-06-05 20:47:33 +0200, Andres Freund wrote:
> On 2015-06-05 14:33:12 -0400, Tom Lane wrote:
> > Robert Haas  writes:
> > > 1. The problem that we might truncate an SLRU members page away when
> > > it's in the buffers, but not drop it from the buffers, leading to a
> > > failure when we try to write it later.
> 
> I've got a fix for this, and about three other issues I found during
> development of the new truncation codepath.
> 
> I'll commit the fix tomorrow.

I've looked through multixact.c/slru.c and afaics there currently is, as
observed by Thomas, no codepath that exercises the broken behaviour. Due
to the way checkpoints and SLRU truncation are linked "problematic"
pages will have been flushed beforehand.

I think we should fix this either way as it seems like a bad trap, but
I'd rather commit it after the the next minor releases are out.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Memory leak fixes for pg_dump, pg_dumpall, initdb and pg_upgrade

2015-06-07 Thread Michael Paquier
Hi all,

Please find attached a set of fixes for a couple of things in src/bin:
- pg_dump/pg_dumpall:
-- getFormattedTypeName, convertTSFunction and myFormatType return
strdup'd results that are never free'd.
-- convertTSFunction returns const char. I fail to see the point of
that... In my opinion we are fine with just returning a char pointer,
which is strdup'd so as it can be freed by the caller.
- initdb's and pg_regress' use getaddrinfo, but do not free the
returned result with freeaddrinfo().
- Coverity noticed on the way some leaked memory in pg_upgrade's
equivalent_locale().

Those issues have been mostly spotted by Coverity, I may have spotted
some of them while looking at similar code paths... In any case that's
Coverity's win ;)
Regards,
-- 
Michael
From 7898cecd7ebf6e729a0a33b14b1d7d0512314bd1 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 8 Jun 2015 14:52:28 +0900
Subject: [PATCH 1/2] Fix some memory leaks in pg_dump and pg_dumpall

---
 src/bin/pg_dump/pg_dump.c| 83 
 src/bin/pg_dump/pg_dumpall.c |  7 
 2 files changed, 60 insertions(+), 30 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index a72dfe9..7458389 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -225,7 +225,7 @@ static char *format_function_signature(Archive *fout,
 static char *convertRegProcReference(Archive *fout,
 		const char *proc);
 static char *convertOperatorReference(Archive *fout, const char *opr);
-static const char *convertTSFunction(Archive *fout, Oid funcOid);
+static char *convertTSFunction(Archive *fout, Oid funcOid);
 static Oid	findLastBuiltinOid_V71(Archive *fout, const char *);
 static Oid	findLastBuiltinOid_V70(Archive *fout);
 static void selectSourceSchema(Archive *fout, const char *schemaName);
@@ -10454,10 +10454,12 @@ dumpFunc(Archive *fout, DumpOptions *dopt, FuncInfo *finfo)
 		parseOidArray(protrftypes, typeids, FUNC_MAX_ARGS);
 		for (i = 0; typeids[i]; i++)
 		{
+			char	   *elemType;
 			if (i != 0)
 appendPQExpBufferStr(q, ", ");
-			appendPQExpBuffer(q, "FOR TYPE %s",
-		 getFormattedTypeName(fout, typeids[i], zeroAsNone));
+			elemType = getFormattedTypeName(fout, typeids[i], zeroAsNone);
+			appendPQExpBuffer(q, "FOR TYPE %s", elemType);
+			free(elemType);
 		}
 	}
 
@@ -10593,6 +10595,8 @@ dumpCast(Archive *fout, DumpOptions *dopt, CastInfo *cast)
 	PQExpBuffer delqry;
 	PQExpBuffer labelq;
 	FuncInfo   *funcInfo = NULL;
+	char	   *sourceType;
+	char	   *targetType;
 
 	/* Skip if not to be dumped */
 	if (!cast->dobj.dump || dopt->dataOnly)
@@ -10616,13 +10620,13 @@ dumpCast(Archive *fout, DumpOptions *dopt, CastInfo *cast)
 	delqry = createPQExpBuffer();
 	labelq = createPQExpBuffer();
 
+	sourceType = getFormattedTypeName(fout, cast->castsource, zeroAsNone);
+	targetType = getFormattedTypeName(fout, cast->casttarget, zeroAsNone);
 	appendPQExpBuffer(delqry, "DROP CAST (%s AS %s);\n",
-	getFormattedTypeName(fout, cast->castsource, zeroAsNone),
-   getFormattedTypeName(fout, cast->casttarget, zeroAsNone));
+	  sourceType, targetType);
 
 	appendPQExpBuffer(defqry, "CREATE CAST (%s AS %s) ",
-	getFormattedTypeName(fout, cast->castsource, zeroAsNone),
-   getFormattedTypeName(fout, cast->casttarget, zeroAsNone));
+	  sourceType, targetType);
 
 	switch (cast->castmethod)
 	{
@@ -10660,8 +10664,7 @@ dumpCast(Archive *fout, DumpOptions *dopt, CastInfo *cast)
 	appendPQExpBufferStr(defqry, ";\n");
 
 	appendPQExpBuffer(labelq, "CAST (%s AS %s)",
-	getFormattedTypeName(fout, cast->castsource, zeroAsNone),
-   getFormattedTypeName(fout, cast->casttarget, zeroAsNone));
+	  sourceType, targetType);
 
 	if (dopt->binary_upgrade)
 		binary_upgrade_extension_member(defqry, &cast->dobj, labelq->data);
@@ -10679,6 +10682,9 @@ dumpCast(Archive *fout, DumpOptions *dopt, CastInfo *cast)
 NULL, "",
 cast->dobj.catId, 0, cast->dobj.dumpId);
 
+	free(sourceType);
+	free(targetType);
+
 	destroyPQExpBuffer(defqry);
 	destroyPQExpBuffer(delqry);
 	destroyPQExpBuffer(labelq);
@@ -10696,6 +10702,7 @@ dumpTransform(Archive *fout, DumpOptions *dopt, TransformInfo *transform)
 	FuncInfo   *fromsqlFuncInfo = NULL;
 	FuncInfo   *tosqlFuncInfo = NULL;
 	char	   *lanname;
+	char	   *transformType;
 
 	/* Skip if not to be dumped */
 	if (!transform->dobj.dump || dopt->dataOnly)
@@ -10723,13 +10730,14 @@ dumpTransform(Archive *fout, DumpOptions *dopt, TransformInfo *transform)
 	labelq = createPQExpBuffer();
 
 	lanname = get_language_name(fout, transform->trflang);
+	transformType = getFormattedTypeName(fout, transform->trftype, zeroAsNone);
 
 	appendPQExpBuffer(delqry, "DROP TRANSFORM FOR %s LANGUAGE %s;\n",
-  getFormattedTypeName(fout, transform->trftype, zeroAsNone),
+	  transformType,
 	  lanname);
 
 	appendPQExpBuffer(defqry, "CREATE TRANSFORM FOR %s LANGUAGE %s (",
-  getFormattedTypeName(fout, transform->trftype, zeroAsNo

Re: [HACKERS] amcheck prototype

2015-06-07 Thread Peter Geoghegan
On Wed, Nov 19, 2014 at 2:09 PM, Peter Geoghegan  wrote:
> Attached is a revision of what I previously called btreecheck, which
> is now renamed to amcheck.

This never really went anywhere, because as a project I don't think
that it has very crisp goals. My sense is that it could be developed
in a new direction, with the goal of finding bugs in the master
branch. This seems like something that could be possible without a
large additional effort; committing the tool itself can come later.

Right now, the code that is actually tested by the tool isn't
particularly likely to have bugs. I used a slightly revised version to
constantly verify B-Trees as the regression tests are run. That didn't
catch anything, but since the tool doesn't consult the heap at all I'm
not surprised. Also, I didn't incorporate any testing of recovery with
that stress test.

I wrote amcheck with the assumption that it is useful to have a tool
that verifies several nbtree invariants, a couple of which are fairly
elaborate. amcheck *is* probably useful for detecting corruption due
to hardware failure and so on today, but that is another problem
entirely.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing tuple overhead

2015-06-07 Thread Amit Kapila
On Sun, Jun 7, 2015 at 3:02 PM, Simon Riggs  wrote:
>
> On 23 April 2015 at 17:24, Andres Freund  wrote:
>>
>
>
> It's hard to see how to save space there without reference to a specific
use case. I see different solutions depending upon whether we assume a low
number of transactions or a high number of transactions.
>

I have tried to check theoretically, how much difference such a
change could give us.

Assuming BLK_SZ - 8192 bytes; Page header - 24 bytes; each
line pointer - 4 bytes; average tuple - 150 bytes, roughly 53
tuples could be accommodated in one page.  Now each of this
tuple contains 12 bytes transaction information (xmin, xmax,
cid/combocid).  Now considering that in average workload 4
transactions operate on a page at the same time (I think for a
workload like pgbench tpc-b, it shouldn't be more otherwise it
should have been visible in perf reports),  4 additional tuples [1]
could be accommodated on a page which is approximately 7% savings
in space (which in-turns means that much less I/O).  This gain
could vary based on tuple size, no. of transactions that can operate
on page, page size..

Some additional benefits that I could see from such a change:

1. I think we don't need to traverse the whole page while freezing,
so there should be some savings in freeze operation as well.

2. Now I think with this, we might be able to reduce WAL also
if we can avoid some transaction related info in the cases where
it is currently stored (update/delete).

3. Another gain could come if we want to add transaction information
in index segment as well, because if such information can be stored at
page level, then there won't be much impact in adding it there which
will help us in avoiding multiple-passes of Vaccum (heap and index
could be vacuumed separately which will definitely help in IO and
bloat reduction).

[1]
Calc for 4 additional tuples:
(saving by removing trans info from tuple - new space consumed by trans)
/new tuple size
(53 * 12  - 12 * 4) / (150 - 12) = 4.26

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


[HACKERS] Collection of memory leaks for ECPG driver

2015-06-07 Thread Michael Paquier
Hi all,

Please find attached a patch aimed at fixing a couple of memory leaks
in the ECPG driver. Coverity (and sometimes I after reading some other
code paths) found them.
Regards,
-- 
Michael
diff --git a/src/interfaces/ecpg/compatlib/informix.c b/src/interfaces/ecpg/compatlib/informix.c
index d6de3ea..c1e3dfb 100644
--- a/src/interfaces/ecpg/compatlib/informix.c
+++ b/src/interfaces/ecpg/compatlib/informix.c
@@ -672,6 +672,7 @@ intoasc(interval * i, char *str)
 	if (!str)
 		return -errno;
 
+	free(str);
 	return 0;
 }
 
diff --git a/src/interfaces/ecpg/ecpglib/connect.c b/src/interfaces/ecpg/ecpglib/connect.c
index 55c5680..12f445d 100644
--- a/src/interfaces/ecpg/ecpglib/connect.c
+++ b/src/interfaces/ecpg/ecpglib/connect.c
@@ -298,7 +298,8 @@ ECPGconnect(int lineno, int c, const char *name, const char *user, const char *p
 		envname = getenv("PG_DBPATH");
 		if (envname)
 		{
-			ecpg_free(dbname);
+			if (dbname)
+ecpg_free(dbname);
 			dbname = ecpg_strdup(envname, lineno);
 		}
 
@@ -314,14 +315,19 @@ ECPGconnect(int lineno, int c, const char *name, const char *user, const char *p
 	/* check if the identifier is unique */
 	if (ecpg_get_connection(connection_name))
 	{
-		ecpg_free(dbname);
+		if (dbname)
+			ecpg_free(dbname);
 		ecpg_log("ECPGconnect: connection identifier %s is already in use\n",
  connection_name);
 		return false;
 	}
 
 	if ((this = (struct connection *) ecpg_alloc(sizeof(struct connection), lineno)) == NULL)
+	{
+		if (dbname)
+			ecpg_free(dbname);
 		return false;
+	}
 
 	if (dbname != NULL)
 	{
diff --git a/src/interfaces/ecpg/preproc/descriptor.c b/src/interfaces/ecpg/preproc/descriptor.c
index 053a7af..ebd95d3 100644
--- a/src/interfaces/ecpg/preproc/descriptor.c
+++ b/src/interfaces/ecpg/preproc/descriptor.c
@@ -175,6 +175,7 @@ output_get_descr(char *desc_name, char *index)
 	for (results = assignments; results != NULL; results = results->next)
 	{
 		const struct variable *v = find_variable(results->variable);
+		char *str_zero = mm_strdup("0");
 
 		switch (results->value)
 		{
@@ -188,7 +189,8 @@ output_get_descr(char *desc_name, char *index)
 break;
 		}
 		fprintf(yyout, "%s,", get_dtype(results->value));
-		ECPGdump_a_type(yyout, v->name, v->type, v->brace_level, NULL, NULL, -1, NULL, NULL, mm_strdup("0"), NULL, NULL);
+		ECPGdump_a_type(yyout, v->name, v->type, v->brace_level, NULL, NULL, -1, NULL, NULL, str_zero, NULL, NULL);
+		free(str_zero);
 	}
 	drop_assignments();
 	fputs("ECPGd_EODT);\n", yyout);
@@ -292,8 +294,12 @@ output_set_descr(char *desc_name, char *index)
 			case ECPGd_indicator:
 			case ECPGd_length:
 			case ECPGd_type:
-fprintf(yyout, "%s,", get_dtype(results->value));
-ECPGdump_a_type(yyout, v->name, v->type, v->brace_level, NULL, NULL, -1, NULL, NULL, mm_strdup("0"), NULL, NULL);
+{
+	char *str_zero = mm_strdup("0");
+	fprintf(yyout, "%s,", get_dtype(results->value));
+	ECPGdump_a_type(yyout, v->name, v->type, v->brace_level, NULL, NULL, -1, NULL, NULL, str_zero, NULL, NULL);
+	free(str_zero);
+}
 break;
 
 			default:
diff --git a/src/interfaces/ecpg/preproc/pgc.l b/src/interfaces/ecpg/preproc/pgc.l
index c70f298..0453409 100644
--- a/src/interfaces/ecpg/preproc/pgc.l
+++ b/src/interfaces/ecpg/preproc/pgc.l
@@ -27,7 +27,7 @@
 extern YYSTYPE yylval;
 
 static int		xcdepth = 0;	/* depth of nesting in slash-star comments */
-static char	   *dolqstart;		/* current $foo$ quote start string */
+static char	   *dolqstart = NULL;	/* current $foo$ quote start string */
 static YY_BUFFER_STATE scanbufhandle;
 static char *scanbuf;
 
@@ -550,6 +550,8 @@ cppline			{space}*#([^i][A-Za-z]*|{if}|{ifdef}|{ifndef}|{import})(.*\\{space})*.
 			}
 {dolqdelim} {
 token_start = yytext;
+if (dolqstart)
+	free(dolqstart);
 dolqstart = mm_strdup(yytext);
 BEGIN(xdolq);
 startlit();
diff --git a/src/interfaces/ecpg/preproc/variable.c b/src/interfaces/ecpg/preproc/variable.c
index 2ad7b5d..bb4b664 100644
--- a/src/interfaces/ecpg/preproc/variable.c
+++ b/src/interfaces/ecpg/preproc/variable.c
@@ -437,11 +437,13 @@ remove_variable_from_list(struct arguments ** list, struct variable * var)
 void
 dump_variables(struct arguments * list, int mode)
 {
-	char	   *str_zero = mm_strdup("0");
+	char	   *str_zero;
 
 	if (list == NULL)
 		return;
 
+	str_zero = mm_strdup("0");
+
 	/*
 	 * The list is build up from the beginning so lets first dump the end of
 	 * the list:

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] skipping pg_log in basebackup

2015-06-07 Thread Abhijit Menon-Sen
At 2015-06-08 13:09:02 +0900, michael.paqu...@gmail.com wrote:
>
> It seems to be that:
> http://www.postgresql.org/message-id/cahgqgwh0okz6ckpjkcwojga3ejwffm1enrmro3dkdoteaai...@mail.gmail.com

(Note that this is about calculating the wrong size, whereas my bug is
about the file being too large to be written to a tar archive.)

> And a recent discussion about that is this one:
> http://www.postgresql.org/message-id/82897A1301080E4B8E461DDAA0FFCF142A1B2660@SYD1216

Oh, sorry, I somehow did miss that thread entirely. Thanks for the
pointer. (I've added Vaishnavi to the Cc: list here.)

I'm not convinced that we need a mechanism to let people exclude the
torrent files they've stored in their data directory, but if we have to
do it, the idea of having a GUC setting rather than specifying excludes
on the basebackup command line each time does have a certain appeal.

Anyone else interested in doing it that way?

-- Abhijit


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Restore-reliability mode

2015-06-07 Thread Peter Geoghegan
On Sat, Jun 6, 2015 at 12:58 PM, Noah Misch  wrote:
>   - Call VALGRIND_MAKE_MEM_NOACCESS() on a shared buffer when its local pin
> count falls to zero.  Under CLOBBER_FREED_MEMORY, wipe a shared buffer
> when its global pin count falls to zero.

Did a patch for this ever materialize?


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)

2015-06-07 Thread Michael Paquier
On Mon, Jun 8, 2015 at 12:42 PM, Abhijit Menon-Sen  wrote:
> This is a followup to a 2014-02 discussion that led to pg_stats_temp
> being excluded from pg_basebackup. At the time, it was discussed to
> exclude pg_log as well, but nothing eventually came of that.

It seems to be that:
http://www.postgresql.org/message-id/cahgqgwh0okz6ckpjkcwojga3ejwffm1enrmro3dkdoteaai...@mail.gmail.com

> Recently, one of our customers has had a basebackup fail because pg_log
> contained files that were >8GB:
> FATAL: archive member "pg_log/postgresql-20150119.log" too large for tar 
> format
>
> I think pg_basebackup should also skip pg_log entries, as it does for
> pg_stats_temp and pg_replslot, etc. I've attached a patch along those
> lines for discussion.

And a recent discussion about that is this one:
http://www.postgresql.org/message-id/82897A1301080E4B8E461DDAA0FFCF142A1B2660@SYD1216
Bringing the point: some users may want to keep log files in a base
backup, and some users may want to skip some of them, and not only
pg_log. Hence we may want more flexibility than what is proposed here.
Regards,
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-06-07 Thread Amit Kapila
On Mon, Jun 8, 2015 at 5:52 AM, Andrew Dunstan  wrote:
>
> On 06/05/2015 11:08 PM, Amit Kapila wrote:
>>
>>
>> Okay, I think I can understand why you want to be cautious for
>> having a different check for this path, but in that case there is a
>> chance that recovery might fail when it will try to create a symlink
>> for that file.  Shouldn't we then try to call this new function only
>> when we are trying to restore from tablespace_map file and also
>> is there a need of ifdef S_ISLINK in remove_tablespace_link?
>>
>>
>> Shall I send an updated patch on these lines or do you want to
>> proceed with v3 version?
>>
>>
>
> The point seems to me that we should not be removing anything that's not
an empty directory or symlink, and that nothing else has any business being
in pg_tblspc. If we encounter such a thing whose name conflicts with the
name of a tablespace link we wish to create, rather than quietly blowing it
away we should complain loudly, and error out, making it the user's
responsibility to clean up their mess. Am I missing something?
>

How about if it is just a flat file with same name as tablespace link,
why we want to give error for that case?  I think now it just don't do
anything with that file (unlink will fail with ENOENT and it will be
ignored, atleast thats the way currently it behaves in Windows) and
create a separate symlink with same name which seems okay to
me and in the change proposed by you it will give error, do you see
any reason for doing so?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] pg_stat_archiver issue with aborted archiver

2015-06-07 Thread Michael Paquier
On Sun, Jun 7, 2015 at 1:11 AM, Julien Rouhaud
 wrote:
> I just noticed that if the archiver aborts (for instance if the
> archive_command exited with a return code > 127), pg_stat_archiver won't
> report those failed attempts. This happens with both 9.4 and 9.5 branches.
>
> Please find attached a patch that fix this issue, based on current head.

The current code seems right to me. When the archive command dies
because of a signal (exit code > 128), the server should fail
immediately with FATAL and should not do any extra processing. It will
also try to archive again the same segment file after restart. When
trying again, if this time the failure is not caused by a signal but
still fails it will be reported to pg_stat_archiver.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] skipping pg_log in basebackup (was Re: pg_basebackup and pg_stat_tmp directory)

2015-06-07 Thread Abhijit Menon-Sen
Hi.

This is a followup to a 2014-02 discussion that led to pg_stats_temp
being excluded from pg_basebackup. At the time, it was discussed to
exclude pg_log as well, but nothing eventually came of that.

Recently, one of our customers has had a basebackup fail because pg_log
contained files that were >8GB:

FATAL: archive member "pg_log/postgresql-20150119.log" too large for tar format

I think pg_basebackup should also skip pg_log entries, as it does for
pg_stats_temp and pg_replslot, etc. I've attached a patch along those
lines for discussion.

-- Abhijit

P.S. Aren't we leaking statrelpath?
>From 8db162c1385b1cdd2b0e666975b76aa814f09f35 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen 
Date: Mon, 27 Apr 2015 12:58:52 +0530
Subject: Skip files in pg_log during basebackup

---
 src/backend/replication/basebackup.c | 29 +
 1 file changed, 29 insertions(+)

diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 1e86e4c..cc75a03 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -27,6 +27,7 @@
 #include "nodes/pg_list.h"
 #include "pgtar.h"
 #include "pgstat.h"
+#include "postmaster/syslogger.h"
 #include "replication/basebackup.h"
 #include "replication/walsender.h"
 #include "replication/walsender_private.h"
@@ -72,6 +73,9 @@ static bool backup_started_in_recovery = false;
 /* Relative path of temporary statistics directory */
 static char *statrelpath = NULL;
 
+/* Relative path to log directory */
+static char logpath[MAXPGPATH];
+
 /*
  * Size of each block sent into the tar stream for larger files.
  */
@@ -157,6 +161,18 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 		else
 			statrelpath = pgstat_stat_directory;
 
+		/*
+		 * Do the same for the log_directory.
+		 */
+
+		if (is_absolute_path(Log_directory) &&
+			path_is_prefix_of_path(DataDir, Log_directory))
+			snprintf(logpath, MAXPGPATH, "./%s", Log_directory + datadirpathlen + 1);
+		else if (strncmp(Log_directory, "./", 2) != 0)
+			snprintf(logpath, MAXPGPATH, "./%s", Log_directory);
+		else
+			strncpy(logpath, Log_directory, MAXPGPATH);
+
 		/* Add a node for the base directory at the end */
 		ti = palloc0(sizeof(tablespaceinfo));
 		ti->size = opt->progress ? sendDir(".", 1, true, tablespaces, true) : -1;
@@ -965,6 +981,19 @@ sendDir(char *path, int basepathlen, bool sizeonly, List *tablespaces,
 		}
 
 		/*
+		 * We can skip pg_log (or whatever log_directory is set to, if
+		 * it's under the data directory), but include it as an empty
+		 * directory anyway, so we get permissions right.
+		 */
+		if (strcmp(pathbuf, logpath) == 0)
+		{
+			if (!sizeonly)
+_tarWriteHeader(pathbuf + basepathlen + 1, NULL, &statbuf);
+			size += 512;
+			continue;
+		}
+
+		/*
 		 * Skip pg_replslot, not useful to copy. But include it as an empty
 		 * directory anyway, so we get permissions right.
 		 */
-- 
1.9.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Memory leak with XLogFileCopy since de768844 (WAL file with .partial)

2015-06-07 Thread Michael Paquier
On Fri, Jun 5, 2015 at 10:45 PM, Fujii Masao wrote:
> Why don't we call InstallXLogFileSegment() at the end of XLogFileCopy()?
> If we do that, the risk of memory leak you're worried will disappear at all.

Yes, that looks fine, XLogFileCopy() would copy to a temporary file,
then install it definitely. Updated patch attached.
-- 
Michael
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 666fa37..c03f70e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -807,7 +807,7 @@ static bool XLogCheckpointNeeded(XLogSegNo new_segno);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	   bool find_free, XLogSegNo max_segno,
-	   bool use_lock);
+	   bool use_lock, int elevel);
 static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 			 int source, bool notexistOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
@@ -3012,7 +3012,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	max_segno = logsegno + CheckPointSegments;
 	if (!InstallXLogFileSegment(&installed_segno, tmppath,
 *use_existent, max_segno,
-use_lock))
+use_lock, LOG))
 	{
 		/*
 		 * No need for any more future segments, or InstallXLogFileSegment()
@@ -3041,18 +3041,16 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 /*
  * Copy a WAL segment file in pg_xlog directory.
  *
- * dstfname		destination filename
  * srcfname		source filename
  * upto			how much of the source file to copy? (the rest is filled with
  *zeros)
+ * segno		identify segment to install.
  *
- * If dstfname is not given, the file is created with a temporary filename,
- * which is returned.  Both filenames are relative to the pg_xlog directory.
- *
- * NB: Any existing file with the same name will be overwritten!
+ * The file is first copied with a temporary filename, and then installed as
+ * a newly-created segment.
  */
-static char *
-XLogFileCopy(char *dstfname, char *srcfname, int upto)
+static void
+XLogFileCopy(char *srcfname, int upto, XLogSegNo segno)
 {
 	char		srcpath[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -3150,25 +3148,9 @@ XLogFileCopy(char *dstfname, char *srcfname, int upto)
 
 	CloseTransientFile(srcfd);
 
-	/*
-	 * Now move the segment into place with its final name.  (Or just return
-	 * the path to the file we created, if the caller wants to handle the rest
-	 * on its own.)
-	 */
-	if (dstfname)
-	{
-		char		dstpath[MAXPGPATH];
-
-		snprintf(dstpath, MAXPGPATH, XLOGDIR "/%s", dstfname);
-		if (rename(tmppath, dstpath) < 0)
-			ereport(ERROR,
-	(errcode_for_file_access(),
-	 errmsg("could not rename file \"%s\" to \"%s\": %m",
-			tmppath, dstpath)));
-		return NULL;
-	}
-	else
-		return pstrdup(tmppath);
+	/* install the new file */
+	(void) InstallXLogFileSegment(&segno, tmppath, false,
+  0, false, ERROR);
 }
 
 /*
@@ -3195,6 +3177,8 @@ XLogFileCopy(char *dstfname, char *srcfname, int upto)
  * place.  This should be TRUE except during bootstrap log creation.  The
  * caller must *not* hold the lock at call.
  *
+ * elevel: log level used by this routine.
+ *
  * Returns TRUE if the file was installed successfully.  FALSE indicates that
  * max_segno limit was exceeded, or an error occurred while renaming the
  * file into place.
@@ -3202,7 +3186,7 @@ XLogFileCopy(char *dstfname, char *srcfname, int upto)
 static bool
 InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	   bool find_free, XLogSegNo max_segno,
-	   bool use_lock)
+	   bool use_lock, int elevel)
 {
 	char		path[MAXPGPATH];
 	struct stat stat_buf;
@@ -3247,7 +3231,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
-		ereport(LOG,
+		ereport(elevel,
 (errcode_for_file_access(),
  errmsg("could not link file \"%s\" to \"%s\" (initialization of log file): %m",
 		tmppath, path)));
@@ -3259,7 +3243,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
-		ereport(LOG,
+		ereport(elevel,
 (errcode_for_file_access(),
  errmsg("could not rename file \"%s\" to \"%s\" (initialization of log file): %m",
 		tmppath, path)));
@@ -3748,7 +3732,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 	if (endlogSegNo <= recycleSegNo &&
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(&endlogSegNo, path,
-			   true, recycleSegNo, true))
+			   true, recycleSegNo, true, LOG))
 	{
 		ereport(DEBUG2,
 (errmsg("recycled transaction log file \"%s\"",
@@ -5227,8 +5211,6 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 */
 	if (endLogSegNo == startLogSegNo)
 	{
-		char	   *tmpfname;
-
 		XLogFileName(xlogfname, endTLI, endLogSegNo);
 
 		/*
@@ -5238,9 

Re: [HACKERS] [CORE] Restore-reliability mode

2015-06-07 Thread Alvaro Herrera
Joshua D. Drake wrote:
> 
> On 06/05/2015 08:07 PM, Bruce Momjian wrote:
> 
> >> From my side, it is only recently I got some clear answers to my questions
> >>about how it worked. I think it is very important that major features have
> >>extensive README type documentation with them so the underlying principles 
> >>used
> >>in the development are clear. I would define the measure of a good feature 
> >>as
> >>whether another committer can read the code comments and get a good feel. A 
> >>bad
> >>feature is one where committers walk away from it, saying I don't really 
> >>get it
> >>and I can't read an explanation of why it does that. Tom's most significant
> >>contribution is his long descriptive comments on what the problem is that 
> >>need
> >>to be solved, the options and the method chosen. Clarity of thought is what
> >>solves bugs.
> >
> >Yes, I think we should have done that early-on for multi-xact, and I am
> >hopeful we will learn to do that more often when complex features are
> >implemented, or when we identify areas that are more complex than we
> >thought.
> 
> I see this idea of the README as very useful. There are far more people like
> me in this community than Simon or Alvaro. I can test, I can break things, I
> can script up a harness but I need to be understand HOW and the README would
> help allow for that.

There is a src/backend/access/README.tuplock that attempts to describe
multixacts.  Is that not sufficient?

Now that I think about it, this file hasn't been updated with the latest
changes, so it's probably a bit outdated now.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not truncate directory "pg_subtrans": apparent wraparound

2015-06-07 Thread Thomas Munro
On Mon, Jun 8, 2015 at 12:29 PM, Thomas Munro
 wrote:
> Here's a repro script and a suggested patch.

Argh... I realised immediately after sending this that subtransaction
truncation doesn't even use the oldest XID computed by vacuum, it uses
GetOldestXmin (the "oldest transaction that was running when any
current transaction was started").  So here is an even simpler repro,
though the fix is the same (with different comments).

-- 
Thomas Munro
http://www.enterprisedb.com


checkpoint-subtrans-boundary-v2.sh
Description: Bourne shell script


fix-bogus-subtrans-wraparound-error-v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] could not truncate directory "pg_subtrans": apparent wraparound

2015-06-07 Thread Thomas Munro
On Sat, Jun 6, 2015 at 4:51 PM, Thomas Munro
 wrote:
> On Sat, Jun 6, 2015 at 1:25 PM, Alvaro Herrera  
> wrote:
>> Thomas Munro wrote:
>>
>>> My idea was that if I could get oldestXact == next XID in
>>> TruncateSUBSTRANS, then TransactionIdToPage(oldestXact) for a value of
>>> oldestXact that happens to be immediately after a page boundary (so
>>> that xid % 2048 == 0) might give page number that is >=
>>> latest_page_number, causing SimpleLruTruncate to print that message.
>>> But I can't figure out how to get next XID == oldest XID, because
>>> vacuumdb --freeze --all consumes xids itself, so in my first attempt
>>> at this, next XID is always 3 ahead of the oldest XID when a
>>> checkpoint is run.
>>
>> vacuumdb starts by querying pg_database, which eats one XID.
>>
>> Vacuum itself only uses one XID when vac_truncate_clog() is called.
>> This is called from vac_update_datfrozenxid(), which always happen at
>> the end of each user-invoked VACUUM (so three times for vacuumdb if you
>> have three databases); autovacuum does it also at the end of each run.
>> Maybe you can get autovacuum to quit before doing it.
>>
>> OTOH, if the values in the pg_database entry do not change,
>> vac_truncate_clog is not called, and thus vacuum would finish without
>> consuming an XID.
>
> I have manage to reproduce it a few times but haven't quite found the
> right synchronisation hacks to make it reliable so I'm not posting a
> repro script yet.
>
> I think it's a scary sounding message but very rare and entirely
> harmless (unless you really have wrapped around...).  The fix is
> probably something like: if oldest XID == next XID, then just don't
> call SimpleLruTruncate (truncation is deferred until the next
> checkpoint), or perhaps (if we can confirm this doesn't cause problems
> for dirty pages or that there can't be any dirty pages before cutoff
> page because of the preceding flush (as I suspect)) we could use
> cutoffPage = TransactionIdToPage(oldextXact - 1) if oldest == next, or
> maybe even always.

Here's a repro script and a suggested patch.  (What I said about dirty
pages in parentheses above was nonsense, I was confusing this with
something else.)

-- 
Thomas Munro
http://www.enterprisedb.com


fix-bogus-subtrans-wraparound-error.patch
Description: Binary data


checkpoint-subtrans-boundary.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Map basebackup tablespaces using a tablespace_map file

2015-06-07 Thread Andrew Dunstan


On 06/05/2015 11:08 PM, Amit Kapila wrote:
On Fri, Jun 5, 2015 at 10:51 AM, Amit Kapila > wrote:


On Fri, Jun 5, 2015 at 9:57 AM, Andrew Dunstan
mailto:and...@dunslane.net>> wrote:


On 06/04/2015 11:35 PM, Amit Kapila wrote:


Theoretically, I don't see much problem by changing the checks
way you have done in patch, but it becomes different than what
we have in destroy_tablespace_directories() and it is slightly
changing the way check was originally done in
create_tablespace_directories(), basically original check
will try
unlink if lstat returns non-zero return code. If you want
to proceed
with the changed checks as in v3, then may be we can modify
comments on top of function remove_tablespace_symlink() which
indicates that it works like destroy_tablespace_directories().



The difference is that here we're getting the list from a base
backup and it seems to me the risk of having a file we don't
really want to unlink is significantly greater.


Okay, I think I can understand why you want to be cautious for
having a different check for this path, but in that case there is a
chance that recovery might fail when it will try to create a symlink
for that file.  Shouldn't we then try to call this new function only
when we are trying to restore from tablespace_map file and also
is there a need of ifdef S_ISLINK in remove_tablespace_link?


Shall I send an updated patch on these lines or do you want to
proceed with v3 version?




The point seems to me that we should not be removing anything that's not 
an empty directory or symlink, and that nothing else has any business 
being in pg_tblspc. If we encounter such a thing whose name conflicts 
with the name of a tablespace link we wish to create, rather than 
quietly blowing it away we should complain loudly, and error out, making 
it the user's responsibility to clean up their mess. Am I missing something?


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Resource Owner reassign Locks

2015-06-07 Thread Andres Freund
On 2015-06-07 13:44:08 -0700, Jeff Janes wrote:
> I'd like to advocate for back-patching this to 9.0, 9.1, and 9.2.  It has
> run without problems for a while now, and it can be considered a bug that
> systems with a very large number of objects cannot be upgraded in a
> reasonable time.

+1

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Resource Owner reassign Locks

2015-06-07 Thread Jeff Janes
On Thu, Jun 21, 2012 at 5:32 AM, Heikki Linnakangas <
heikki.linnakan...@enterprisedb.com> wrote:

> On 18.06.2012 13:59, Heikki Linnakangas wrote:
>
>> On 10.06.2012 23:39, Jeff Janes wrote:
>> I found the interface between resowner.c and lock.c a bit confusing.
>> resowner.c would sometimes call LockReassignCurrentOwner() to reassign
>> all the locks, and sometimes it would call LockReassignCurrentOwner() on
>> each individual lock, with the net effect that's the same as calling
>> LockReleaseCurrentOwner(). And it doesn't seem right for
>> ResourceOwnerRemember/ForgetLock to have to accept a NULL owner.
>>
>> I rearranged that so that there's just a single
>> LockReassignCurrentOwner() function, like before this patch. But it
>> takes as an optional argument a list of locks held by the current
>> resource owner. If the caller knows it, it can pass them to make the
>> call faster, but if it doesn't it can just pass NULL and the function
>> will traverse the hash table the old-fashioned way. I think that's a
>> better API.
>>
>> Please take a look to see if I broke something.
>>
>
> I hear no complaints, so committed. Thanks!


I'd like to advocate for back-patching this to 9.0, 9.1, and 9.2.  It has
run without problems for a while now, and it can be considered a bug that
systems with a very large number of objects cannot be upgraded in a
reasonable time.

There are possible ways to make the upgrade smoother by changing the new
systems pg_upgrade rather the old systems server, but those methods will
not be simpler, and not be as well tested.

I'll add this proposal to the commit fest.

Thanks,

Jeff


Re: [HACKERS] [CORE] Restore-reliability mode

2015-06-07 Thread Jeff Janes
On Sat, Jun 6, 2015 at 7:35 AM, Geoff Winkless  wrote:

> On 6 June 2015 at 13:41, Sehrope Sarkuni  wrote:
>
>>
>>
> It's much easier to work into dev/test setups if there are system
>> packages as it's just a config change to an existing script. Building
>> from source would require a whole new workflow that I don't have time
>> to incorporate.
>>
>
> ​Really? You genuinely don't have time to paste, say:
>
> mkdir -p ~/src/pgdevel
> cd ~/src/pgdevel
> wget
> https://ftp.postgresql.org/pub/snapshot/dev/postgresql-snapshot.tar.bz2
> tar xjf postgresql-snapshot.tar.bz2
> ​mkdir bld
> ​
> cd bld
> ../postgresql-9.5devel/configure $(pg_config --configure | sed -e
> 's/\(pg\|postgresql[-\/]\)\(doc-\)\?9\.[0-9]*\(dev\)\?/\1\29.5dev/g')
> make wor
> ​ld​
> ​make check
> make world-install
>

I think this is rather uncharitable.  You don't include yum, zypper, or
apt-get anywhere in there, and I vaguely recall it took me quite a bit of
time to install all the prereqs in order to get it to compile several years
ago when I first started trying to compile it.  And then even more time get
it to compile with several of the config flags I wanted.  And then even
more time to get the docs to compile.

And now after I got all of that, when I try your code, it still doesn't
work.  $(pg_config ) seems to not quote things the way that configure
wants them quoted, or something.  And the package from which I was using
pg_config uses more config options than I was set up for when compiling
myself, so I either have to manually remove the flags, or find more
dependencies (pam, xslt, ossp-uuid, tcl, tcl-dev, and counting).  This is
not very fun, and I didn't even need to get bureaucratic approval to do any
of this stuff, like a lot of people do.

And then when I try to install it, it tries to overwrite some of the files
which were initially installed by the package manager in /usr/lib.  That
doesn't seem good.

And how do I, as a hypothetical package manager user, start this puppy up?
Where is pg_ctlcluster?  How does one do pg_upgrade between a
package-controlled data directory and this new binary?

And then when I find a bug, how do I know it is a bug and not me doing
something wrong in the build process, and getting the wrong .so to load
with the wrong binary or something like that?


> ​and yet you think you have enough time to provide more than a "looks like
> it's working" report to the developers?​
>

If it isn't working, reports of it isn't working with error messages are
pretty helpful and don't take a whole lot of time.  If it is working,
reports of that probably aren't terribly helpful without putting a lot more
work into it.  But people might be willing to put a lot of work into, say,
performance regression testing it that is their area of expertise, if they
don' t also have to put a lot of work into getting the software to compile
in the first place, which is not their area.

Now I don't see a lot of evidence of beta testing from the public (i.e.
unfamiliar names) on -hackers and -bugs lists.  But a lot of hackers report
things that "a client" or "someone on IRC" reported to them, so I'm willing
to believe that a lot of useful beta testing does go on, even though I
don't directly see it, and if there were alpha releases, why wouldn't it
extend to that?


>
> (NB the sed for the pg_config line will probably need work, it looks like
> it should work on the two types of system I have here but I have to admit I
> changed the config line manually when I built it)
>

Right, and are the people who use apt-get to install everything likely to
know how to do that work?


Cheers,

Jeff


Re: [HACKERS] PoC: Partial sort

2015-06-07 Thread Peter Geoghegan
On Sun, Jun 7, 2015 at 8:10 AM, Andreas Karlsson  wrote:
> Are you planning to work on this patch for 9.6?

FWIW I hope so. It's a nice patch.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: Partial sort

2015-06-07 Thread Andreas Karlsson

On 09/15/2014 01:58 PM, Alexander Korotkov wrote:

On Sun, Sep 14, 2014 at 9:32 AM, Peter Geoghegan mailto:p...@heroku.com>> wrote:

I think we might be better off if a tuplesort function was called
shortly after tuplesort_begin_heap() is called. How top-n heap sorts
work is something that largely lives in tuplesort's head. Today, we
call tuplesort_set_bound() to hint to tuplesort "By the way, this is a
top-n heap sort applicable sort". I think that with this patch, we
should then hint (where applicable) "by the way, you won't actually be
required to sort those first n indexed attributes; rather, you can
expect to scan those in logical order. You may work the rest out
yourself, and may be clever about exploiting the sorted-ness of the
first few columns". The idea of managing a bunch of tiny sorts from
with ExecSort(), and calling the new function tuplesort_reset() seems
questionable. tuplesortstate is supposed to be private/opaque to
nodeSort.c, and the current design strains that.

I'd like to keep nodeSort.c simple. I think it's pretty clear that the
guts of this do not belong within ExecSort(), in any case. Also, the
additions there should be much better commented, wherever they finally
end up.


As I understand, you propose to incapsulate partial sort algorithm into
tuplesort. However, in this case we anyway need some significant change
of its interface: let tuplesort decide when it's able to return tuple.
Otherwise, we would miss significant part of LIMIT clause optimization.
tuplesort_set_bound() can't solve all the cases. There could be other
planner nodes between the partial sort and LIMIT.


Hi,

Are you planning to work on this patch for 9.6?

I generally agree with Peter that the changes to the sorting probably 
belong in the tuplesort code rather than in the executor. This way it 
should also be theoretically possible to support mark/restore.


Andreas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer continuous flushing

2015-06-07 Thread Fabien COELHO


Hello Andres,


They pretty much can't if you flush things frequently. That's why I
think this won't be acceptable without the sorting in the checkpointer.



* VERSION 2 "WORK IN PROGRESS".

The implementation is more a proof-of-concept for having feedback than
clean code. What it does:

 - as version 1 : simplified asynchronous flush based on Andres Freund
   patch, with sync_file_range/posix_fadvise used to hint the OS that
   the buffer must be sent to disk "now".

 - added: checkpoint buffer sorting based on a 2007 patch by Takahiro Itagaki
   but with a smaller and static buffer allocated once. Also,
   sorting is done by chunks in the current version.

 - also added: sync/advise calls are now merged if possible,
   so less calls are used, especially when buffers are sorted,
   but also if there are few files.


* PERFORMANCE TESTS

Impacts on "pgbench -M prepared -N -P 1" scale 10  (simple update pgbench
with a mostly-write activity),  with checkpoint_completion_target=0.8
and shared_buffers=1GB.

Contrary to v1, I have not tested bgwriter flushing as the impact
on the first round was close to nought. This does not mean that particular
loads may benefit or be harmed but flushing from bgwriter.

- 100 tps throttled max 100 ms latency over 6400 seconds
  with checkpoint_timeout=30s

  flush | sort | late transactions
off |  off | 6.0 %
off |   on | 6.1 %
 on |  off | 0.4 %
 on |   on | 0.4 % (93% improvement)

- 100 tps throttled max 100 ms latency over 4000 seconds
  with checkpoint_timeout=10mn

  flush | sort | late transactions
off |  off | 1.5 %
off |   on | 0.6 % (?!)
 on |  off | 0.8 %
 on |   on | 0.6 % (60% improvement)

- 150 tps throttled max 100 ms latency over 19600 seconds (5.5 hours)
  with checkpoint_timeout=30s

  flush | sort | late transactions
off |  off | 8.5 %
off |   on | 8.1 %
 on |  off | 0.5 %
 on |   on | 0.4 % (95% improvement)

- full speed bgbench over 6400 seconds with checkpoint_timeout=30s

  flush | sort | tps performance over per second data
off |  off | 676 +- 230
off |   on | 683 +- 213
 on |  off | 712 +- 130
 on |   on | 725 +- 116 (7.2% avg/50% stddev improvements)

- full speed bgbench over 4000 seconds with checkpoint_timeout=10mn

  flush | sort | tps performance over per second data
off |  off | 885 +- 188
off |   on | 940 +- 120 (6%/36%!)
 on |  off | 778 +- 245 (hmmm... not very consistent?)
 on |   on | 927 +- 108 (4.5% avg/43% sttdev improvements)

- full speed bgbench "-j2 -c4" over 6400 seconds with checkpoint_timeout=30s

  flush | sort | tps performance over per second data
off |  off | 2012 +- 747
off |   on | 2086 +- 708
 on |  off | 2099 +- 459
 on |   on | 2114 +- 422 (5% avg/44% stddev improvements)


* CONCLUSION :

For all these HDD tests, when both options are activated the tps performance
is improved, the latency is reduced and the performance is more stable
(smaller standard deviation).

Overall the option effects, not surprisingly, are quite (with exceptions) 
orthogonal:

 - latency is essentially improved (60 to 95% reduction) by flushing
 - throughput is improved (4 to 7% better) thanks to sorting

In detail, some loads may benefit more from only one option activated.
Also on SSD probably both options would have limited benefit.

Usual caveat: these are only benches on one host at a particular time and
location, which may or may not be reproducible nor be representative
as such of any other load.  The good news is that all these tests tell
the same thing.


* LOOK FOR THOUGHTS

- The bgwriter flushing option seems ineffective, it could be removed
  from the patch?

- Move fsync as early as possible, suggested by Andres Freund?

In these tests, when the flush option is activated, the fsync duration
at the end of the checkpoint is small: on more than 5525 checkpoint
fsyncs, 0.5% are above 1 second when flush is on, but the figure raises
to 24% when it is off This suggest that doing the fsync as soon as
possible would probably have no significant effect on these tests.

My opinion is that this should be left out for the nonce.


- Take into account tablespaces, as pointed out by Andres Freund?

The issue is that if writes are sorted, they are not be distributed 
randomly over tablespaces, inducing lower performance on such systems.


How to do it: while scanning shared_buffers, count dirty buffers for each
tablespace. Then start as many threads as table spaces, each one doing
its own independent throttling for a tablespace? For some obscure reason 
there are 2 tablespaces by default (pg_global and  pg_default), that would 
mean at least 2 threads.


Alternatively, maybe it can be done from one thread, but it would probably 
involve some strange hocus-pocus to switch frequently between tablespaces.


--
Fabien.diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1da7dfb..2e6bb10 100644
--- a/doc/src/sgml/config.sgml
+++ b/

Re: [HACKERS] [CORE] Restore-reliability mode

2015-06-07 Thread Kevin Grittner
Joshua D. Drake  wrote:
> On 06/06/2015 07:14 PM, Peter Geoghegan wrote:
>> On Sat, Jun 6, 2015 at 7:07 PM, Robert Haas  wrote:

>>> Perhaps we're honoring this more in the breech than in the
>>> observance, but I'm not making up what Tom has said about this:
>>>
>>> http://www.postgresql.org/message-id/27310.1251410...@sss.pgh.pa.us

That's 9.0 release discussion:

| I think that the traditional criterion is that we don't release beta1
| as long as there are any known issues that might force an initdb.
| We were successful in avoiding a post-beta initdb this time, although
| IIRC the majority of release cycles have had one --- so maybe you
| could argue that that's not so important.  It would certainly be
| less important if we had working pg_migrator functionality to ease
| the pain of going from beta to final.

>>> http://www.postgresql.org/message-id/19174.1299782...@sss.pgh.pa.us

That's 9.1 release discussion:

| Historically we've declared it beta once we think we are done with
| initdb-forcing problems.

| In any case, the existence of pg_upgrade means that "might we need
| another initdb?" is not as strong a consideration as it once was, so
| I'm not sure if we should still use that as a criterion.  I don't know
| quite what "ready for beta" should mean otherwise, though.

>>> http://www.postgresql.org/message-id/3413.1301154...@sss.pgh.pa.us

Also 9.1, it is listed as one criterion:

| * No open issues that are expected to result in a catversion bump.
| (With pg_upgrade, this is not as critical as it used to be, but
| I still think catalog stability is a good indicator of a release's
| maturity)

>>> http://www.postgresql.org/message-id/3261.1401915...@sss.pgh.pa.us

Here we jump to 9.4 discussion:

| > Agreed. Additionally I also agree with Stefan that the price of a initdb
| > during beta isn't that high these days.
|
| Yeah, if nothing else it gives testers another opportunity to exercise
| pg_upgrade ;-).  The policy about post-beta1 initdb is "avoid if
| practical", not "avoid at all costs".

So I think these examples show that the policy has shifted from a
pretty strong requirement to "it's probably nice if" status, with
some benefits seen in pg_upgrade testing to actually having a bump.

>> Of course, not doing a catversion bump after beta1 doesn't necessarily
>> have much value in and of itself. *Promising* to not do a catversion
>> bump, and then usually keeping that promise definitely has a certain
>> value, but clearly we are incapable of that.

As someone who was able to bring up a new production application on
8.2 because it was all redundant data and not yet mission-critical,
I appreciate that in very rate circumstances that combination could
have benefit.  But really, how often are people in that position?

> It seems to me that a cat bump during Alpha or Beta should be absolutely
> fine and reservedly fine respectively. Where we should absolutely not
> cat bump unless there is absolutely no other choice is during and RC.

+1 on all of that.  And for a while now we've been talking about an
alpha test release, right?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Proposal] More Vacuum Statistics

2015-06-07 Thread Tomas Vondra

Hi,

On 06/05/15 14:10, Naoya Anzai wrote:

Thank you for quick feedback, and I'm sorry for slow response.
All of your opinions were very helpful for me.

I have confirmed Greg's Idea "Timing events".
http://www.postgresql.org/message-id/509300f7.5000...@2ndquadrant.com

Greg said at first,
"Parsing log files for commonly needed performance data is no fun."
Yes, I completely agree with him.

>

That looks a nice idea but I don't know why this idea has
not been commited yet. Anybody knows?


Most likely lack of time, I guess.



I have reworked my idea since I heard dear hacker's opinions.


pg_stat_vacuum view


I understand it is not good to simply add more counters in
pg_stat_*_tables. For now, I'd like to suggest an extension
which can confirm vacuum statistics like pg_stat_statements.


I don't see how you want to collect the necessary information from an
extension? pg_stat_statements get most of the stats from BufferUsage 
structure, but vacuum keeps all this internal, AFAIK.


So it'd be necessary to make this somehow public - either by creating 
something like BufferUsage with all the vacuum stats, or perhaps a set 
of callbacks (either a single log_relation_vacuum or different callbacks 
for tables and indexs).


IMHO the callbacks are a better idea - for example because it naturally 
handles database-wide vacuum. The global structure makes this difficult, 
because you'll only see data for all the vacuumed objects (or it'd have 
to track per-object stats internally, somehow).



VACUUM is a most important feature in PostgreSQL, but a
special view for vacuum does not exist. Don't you think
the fact is inconvenience? At least, I am disgruntled with
that we need to parse pg_log for tune VACUUM.


+1


My first design of pg_stat_vacuum view is following.
(There are two views.)

pg_stat_vacuum_table
---
dbid
schemaname
relid
relname
elapsed
page_removed
page_remain
page_skipped
tuple_removed
tuple_remain
tuple_notremovable
buffer_hit
buffer_miss
buffer_dirty
avg_read
avg_write
vm_count
vac_start
vac_end
is_autovacuum

pg_stat_vacuum_index
---
dbid
shemaname
relid
indexrelid
indexname
elapsed
num_index_tuples
num_pages
tuples_removed
pages_deleted
pages_free
is_autovacuum

At present, I think memory design of pg_stat_statements can
divert into this feature.And I think this module needs to
prepare following parameters like pg_stat_statements.


I'm not really sure about this.

Firstly, the very fist response from TL in this thread was that adding 
per-table counters is not a particularly good idea, as it'll bloat the 
statistics files. It's true you're not adding the data into the main 
stats files, but you effectively establish a new 'vertical partition' 
with one record per table/index. It might be worth the overhead, if it 
really brings useful functionality (especially if it's opt-in feature, 
like pg_stat_statements).


Secondly, the main issue of this design IMHO is that it only tracks the 
very last vacuum run (or do I understand it wrong?). That means even if 
you snapshot the pg_stat_vacuum views, you'll not know how many vacuums 
executed in between (and the more frequently you snapshot that, the 
greater the overhead). The other stats counters have the same issue, but 
the snapshotting works a bit better because the counters are cumulative 
(so you can easily do deltas etc.). But that's not the case here - 
certainly not with the timestamps, for example.


I don't think the vacuum start/end timestamps are particularly 
interesting, TBH - we already have them in pg_stat_all_tables anyway, 
including the vacuum_count etc. So I'd propose dropping the timestamps, 
possibly replacing them with a single 'elapsed time', and making all the 
counters cumulative (so that you can do snapshots and deltas).


I'm also wondering whether this should track the vacuum costs (because 
that determines how aggressive the vacuum is, and how much work will be 
done in a particular time), if it was anti-wraparound vacuum, if there 
was also ANALYZE performed, if the autovacuum was interrupted because of 
user activity, etc.



pg_stat_vacuum.max(integer)
pg_stat_vacuum.save(boolean)
pg_stat_vacuum.excluded_dbnames(text)
pg_stat_vacuum.excluded_schemas(text)
pg_stat_vacuum.min_duration(integer)
... and so on.

To implement this feature, I have to collect each vacuum-stats
every lazy_vacuum_* and I need to embed a hook function point
where needed. (probably last point of lazy_vacuum_rel).
Do you hesitate to add the hook only for this function?


Aha! So you plan to use the callbacks.



Similar feature has been already provided by pg_statsinfo package.
But it is a full-stack package for PG-stats and it needs to
redesign pg_log and design a repository database for introduce.
And it is not a core-extension for PostgreSQL.
(I don't intend to hate pg_statsinfo,
  I think this package is a very convinient tool)

Everyone will be able to do more easily tuning o

Re: [HACKERS] Reducing tuple overhead

2015-06-07 Thread Simon Riggs
On 23 April 2015 at 17:24, Andres Freund  wrote:

> Split into a new thread, the other one is already growing fast
> enough. This discussion started at
> http://archives.postgresql.org/message-id/55391469.5010506%40iki.fi
>
> On April 23, 2015 6:48:57 PM GMT+03:00, Heikki Linnakangas <
> hlinn...@iki.fi> wrote:
> >Stop right there. You need to reserve enough space on the page to store
> >
> >an xmax for *every* tuple on the page. Because if you don't, what are
> >you going to do when every tuple on the page is deleted by a different
> >transaction.
> >
> >Even if you store the xmax somewhere else than the page header, you
> >need
> >to reserve the same amount of space for them, so it doesn't help at
> >all.
>
> Depends on how you do it and what you optimize for (disk space, runtime,
> code complexity..).  You can e.g. use apply a somewhat similar trick to
> xmin/xmax as done to cmin/cmax; only that the data structure needs to be
> persistent.
>
> In fact, we already have combocid like structure for xids that's
> persistent - multixacts. We could just have one xid saved that's either
> xmin or xmax (indicated by bits) or a multixact.  When a tuple is
> updated/deleted whose xmin is still required we could replace the former
> xmin with a multixact, otherwise just change the flag that it's now a
> xmax without a xmin.  To check visibility and if the xid is a multixact
> we'd just have to look for the relevant member for the actual xmin and
> xmax.
>
> To avoid exessive overhead when a tuple is repeatedly updated within one
> session we could store some of the data in the combocid entry that we
> anyway need in that case.
>
> Whether that's feasible complexity wise is debatable, but it's certainly
> possible.
>
>
> I do wonder what, in realistic cases, is actually the bigger contributor
> to the overhead. The tuple header or the padding we liberally add in
> many cases...
>

It's hard to see how to save space there without reference to a specific
use case. I see different solutions depending upon whether we assume a low
number of transactions or a high number of transactions.

A case we could optimize for is insert-mostly tables. But in that case if
you get rid of the xmax then you still have to freeze the tuples later.

I would have thought a better optimization would be to use the xmax for the
xid epoch by default, so that such rows would never need freezing. Then at
least we are using the xmax for something useful in a larger insert-mostly
database rather than just leaving it at zero.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Reducing tuple overhead

2015-06-07 Thread Peter Geoghegan
On Thu, Apr 30, 2015 at 6:54 AM, Robert Haas  wrote:
> The other, related problem is that the ordering operator might start
> to return different results than it did at index creation time.  For
> example, consider a btree index built on a text column.  Now consider
> 'yum update'.  glibc gets updated, collation ordering of various
> strings change, and now you've got tuples that are in the "wrong
> place" in the index, because when the index was built, we thought A <
> B, but now we think B < A.  You would think the glibc maintainers
> might avoid such changes in minor releases, or that the Red Hat guys
> would avoid packaging and shipping those changes in minor releases,
> but you'd be wrong.

I would not think that. Unicode Technical Standard #10 states:

"""
Collation order is not fixed.

Over time, collation order will vary: there may be fixes needed as
more information becomes available about languages; there may be new
government or industry standards for the language that require
changes; and finally, new characters added to the Unicode Standard
will interleave with the previously-defined ones. This means that
collations must be carefully versioned.
"""

Also, in the paper "Modern B-Tree Techniques", by Goetz Graefe, page
238, it states:

"""
In many operating systems, appropriate functions are provided to
compute a normalized key from a localized string value, date value, or
time value. This functionality is used, for example, to list files in
a directory as appropriate for the local language. Adding
normalization for numeric data types is relatively straightforward, as
is concatenation of multiple normalized values. Database code must not
rely on such operating system code, however. The problem with relying
on operating systems support for database indexes is the update
frequency. An operating system might update its normalization code due
to an error or extension in the code or in the definition of a local
sort order; it is unacceptable, however, if such an update silently
renders existing large database indexes incorrect.
"""

Unfortunately, it is simply not the case that we can rely on OS
collations being immutable. We have *no* contract with any C standard
library concerning collation stability whatsoever. I'm surprised that
we don't see hear more about this kind of corruption.
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers