Re: [HACKERS] CINE in CREATE TABLE AS ... and CREATE MATERIALIZED VIEW ...

2014-10-27 Thread Rushabh Lathia
Hi All,

- Patch got applied cleanly.
- Regression make check run fine.
- Patch covered the documentation changes

Here are few comments:

1) What the need of following change:

diff --git a/src/backend/storage/lmgr/lwlock.c
b/src/backend/storage/lmgr/lwlock.c
index bcec173..9fe6855 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -1005,12 +1005,6 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr,
uint64 oldval, uint64 *newval)
 lock-tail = proc;
 lock-head = proc;

-/*
- * Set releaseOK, to make sure we get woken up as soon as the lock
is
- * released.
- */
-lock-releaseOK = true;
-
 /* Can release the mutex now */
 SpinLockRelease(lock-mutex);


It doesn't look like related to this patch.

2)

diff --git a/src/bin/psql/help.c b/src/bin/psql/help.c
index ae5fe88..4d11952 100644
--- a/src/bin/psql/help.c
+++ b/src/bin/psql/help.c
@@ -247,7 +247,7 @@ slashUsage(unsigned short int pager)
 fprintf(output, _(  \\f [STRING]show or set field
separator for unaligned query output\n));
 fprintf(output, _(  \\H toggle HTML output mode
(currently %s)\n),
 ON(pset.popt.topt.format == PRINT_HTML));
-fprintf(output, _(  \\pset [NAME [VALUE]]   set table output option\n
+fprintf(output, _(  \\pset [NAME [VALUE]] set table output
option\n
(NAME :=
{format|border|expanded|fieldsep|fieldsep_zero|footer|null|\n
   
numericlocale|recordsep|recordsep_zero|tuples_only|title|tableattr|pager|\n
   
unicode_border_linestyle|unicode_column_linestyle|unicode_header_linestyle})\n));


Why above changes reqired ?

3) This patch adding IF NOT EXIST_S for CREATE TABLE AS and CREATE
MATERIALIZED
TABLE, but testcase coverage for CREATE MATERIALIZED TABLE is missing.

Apart from this changes looks good to me.


On Tue, Oct 14, 2014 at 10:28 PM, Fabrízio de Royes Mello 
fabriziome...@gmail.com wrote:

 On Wed, Oct 1, 2014 at 9:17 AM, Fabrízio de Royes Mello 
 fabriziome...@gmail.com wrote:
 
  Hi all,
 
  We already have IF NOT EXISTS for CREATE TABLE. There are some reason to
 don't have to CREATE TABLE AS and CREATE MATERIALIZED VIEW??
 

 Patch attached to add CINE support to:

 - CREATE TABLE AS
 - CREATE MATERIALIZED VIEW

 Regards,

 --
 Fabrízio de Royes Mello
 Consultoria/Coaching PostgreSQL
  Timbira: http://www.timbira.com.br
  Blog: http://fabriziomello.github.io
  Linkedin: http://br.linkedin.com/in/fabriziomello
  Twitter: http://twitter.com/fabriziomello
  Github: http://github.com/fabriziomello


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers




-- 
Rushabh Lathia


Re: [HACKERS] [v9.5] Custom Plan API

2014-10-27 Thread Kouhei Kaigai
 FYI, patch v12 part 2 no longer applies cleanly.

Thanks. I rebased the patch set according to the latest master branch.
The attached v13 can be applied to the master.
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei kai...@ak.jp.nec.com


 -Original Message-
 From: thombr...@gmail.com [mailto:thombr...@gmail.com] On Behalf Of Thom
 Brown
 Sent: Sunday, October 26, 2014 9:22 PM
 To: Kaigai Kouhei(海外 浩平)
 Cc: Robert Haas; Kohei KaiGai; Tom Lane; Alvaro Herrera; Shigeru Hanada;
 Simon Riggs; Stephen Frost; Andres Freund; PgHacker; Jim Mlodgenski; Peter
 Eisentraut
 Subject: Re: [HACKERS] [v9.5] Custom Plan API
 
 On 30 September 2014 07:27, Kouhei Kaigai kai...@ak.jp.nec.com wrote:
 
 
On Mon, Sep 29, 2014 at 9:04 AM, Kohei KaiGai
 kai...@kaigai.gr.jp wrote:
 Do we need to match the prototype of wrapper function with
 callback?
   
Yes.
   
   OK, I fixed up the patch part-2, to fit declaration of
 GetSpecialCustomVar()
   with corresponding callback.
 
   Also, a noise in the part-3 patch, by git-pull, was removed.
 
 
 FYI, patch v12 part 2 no longer applies cleanly.
 
 --
 Thom


pgsql-v9.5-custom-scan.part-3.v13.patch
Description: pgsql-v9.5-custom-scan.part-3.v13.patch


pgsql-v9.5-custom-scan.part-2.v13.patch
Description: pgsql-v9.5-custom-scan.part-2.v13.patch


pgsql-v9.5-custom-scan.part-1.v13.patch
Description: pgsql-v9.5-custom-scan.part-1.v13.patch

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Function array_agg(array)

2014-10-27 Thread Ali Akbar
2014-10-27 9:11 GMT+07:00 Ali Akbar the.ap...@gmail.com:


 2014-10-27 1:38 GMT+07:00 Pavel Stehule pavel.steh...@gmail.com:

 Hi

 My idea is using new ArrayBuilder optimized for building multidimensional
 arrays with own State type. I think so casting to ArrayBuildState is base
 of our problems, so I don't would to do. Code in array_agg_* is simple,
 little bit more complex code is in nodeSubplan.c. Some schematic changes
 are in attachments.


 Thanks! The structure looks clear, and thanks for the example on
 nodeSubplan.c. I will restructure the v10 of the patch to this structure.


Patch attached.

Regards,
-- 
Ali Akbar
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7e5bcd9..f59738a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -12046,6 +12046,22 @@ NULL baz/literallayout(3 rows)/entry
  row
   entry
indexterm
+primaryarray_agg/primary
+   /indexterm
+   functionarray_agg(replaceable class=parameteranyarray/replaceable)/function
+  /entry
+  entry
+   any
+  /entry
+  entry
+   the same array type as input type
+  /entry
+  entryinput arrays, aggregated into higher-order multidimesional array. Rejects NULL and empty array as input./entry
+ /row
+
+ row
+  entry
+   indexterm
 primaryaverage/primary
/indexterm
indexterm
diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index 2f0680f..8c182a4 100644
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@@ -2238,6 +2238,11 @@ SELECT ARRAY(SELECT oid FROM pg_proc WHERE proname LIKE 'bytea%');
  array
 ---
  {2011,1954,1948,1952,1951,1244,1950,2005,1949,1953,2006,31,2412,2413}
+
+SELECT ARRAY(SELECT array(select i) FROM generate_series(1,5) a(i));
+ array
+---
+ {{1},{2},{3},{4},{5}}
 (1 row)
 /programlisting
The subquery must return a single column. The resulting
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 401bad4..eb4de3b 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -231,7 +231,10 @@ ExecScanSubPlan(SubPlanState *node,
 	bool		found = false;	/* TRUE if got at least one subplan tuple */
 	ListCell   *pvar;
 	ListCell   *l;
+	bool		use_md_array_builder;
+	Oid			md_array_element_type;
 	ArrayBuildState *astate = NULL;
+	MdArrayBuildState *mdastate = NULL;
 
 	/*
 	 * MULTIEXPR subplans, when executed, just return NULL; but first we
@@ -366,8 +369,25 @@ ExecScanSubPlan(SubPlanState *node,
 			/* stash away current value */
 			Assert(subplan-firstColType == tdesc-attrs[0]-atttypid);
 			dvalue = slot_getattr(slot, 1, disnull);
-			astate = accumArrayResult(astate, dvalue, disnull,
-	  subplan-firstColType, oldcontext);
+			/*
+			 * use a fast array multidimensional builder when input is a array
+			 * only check on first iteration. On subsequent, use the cached values
+			 */
+			if (astate == NULL  mdastate == NULL)
+			{
+md_array_element_type = get_element_type(subplan-firstColType);
+use_md_array_builder = OidIsValid(md_array_element_type);
+			}
+
+			if (use_md_array_builder)
+mdastate = accumMdArray(mdastate,
+		disnull? NULL :
+ DatumGetArrayTypeP(dvalue),
+		disnull, md_array_element_type,
+		oldcontext);
+			else
+astate = accumArrayResult(astate, dvalue, disnull,
+		  subplan-firstColType, oldcontext);
 			/* keep scanning subplan to collect all values */
 			continue;
 		}
@@ -439,6 +459,8 @@ ExecScanSubPlan(SubPlanState *node,
 		/* We return the result in the caller's context */
 		if (astate != NULL)
 			result = makeArrayResult(astate, oldcontext);
+		else if (mdastate != NULL)
+			result = makeMdArray(mdastate, oldcontext);
 		else
 			result = PointerGetDatum(construct_empty_array(subplan-firstColType));
 	}
@@ -951,7 +973,10 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	ListCell   *pvar;
 	ListCell   *l;
 	bool		found = false;
+	bool		use_md_array_builder;
+	Oid			md_array_element_type;
 	ArrayBuildState *astate = NULL;
+	MdArrayBuildState *mdastate = NULL;
 
 	if (subLinkType == ANY_SUBLINK ||
 		subLinkType == ALL_SUBLINK)
@@ -1018,8 +1043,25 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 			/* stash away current value */
 			Assert(subplan-firstColType == tdesc-attrs[0]-atttypid);
 			dvalue = slot_getattr(slot, 1, disnull);
-			astate = accumArrayResult(astate, dvalue, disnull,
-	  subplan-firstColType, oldcontext);
+			/*
+			 * use a fast array multidimensional builder when input is a array
+			 * only check on first iteration. On subsequent, use the cached values
+			 */
+			if (astate == NULL  mdastate == NULL)
+			{
+md_array_element_type = get_element_type(subplan-firstColType);
+use_md_array_builder = 

Re: [HACKERS] DISTINCT with btree skip scan

2014-10-27 Thread David Rowley
On Sat, Jul 5, 2014 at 12:17 PM, Thomas Munro mu...@ip9.org wrote:

 postgres=# set enable_hashagg = false;
 SET
 Time: 0.302 ms
 postgres=# explain select distinct a from foo;

 ┌─┐
 │ QUERY PLAN
│

 ├─┤
 │ Index Only Scan for distinct prefix 1 using foo_pkey on foo
  (cost=0.43..263354.20 rows=10 width=4) │
 │ Planning time: 0.063 ms
 │

 └─┘
 (2 rows)

 Time: 0.443 ms
 postgres=# select distinct a from foo;
 ┌───┐
 │ a │
 ├───┤
 │ 0 │
 │ 1 │
 │ 2 │
 │ 3 │
 │ 4 │
 │ 5 │
 │ 6 │
 │ 7 │
 │ 8 │
 │ 9 │
 └───┘
 (10 rows)

 Time: 0.565 ms



Hi Thomas,

I've had a quick look at this and it seems like a great win! I'm quite
surprised that we've not got this already. I think this technology could
also really help performance of queries such as SELECT * from bigtable bt
WHERE EXISTS(SELECT 1 FROM otherbigtable obt WHERE bt.somecol =
obt.someIndexedButNonUniqueColumn); I know you're not proposing to improve
that first off, but it could be done later once the benefits of this are
more commonly realised.

I think our shortfalls in this area have not gone unnoticed. I was reading
this post
https://www.periscope.io/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html
about comparing performance of COUNT(DISTINCT col). I think this would give
a big win for query 3 in that post. I'm trying to find some time make some
changes to transform queries to allow the group by to happen before the
joins when possible, so between that and your patch we'd be 2 steps closer
to making query 1 in the link above perform a little better on PostgreSQL.

Do you think you'll manage to get time to look at this a bit more? I'd be
keen to look into the costing side of it if you think that'll help you any?

Regards

David Rowley


Re: [HACKERS] On partitioning

2014-10-27 Thread Amit Langote

Hi,

 On Mon, Oct 13, 2014 at 04:38:39PM -0300, Alvaro Herrera wrote:
  Bruce Momjian wrote:
   I realize there hasn't been much progress on this thread, but I wanted
   to chime in to say I think our current partitioning implementation is
   too heavy administratively, error-prone, and performance-heavy.
 
  On the contrary, I think there was lots of progress; there's lots of
  useful feedback from the initial design proposal I posted.  I am a bit
  sad to admit that I'm not working on it at the moment as I had
  originally planned, though, because other priorities slipped in and I am
  not able to work on this for a while.  Therefore if someone else wants
  to work on this topic, be my guest -- otherwise I hope to get on it in a
  few months.
 
 Oh, I just meant code progress --- I agree the discussion was fruitful.
 

FWIW, I think Robert's criticism regarding not basing this on inheritance
scheme was not responded to. He mentions a patch by Itagaki-san (four years
ago, abandoned unfortunately); details here:

https://wiki.postgresql.org/wiki/Table_partitioning#Active_Work_In_Progress

This patch could be resurrected fixing some parts of it as was suggested at
the time. But, the most important decisions regarding the patch like storage
structure, syntax etc. would require building some consensus whether this is a
worthwhile direction. At least some consideration must be given to the idea
that we might want to have remote partitions backed by FDW infrastructure in
near future, although that may not be the primary goal of partitioning effort.
What do others think?

--
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Function array_agg(array)

2014-10-27 Thread Pavel Stehule
Hi

I did some minor changes in code

* move tests of old or new builder style for array sublink out of main
cycles
* some API simplification of new builder - we should not to create
identical API, mainly it has no sense

Regards

Pavel Stehule


2014-10-27 8:12 GMT+01:00 Ali Akbar the.ap...@gmail.com:

 2014-10-27 9:11 GMT+07:00 Ali Akbar the.ap...@gmail.com:


 2014-10-27 1:38 GMT+07:00 Pavel Stehule pavel.steh...@gmail.com:

 Hi

 My idea is using new ArrayBuilder optimized for building
 multidimensional arrays with own State type. I think so casting to
 ArrayBuildState is base of our problems, so I don't would to do. Code in
 array_agg_* is simple, little bit more complex code is in nodeSubplan.c.
 Some schematic changes are in attachments.


 Thanks! The structure looks clear, and thanks for the example on
 nodeSubplan.c. I will restructure the v10 of the patch to this structure.


 Patch attached.

 Regards,
 --
 Ali Akbar

commit f09f41f5fe780bed4cf25949961a4e68e6402ff0
Author: Pavel Stehule pavel.steh...@gooddata.com
Date:   Mon Oct 27 10:03:07 2014 +0100

initial

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7e5bcd9..f59738a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -12046,6 +12046,22 @@ NULL baz/literallayout(3 rows)/entry
  row
   entry
indexterm
+primaryarray_agg/primary
+   /indexterm
+   functionarray_agg(replaceable class=parameteranyarray/replaceable)/function
+  /entry
+  entry
+   any
+  /entry
+  entry
+   the same array type as input type
+  /entry
+  entryinput arrays, aggregated into higher-order multidimesional array. Rejects NULL and empty array as input./entry
+ /row
+
+ row
+  entry
+   indexterm
 primaryaverage/primary
/indexterm
indexterm
diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index 2f0680f..8c182a4 100644
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@@ -2238,6 +2238,11 @@ SELECT ARRAY(SELECT oid FROM pg_proc WHERE proname LIKE 'bytea%');
  array
 ---
  {2011,1954,1948,1952,1951,1244,1950,2005,1949,1953,2006,31,2412,2413}
+
+SELECT ARRAY(SELECT array(select i) FROM generate_series(1,5) a(i));
+ array
+---
+ {{1},{2},{3},{4},{5}}
 (1 row)
 /programlisting
The subquery must return a single column. The resulting
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 401bad4..21b8de1 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -231,7 +231,10 @@ ExecScanSubPlan(SubPlanState *node,
 	bool		found = false;	/* TRUE if got at least one subplan tuple */
 	ListCell   *pvar;
 	ListCell   *l;
+	bool		use_md_array_builder = false;
+	Oid			md_array_element_type = InvalidOid;
 	ArrayBuildState *astate = NULL;
+	MdArrayBuildState *mdastate = NULL;
 
 	/*
 	 * MULTIEXPR subplans, when executed, just return NULL; but first we
@@ -260,6 +263,16 @@ ExecScanSubPlan(SubPlanState *node,
 	}
 
 	/*
+	 * use a fast array multidimensional builder when input is a array
+	 * only check on first iteration. On subsequent, use the cached values
+	 */
+	if (subLinkType == ARRAY_SUBLINK)
+	{
+		md_array_element_type = get_element_type(subplan-firstColType);
+		use_md_array_builder = OidIsValid(md_array_element_type);
+	}
+
+	/*
 	 * We are probably in a short-lived expression-evaluation context. Switch
 	 * to the per-query context for manipulating the child plan's chgParam,
 	 * calling ExecProcNode on it, etc.
@@ -366,8 +379,16 @@ ExecScanSubPlan(SubPlanState *node,
 			/* stash away current value */
 			Assert(subplan-firstColType == tdesc-attrs[0]-atttypid);
 			dvalue = slot_getattr(slot, 1, disnull);
-			astate = accumArrayResult(astate, dvalue, disnull,
-	  subplan-firstColType, oldcontext);
+
+			if (use_md_array_builder)
+mdastate = accumMdArray(mdastate,
+		disnull? NULL :
+ DatumGetArrayTypeP(dvalue),
+		disnull, md_array_element_type,
+		oldcontext);
+			else
+astate = accumArrayResult(astate, dvalue, disnull,
+		  subplan-firstColType, oldcontext);
 			/* keep scanning subplan to collect all values */
 			continue;
 		}
@@ -439,6 +460,8 @@ ExecScanSubPlan(SubPlanState *node,
 		/* We return the result in the caller's context */
 		if (astate != NULL)
 			result = makeArrayResult(astate, oldcontext);
+		else if (mdastate != NULL)
+			result = PointerGetDatum(makeMdArray(mdastate, oldcontext, true));
 		else
 			result = PointerGetDatum(construct_empty_array(subplan-firstColType));
 	}
@@ -951,7 +974,10 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	ListCell   *pvar;
 	ListCell   *l;
 	bool		found = false;
+	bool		use_md_array_builder = false;
+	Oid			md_array_element_type = InvalidOid;
 	ArrayBuildState *astate = 

Re: [HACKERS] On partitioning

2014-10-27 Thread Alvaro Herrera
Amit Langote wrote:

  On Mon, Oct 13, 2014 at 04:38:39PM -0300, Alvaro Herrera wrote:
   Bruce Momjian wrote:
I realize there hasn't been much progress on this thread, but I wanted
to chime in to say I think our current partitioning implementation is
too heavy administratively, error-prone, and performance-heavy.
  
   On the contrary, I think there was lots of progress; there's lots of
   useful feedback from the initial design proposal I posted.  I am a bit
   sad to admit that I'm not working on it at the moment as I had
   originally planned, though, because other priorities slipped in and I am
   not able to work on this for a while.  Therefore if someone else wants
   to work on this topic, be my guest -- otherwise I hope to get on it in a
   few months.
  
  Oh, I just meant code progress --- I agree the discussion was fruitful.
 
 FWIW, I think Robert's criticism regarding not basing this on inheritance
 scheme was not responded to.

It was responded to by ignoring it.  I didn't see anybody else
supporting the idea that inheritance is in any way a sane thing to base
partitioning on.  Sure, we have accumulated lots of kludges over the
years to cope with the fact that, really, it doesn't work very well.  So
what.  We can keep them, I don't care.

Anyway as I said above, I'm not particularly interested in any more
discussion on this topic for the time being, since I don't have time to
work on this patch.  If anybody wants to continue discussing to improve
the design some more, and even implement it or parts of it, that's fine
with me -- but please expect me not to answer.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Heikki Linnakangas

On 10/26/2014 11:47 PM, Tomas Vondra wrote:

After eyeballing the code for an hour or two, I think CREATE DATABASE
should be fine with performing only a 'partial checkpoint' on the
template database - calling FlushDatabaseBuffers and processing unlink
requests, as suggested by the comment in createdb().


Hmm. You could replace the first checkpoint with that, but I don't think 
that's enough for the second. To get any significant performance 
benefit, you need to get rid of both checkpoints, because doing two 
checkpoints one after another is almost as fast as doing a single 
checkpoint; the second checkpoint has very little work to do because the 
first checkpoint already flushed out everything.


The second checkpoint, after copying but before commit, is done because 
(from the comments in createdb function):



 * #1: When PITR is off, we don't XLOG the contents of newly created
 * indexes; therefore the drop-and-recreate-whole-directory behavior
 * of DBASE_CREATE replay would lose such indexes.

 * #2: Since we have to recopy the source database during DBASE_CREATE
 * replay, we run the risk of copying changes in it that were
 * committed after the original CREATE DATABASE command but before the
 * system crash that led to the replay.  This is at least unexpected
 * and at worst could lead to inconsistencies, eg duplicate table
 * names.


Doing only FlushDatabaseBuffers would not prevent these issues - you 
need a full checkpoint. These issues are better explained here: 
http://www.postgresql.org/message-id/28884.1119727...@sss.pgh.pa.us


To solve #1, we could redesign CREATE DATABASE so that replaying the 
DBASE_CREATE record doesn't zap the old directory, and also doesn't copy 
any files. We could instead just assume that if the transaction commits, 
all the files have been copied and fsync'd already, like we assume that 
if a CREATE INDEX commits in wal_level=minimal, the underlying file was 
fsync'd before the commit.


That would also solve #2, when doing crash recovery. But it would remain 
when doing archive recovery. I guess we could still redo the copy when 
in archive recovery mode. I believe it would be the first time we have a 
WAL record that's replayed differently in crash recovery than in archive 
recovery, so here be dragons...



It's not exactly trivial change, but it does not seem frighteningly
difficult coding either.

The templates are usually static, so this would minimize both the CREATE
DATABASE duration and disruption to the cluster it causes.


I wonder if we should bite the bullet and start WAL-logging all the 
files that are copied from the template database to the new database. 
When the template database is small (template0 is 6.4MB currently), that 
wouldn't generate too much WAL. We could perhaps do that only if the 
template database is small, and do the checkpoints otherwise, although I 
wouldn't like to have subtly different behavior depending on database 
size like that.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strip nulls functions for json and jsonb

2014-10-27 Thread Pavel Stehule
Hi

I am sending a final review of this patch:

0. this patch implements null fields stripping. It does exactly what was
proposed and we would to have this feature in core. It is requested feature
for JSON types.

1. there is no problem with patch apply and with compilation - one warning
is fixed in attachments

2. code is relative small and clean, I have no any objection

3. there is necessary regress tests and related documentation.


I have no any objection - this patch is ready for commiter.

Thank you for patch

Regards

Pavel

2014-10-26 21:57 GMT+01:00 Andrew Dunstan and...@dunslane.net:


 On 10/26/2014 04:22 PM, Pavel Stehule wrote:



 2014-10-26 21:18 GMT+01:00 Andrew Dunstan and...@dunslane.net mailto:
 and...@dunslane.net:


 On 10/26/2014 04:14 PM, Thom Brown wrote:

 On 26 October 2014 20:07, Andrew Dunstan and...@dunslane.net
 mailto:and...@dunslane.net mailto:and...@dunslane.net
 mailto:and...@dunslane.net wrote:


 On 10/26/2014 03:50 PM, Pavel Stehule wrote:

 Hi

 I have a question,

 what is expected result of null strip of

 {a: {b: null, c, null} }

 ?



 Please remember not to top-post.

 The above is not legal json, so the answer would be an error.


 I believe Pavel means:

 {a: {b: null, c: null} }


 This is the expected result:

andrew=# select json_strip_nulls('{a: {b: null, c: null} }');
  json_strip_nulls
--
  {a:{}}
(1 row)


 It is NOT expected that we replace an empty object with NULL (and
 then strip it if it's a field value of an outer level object).


 ok,

 This case should be in regress test probably



 Patch attached.

 cheers

 andrew

commit 03dc02f601c6b9097402317e2dfa34e0f5d33ba5
Author: Pavel Stehule pavel.steh...@gooddata.com
Date:   Mon Oct 27 10:53:55 2014 +0100

initial

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index f59738a..e453da4 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -10716,6 +10716,19 @@ table2-mapping
 /programlisting
/entry
   /row
+  row
+   entryparaliteraljson_strip_nulls(from_json json)/literal
+ /paraparaliteraljsonb_strip_nulls(from_json jsonb)/literal
+   /para/entry
+   entryparatypejson/type/paraparatypejsonb/type/para/entry
+   entry
+ Returns replaceablefrom_json/replaceable
+ with all object fields that have null values omitted. Other null values
+ are untouched.
+   /entry
+   entryliteraljson_strip_nulls('[{f1:1,f2:null},2,null,3]')/literal/entry
+   entryliteral[{f1:1},2,null,3]/literal/entry
+   /row
  /tbody
 /tgroup
/table
@@ -10752,6 +10765,16 @@ table2-mapping
 /para
   /note
 
+  note
+para
+  If the argument to literaljson_strip_nulls/ contains duplicate
+  field names in any object, the result could be semantically somewhat
+  different, depending on the order in which they occur. This is not an
+  issue for literaljsonb_strip_nulls/ since jsonb values never have
+  duplicate object field names.
+/para
+  /note
+
   para
 See also xref linkend=functions-aggregate for the aggregate
 function functionjson_agg/function which aggregates record
diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index 2d00dbe..efbec7f 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -105,6 +105,15 @@ static void populate_recordset_object_end(void *state);
 static void populate_recordset_array_start(void *state);
 static void populate_recordset_array_element_start(void *state, bool isnull);
 
+/* semantic action functions for json_strip_nulls */
+static void sn_object_start(void *state);
+static void sn_object_end(void *state);
+static void sn_array_start(void *state);
+static void sn_array_end(void *state);
+static void sn_object_field_start (void *state, char *fname, bool isnull);
+static void sn_array_element_start (void *state, bool isnull);
+static void sn_scalar(void *state, char *token, JsonTokenType tokentype);
+
 /* worker function for populate_recordset and to_recordset */
 static Datum populate_recordset_worker(FunctionCallInfo fcinfo, const char *funcname,
 		  bool have_record_arg);
@@ -225,6 +234,13 @@ typedef struct PopulateRecordsetState
 	MemoryContext fn_mcxt;		/* used to stash IO funcs */
 } PopulateRecordsetState;
 
+/* state for json_strip_nulls */
+typedef struct StripnullState{
+	JsonLexContext *lex;
+	StringInfo  strval;
+	bool skip_next_null;
+} StripnullState;
+
 /* Turn a jsonb object into a record */
 static void make_row_from_rec_and_jsonb(Jsonb *element,
 			PopulateRecordsetState *state);
@@ -2996,3 +3012,184 @@ findJsonbValueFromContainerLen(JsonbContainer *container, uint32 flags,
 
 	return findJsonbValueFromContainer(container, 

Re: [HACKERS] Function array_agg(array)

2014-10-27 Thread Ali Akbar
2014-10-27 16:15 GMT+07:00 Pavel Stehule pavel.steh...@gmail.com:

 Hi

 I did some minor changes in code

 * move tests of old or new builder style for array sublink out of main
 cycles
 * some API simplification of new builder - we should not to create
 identical API, mainly it has no sense


minor changes in the patch:
* remove array_agg_finalfn_internal declaration in top of array_userfuncs.c
* fix comment of makeMdArray
* fix return of makeMdArray
* remove unnecesary changes to array_agg_finalfn

Regards,
--
Ali Akbar
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7e5bcd9..f59738a 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -12046,6 +12046,22 @@ NULL baz/literallayout(3 rows)/entry
  row
   entry
indexterm
+primaryarray_agg/primary
+   /indexterm
+   functionarray_agg(replaceable class=parameteranyarray/replaceable)/function
+  /entry
+  entry
+   any
+  /entry
+  entry
+   the same array type as input type
+  /entry
+  entryinput arrays, aggregated into higher-order multidimesional array. Rejects NULL and empty array as input./entry
+ /row
+
+ row
+  entry
+   indexterm
 primaryaverage/primary
/indexterm
indexterm
diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml
index 2f0680f..8c182a4 100644
--- a/doc/src/sgml/syntax.sgml
+++ b/doc/src/sgml/syntax.sgml
@@ -2238,6 +2238,11 @@ SELECT ARRAY(SELECT oid FROM pg_proc WHERE proname LIKE 'bytea%');
  array
 ---
  {2011,1954,1948,1952,1951,1244,1950,2005,1949,1953,2006,31,2412,2413}
+
+SELECT ARRAY(SELECT array(select i) FROM generate_series(1,5) a(i));
+ array
+---
+ {{1},{2},{3},{4},{5}}
 (1 row)
 /programlisting
The subquery must return a single column. The resulting
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 401bad4..21b8de1 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -231,7 +231,10 @@ ExecScanSubPlan(SubPlanState *node,
 	bool		found = false;	/* TRUE if got at least one subplan tuple */
 	ListCell   *pvar;
 	ListCell   *l;
+	bool		use_md_array_builder = false;
+	Oid			md_array_element_type = InvalidOid;
 	ArrayBuildState *astate = NULL;
+	MdArrayBuildState *mdastate = NULL;
 
 	/*
 	 * MULTIEXPR subplans, when executed, just return NULL; but first we
@@ -260,6 +263,16 @@ ExecScanSubPlan(SubPlanState *node,
 	}
 
 	/*
+	 * use a fast array multidimensional builder when input is a array
+	 * only check on first iteration. On subsequent, use the cached values
+	 */
+	if (subLinkType == ARRAY_SUBLINK)
+	{
+		md_array_element_type = get_element_type(subplan-firstColType);
+		use_md_array_builder = OidIsValid(md_array_element_type);
+	}
+
+	/*
 	 * We are probably in a short-lived expression-evaluation context. Switch
 	 * to the per-query context for manipulating the child plan's chgParam,
 	 * calling ExecProcNode on it, etc.
@@ -366,8 +379,16 @@ ExecScanSubPlan(SubPlanState *node,
 			/* stash away current value */
 			Assert(subplan-firstColType == tdesc-attrs[0]-atttypid);
 			dvalue = slot_getattr(slot, 1, disnull);
-			astate = accumArrayResult(astate, dvalue, disnull,
-	  subplan-firstColType, oldcontext);
+
+			if (use_md_array_builder)
+mdastate = accumMdArray(mdastate,
+		disnull? NULL :
+ DatumGetArrayTypeP(dvalue),
+		disnull, md_array_element_type,
+		oldcontext);
+			else
+astate = accumArrayResult(astate, dvalue, disnull,
+		  subplan-firstColType, oldcontext);
 			/* keep scanning subplan to collect all values */
 			continue;
 		}
@@ -439,6 +460,8 @@ ExecScanSubPlan(SubPlanState *node,
 		/* We return the result in the caller's context */
 		if (astate != NULL)
 			result = makeArrayResult(astate, oldcontext);
+		else if (mdastate != NULL)
+			result = PointerGetDatum(makeMdArray(mdastate, oldcontext, true));
 		else
 			result = PointerGetDatum(construct_empty_array(subplan-firstColType));
 	}
@@ -951,7 +974,10 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 	ListCell   *pvar;
 	ListCell   *l;
 	bool		found = false;
+	bool		use_md_array_builder = false;
+	Oid			md_array_element_type = InvalidOid;
 	ArrayBuildState *astate = NULL;
+	MdArrayBuildState *mdastate = NULL;
 
 	if (subLinkType == ANY_SUBLINK ||
 		subLinkType == ALL_SUBLINK)
@@ -960,6 +986,16 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext)
 		elog(ERROR, CTE subplans should not be executed via ExecSetParamPlan);
 
 	/*
+	 * use a fast array multidimensional builder when input is a array
+	 * only check on first iteration. On subsequent, use the cached values
+	 */
+	if (subLinkType == ARRAY_SUBLINK)
+	{
+		md_array_element_type = get_element_type(subplan-firstColType);
+		use_md_array_builder = 

Re: [HACKERS] Better support of exported snapshots with pg_dump

2014-10-27 Thread Petr Jelinek

On 17/10/14 06:25, Michael Paquier wrote:

Two votes in favor of that from two committers sounds like a deal. Here
is an refreshed version of the patch introducing --snapshot from here,
after fixing a couple of things and adding documentation:
http://www.postgresql.org/message-id/ca+u5nmk9+ttcff_-4mfdxwhnastauhuq7u7uedd57vay28a...@mail.gmail.com



Hi,

I have two minor things:

+	printf(_(  --snapshot   use given synchronous 
snapshot for the dump\n));


I think this should say --snapshot=NAME or something like that to make 
it visible that you are supposed to provide the parameter.


The other thing is, you changed logic of the parallel dump behavior a 
little - before your patch it works in a way that one connection exports 
snapshot and others do SET TRANSACTION SNAPSHOT, after your patch, 
even the connection that exported the snapshot does the SET TRANSACTION 
SNAPSHOT which is unnecessary. I don't see it as too big deal but I 
don't see the need to change that behavior either.


You could do something like this instead maybe?:
if (AH-sync_snapshot_id)
SET TRANSACTION SNAPSHOT
else if (AH-numWorkers  1  AH-remoteVersion = 90200  
!dopt-no_synchronized_snapshots)

AH-sync_snapshot_id = get_synchronized_snapshot(AH);

And remove the else if for the if (dumpsnapshot) part.

--
 Petr Jelinek  http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] END_OF_RECOVERY shutdowns and ResetUnloggedRelations()

2014-10-27 Thread Abhijit Menon-Sen
At 2014-09-25 22:41:18 +0200, and...@2ndquadrant.com wrote:

 On 2014-09-24 17:06:05 +0530, Abhijit Menon-Sen wrote:
  
  1. Move the call to ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to
 earlier in StartupXLOG.
  
  2. Inside that function, issue fsync()s for the main forks we create by
 copying the _init fork.
 
 These two imo should definitely be backpatched.

I've attached updated versions of these two patches. The only changes
are to revise the comments as you suggested.

  3. A small fixup to add a const to typedef char *FileName, because the
 earlier patch gave me warnings about discarding const-ness. This is
 consistent with many other functions in fd.c that take const char *.
 
 I'm happy with consider that one for master (although I seem to recall
 having had a patch for it rejected?), but I don't think we want to do
 that in the backbranches.

(I gave up on this for now as an unrelated diversion.)

  4. Issue an fsync() on the data directory at startup if we need to
  perform crash recovery.
 
 I personally think this one should be backpatched too. Does anyone
 believe differently?

I revised this patch as well, I'll rebase and send it separately along
with the initdb -S patch shortly.

-- Abhijit
From 8487185a5f2073ed475f971e6203d49e1e21a987 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Wed, 8 Oct 2014 11:48:25 +0530
Subject: Call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier

We need to call this after recovery, but not after the END_OF_RECOVERY
checkpoint is written. If we crash after that checkpoint, for example
because of ENOSPC in PreallocXlogFiles, the checkpoint has already set
the ControlFile-state to DB_SHUTDOWNED, so we don't enter recovery
again at startup.

Because we did call ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP)
in the first cleanup, this leaves the database with a bunch of _init
forks for unlogged relations, but no main forks, leading to scary
errors.

See thread from 20140912112246.ga4...@alap3.anarazel.de for details.
---
 src/backend/access/transam/xlog.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3c9aeae..2ab9da5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6871,6 +6871,16 @@ StartupXLOG(void)
 	ShutdownWalRcv();
 
 	/*
+	 * Reset unlogged relations to the contents of their INIT fork. This
+	 * is done AFTER recovery is complete so as to include any unlogged
+	 * relations created during recovery, but BEFORE recovery is marked
+	 * as having completed successfully (because this step can fail,
+	 * e.g. due to ENOSPC).
+	 */
+	if (InRecovery)
+		ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+
+	/*
 	 * We don't need the latch anymore. It's not strictly necessary to disown
 	 * it, but let's do it for the sake of tidiness.
 	 */
@@ -7138,14 +7148,6 @@ StartupXLOG(void)
 	PreallocXlogFiles(EndOfLog);
 
 	/*
-	 * Reset initial contents of unlogged relations.  This has to be done
-	 * AFTER recovery is complete so that any unlogged relations created
-	 * during recovery also get picked up.
-	 */
-	if (InRecovery)
-		ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
-
-	/*
 	 * Okay, we're officially UP.
 	 */
 	InRecovery = false;
-- 
1.9.1

From 944e7c4e27fca5589b8a103f7f470df23a5161c2 Mon Sep 17 00:00:00 2001
From: Abhijit Menon-Sen a...@2ndquadrant.com
Date: Wed, 24 Sep 2014 16:01:37 +0530
Subject: =?UTF-8?q?Make=20ResetUnloggedRelations(=E2=80=A6=5FINIT)=20fsync?=
 =?UTF-8?q?=20newly-created=20main=20forks?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

During UNLOGGED_RELATION_INIT, after we have copied ${x}_init to $x, we
issue fsync()s for the newly-created files. We depend on their existence
and a checkpoint isn't going to fsync them for us during recovery.

See thread from 20140912112246.ga4...@alap3.anarazel.de for details.
---
 src/backend/storage/file/reinit.c | 38 ++
 1 file changed, 38 insertions(+)

diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c
index 3229f41..4ad5987 100644
--- a/src/backend/storage/file/reinit.c
+++ b/src/backend/storage/file/reinit.c
@@ -339,6 +339,44 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op)
 		}
 
 		FreeDir(dbspace_dir);
+
+		/*
+		 * copy_file() above has already called pg_flush_data() on the
+		 * files it created. Now we need to fsync those files, because
+		 * a checkpoint won't do it for us while we're in recovery. We
+		 * do this in a separate pass to allow the kernel to perform
+		 * all the flushes at once.
+		 */
+
+		dbspace_dir = AllocateDir(dbspacedirname);
+		while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL)
+		{
+			ForkNumber forkNum;
+			int			oidchars;
+			char		oidbuf[OIDCHARS + 1];
+			char		mainpath[MAXPGPATH];
+
+			/* Skip 

[HACKERS] Missing FIN_CRC32 calls in logical replication code

2014-10-27 Thread Heikki Linnakangas
replication/slot.c and replication/logical/snapbuild.c use a CRC on the 
physical slot and snapshot files. It uses the same algorithm as used 
e.g. for the WAL. However, they are not doing the finalization step, 
FIN_CRC32() on the calculated checksums. Not that it matters much, but 
it's a bit weird and inconsistent, and was probably an oversight.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Function array_agg(array)

2014-10-27 Thread Pavel Stehule
2014-10-27 11:20 GMT+01:00 Ali Akbar the.ap...@gmail.com:


 2014-10-27 16:15 GMT+07:00 Pavel Stehule pavel.steh...@gmail.com:

 Hi

 I did some minor changes in code

 * move tests of old or new builder style for array sublink out of main
 cycles
 * some API simplification of new builder - we should not to create
 identical API, mainly it has no sense


 minor changes in the patch:
 * remove array_agg_finalfn_internal declaration in top of array_userfuncs.c
 * fix comment of makeMdArray
 * fix return of makeMdArray
 * remove unnecesary changes to array_agg_finalfn


super

I tested last version and I have not any objections.

1. We would to have this feature - it is long time number of our ToDo List

2. Proposal and design of multidimensional aggregation is clean and nobody
has objection here.

3. There is zero impact on current implementation. From performance reasons
it uses new array optimized aggregator - 30% faster for this purpose than
current (scalar) array aggregator

4. Code is clean and respect PostgreSQL coding rules

5. There are documentation and necessary regress tests

6. Patching and compilation is clean without warnings.

This patch is ready for commit

Thank you for patch

Regards

Pavel



 Regards,
 --
 Ali Akbar



Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Atri Sharma



 To solve #1, we could redesign CREATE DATABASE so that replaying the
 DBASE_CREATE record doesn't zap the old directory, and also doesn't copy
 any files. We could instead just assume that if the transaction commits,
 all the files have been copied and fsync'd already, like we assume that if
 a CREATE INDEX commits in wal_level=minimal, the underlying file was
 fsync'd before the commit.


Do you mean that during a recovery, we just let the database directory be
and assume that it is in good shape since the transaction committed
originally?


 I wonder if we should bite the bullet and start WAL-logging all the files
 that are copied from the template database to the new database. When the
 template database is small (template0 is 6.4MB currently), that wouldn't
 generate too much WAL. We could perhaps do that only if the template
 database is small, and do the checkpoints otherwise, although I wouldn't
 like to have subtly different behavior depending on database size like that.


For the sort of workload Tomas described above (creating a lot of databases
on the fly), we may end up with a lot of WAL eventually if we do this.

Regards,

Atri


Re: [HACKERS] pg_receivexlog --status-interval add fsync feedback

2014-10-27 Thread Fujii Masao
On Fri, Oct 24, 2014 at 11:21 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 10/24/2014 01:24 PM, furu...@pm.nttdata.co.jp wrote:

 Sorry, I'm going around in the circle. But I'd like to say again, I
 don't think this is good idea. It prevents asynchronous
 pg_receivexlog from fsyncing WAL data and sending feedbacks more
 frequently at all. They are useful, for example, when we want to
 monitor the write location of asynchronous pg_receivexlog in almost
 realtime. But if we adopt the idea, since feedback cannot be sent
 soon in async mode, pg_stat_replication always returns the

 not-up-to-date location.


 Why not send a message every 10 seconds when its not sync rep?


 Or even after every write(). It's a tiny amount of network traffic
 anyway.


 I understand that send feedback message frequently will keep
 pg_stat_replication up-to-date state.

 Are there really no needs who wants to fsync even in async mode ?
 I think the people who dislike Data lost will like that idea.


 The OS will not sit on the written but not fsync'd data forever, it will get
 flushed to disk in a few seconds even without the fsync. It's just that
 there is no guarantee on when it will hit the disk, but there are no strong
 guarantees in asynchronous replication anyway.

This makes me come up with the idea about adding new GUC parameter like
wal_receiver_fsync so that a user can disable walreceiver to fsync WAL after
every write(). IOW, it can allow walreceiver to work like current
pg_receivexlog.

One problem of the above idea is that the startup process can replay WAL
which has not been fsync'd yet. This violates the WAL rule. To resolve this,
the startup process would need to ensure that WAL to replay has already been
fsync'd, and otherwise issue the fsync. This basically makes the WAL replay
longer, but we can reduce the amount of I/O by walreceiver.

This seems also useful even with synchronous replication if synchronous_commit
is set to remote_write. Since walreceiver doesn't issue the fsync and can focus
on receiving and writing WAL, the performance of replication would be improved.

 Nevertheless in sync or async, returning feedback and executing
 fsync() same as like walreceiver is such a problem?


 Probably would be OK. It would increase the I/O a lot, thanks to a lot of
 small writes and fsyncs, which might be surprising for a tool like
 pg_receivexlog.

 One idea is to change the default behavior to be like walreceiver, and
 fsync() after every write. But provide an option to get the old behavior,
 --no-fsync.

I'm OK with this. That is, by default, pg_receivexlog issues the fsync and
sends the feedback message back after every write(). But something like
--no-fsync is specified, WAL file is fsync'd only when it's closed and
a feedback message is sent back only when --status-interval time is elapsed.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Heikki Linnakangas

On 10/27/2014 01:06 PM, Atri Sharma wrote:






To solve #1, we could redesign CREATE DATABASE so that replaying the
DBASE_CREATE record doesn't zap the old directory, and also doesn't copy
any files. We could instead just assume that if the transaction commits,
all the files have been copied and fsync'd already, like we assume that if
a CREATE INDEX commits in wal_level=minimal, the underlying file was
fsync'd before the commit.



Do you mean that during a recovery, we just let the database directory be
and assume that it is in good shape since the transaction committed
originally?


Right.


I wonder if we should bite the bullet and start WAL-logging all the files
that are copied from the template database to the new database. When the
template database is small (template0 is 6.4MB currently), that wouldn't
generate too much WAL. We could perhaps do that only if the template
database is small, and do the checkpoints otherwise, although I wouldn't
like to have subtly different behavior depending on database size like that.


For the sort of workload Tomas described above (creating a lot of databases
on the fly), we may end up with a lot of WAL eventually if we do this.


I think writing 6 MB for each CREATE DATABASE would be OK. For 
comparison, a pg_switch_xlog() call will waste on average half a 
segment, i.e. 8MB. It's a lot cheaper than a checkpoint on a busy 
system, for sure. But of course, if the template database is larger, 
then you will generate more WAL.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Master ip from hot_standby..

2014-10-27 Thread sudalai
Hi,
   I need to query master ip from hot_standby.  
*pg_stat_replication*  view only shows the slave replication status. 

Is there any way to get *Master IP* from standby node apart from checking
*recovery.conf* file.







Thanks,
Sudalai



-
sudalai
--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Master-ip-from-hot-standby-tp5824415.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Atri Sharma
On Mon, Oct 27, 2014 at 4:44 PM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 10/27/2014 01:06 PM, Atri Sharma wrote:




  To solve #1, we could redesign CREATE DATABASE so that replaying the
 DBASE_CREATE record doesn't zap the old directory, and also doesn't copy
 any files. We could instead just assume that if the transaction commits,
 all the files have been copied and fsync'd already, like we assume that
 if
 a CREATE INDEX commits in wal_level=minimal, the underlying file was
 fsync'd before the commit.


 Do you mean that during a recovery, we just let the database directory be
 and assume that it is in good shape since the transaction committed
 originally?


 Right.


It does make sense, however, with the checkpoint after creating the files
gone, the window between the creation of files and actual commit might be
increased, increasing the possibility of a crash during that period and
causing an orphan database. However, my understanding of the consequences
of removing the checkpoint might be incorrect, so my fears might be wrong.

Regards,

Atri


Re: [HACKERS] Directory/File Access Permissions for COPY and Generic File Access Functions

2014-10-27 Thread Stephen Frost
* Peter Eisentraut (pete...@gmx.net) wrote:
 On 10/16/14 12:01 PM, Stephen Frost wrote:
  This started out as a request for a non-superuser to be able to review
  the log files without needing access to the server.
 
 I think that can be done with a security-definer function.

Of course it can be.  We could replace the entire authorization system
with security definer functions too.  I don't view this as an argument
against this feature, particularly as we know other systems have it,
users have asked for multiple times, and large PG deployments have had
to hack around our lack of it.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Index-only scans for GIST

2014-10-27 Thread Thom Brown
On 18 August 2014 09:05, Heikki Linnakangas hlinnakan...@vmware.com wrote:

 On 08/17/2014 07:15 PM, Anastasia Lubennikova wrote:

 2014-08-07 0:30 GMT+04:00 Heikki Linnakangas hlinnakan...@vmware.com:

 * I'm getting two regression failures with this (opr_sanity and join).



 opr_sanity failure is corrected.
 But there is remain question with join.
 I check the latest version of my github repo and there's no fail in join
 regression test
 All 145 tests passed.
 To tell the truth, I don't understand which changes could led to this
 failure.
 Could you show me regression diffs?


 Sure, here you go. It seems like a change in a plan. At a quick glance it
 seems harmless: the new plan is identical except that the left and right
 side of a join have been reversed. But I don't understand either why this
 patch would change that, so it needs to be investigated.

  * The regression test queries that use LIMIT are not guaranteed to always

 return the same rows, hence they're not very good regression test cases.
 I'd suggest using more restricting WHERE clauses, so that each query only
 returns a handful of rows.


 Thank you for comment, I rewrote wrong queries. But could you explain why
 LIMIT queries may return different results? Is it happens because of
 different query optimization?


 Imagine that you have a table with two rows, A and B. If you run a query
 like SELECT * FROM table LIMIT 1, the system can legally return either
 row A or B, because there's no ORDER BY.

 Now, admittedly you have a similar problem even without the LIMIT - the
 system can legally return the rows in either order - but it's less of an
 issue because at least you can quickly see from the diff that the result
 set is in fact the same, the rows are just in different order. You could
 fix that by adding an ORDER BY to all test queries, but we haven't done
 that in the regression suite because then we would not have any test
 coverage for cases where you don't have an ORDER BY. As a compromise, test
 cases are usually written without an ORDER BY, but if e.g. the buildfarm
 starts failing because of differences in the result set order across
 platforms, then we add an ORDER BY to make it stable.

  * I think it's leaking memory, in GIST scan context. I tested this with a

 variant of the regression tests:

 insert into gist_tbl select box(point(0.05*i, 0.05*i), point(0.05*i,
 0.05*i)),
   point(0.05*i, 0.05*i) FROM generate_series(0,
 1000) as i;
 CREATE INDEX gist_tbl_point_index ON gist_tbl USING gist (p);

 set enable_seqscan=off;
 set enable_bitmapscan=off;

 explain analyze  select p from gist_tbl where p @ box(point(0,0),
 point(999,999)) and length(p::text)  10;

 while the final query runs, 'top' shows constantly increasing memory
 usage.


 I don't think it's memory leak. After some increasing, memory using remain
 the same. It works similar without using indexonlyscan.


 No, it's definitely a leak caused by the patch. Test with the attached
 patch, which prints out to the server log the amount of memory used by the
 GiST scan memory context every 1 rows. It clearly shows increasing
 memory usage all the way to the end of the query.

 It's cleaned up at the end of the query, but that's not good enough
 because for a large query you might accumulate gigabytes of leaked memory
 until the query has finished. If you (manually) apply the same patch to git
 master, you'll see that the memory usage stays consistent and small.


Hi Anastasia,

Do you have time to address the issues brought up in Heikki's review?  It
would be good if we could your work into PostgreSQL 9.5.

Thanks

Thom


Re: [HACKERS] Index scan optimization

2014-10-27 Thread Rajeev rastogi
On 26 October 2014 10:42, Haribabu Kommi wrote:
 
 Hi,
 
 I reviewed index scan optimization patch, the following are the
 observations.
 
 - Patch applies cleanly.
 - Compiles without warnings
 - All regress tests are passed.
 
 There is a good performance gain with the patch in almost all scenarios.

Thanks for reviewing.

 I have a question regarding setting of key flags matched. Only the
 first key was set as matched even if we have multiple index conditions.
 Is there any reason behind that?

Actually the keysCount can be more than one only if the key strategy is 
BTEqualStrategyNumber (i.e. = condition) 
But our optimization is only for the '' or '=' condition only. So adding code 
to set flag for all keys is of no use.

Let me know if I am missing anything here?

 If any volatile function is present in the index condition, the index
 scan itself is not choosen, Is there any need of handling the same
 similar to NULLS?

Do you mean to say that whether we should add any special handling for volatile 
function?
If yes then as you said since index scan itself is not selected for such 
functions, then 
I feel We don’t need to add anything for that.
Any opinion?

Thanks and Regards,
Kumar Rajeev Rastogi



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] KNN searches support for SP-GiST [GSOC'14]

2014-10-27 Thread Thom Brown
On 20 August 2014 08:09, Heikki Linnakangas hlinnakan...@vmware.com wrote:

 On 08/20/2014 03:35 AM, Vladislav Sterzhanov wrote:

 Hi there, pg-Hackers!
 Here I go with the patch which brings up the possibility to perform
 nearest-neighbour searches on SP-GiSTs (as of now includes implementation
 for quad and kd trees). Pre-reviewed by my GSoC mentor Alexander Korotkov.
 Sample benchmarking script included in the attachment (dumps the current
 geonames archive and runs several searches on the (latitude, longitude)
 points), which demonstrates the dramatic improvements against plain
 searches and sorting.


 Cool!

 I think this needs a high-level description in access/spgist/README on how
 this works. You can probably copy-paste the similar description from gist's
 README, because it's the same design. If I understand correctly, the
 support functions return distances between the nodes and the query, and the
 SP-GiST code uses those distances to return the rows in order. An important
 detail there is that the distance returned for an inner node is a lower
 bound of the distance of any node in that branch.

 A binary heap would be a better data structure than a red-black tree for
 the queue of nodes to visit/return. I know that GiST also uses a red-black
 for the same thing, but that's no reason not to do better here. There is a
 binary heap implementation in the backend too (see
 src/backend/lib/binaryheap.c), so it's not any more difficult to use.

 Does the code handle ties correctly? The distance function is just an
 approximation, so it's possible that there are two items with same
 distance, but are not equal according to the ordering operator. As an
 extreme example, if you hack the distance function to always return 0.0, do
 you still get correct results (albeit slowly)?

 The new suppLen field in spgConfigOut is not mentioned in the docs. Why
 not support pass-by-value supplementary values?

 A couple of quick comments on the code:

  @@ -137,8 +138,8 @@ spg_quad_picksplit(PG_FUNCTION_ARGS)
  {
 spgPickSplitIn *in = (spgPickSplitIn *) PG_GETARG_POINTER(0);
 spgPickSplitOut *out = (spgPickSplitOut *) PG_GETARG_POINTER(1);
 -   int i;
 -   Point  *centroid;
 +   int i;
 +   Point *centroid;

  #ifdef USE_MEDIAN
 /* Use the median values of x and y as the centroid point */


 Please remove all unnecessary whitespace changes like this.

  @@ -213,14 +215,19 @@ spg_quad_inner_consistent(PG_FUNCTION_ARGS)
 }

 Assert(in-nNodes == 4);
 +
 +   if (in-norderbys  0)
 +   {
 +   out-distances = palloc(in-nNodes * sizeof (double *));
 +   }


 I think that should be sizeof(double).

  +   *distances = (double *) malloc(norderbys * sizeof (double *));


 No mallocs allowed in the backend, should be palloc.


Hi Vlad,

I've been revisiting the status of the GSoC projects for this year and saw
this had stalled.  Do you have time to address the remaining issues so we
can get your work committed?

Thanks

Thom


Re: [HACKERS] On partitioning

2014-10-27 Thread Andres Freund
On 2014-10-27 06:29:33 -0300, Alvaro Herrera wrote:
 Amit Langote wrote:
 
   On Mon, Oct 13, 2014 at 04:38:39PM -0300, Alvaro Herrera wrote:
Bruce Momjian wrote:
 I realize there hasn't been much progress on this thread, but I wanted
 to chime in to say I think our current partitioning implementation is
 too heavy administratively, error-prone, and performance-heavy.
   
On the contrary, I think there was lots of progress; there's lots of
useful feedback from the initial design proposal I posted.  I am a bit
sad to admit that I'm not working on it at the moment as I had
originally planned, though, because other priorities slipped in and I am
not able to work on this for a while.  Therefore if someone else wants
to work on this topic, be my guest -- otherwise I hope to get on it in a
few months.
   
   Oh, I just meant code progress --- I agree the discussion was fruitful.
  
  FWIW, I think Robert's criticism regarding not basing this on inheritance
  scheme was not responded to.
 
 It was responded to by ignoring it.  I didn't see anybody else
 supporting the idea that inheritance is in any way a sane thing to base
 partitioning on.  Sure, we have accumulated lots of kludges over the
 years to cope with the fact that, really, it doesn't work very well.  So
 what.  We can keep them, I don't care.

As far as I understdood Robert's criticism it was more about the
internals, than about the userland representation. To me it's absolutely
clear that 'real partitioning' userland shouldn't be based on the
current hacks to allow it. But I do think that a first step very well
might reuse the planner/executor smarts about it. Even a good chunk of
the tablecmd.c logic might be reusable for individual partitions without
much change.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Dynamically change Master(recovery info) without restarting standby server..

2014-10-27 Thread sudalai
Hi,

   Is there any way to change the *master*  without restarting the *standby*
server. 

Postgresql Documentation says,
   --Recovery.conf file only read on the startup in standby mode. In that
file we specify the Masterip. 
   --If  you want to change the master we need to change the recovery.conf
file and restart the standby node. 
but,
 *I Need to change masterip without restarting the standby node.*





-
sudalai
--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Dynamically-change-Master-recovery-info-without-restarting-standby-server-tp5824423.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]

2014-10-27 Thread Amit Kapila
On Sat, Oct 25, 2014 at 5:52 PM, Amit Kapila amit.kapil...@gmail.com
wrote:


 ***
 *** 358,363  handle_sigint(SIGNAL_ARGS)
 --- 358,364 

   /* Send QueryCancel if we are processing a database query */
   if (cancelConn != NULL)
   {
 + inAbort = true;
   if (PQcancel(cancelConn, errbuf, sizeof(errbuf)))
   fprintf(stderr, _(Cancel request sent\n));
   else

 Do we need to set inAbort flag incase PQcancel is successful?
 Basically if PQCancel fails due to any reason, I think behaviour
 can be undefined as the executing thread can assume that cancel is
 done.

 *** 391,396  consoleHandler(DWORD dwCtrlType)
 --- 392,399 
   EnterCriticalSection
 (cancelConnLock);
   if (cancelConn != NULL)
   {
 + inAbort =
 true;
 +

 You have set this flag in case of windows handler, however the same
 is never used incase of windows, are you expecting any use of this
 flag for windows?

Going further with verification of this patch, I found below issue:
Run the testcase.sql file at below link:
http://www.postgresql.org/message-id/4205e661176a124faf891e0a6ba9135266347...@szxeml509-mbs.china.huawei.com
./vacuumdb --analyze-in-stages -j 8 -d postgres
Generating minimal optimizer statistics (1 target)
Segmentation fault

Server Log:
ERROR:  syntax error at or near minimal at character 12
STATEMENT:  ANALYZE ng minimal optimizer statistics (1 target)
LOG:  could not receive data from client: Connection reset by peer


Fixed below issues and attached an updated patch with mail:

1.
make check for docs gives below errors:
{ \
  echo !ENTITY version \9.5devel\; \
  echo !ENTITY majorversion \9.5\; \
}  version.sgml
'/usr/bin/perl' ./mk_feature_tables.pl YES
../../../src/backend/catalog/sql_feature_packages.txt
../../../src/backend/catalog/sql_features.txt  features-supported.sgml
'/usr/bin/perl' ./mk_feature_tables.pl NO
../../../src/backend/catalog/sql_feature_packages.txt
../../../src/backend/catalog/sql_features.txt  features-unsupported.sgml
'/usr/bin/perl' ./generate-errcodes-table.pl
../../../src/backend/utils/errcodes.txt  errcodes-table.sgml
onsgmls -wall -wno-unused-param -wno-empty -wfully-tagged -D . -D . -s
postgres.sgml
onsgmls:ref/vacuumdb.sgml:224:15:E: end tag for LISTITEM omitted, but
OMITTAG NO was specified
onsgmls:ref/vacuumdb.sgml:209:8: start tag was here
onsgmls:ref/vacuumdb.sgml:224:15:E: end tag for VARLISTENTRY omitted, but
OMITTAG NO was specified
onsgmls:ref/vacuumdb.sgml:206:5: start tag was here
onsgmls:ref/vacuumdb.sgml:224:15:E: end tag for VARIABLELIST omitted, but
OMITTAG NO was specified
onsgmls:ref/vacuumdb.sgml:79:4: start tag was here
onsgmls:ref/vacuumdb.sgml:225:18:E: end tag for element LISTITEM which is
not open
onsgmls:ref/vacuumdb.sgml:226:21:E: end tag for element VARLISTENTRY
which is not open
onsgmls:ref/vacuumdb.sgml:228:18:E: document type does not allow element
VARLISTENTRY here; assuming
missing VARIABLELIST start-tag
onsgmls:ref/vacuumdb.sgml:260:9:E: end tag for element PARA which is not
open
make: *** [check] Error 1

Fixed.

2.
Below multi-line comment is wrong:
/* Otherwise, we got a stage from vacuum_all_databases(), so run
 * only that one. */

Fixed.

3.
fprintf(stderr, _(%s: vacuuming of database \%s\ failed: %s),
progname, dbname, PQerrorMessage(conn));
indentation of fprintf is not proper.

Fixed.

4.
/* This can only happen if user has sent the cacle request using
 * Ctrl+C, Cancle is
handled by 0th slot, so fetch the error result
 */
spelling of cancel is wrong and multi-line comment is not proper.

Fixed

5.
/* This can only happen if user has sent the cacle request using
 * Ctrl+C, Cancle is handled by 0th slot, so fetch the error result
 */

GetQueryResult(pSlot[0].connection, dbname, progname,
completedb);

indentation of completedb parameter is wrong.

Fixed.

6.
/*
 * vacuum_parallel
 * This function will open the multiple concurrent connections as
 * suggested by used, it will derive the table list using server call
 * if table list is not given by user and perform the vacuum call
 */

s/used/user

Fixed.

In general, I think you can once go through all the comments
and code to see if similar issues exist at other places as well.

I have done some performance test with the patch, data for which is
as below:
Performance Data
--
IBM POWER-7 16 cores, 64 hardware threads
RAM = 64GB
max_connections = 128
checkpoint_segments=256
checkpoint_timeout=15min
shared_buffers = 1GB

Before each test, run the testcase.sql file at below link:
http://www.postgresql.org/message-id/4205e661176a124faf891e0a6ba9135266347...@szxeml509-mbs.china.huawei.com

Un-patched -
time ./vacuumdb -t t1 -t t2 -t t3 -t t4 -t t5 -t t6 -t t7 -t t8 -d postgres
real 0m2.454s
user 0m0.002s
sys 0m0.006s

Patched -
time ./vacuumdb -t t1 -t t2 -t t3 -t t4 -t t5 -t t6 -t t7 -t t8 -j 4 -d
postgres

real 0m1.691s
user 0m0.001s
sys 0m0.004s

time ./vacuumdb -t t1 -t t2 -t t3 -t t4 -t t5 -t t6 -t t7 -t t8 -j 8 -d
postgres

real 0m1.496s
user 

Re: [HACKERS] Missing FIN_CRC32 calls in logical replication code

2014-10-27 Thread Andres Freund
On 2014-10-27 12:51:44 +0200, Heikki Linnakangas wrote:
 replication/slot.c and replication/logical/snapbuild.c use a CRC on the
 physical slot and snapshot files. It uses the same algorithm as used e.g.
 for the WAL. However, they are not doing the finalization step, FIN_CRC32()
 on the calculated checksums. Not that it matters much, but it's a bit weird
 and inconsistent, and was probably an oversight.

Hm. Good catch - that's stupid. I wonder what to do about it. I'm
tempted to just add a comment about it to 9.4 and fix it on master as
changing it is essentially a catversion bump. Any objections to that?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Tomas Vondra
Dne 27 Říjen 2014, 10:47, Heikki Linnakangas napsal(a):
 On 10/26/2014 11:47 PM, Tomas Vondra wrote:
 After eyeballing the code for an hour or two, I think CREATE DATABASE
 should be fine with performing only a 'partial checkpoint' on the
 template database - calling FlushDatabaseBuffers and processing unlink
 requests, as suggested by the comment in createdb().

 Hmm. You could replace the first checkpoint with that, but I don't think
 that's enough for the second. To get any significant performance
 benefit, you need to get rid of both checkpoints, because doing two
 checkpoints one after another is almost as fast as doing a single
 checkpoint; the second checkpoint has very little work to do because the
 first checkpoint already flushed out everything.

Yes, that's why I wrote that two checkpoints not too far away are
effectively a single checkpoint. OTOH if the template database is not
small, it may take a while to copy the data, increasing the distance
between the checkpoints.

While our template databases are small (in the order of ~10-100MB), there
are probably people using this to clone much larger databases.

 The second checkpoint, after copying but before commit, is done because
 (from the comments in createdb function):

  * #1: When PITR is off, we don't XLOG the contents of newly created
  * indexes; therefore the drop-and-recreate-whole-directory behavior
  * of DBASE_CREATE replay would lose such indexes.

  * #2: Since we have to recopy the source database during DBASE_CREATE
  * replay, we run the risk of copying changes in it that were
  * committed after the original CREATE DATABASE command but before the
  * system crash that led to the replay.  This is at least unexpected
  * and at worst could lead to inconsistencies, eg duplicate table
  * names.

 Doing only FlushDatabaseBuffers would not prevent these issues - you
 need a full checkpoint. These issues are better explained here:
 http://www.postgresql.org/message-id/28884.1119727...@sss.pgh.pa.us

 To solve #1, we could redesign CREATE DATABASE so that replaying the
 DBASE_CREATE record doesn't zap the old directory, and also doesn't copy
 any files. We could instead just assume that if the transaction commits,
 all the files have been copied and fsync'd already, like we assume that
 if a CREATE INDEX commits in wal_level=minimal, the underlying file was
 fsync'd before the commit.

 That would also solve #2, when doing crash recovery. But it would remain
 when doing archive recovery. I guess we could still redo the copy when
 in archive recovery mode. I believe it would be the first time we have a
 WAL record that's replayed differently in crash recovery than in archive
 recovery, so here be dragons...

Yeah ...

 It's not exactly trivial change, but it does not seem frighteningly
 difficult coding either.

 The templates are usually static, so this would minimize both the CREATE
 DATABASE duration and disruption to the cluster it causes.

 I wonder if we should bite the bullet and start WAL-logging all the
 files that are copied from the template database to the new database.
 When the template database is small (template0 is 6.4MB currently), that
 wouldn't generate too much WAL. We could perhaps do that only if the
 template database is small, and do the checkpoints otherwise, although I
 wouldn't like to have subtly different behavior depending on database
 size like that.

IMHO writing all the data into a WAL would be the cleanest solution.

Also, what is a small database? I don't think a static value will work,
because the sweet spot between the current approach (forcing two
checkpoints) and writing everything in WAL depends on the amount of dirty
buffers that need to be checkpointed. Which is mostly driven by the size
of shared buffers and write activity - for small shared buffers and/or
mostly-read workload, checkpoints are cheap, so the 'small database'
threshold (when choosing the WAL approach) is much higher than for large
shared buffers or write-heavy workloads.

So maybe if we could determine the amount of data to be checkpointed, and
then base the decision on that, that'd work better? This would also have
to take into account that writing into WAL is sequential, while
checkpoints usually cause random writes all over the datafiles (which is
more expensive).

Another option might be forcing just a spread checkpoint, not the
immediate one (which is what we do now). That would not fix the CREATE
DATABASE duration (actually, it would make it longer), but it would lower
the impact on other activity on the machine.

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

2014-10-27 Thread Fujii Masao
On Fri, Oct 24, 2014 at 10:05 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 10/23/2014 11:09 AM, Heikki Linnakangas wrote:

 At least for master, we should consider changing the way the archiving
 works so that we only archive WAL that was generated in the same server.
 I.e. we should never try to archive WAL files belonging to another
 timeline.

 I just remembered that we discussed a different problem related to this
 some time ago, at

 http://www.postgresql.org/message-id/20131212.110002.204892575.horiguchi.kyot...@lab.ntt.co.jp.
 The conclusion of that was that at promotion, we should not archive the
 last, partial, segment from the old timeline.


 So, this is what I came up with for master. Does anyone see a problem with
 it?

What about the problem that I raised upthread? This is, the patch
prevents the last, partial, WAL file of the old timeline from being archived.
So we can never PITR the database to the point that the last, partial WAL
file has. Isn't this problem? For example, please imagine the
following scenario.

1. The important data was deleted but no one noticed that. This deletion was
logged in last, partial WAL file.
2. The server crashed and DBA started an archive recovery from old backup.
3. After recovery, all WAL files of the old timeline were recycled.
4. Finally DBA noticed the loss of important data and tried to do PITR
to the point
where the data was deleted.

HOWEVER, the WAL file containing that deletion operation no longer exists.
So DBA will never be able to recover that important data 

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alter user set local_preload_libraries.

2014-10-27 Thread Fujii Masao
On Tue, Oct 21, 2014 at 3:16 PM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
 Wow.

  By the way, I became unable to login at all after wrongly setting
  *_preload_libraries for all available users. Is there any releaf
  measures for the situation? I think it's okay even if there's no
  way to login again but want to know if any.

 Yep, that's a problem. You can login to the server even in that case
 by, for example, executing the following commands, though.

 $ export PGOPTIONS=-c *_preload_libraries=
 $ psql

 Thank you. I see. It seems enough for me. The section 18.1.3 of
 9.4 document describes about it,

 http://www.postgresql.org/docs/9.4/static/config-setting.html

 18.1.4. Parameter Interaction via Shell
 ..
   On the libpq-client, command-line options can be specified
   using the PGOPTIONS environment variable. When connecting to
   the server, the contents of this variable are sent to the
   server as if they were being executed via SQL SET at the
   beginning of the session.

 This implicitly denies PGC_BACKEND variables to be set by this
 method but it also seems not proper to describe precisely...

Sorry, probably I failed to understand your point. Could you tell me
what you're suggesting? You're thinking that the description needs to
be updated? How?

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing lock strength of adding foreign keys

2014-10-27 Thread Robert Haas
Thanks for weighing in, Noah.

On Sat, Oct 25, 2014 at 2:00 PM, Noah Misch n...@leadboat.com wrote:
 http://www.postgresql.org/message-id/ca+tgmoy4glsxzk0tao29-ljtcuj0sl1xwcwq51xb-hfysgi...@mail.gmail.com
 http://www.postgresql.org/message-id/20893.1393892...@sss.pgh.pa.us
 http://www.postgresql.org/message-id/20140306224340.ga3551...@tornado.leadboat.com

 As far as triggers are concerned, the issue of skew between the
 transaction snapshot and what the ruleutils.c snapshots do seems to be
 the principal issue.  Commit e5550d5fec66aa74caad1f79b79826ec64898688
 changed pg_get_constraintdef() to use an MVCC snapshot rather than a
 current MVCC snapshot; if that change is safe, I am not aware of any
 reason why we couldn't change pg_get_triggerdef() similarly.

 pg_get_triggerdef() is fine as-is with concurrent CREATE TRIGGER.  The
 pg_get_constraintdef() change arose to ensure a consistent result when
 concurrent ALTER TABLE VALIDATE CONSTRAINT mutates a constraint definition.
 (Reducing the lock level of DROP TRIGGER or ALTER TRIGGER, however, would
 create the analogous problem for pg_get_triggerdef().)

Maybe so, but I'd favor changing it anyway and getting it over with.
The current situation seems to have little to recommend it; moreover,
it would be nice, if it's possible and safe, to weaken the lock levels
for all three of those commands at the same time.  Do you see any
hazards for ALTER or DROP that do not exist for CREATE?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing lock strength of adding foreign keys

2014-10-27 Thread Robert Haas
On Sun, Oct 26, 2014 at 9:48 PM, Andreas Karlsson andr...@proxel.se wrote:
 Agreed.. But I think reducing the lock level of the secondary table is much
 more important than doing the same for the primary table due to the case
 where the secondary table is an existing table which is hit by a workload of
 long running queries and DML while the primary is a new table which is added
 now. In my dream world I could add the new table without any disruption at
 all of queries using the secondary table, no matter the duration of the
 transaction adding the table (barring insertion of actual data into the
 primary table, which would take row locks).

That would indeed be nice, but it doesn't seem very practical, because
the parent needs triggers, too: if you try to delete a row from the
parent, or update the key, it's got to go look at the child and see
whether there are rows against the old value.  Then it's got to either
update those rows, or null out the value, or throw an error.
Regardless of which of those things it does (which depends on the ON
DELETE and ON UPDATE settings you choose), it's hard to imagine that
it would be a good idea for any of those things to start happening in
the middle of a transaction or statement.

So, let's take what we can get.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Atri Sharma



 IMHO writing all the data into a WAL would be the cleanest solution.

 Also, what is a small database? I don't think a static value will work,
 because the sweet spot between the current approach (forcing two
 checkpoints) and writing everything in WAL depends on the amount of dirty
 buffers that need to be checkpointed. Which is mostly driven by the size
 of shared buffers and write activity - for small shared buffers and/or
 mostly-read workload, checkpoints are cheap, so the 'small database'
 threshold (when choosing the WAL approach) is much higher than for large
 shared buffers or write-heavy workloads.


So are you proposing having a heuristic based on the amount of data in
shared buffers and write activity? Do you have something in mind that works
for general workloads as well?


 So maybe if we could determine the amount of data to be checkpointed, and
 then base the decision on that, that'd work better? This would also have
 to take into account that writing into WAL is sequential, while
 checkpoints usually cause random writes all over the datafiles (which is
 more expensive).

 Another option might be forcing just a spread checkpoint, not the
 immediate one (which is what we do now). That would not fix the CREATE
 DATABASE duration (actually, it would make it longer), but it would lower
 the impact on other activity on the machine.



I believe this to be the cleanest way to reduce the amount of I/O
generated. If I understand correctly, the original problem you mentioned
was not the time CREATE DATABASE is taking but rather the amount of I/O
each one is generating.

This also leads me to think if it makes sense to explore group commits
around the creation of files for a new database (for a same backend, of
course). This might be on call, if the user knows he/she is going to create
a lot of databases in the near future and is fine with a large spike in I/O
at one go. Again, might be even more broken than the current scenario, but
depends on what the user wants...

Regards,

Atri



-- 
Regards,

Atri
*l'apprenant*


Re: [HACKERS] Possible problem with shm_mq spin lock

2014-10-27 Thread Robert Haas
On Sat, Oct 25, 2014 at 9:12 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Haribabu Kommi kommi.harib...@gmail.com writes:
 Thanks for the details. I am sorry It is not proc_exit. It is the exit
 callback functions that can cause problem.

 The following is the callstack where the problem can happen, if the signal
 handler is called after the spin lock took by the worker.

 Breakpoint 1, 0x0072dd83 in shm_mq_detach ()
 (gdb) bt
 #0  0x0072dd83 in shm_mq_detach ()
 #1  0x0072e7db in shm_mq_detach_callback ()
 #2  0x00726d71 in dsm_detach ()
 #3  0x00726c43 in dsm_backend_shutdown ()
 #4  0x00727450 in shmem_exit ()
 #5  0x007272fc in proc_exit_prepare ()
 #6  0x00727501 in atexit_callback ()
 #7  0x0030ff435da2 in exit () from /lib64/libc.so.6
 #8  0x006ddaec in bgworker_quickdie ()

 Or in other words, Robert broke it.  This control path should absolutely
 not occur: the entire point of the on_exit_reset call in quickdie() is to
 prevent any callbacks from being executed when we get to shmem_exit().
 DSM-related functions DO NOT get an exemption.

All true.  However, Robert also fixed it, in commit
cb9a0c7987466b130fbced01ab5d5481cf3a16df, when you complained about it
previously.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Tomas Vondra
Dne 27 Říjen 2014, 13:50, Atri Sharma napsal(a):



 IMHO writing all the data into a WAL would be the cleanest solution.

 Also, what is a small database? I don't think a static value will work,
 because the sweet spot between the current approach (forcing two
 checkpoints) and writing everything in WAL depends on the amount of
 dirty
 buffers that need to be checkpointed. Which is mostly driven by the size
 of shared buffers and write activity - for small shared buffers and/or
 mostly-read workload, checkpoints are cheap, so the 'small database'
 threshold (when choosing the WAL approach) is much higher than for large
 shared buffers or write-heavy workloads.


 So are you proposing having a heuristic based on the amount of data in
 shared buffers and write activity? Do you have something in mind that
 works
 for general workloads as well?


 So maybe if we could determine the amount of data to be checkpointed,
 and
 then base the decision on that, that'd work better? This would also have
 to take into account that writing into WAL is sequential, while
 checkpoints usually cause random writes all over the datafiles (which is
 more expensive).

 Another option might be forcing just a spread checkpoint, not the
 immediate one (which is what we do now). That would not fix the CREATE
 DATABASE duration (actually, it would make it longer), but it would
 lower
 the impact on other activity on the machine.



 I believe this to be the cleanest way to reduce the amount of I/O
 generated. If I understand correctly, the original problem you mentioned
 was not the time CREATE DATABASE is taking but rather the amount of I/O
 each one is generating.

Not exactly. There are two related issues, both caused by the I/O activity
from the CHECKPOINT.

(a) I/O spike, because of checkpointing everything
(b) long CREATE DATABASE durations

This spread CREATE DATABASE only fixes (a), and makes (b) a bit worse.

It however makes the 'create databases in advance' a bit more flexible,
because it allows more frequent runs of the cron job that actually creates
them. Right now we're forced to schedule it rather rarely (say, 1/day) to
minimize the impact, which has downsides (higher probability of running
out of spare dbs, slow response to updated template etc.).

If we can fix both (a) and (b), that'd be great. If we can fix only (a),
so be it.

 This also leads me to think if it makes sense to explore group commits
 around the creation of files for a new database (for a same backend, of
 course). This might be on call, if the user knows he/she is going to
 create
 a lot of databases in the near future and is fine with a large spike in
 I/O
 at one go. Again, might be even more broken than the current scenario, but
 depends on what the user wants...

That seems much more complex than what I proposed to do, but maybe I'm
wrong. If we could get rid of the checkpoint everything altogether, we
don't really need the group commit, no?

regards
Tomas



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Master ip from hot_standby..

2014-10-27 Thread Michael Paquier
On Mon, Oct 27, 2014 at 8:15 PM, sudalai sudala...@gmail.com wrote:
I need to query master ip from hot_standby.
 *pg_stat_replication*  view only shows the slave replication status.

 Is there any way to get *Master IP* from standby node apart from checking
 *recovery.conf* file.
That's the way to go as it is in primary_conninfo. Recovery parameters
are only loaded by the startup process after reading it from
recovery.conf.

Note that this would be more a question for pgsql-general or
pgsql-novice but surely not hackers.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes

2014-10-27 Thread Fujii Masao
On Fri, Oct 17, 2014 at 1:52 PM, Rahila Syed rahilasye...@gmail.com wrote:
 Hello,

 Please find the updated patch attached.

Thanks for updating the patch! Here are the comments.

The patch isn't applied to the master cleanly.

I got the following compiler warnings.

xlog.c:930: warning: ISO C90 forbids mixed declarations and code
xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code
xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code

The compilation of the document failed with the following error message.

openjade:config.sgml:2188:12:E: end tag for element TERM which is not open
make[3]: *** [HTML.index] Error 1

Only backend calls CompressBackupBlocksPagesAlloc when SIGHUP is sent.
Why does only backend need to do that? What about other processes which
can write FPW, e.g., autovacuum?

Do we release the buffers for compressed data when fpw is changed from
compress to on?

+if (uncompressedPages == NULL)
+{
+uncompressedPages = (char *)malloc(XLR_TOTAL_BLCKSZ);
+if (uncompressedPages == NULL)
+outOfMem = 1;
+}

The memory is always (i.e., even when fpw=on) allocated to uncompressedPages,
but not to compressedPages. Why? I guess that the test of fpw needs to be there.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing FIN_CRC32 calls in logical replication code

2014-10-27 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-10-27 12:51:44 +0200, Heikki Linnakangas wrote:
 replication/slot.c and replication/logical/snapbuild.c use a CRC on the
 physical slot and snapshot files. It uses the same algorithm as used e.g.
 for the WAL. However, they are not doing the finalization step, FIN_CRC32()
 on the calculated checksums. Not that it matters much, but it's a bit weird
 and inconsistent, and was probably an oversight.

 Hm. Good catch - that's stupid. I wonder what to do about it. I'm
 tempted to just add a comment about it to 9.4 and fix it on master as
 changing it is essentially a catversion bump. Any objections to that?

Yeah, I think you should get it right the first time.  It hardly seems
likely that any 9.4 beta testers are depending on those files to be stable
yet.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Lockless StrategyGetBuffer() clock sweep

2014-10-27 Thread Andres Freund
Hi,

I've previously posted a patch at
http://archives.postgresql.org/message-id/20141010160020.GG6670%40alap3.anarazel.de
that reduces contention in StrategyGetBuffer() by making the clock sweep
lockless.  Robert asked me to post it to a new thread; I originally
wrote it to see some other contention in more detail, that's why it
ended up in that thread...

The performance numbers I've quoted over there are:
 Test:
 pgbench  -M prepared -P 5 -S -c 496 -j 496 -T 5000
 on a scale=1000 database, with 4GB of shared buffers.
 
 Before:
 progress: 40.0 s, 136252.3 tps, lat 3.628 ms stddev 4.547
 progress: 45.0 s, 135049.0 tps, lat 3.660 ms stddev 4.515
 progress: 50.0 s, 135788.9 tps, lat 3.640 ms stddev 4.398
 progress: 55.0 s, 135268.4 tps, lat 3.654 ms stddev 4.469
 progress: 60.0 s, 134991.6 tps, lat 3.661 ms stddev 4.739
 
 after:
 progress: 40.0 s, 207701.1 tps, lat 2.382 ms stddev 3.018
 progress: 45.0 s, 208022.4 tps, lat 2.377 ms stddev 2.902
 progress: 50.0 s, 209187.1 tps, lat 2.364 ms stddev 2.970
 progress: 55.0 s, 206462.7 tps, lat 2.396 ms stddev 2.871
 progress: 60.0 s, 210263.8 tps, lat 2.351 ms stddev 2.914

Imo the patch doesn't complicate the logic noticeably...

I do wonder if we could make the freelist accesses lockless as well -
but I think that's harder. So I don't want to mix that with this.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
From 6b486e5b467e94ab9297d7656a5b39b816c5c55a Mon Sep 17 00:00:00 2001
From: Andres Freund and...@anarazel.de
Date: Fri, 10 Oct 2014 17:36:46 +0200
Subject: [PATCH] WIP: lockless StrategyGetBuffer hotpath

---
 src/backend/storage/buffer/freelist.c | 154 --
 1 file changed, 90 insertions(+), 64 deletions(-)

diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 5966beb..0c634a0 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -18,6 +18,12 @@
 #include storage/buf_internals.h
 #include storage/bufmgr.h
 
+#include port/atomics.h
+
+
+#define LATCHPTR_ACCESS_ONCE(var)	((Latch *)(*((volatile Latch **)(var
+#define INT_ACCESS_ONCE(var)	((int)(*((volatile int *)(var
+
 
 /*
  * The shared freelist control information.
@@ -27,8 +33,12 @@ typedef struct
 	/* Spinlock: protects the values below */
 	slock_t		buffer_strategy_lock;
 
-	/* Clock sweep hand: index of next buffer to consider grabbing */
-	int			nextVictimBuffer;
+	/*
+	 * Clock sweep hand: index of next buffer to consider grabbing. Note that
+	 * this isn't a concrete buffer - we only ever increase the value. So, to
+	 * get an actual buffer, it needs to be used modulo NBuffers.
+	 */
+	pg_atomic_uint32 nextVictimBuffer;
 
 	int			firstFreeBuffer;	/* Head of list of unused buffers */
 	int			lastFreeBuffer; /* Tail of list of unused buffers */
@@ -42,8 +52,8 @@ typedef struct
 	 * Statistics.  These counters should be wide enough that they can't
 	 * overflow during a single bgwriter cycle.
 	 */
-	uint32		completePasses; /* Complete cycles of the clock sweep */
-	uint32		numBufferAllocs;	/* Buffers allocated since last reset */
+	pg_atomic_uint32 completePasses; /* Complete cycles of the clock sweep */
+	pg_atomic_uint32 numBufferAllocs;	/* Buffers allocated since last reset */
 
 	/*
 	 * Notification latch, or NULL if none.  See StrategyNotifyBgWriter.
@@ -124,87 +134,107 @@ StrategyGetBuffer(BufferAccessStrategy strategy)
 			return buf;
 	}
 
-	/* Nope, so lock the freelist */
-	SpinLockAcquire(StrategyControl-buffer_strategy_lock);
-
 	/*
 	 * We count buffer allocation requests so that the bgwriter can estimate
 	 * the rate of buffer consumption.  Note that buffers recycled by a
 	 * strategy object are intentionally not counted here.
 	 */
-	StrategyControl-numBufferAllocs++;
+	pg_atomic_fetch_add_u32(StrategyControl-numBufferAllocs, 1);
 
 	/*
-	 * If bgwriterLatch is set, we need to waken the bgwriter, but we should
-	 * not do so while holding buffer_strategy_lock; so release and re-grab.
-	 * This is annoyingly tedious, but it happens at most once per bgwriter
-	 * cycle, so the performance hit is minimal.
+	 * If bgwriterLatch is set, we need to waken the bgwriter. We don't want
+	 * to check this while holding the spinlock, so we the latch from memory
+	 * once to see whether it's set. There's no need to do so with a lock
+	 * present - we'll just set the latch next call if we missed it once.
+	 *
+	 * Since we're not guaranteed atomic 8 byte reads we need to acquire the
+	 * spinlock if not null to be sure we get a correct pointer. Because we
+	 * don't want to set the latch while holding the buffer_strategy_lock we
+	 * just grab the lock to read and reset the pointer.
 	 */
-	bgwriterLatch = StrategyControl-bgwriterLatch;
+	bgwriterLatch = LATCHPTR_ACCESS_ONCE(StrategyControl-bgwriterLatch);
 	if (bgwriterLatch)
 	{
+		/* we don't have guaranteed 

Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 10/27/2014 03:21 PM, Tomas Vondra wrote:
 Thinking about this a bit more, do we really need a full checkpoint? That
 is a checkpoint of all the databases in the cluster? Why checkpointing the
 source database is not enough?

 A full checkpoint ensures that you always begin recovery *after* the 
 DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during 
 crash recovery (except when you crash before the transaction commits, in 
 which case it doesn't matter if the new database's directory is borked).

Yeah.  After re-reading the 2005 thread, I wonder if we shouldn't just
bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
all the copied files instead of doing a cp -r-equivalent directory copy.
That would fix a number of existing replay hazards as well as making it
safe to do what Tomas wants.  In the small scale this would cause more I/O
(2 copies of the template database's data) but in production situations
we might well come out ahead by avoiding a forced checkpoint of the rest
of the cluster.  Also I guess we could skip WAL-logging if WAL archiving
is off, similarly to the existing optimization for CREATE INDEX etc.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Andres Freund
On 2014-10-27 09:46:41 -0400, Tom Lane wrote:
 Heikki Linnakangas hlinnakan...@vmware.com writes:
  On 10/27/2014 03:21 PM, Tomas Vondra wrote:
  Thinking about this a bit more, do we really need a full checkpoint? That
  is a checkpoint of all the databases in the cluster? Why checkpointing the
  source database is not enough?
 
  A full checkpoint ensures that you always begin recovery *after* the 
  DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during 
  crash recovery (except when you crash before the transaction commits, in 
  which case it doesn't matter if the new database's directory is borked).
 
 Yeah.  After re-reading the 2005 thread, I wonder if we shouldn't just
 bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
 all the copied files instead of doing a cp -r-equivalent directory copy.
 That would fix a number of existing replay hazards as well as making it
 safe to do what Tomas wants.  In the small scale this would cause more I/O
 (2 copies of the template database's data) but in production situations
 we might well come out ahead by avoiding a forced checkpoint of the rest
 of the cluster.  Also I guess we could skip WAL-logging if WAL archiving
 is off, similarly to the existing optimization for CREATE INDEX etc.

+1.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb generator functions

2014-10-27 Thread Andrew Dunstan


On 10/15/2014 03:54 PM, Andrew Dunstan wrote:




I checked a code, and I have only two small objection - a name 
jsonb_object_two_arg is not good - maybe json_object_keys_values ?


It's consistent with the existing json_object_two_arg. In all cases I 
think I kept the names the same except for changing json to jsonb. 
Note that these _two_arg functions are not visible at the SQL level - 
they are only visible in the C code.


I'm happy to be guided by others in changing or keeping these names.



Next: there are no tests for to_jsonb function.




Oh, my bad. I'll add some.

Thank for the review.




Here is a new patch that includes documentation and addresses all these 
issues, except that I didn't change the name of jsonb_object_two_arg to 
keep it consistent with the name of json_object_two_arg. I'm happy to 
change both if people feel it matters.


cheers

andrew
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7e5bcd9..21ce293 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -10245,9 +10245,10 @@ table2-mapping
 
   para
xref linkend=functions-json-creation-table shows the functions that are
-   available for creating typejson/type values.
-   (Currently, there are no equivalent functions for typejsonb/, but you
-   can cast the result of one of these functions to typejsonb/.)
+   available for creating typejson/type and typejsonb/type values.
+   (There are no equivalent functions for typejsonb/, of the literalrow_to_json/
+   and literalarray_to_json/ functions. However, the literalto_jsonb/
+   function supplies much the same functionality as these functions would.)
   /para
 
   indexterm
@@ -10268,6 +10269,18 @@ table2-mapping
   indexterm
primaryjson_object/primary
   /indexterm
+  indexterm
+   primaryto_jsonb/primary
+  /indexterm
+  indexterm
+   primaryjsonb_build_array/primary
+  /indexterm
+  indexterm
+   primaryjsonb_build_object/primary
+  /indexterm
+  indexterm
+   primaryjsonb_object/primary
+  /indexterm
 
   table id=functions-json-creation-table
 titleJSON Creation Functions/title
@@ -10282,17 +10295,19 @@ table2-mapping
  /thead
  tbody
   row
+   entryparaliteralto_json(anyelement)/literal
+  /paraparaliteralto_jsonb(anyelement)/literal
+   /para/entry
entry
- literalto_json(anyelement)/literal
-   /entry
-   entry
- Returns the value as JSON.  Arrays and composites are converted
+ Returns the value as typejson/ or typejsonb/.
+ Arrays and composites are converted
  (recursively) to arrays and objects; otherwise, if there is a cast
- from the type to typejson/type, the cast function will be used to
- perform the conversion; otherwise, a JSON scalar value is produced.
+ from the type to typejson/type or typejsonb/ in the case of 
+ literalto_jsonb/, the cast function will be used to
+ perform the conversion; otherwise, a scalar value is produced.
  For any scalar type other than a number, a Boolean, or a null value,
- the text representation will be used, properly quoted and escaped
- so that it is a valid JSON string.
+ the text representation will be used, in such a fashion that it is a 
+ valid typejson/ or typejsonb/ value.
/entry
entryliteralto_json('Fred said Hi.'::text)/literal/entry
entryliteralFred said \Hi.\/literal/entry
@@ -10321,9 +10336,9 @@ table2-mapping
entryliteral{f1:1,f2:foo}/literal/entry
   /row
   row
-   entry
- literaljson_build_array(VARIADIC any)/literal
-   /entry
+   entryparaliteraljson_build_array(VARIADIC any)/literal
+  /paraparaliteraljsonb_build_array(VARIADIC any)/literal
+   /para/entry
entry
  Builds a possibly-heterogeneously-typed JSON array out of a variadic
  argument list.
@@ -10332,9 +10347,9 @@ table2-mapping
entryliteral[1, 2, 3, 4, 5]/literal/entry
   /row
   row
-   entry
- literaljson_build_object(VARIADIC any)/literal
-   /entry
+   entryparaliteraljson_build_object(VARIADIC any)/literal
+  /paraparaliteraljsonb_build_object(VARIADIC any)/literal
+   /para/entry
entry
  Builds a JSON object out of a variadic argument list.  By
  convention, the argument list consists of alternating
@@ -10344,9 +10359,9 @@ table2-mapping
entryliteral{foo: 1, bar: 2}/literal/entry
   /row
   row
-   entry
- literaljson_object(text[])/literal
-   /entry
+   entryparaliteraljson_object(text[])/literal
+  /paraparaliteraljsonb_object(text[])/literal
+   /para/entry
entry
  Builds a JSON object out of a text array.  The array must have either
  exactly one dimension with an even number of members, in which case
@@ -10359,9 +10374,9 @@ table2-mapping
entryliteral{a: 1, b: def, c: 

Re: [HACKERS] Review of GetUserId() Usage

2014-10-27 Thread Stephen Frost
* Stephen Frost (sfr...@snowman.net) wrote:
 Attached is a patch to address the pg_cancel/terminate_backend and the
 statistics info as discussed previously.  It sounds like we're coming to

And I forgot the attachment, of course.  Apologies.

Thanks,

Stephen
diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c
new file mode 100644
index 67539ec..6924fb7
*** a/src/backend/utils/adt/misc.c
--- b/src/backend/utils/adt/misc.c
***
*** 37,42 
--- 37,43 
  #include utils/lsyscache.h
  #include utils/ruleutils.h
  #include tcop/tcopprot.h
+ #include utils/acl.h
  #include utils/builtins.h
  #include utils/timestamp.h
  
*** pg_signal_backend(int pid, int sig)
*** 113,121 
  		return SIGNAL_BACKEND_ERROR;
  	}
  
! 	if (!(superuser() || proc-roleId == GetUserId()))
  		return SIGNAL_BACKEND_NOPERMISSION;
  
  	/*
  	 * Can the process we just validated above end, followed by the pid being
  	 * recycled for a new process, before reaching here?  Then we'd be trying
--- 114,127 
  		return SIGNAL_BACKEND_ERROR;
  	}
  
! 	/* Only allow superusers to signal superuser-owned backends. */
! 	if (superuser_arg(proc-roleId)  !superuser())
  		return SIGNAL_BACKEND_NOPERMISSION;
  
+ 	/* Users can signal their own backends (including through membership) */
+ 	if (!has_privs_of_role(GetUserId(), proc-roleId))
+ 			return SIGNAL_BACKEND_NOPERMISSION;
+ 
  	/*
  	 * Can the process we just validated above end, followed by the pid being
  	 * recycled for a new process, before reaching here?  Then we'd be trying
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
new file mode 100644
index d621a68..3663e93
*** a/src/backend/utils/adt/pgstatfuncs.c
--- b/src/backend/utils/adt/pgstatfuncs.c
***
*** 20,25 
--- 20,26 
  #include libpq/ip.h
  #include miscadmin.h
  #include pgstat.h
+ #include utils/acl.h
  #include utils/builtins.h
  #include utils/inet.h
  #include utils/timestamp.h
*** pg_stat_get_activity(PG_FUNCTION_ARGS)
*** 674,681 
  		else
  			nulls[15] = true;
  
! 		/* Values only available to same user or superuser */
! 		if (superuser() || beentry-st_userid == GetUserId())
  		{
  			SockAddr	zero_clientaddr;
  
--- 675,682 
  		else
  			nulls[15] = true;
  
! 		/* Values only available to role member */
! 		if (has_privs_of_role(GetUserId(), beentry-st_userid))
  		{
  			SockAddr	zero_clientaddr;
  
*** pg_stat_get_backend_activity(PG_FUNCTION
*** 877,883 
  
  	if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
  		activity = backend information not available;
! 	else if (!superuser()  beentry-st_userid != GetUserId())
  		activity = insufficient privilege;
  	else if (*(beentry-st_activity) == '\0')
  		activity = command string not enabled;
--- 878,884 
  
  	if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
  		activity = backend information not available;
! 	else if (!has_privs_of_role(GetUserId(), beentry-st_userid))
  		activity = insufficient privilege;
  	else if (*(beentry-st_activity) == '\0')
  		activity = command string not enabled;
*** pg_stat_get_backend_waiting(PG_FUNCTION_
*** 898,904 
  	if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
  		PG_RETURN_NULL();
  
! 	if (!superuser()  beentry-st_userid != GetUserId())
  		PG_RETURN_NULL();
  
  	result = beentry-st_waiting;
--- 899,905 
  	if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
  		PG_RETURN_NULL();
  
! 	if (!has_privs_of_role(GetUserId(), beentry-st_userid))
  		PG_RETURN_NULL();
  
  	result = beentry-st_waiting;
*** pg_stat_get_backend_activity_start(PG_FU
*** 917,923 
  	if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
  		PG_RETURN_NULL();
  
! 	if (!superuser()  beentry-st_userid != GetUserId())
  		PG_RETURN_NULL();
  
  	result = beentry-st_activity_start_timestamp;
--- 918,924 
  	if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
  		PG_RETURN_NULL();
  
! 	if (!has_privs_of_role(GetUserId(), beentry-st_userid))
  		PG_RETURN_NULL();
  
  	result = beentry-st_activity_start_timestamp;
*** pg_stat_get_backend_xact_start(PG_FUNCTI
*** 943,949 
  	if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
  		PG_RETURN_NULL();
  
! 	if (!superuser()  beentry-st_userid != GetUserId())
  		PG_RETURN_NULL();
  
  	result = beentry-st_xact_start_timestamp;
--- 944,950 
  	if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
  		PG_RETURN_NULL();
  
! 	if (!has_privs_of_role(GetUserId(), beentry-st_userid))
  		PG_RETURN_NULL();
  
  	result = beentry-st_xact_start_timestamp;
*** pg_stat_get_backend_start(PG_FUNCTION_AR
*** 965,971 
  	if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
  		PG_RETURN_NULL();
  
! 	if (!superuser()  beentry-st_userid != GetUserId())
  		PG_RETURN_NULL();
  
  	result = beentry-st_proc_start_timestamp;

Re: [HACKERS] Review of GetUserId() Usage

2014-10-27 Thread Stephen Frost
All,

* Peter Eisentraut (pete...@gmx.net) wrote:
 It would be weird if it were inconsistent: some things require role
 membership, some things require SET ROLE.  Try explaining that.

Attached is a patch to address the pg_cancel/terminate_backend and the
statistics info as discussed previously.  It sounds like we're coming to
consensus that this is the correct approach.  Comments are always
welcome, of course, but it's a pretty minor patch and I'd like to move
forward with it soon against master.

We'll rebase the role attributes patch with these changes and update
that patch for review (with a few other changes) after.

Thanks!

Stephen


signature.asc
Description: Digital signature


[HACKERS] cost_index()

2014-10-27 Thread Teodor Sigaev

Hi!

Some fragment of code (src/backend/optimizer/path/costsize.c, lineno ~400):
/*
 * Normal case: apply the Mackert and Lohman formula, and then
 * interpolate between that and the correlation-derived result.
 */
pages_fetched = index_pages_fetched(tuples_fetched,
baserel-pages,
(double) index-pages,
root);
if (indexonly)
pages_fetched = ceil(pages_fetched * (1.0 - baserel-allvisfrac));

As I understand the code, index_pages_fetched() returns summary of page's read 
for index and heap together. But on next line this recalculates with all 
visible fraction of heap. After recent vacuum it could be 1.0 and then 
pages_fetches will be zero. It seems to me obviously wrong, because it's for 
index only scan it could be true only for heap, not for index pages.


Am I wrong or miss something?



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cost_index()

2014-10-27 Thread Tom Lane
Teodor Sigaev teo...@sigaev.ru writes:
  if (indexonly)
  pages_fetched = ceil(pages_fetched * (1.0 - 
 baserel-allvisfrac));

 As I understand the code, index_pages_fetched() returns summary of page's 
 read 
 for index and heap together.

No.  Costs for touching the index were computed by the amcostestimate
function; this code is solely about estimating costs for touching the
heap.

 But on next line this recalculates with all 
 visible fraction of heap. After recent vacuum it could be 1.0 and then 
 pages_fetches will be zero. It seems to me obviously wrong, because it's for 
 index only scan it could be true only for heap, not for index pages.

Seems correct to me: if it's an index-only scan, and all pages are
all-visible, then indeed no heap pages will be fetched.

(Note that the cost model doesn't charge anything for touching the
visibility map.  This might be too simplistic, but certainly that cost
should be much smaller than touching the heap, since the map is small
enough that it should stay resident in shared buffers.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] superuser() shortcuts

2014-10-27 Thread Stephen Frost
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
 Brightwell, Adam wrote:
   If we were to make it consistent and use the old wording, what do you
   think about providing an errhint as well?
  
   Perhaps for example in slotfuncs.c#pg_create_physical_replication_stot:
  
   errmsg - permission denied to create physical replication slot
   errhint - You must be superuser or replication role to use replication 
   slots.
 
 Sure.

Sounds reasonable to me and looks to be in-line with wording elsewhere.

  As I started looking at this, there are multiple other places where
  these types of error messages occur (opclasscmds.c, user.c,
  postinit.c, miscinit.c are just a few), not just around the changes in
  this patch.  If we change them in one place, wouldn't it be best to
  change them in the rest?  If that is the case, I'm afraid that might
  distract from the purpose of this patch.  Perhaps, if we want to
  change them, then that should be submitted as a separate patch?
 
 Yeah.  I'm just saying that maybe this patch should adopt whatever
 wording we agree to, not that we need to change other places.  On the
 other hand, since so many other places have adopted the different
 wording, maybe there's a reason for it and if so, does anybody know what
 it is.  But I have to say that it does look inconsistent to me.

I agree that this patch should adopt the wording agreed to above and
that it's 'more similar' to most of the usage in the backend.  As for
the general question, I'd suggest we add to the error message policy
that more-or-less anything with errcode(ERRCODE_INSUFFICIENT_PRIVILEGE)
have errmsg(permission denied to X) with any comments about 'must be a
superuser' or similar relegated to errhint().  We could add that as a
TODO and have more junior developers review these cases and come up with
patches (or argue for cases where they feel deviation is warranted).

Either way, that larger rework isn't for this patch and since you seem
happy with it (modulo the change above), I'm going to make that change,
do my own review, and move foward with it unless someone else wants to
speak up.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Typo fixes for pg_recvlogical documentation

2014-10-27 Thread Robert Haas
On Fri, Oct 24, 2014 at 5:14 PM, Michael Paquier
michael.paqu...@gmail.com wrote:
 On Fri, Oct 24, 2014 at 7:42 PM, Robert Haas robertmh...@gmail.com wrote:
 [rhaas pgsql]$ patch -p1  ~/Downloads/20141023_pg_recvlogical_fixes.patch
 patching file doc/src/sgml/ref/pg_recvlogical.sgml
 Hunk #1 succeeded at 270 with fuzz 1 (offset 165 lines).
 Hunk #2 FAILED at 282.
 Hunk #3 FAILED at 295.
 Hunk #4 FAILED at 308.
 3 out of 4 hunks FAILED -- saving rejects to file
 doc/src/sgml/ref/pg_recvlogical.sgml.rej
 Sorry, I did not rebase much these days, there were conflicts with 52c1ae2...

Committed and back-patched to 9.4.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Typo fixes for pg_recvlogical documentation

2014-10-27 Thread Michael Paquier
On Tue, Oct 28, 2014 at 12:12 AM, Robert Haas robertmh...@gmail.com wrote:
 Committed and back-patched to 9.4.
Thanks.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] cost_index()

2014-10-27 Thread Teodor Sigaev

No.  Costs for touching the index were computed by the amcostestimate
function; this code is solely about estimating costs for touching the
heap.

I see. Thank you.

--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] superuser() shortcuts

2014-10-27 Thread Stephen Frost
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote:
  As I started looking at this, there are multiple other places where
  these types of error messages occur (opclasscmds.c, user.c,
  postinit.c, miscinit.c are just a few), not just around the changes in
  this patch.  If we change them in one place, wouldn't it be best to
  change them in the rest?  If that is the case, I'm afraid that might
  distract from the purpose of this patch.  Perhaps, if we want to
  change them, then that should be submitted as a separate patch?
 
 Yeah.  I'm just saying that maybe this patch should adopt whatever
 wording we agree to, not that we need to change other places.  On the
 other hand, since so many other places have adopted the different
 wording, maybe there's a reason for it and if so, does anybody know what
 it is.  But I have to say that it does look inconsistent to me.

Updated patch attached.  Comments welcome.

Thanks!

Stephen
diff --git a/contrib/test_decoding/expected/permissions.out b/contrib/test_decoding/expected/permissions.out
new file mode 100644
index 212fd1d..f011955
*** a/contrib/test_decoding/expected/permissions.out
--- b/contrib/test_decoding/expected/permissions.out
*** RESET ROLE;
*** 54,66 
  -- plain user *can't* can control replication
  SET ROLE lr_normal;
  SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
! ERROR:  must be superuser or replication role to use replication slots
  INSERT INTO lr_test VALUES('lr_superuser_init');
  ERROR:  permission denied for relation lr_test
  SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
! ERROR:  must be superuser or replication role to use replication slots
  SELECT pg_drop_replication_slot('regression_slot');
! ERROR:  must be superuser or replication role to use replication slots
  RESET ROLE;
  -- replication users can drop superuser created slots
  SET ROLE lr_superuser;
--- 54,69 
  -- plain user *can't* can control replication
  SET ROLE lr_normal;
  SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding');
! ERROR:  permission denied to use replication slots
! HINT:  You must be superuser or replication role to use replication slots.
  INSERT INTO lr_test VALUES('lr_superuser_init');
  ERROR:  permission denied for relation lr_test
  SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1');
! ERROR:  permission denied to use replication slots
! HINT:  You must be superuser or replication role to use replication slots.
  SELECT pg_drop_replication_slot('regression_slot');
! ERROR:  permission denied to use replication slots
! HINT:  You must be superuser or replication role to use replication slots.
  RESET ROLE;
  -- replication users can drop superuser created slots
  SET ROLE lr_superuser;
*** SELECT 'init' FROM pg_create_logical_rep
*** 90,96 
  RESET ROLE;
  SET ROLE lr_normal;
  SELECT pg_drop_replication_slot('regression_slot');
! ERROR:  must be superuser or replication role to use replication slots
  RESET ROLE;
  -- all users can see existing slots
  SET ROLE lr_superuser;
--- 93,100 
  RESET ROLE;
  SET ROLE lr_normal;
  SELECT pg_drop_replication_slot('regression_slot');
! ERROR:  permission denied to use replication slots
! HINT:  You must be superuser or replication role to use replication slots.
  RESET ROLE;
  -- all users can see existing slots
  SET ROLE lr_superuser;
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
new file mode 100644
index 133143d..a05de7e
*** a/src/backend/access/transam/xlogfuncs.c
--- b/src/backend/access/transam/xlogfuncs.c
***
*** 27,32 
--- 27,33 
  #include miscadmin.h
  #include replication/walreceiver.h
  #include storage/smgr.h
+ #include utils/acl.h
  #include utils/builtins.h
  #include utils/numeric.h
  #include utils/guc.h
*** pg_start_backup(PG_FUNCTION_ARGS)
*** 54,63 
  
  	backupidstr = text_to_cstring(backupid);
  
! 	if (!superuser()  !has_rolreplication(GetUserId()))
  		ereport(ERROR,
  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 		   errmsg(must be superuser or replication role to run a backup)));
  
  	startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL);
  
--- 55,65 
  
  	backupidstr = text_to_cstring(backupid);
  
! 	if (!has_replication_privilege(GetUserId()))
  		ereport(ERROR,
  (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 			 	 errmsg(permission denied to run a backup),
! 			 	 errhint(You must be superuser or replication role to run a backup.)));
  
  	startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL);
  
*** pg_stop_backup(PG_FUNCTION_ARGS)
*** 82,91 
  {
  	XLogRecPtr	stoppoint;
  
! 	if (!superuser()  !has_rolreplication(GetUserId()))
  		ereport(ERROR,
  

Re: [HACKERS] TAP test breakage on MacOS X

2014-10-27 Thread Robert Haas
On Sun, Oct 26, 2014 at 12:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The larger issue though is that even with both the above things fixed,
 the TAP tests would still be an expensive no-op on the majority of
 buildfarm members.  AFAICT, I do not own a single machine on which the
 current TAP tests will consent to run, and in most cases that's after
 going out of my way to fetch CPAN modules that aren't in the vendor Perl
 installs.  Peter's mostly been fixing the portability issues by disabling
 tests, which I guess is better than no fix at all, but it leaves darn
 little useful functionality.

I agree, emphatically.  Honestly, if we can't get these tests running
everywhere with reasonable effort, we should just rip them out.  We've
gone to a lot of trouble in general to make sure that our source code
can be ported even to systems that arguably nobody uses any more, and
instrumental to that effort is keeping the system requirements to
install and test PostgreSQL minimal.  At this point, I wouldn't mind
moving the goalposts from C89 to C89 + a bunch of C99 features that
are available on all the platforms we have buildfarm coverage for, and
I wouldn't mind require perl to compile and install, full stop.  But
this patch has gone much further than that: you need a new-enough
version of perl, and a new-enough version of a bunch of modules that
aren't installed by default, and maybe not even in the vendor install,
and the documentation on how to make it work is an embarrassment:

http://www.postgresql.org/docs/devel/static/regress-tap.html

Beyond all that, I have serious doubts about whether, even if we
eventually get these tests mostly working in most places, whether they
will actually catch any bugs.  For example, consider dropdb.  The TAP
tests cover the following points:

- When run with --help or --version, it should exit with code 0, print
something on stderr, and print nothing on stdout.
- When run with --not-a-valid-option, it should exit with a non-zero
exit code, print something on stderr, and print nothing on stdout.
- dropdb foobar1 should cause statement: DROP DATABASE foobar1 to
show up in the postgresql log file.
- dropdb nonexistent should exit non-zero.

These are certainly good things to test, but I'd argue that once
you've verified that they are working, they're unlikely to get broken
again in the future.  I'm generally supportive of the idea that we
need more automated tests, but these tests seem pretty low-value to
me, even if they were running everywhere, and even moreso if they
aren't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Materialized views don't show up in information_schema

2014-10-27 Thread Stephen Frost
* Robert Haas (robertmh...@gmail.com) wrote:
 On Fri, Oct 17, 2014 at 8:10 PM, Stephen Frost sfr...@snowman.net wrote:
  * Peter Eisentraut (pete...@gmx.net) wrote:
  On 10/16/14 9:45 AM, Stephen Frost wrote:
   Alright, coming back to this, I have to ask- how are matviews different
   from views from the SQL standard's perspective?  I tried looking through
   the standard to figure it out (and I admit that I probably missed
   something), but the only thing appears to be a statement in the standard
   that (paraphrased) functions are run with the view is queried and that
   strikes me as a relatively minor point..
 
  To me, the main criterion is that you cannot DROP VIEW a materialized view.
 
  That is an entirely correctable situation.  We don't require 'DROP
  UNLOGGED TABLE'.
 
 I think that's an inapposite comparison.  The fact that a table is
 unlogged is merely a property of the table; it does not change the
 fact that it is a table.  A materialized view, on the other hand, is
 different kind of object from a view.  This manifests itself the fact
 that it's represented by a different relkind; and that different
 syntax is used not only for DROP but also for COMMENT, ALTER VIEW,
 SECURITY LABEL, and ALTER EXTENSION .. ADD/DROP; and that the set of
 supported operations on a materialized view is different from a
 regular view (and will probably be more different in the future).  If
 we really want to change this, we can't just change DROP VIEW; we need
 to change all of the places in a consistent fashion, and we probably
 have to continue to support the old syntax so that we don't break
 existing dumps.

Yes, clearly we'd have to adjust the syntax for all of the commands to
support both with MATERIALIZED and without.  I don't have an issue with
that.  The argument makes more sense if there are operations which ONLY
work against a regular view and do *not* work against a matview.
Operations which operate only against a matview simply must have
MATERIALIZED used for them.

 But I think it's the wrong thing anyway, because it presumes that,
 when Kevin chose to make materialized views a different relkind and a
 different object type, rather than just a property of an object, he
 made the wrong call, and I don't agree with that.  I think he got it
 exactly right.  A materialized view is really much more like a table
 than a view: it has storage and can be vacuumed, clustered, analyzed,
 and so on.  That's far more significant IMV than the difference
 between a table and unlogged table.

I don't think Kevin was wrong to use a different relkind, but I don't
buy into the argument that a different relkind means it's not a view.
As for the other comments, I agree that a matview is *more* than a view,
but at its base, in my view (pun intended), it's still a view.  Why not
call it a materialized query?

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-10-27 Thread Robert Haas
On Sun, Oct 26, 2014 at 4:39 PM, Peter Geoghegan p...@heroku.com wrote:
 I don't care whether you actually generate index-paths or not, and in
 fact I suspect it makes no sense to do so.  But I do care that you do
 a cost comparison between the available indexes and pick the one that
 looks cheapest.  If somebody's got a bloated index and builds a new
 one, they will want it to get used even before they drop the old one.

 That seems extremely narrow. I am about ready to just do what you say,
 by costing the index based on something like relpages, but for the
 record I see no point. If I see no point, and you're sure that there
 is a point, then there is a danger that I'll miss the point, which
 contributes to my giving push back. I know I said that already, but I
 reiterate it once more for emphasis.

I don't think it's either narrow or difficult: I think you just need
to apply the existing index costing function.  CLUSTER does a similar
calculation to figure out whether to seq-scan or sort, even though it
doesn't use index paths per se, or at least I don't think it does.

 Even if that weren't an issue, though, the fact that you can't
 re-parse but you can re-plan is a killer point AFAICS. It means you
 are borked if the statement gets re-executed after the index you
 picked is dropped.

 That simply isn't the case. As specifically noted in comments in the
 patch, relcache invalidation works in such a way as to invalidate the
 query proper, even when only an index has been dropped. For a time
 during development, the semantic dependence was more directly
 represented by actually having extract_query_dependencies() extract
 the arbitrating unique index pg_class Oid as a dependency for the
 benefit of the plan cache, but I ultimately decided against doing
 anything more than noting the potential future problem (the problem
 that arises in a world where relcache invalidation is more selective).

I don't know what you mean by invalidate the query proper.  Here's
my chain of reasoning: If the index selection occurs in parse
analysis, then it's embedded in the parse tree.  If this can used a
writable CTE, then the parse tree can get stored someplace.  If the
index is then dropped, the parse tree is no good any more.  Where's
the flaw in that reasoning?

I haven't looked at the patch to know exactly what will happen in that
case, but it seems to me like it must either not work or there must be
some ugly hack to make it work.

 But, I digress: I'll have inference occur during planning during the
 next revision since you think that's important.

Great.

 I think that it's accurate to say that I asked the community to do
 that once, last year. I was even very explicit about it at the time.
 I'm surprised that you think that's the case now, though. I learned my
 lesson a little later: don't do that, because fellow contributors are
 unwilling to go along with it, even for something as important as
 this.

I'm just telling you how I feel.  I can't vouch for it being perfectly
fair or accurate, though I do strive for that.

 As recently as a few months ago, you wanted me to go a *completely*
 different direction with this (i.e. attempt an UPDATE first). I'm
 surprised that you're emphasizing the EXCLUDED()/CONFLICTING() thing
 so much. Should I take it that I more or less have your confidence
 that the way the patch fits together at a high level is sound?

No.  Commenting on one aspect of a patch doesn't imply agreement with
other aspects of the patch.  Please don't put words into my mouth.  I
haven't reviewed this patch in detail; I've only commented on specific
aspects of it as they have arisen in discussion.  I may or may not
someday review it in detail, but not before I'm fairly confident that
the known issues raised by other community members have been addressed
as thoroughly as possible.

 In reality, the
 main reason for that is: getting this feature to a point where it
 might plausibly be committed is bloody difficult, as evidenced by the
 fact that it took this long for someone to produce a patch. Please
 don't lose sight of that.

I'm not.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TAP test breakage on MacOS X

2014-10-27 Thread Heikki Linnakangas

On 10/27/2014 05:41 PM, Robert Haas wrote:

On Sun, Oct 26, 2014 at 12:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:

The larger issue though is that even with both the above things fixed,
the TAP tests would still be an expensive no-op on the majority of
buildfarm members.  AFAICT, I do not own a single machine on which the
current TAP tests will consent to run, and in most cases that's after
going out of my way to fetch CPAN modules that aren't in the vendor Perl
installs.  Peter's mostly been fixing the portability issues by disabling
tests, which I guess is better than no fix at all, but it leaves darn
little useful functionality.


I agree, emphatically.  Honestly, if we can't get these tests running
everywhere with reasonable effort, we should just rip them out.  We've
gone to a lot of trouble in general to make sure that our source code
can be ported even to systems that arguably nobody uses any more, and
instrumental to that effort is keeping the system requirements to
install and test PostgreSQL minimal.  At this point, I wouldn't mind
moving the goalposts from C89 to C89 + a bunch of C99 features that
are available on all the platforms we have buildfarm coverage for, and
I wouldn't mind require perl to compile and install, full stop.  But
this patch has gone much further than that: you need a new-enough
version of perl, and a new-enough version of a bunch of modules that
aren't installed by default, and maybe not even in the vendor install,
and the documentation on how to make it work is an embarrassment:

http://www.postgresql.org/docs/devel/static/regress-tap.html

Beyond all that, I have serious doubts about whether, even if we
eventually get these tests mostly working in most places, whether they
will actually catch any bugs.


The existing tests are not very useful, but it certainly would be nice 
to have some infrastructure like TAP to write useful tests. For example, 
I recently wrote a test suite for SSL, using TAP 
(http://www.postgresql.org/message-id/53df9afe.4090...@vmware.com), and 
it would be nice to have some TAP tests for hot standby, replication, 
PITR etc.


I'm not sure if the current infrastructure is very good for that, but we 
need something. If the current infrastructure is beyond repair, then 
let's discuss what we could replace it with.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TAP test breakage on MacOS X

2014-10-27 Thread Andrew Dunstan


On 10/27/2014 11:41 AM, Robert Haas wrote:

On Sun, Oct 26, 2014 at 12:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:

The larger issue though is that even with both the above things fixed,
the TAP tests would still be an expensive no-op on the majority of
buildfarm members.  AFAICT, I do not own a single machine on which the
current TAP tests will consent to run, and in most cases that's after
going out of my way to fetch CPAN modules that aren't in the vendor Perl
installs.  Peter's mostly been fixing the portability issues by disabling
tests, which I guess is better than no fix at all, but it leaves darn
little useful functionality.

I agree, emphatically.  Honestly, if we can't get these tests running
everywhere with reasonable effort, we should just rip them out.  We've
gone to a lot of trouble in general to make sure that our source code
can be ported even to systems that arguably nobody uses any more, and
instrumental to that effort is keeping the system requirements to
install and test PostgreSQL minimal.  At this point, I wouldn't mind
moving the goalposts from C89 to C89 + a bunch of C99 features that
are available on all the platforms we have buildfarm coverage for, and
I wouldn't mind require perl to compile and install, full stop.  But
this patch has gone much further than that: you need a new-enough
version of perl, and a new-enough version of a bunch of modules that
aren't installed by default, and maybe not even in the vendor install,
and the documentation on how to make it work is an embarrassment:

http://www.postgresql.org/docs/devel/static/regress-tap.html

Beyond all that, I have serious doubts about whether, even if we
eventually get these tests mostly working in most places, whether they
will actually catch any bugs.  For example, consider dropdb.  The TAP
tests cover the following points:

- When run with --help or --version, it should exit with code 0, print
something on stderr, and print nothing on stdout.
- When run with --not-a-valid-option, it should exit with a non-zero
exit code, print something on stderr, and print nothing on stdout.
- dropdb foobar1 should cause statement: DROP DATABASE foobar1 to
show up in the postgresql log file.
- dropdb nonexistent should exit non-zero.

These are certainly good things to test, but I'd argue that once
you've verified that they are working, they're unlikely to get broken
again in the future.  I'm generally supportive of the idea that we
need more automated tests, but these tests seem pretty low-value to
me, even if they were running everywhere, and even moreso if they
aren't.




Yeah. I think at the very least they should be removed from the 
check-world and installcheck-world targets until this is sorted out.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] What exactly is our CRC algorithm?

2014-10-27 Thread Heikki Linnakangas

On 10/09/2014 12:13 AM, Andres Freund wrote:

On 2014-10-08 22:13:46 +0300, Heikki Linnakangas wrote:

As far as I can tell, PostgreSQL's so-called CRC algorithm doesn't
correspond to any bit-by-bit CRC variant and polynomial. My math skills are
not strong enough to determine what the consequences of that are. It might
still be a decent checksum. Or not. I couldn't tell if the good error
detection properties of the normal CRC-32 polynomial apply to our algorithm
or not.


Additional interesting datapoints are that hstore and ltree contain the
same tables - but properly use the reflected computation.


Thoughts?


It clearly seems like a bad idea to continue with this - I don't think
anybody here knows which guarantees this gives us.

The question is how can we move away from this. There's unfortunately
two places that embed PGC32 that are likely to prove problematic when
fixing the algorithm: pg_trgm and tsgist both seem to include crc's in
their logic in a persistent way.  I think we should provide
INIT/COMP/FIN_PG32 using the current algorithm for these.


Agreed, it's not worth breaking pg_upgrade for this.


If we're switching to a saner computation, we should imo also switch to
a better polynom - CRC-32C has better error detection capabilities than
CRC32 and is available in hardware. As we're paying the price pf
breaking compat anyway...


Agreed.


Arguably we could also say that given that there's been little evident
problems with the borked computation we could also switch to a much
faster hash instead of continuing to use crc...


I don't feel like taking the leap. Once we switch to slice-by-4/8 and/or 
use a hardware instruction when available, CRC is fast enough.


I came up with the attached patches. They do three things:

1. Get rid of the 64-bit CRC code. It's not used for anything, and 
haven't been for years, so it doesn't seem worth spending any effort to 
fix them.


2. Switch to CRC-32C (Castagnoli) for WAL and other places that don't 
need to remain compatible across major versions.


3. Use the same lookup table for hstore and ltree, as used for the 
legacy almost CRC-32 algorithm. The tables are identical, so might as 
well.


Any objections?

- Heikki

From 4fe6472976f2efc22035e802067a70d3cca61d79 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas heikki.linnakan...@iki.fi
Date: Mon, 27 Oct 2014 12:01:19 +0200
Subject: [PATCH 1/2] Remove support for 64-bit CRC.

It hasn't been used for anything in backend code for a long time.
---
 src/include/utils/pg_crc.h|  96 -
 src/include/utils/pg_crc_tables.h | 416 --
 2 files changed, 512 deletions(-)

diff --git a/src/include/utils/pg_crc.h b/src/include/utils/pg_crc.h
index 375c405..f43f4aa 100644
--- a/src/include/utils/pg_crc.h
+++ b/src/include/utils/pg_crc.h
@@ -10,9 +10,6 @@
  * We use a normal (not reflected, in Williams' terms) CRC, using initial
  * all-ones register contents and a final bit inversion.
  *
- * The 64-bit variant is not used as of PostgreSQL 8.1, but we retain the
- * code for possible future use.
- *
  *
  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -56,97 +53,4 @@ do { \
 /* Constant table for CRC calculation */
 extern CRCDLLIMPORT const uint32 pg_crc32_table[];
 
-
-#ifdef PROVIDE_64BIT_CRC
-
-/*
- * If we use a 64-bit integer type, then a 64-bit CRC looks just like the
- * usual sort of implementation.  However, we can also fake it with two
- * 32-bit registers.  Experience has shown that the two-32-bit-registers code
- * is as fast as, or even much faster than, the 64-bit code on all but true
- * 64-bit machines.  We use SIZEOF_VOID_P to check the native word width.
- */
-
-#if SIZEOF_VOID_P  8
-
-/*
- * crc0 represents the LSBs of the 64-bit value, crc1 the MSBs.  Note that
- * with crc0 placed first, the output of 32-bit and 64-bit implementations
- * will be bit-compatible only on little-endian architectures.  If it were
- * important to make the two possible implementations bit-compatible on
- * all machines, we could do a configure test to decide how to order the
- * two fields, but it seems not worth the trouble.
- */
-typedef struct pg_crc64
-{
-	uint32		crc0;
-	uint32		crc1;
-}	pg_crc64;
-
-/* Initialize a CRC accumulator */
-#define INIT_CRC64(crc) ((crc).crc0 = 0x, (crc).crc1 = 0x)
-
-/* Finish a CRC calculation */
-#define FIN_CRC64(crc)	((crc).crc0 ^= 0x, (crc).crc1 ^= 0x)
-
-/* Accumulate some (more) bytes into a CRC */
-#define COMP_CRC64(crc, data, len)	\
-do { \
-	uint32		__crc0 = (crc).crc0; \
-	uint32		__crc1 = (crc).crc1; \
-	unsigned char *__data = (unsigned char *) (data); \
-	uint32		__len = (len); \
-\
-	while (__len--  0) \
-	{ \
-		int		__tab_index = ((int) (__crc1  24) ^ *__data++)  0xFF; \
-		__crc1 = pg_crc64_table1[__tab_index] ^ ((__crc1  8) | (__crc0  24)); \
-		__crc0 = 

Re: [HACKERS] BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

2014-10-27 Thread Heikki Linnakangas

On 10/27/2014 02:12 PM, Fujii Masao wrote:

On Fri, Oct 24, 2014 at 10:05 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:

On 10/23/2014 11:09 AM, Heikki Linnakangas wrote:


At least for master, we should consider changing the way the archiving
works so that we only archive WAL that was generated in the same server.
I.e. we should never try to archive WAL files belonging to another
timeline.

I just remembered that we discussed a different problem related to this
some time ago, at

http://www.postgresql.org/message-id/20131212.110002.204892575.horiguchi.kyot...@lab.ntt.co.jp.
The conclusion of that was that at promotion, we should not archive the
last, partial, segment from the old timeline.



So, this is what I came up with for master. Does anyone see a problem with
it?


What about the problem that I raised upthread? This is, the patch
prevents the last, partial, WAL file of the old timeline from being archived.
So we can never PITR the database to the point that the last, partial WAL
file has.


A partial WAL file is never archived in the master server to begin with, 
so if it's ever used in archive recovery, the administrator must have 
performed some manual action to copy the partial WAL file from the 
original server. When he does that, he can also copy it manually to the 
archive, or whatever he wants to do with it.


Note that the same applies to any complete, but not-yet archived WAL 
files. But we've never had any mechanism in place to archive those in 
the new instance, after PITR.



Isn't this problem? For example, please imagine the
following scenario.

1. The important data was deleted but no one noticed that. This deletion was
 logged in last, partial WAL file.
2. The server crashed and DBA started an archive recovery from old backup.
3. After recovery, all WAL files of the old timeline were recycled.
4. Finally DBA noticed the loss of important data and tried to do PITR
to the point
 where the data was deleted.

HOWEVER, the WAL file containing that deletion operation no longer exists.
So DBA will never be able to recover that important data 


I think you're missing a step above:

1.5: The administrator copies the last, partial WAL file (and any 
complete but not yet-archived files) to the new server's pg_xlog directory.


Without that, it won't be available for PITR anyway, and the new server 
won't see it or try to archive it, no matter what.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Heikki Linnakangas

On 10/27/2014 03:46 PM, Tom Lane wrote:

Heikki Linnakangas hlinnakan...@vmware.com writes:

On 10/27/2014 03:21 PM, Tomas Vondra wrote:

Thinking about this a bit more, do we really need a full checkpoint? That
is a checkpoint of all the databases in the cluster? Why checkpointing the
source database is not enough?



A full checkpoint ensures that you always begin recovery *after* the
DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
crash recovery (except when you crash before the transaction commits, in
which case it doesn't matter if the new database's directory is borked).


Yeah.  After re-reading the 2005 thread, I wonder if we shouldn't just
bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
all the copied files instead of doing a cp -r-equivalent directory copy.
That would fix a number of existing replay hazards as well as making it
safe to do what Tomas wants.  In the small scale this would cause more I/O
(2 copies of the template database's data) but in production situations
we might well come out ahead by avoiding a forced checkpoint of the rest
of the cluster.  Also I guess we could skip WAL-logging if WAL archiving
is off, similarly to the existing optimization for CREATE INDEX etc.


That would be a nasty surprise for anyone who's using CREATE DATABASE as 
a fast way to clone a large database. But I would be OK with that, at 
least if we can skip the WAL-logging with wal_level=minimal.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TAP test breakage on MacOS X

2014-10-27 Thread Tom Lane
Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 10/27/2014 05:41 PM, Robert Haas wrote:
 Beyond all that, I have serious doubts about whether, even if we
 eventually get these tests mostly working in most places, whether they
 will actually catch any bugs.

 The existing tests are not very useful, but it certainly would be nice 
 to have some infrastructure like TAP to write useful tests.

Yeah.  I think this is an if you build it they will come situation.
The existing tests are really only proof-of-concept; if we can get the
infrastructure to the point of being widely usable, we can certainly
use it to write many more-useful tests.

So I'm not for giving up and ripping it out in the near future.  But I do
think there's a lot of test-infrastructure work left here to get to the
point of being widely usable, and I'm not sure if anyone is committed to
making that happen.

But let's stop talking in generalities and try to quantify where we think
the bar needs to be set.  Is it okay to depend on non-core CPAN modules
at all?  How old versions of Perl, Test::More, etc do we need it to work
with?

For my own purposes, I'd be satisfied if it ran on RHEL6 and reasonably
recent OS X.  For RHEL6, I'd not want to have to install any CPAN modules
other than the versions supplied by Red Hat (which means no Test::More
subplans).  This platform is only four years old, if you can't cope with
that then you're definitely too bleeding-edge to be a portable test
infrastructure.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TAP test breakage on MacOS X

2014-10-27 Thread Tom Lane
Andrew Dunstan and...@dunslane.net writes:
 Yeah. I think at the very least they should be removed from the 
 check-world and installcheck-world targets until this is sorted out.

+1 for doing that in the 9.4 branch; I'm surprised we've not already
heard bitching from packagers about how far we've moved the goalposts
for what's required to get make check-world to work.

As I said in my other reply, I'm not giving up on this (yet) so I'd
be happy with leaving it in check-world in the master branch.  But it's
not ready for prime time, and 9.4 needs to ship soon.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-10-27 Thread Simon Riggs
On 27 October 2014 15:55, Robert Haas robertmh...@gmail.com wrote:

 Commenting on one aspect of a patch doesn't imply agreement with
 other aspects of the patch.  Please don't put words into my mouth.  I
 haven't reviewed this patch in detail; I've only commented on specific
 aspects of it as they have arisen in discussion.  I may or may not
 someday review it in detail, but not before I'm fairly confident that
 the known issues raised by other community members have been addressed
 as thoroughly as possible.

+1

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-10-27 Thread Peter Geoghegan
On Mon, Oct 27, 2014 at 9:43 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On 27 October 2014 15:55, Robert Haas robertmh...@gmail.com wrote:

 Commenting on one aspect of a patch doesn't imply agreement with
 other aspects of the patch.  Please don't put words into my mouth.  I
 haven't reviewed this patch in detail; I've only commented on specific
 aspects of it as they have arisen in discussion.  I may or may not
 someday review it in detail, but not before I'm fairly confident that
 the known issues raised by other community members have been addressed
 as thoroughly as possible.

 +1

I wasn't putting words in anyone's mouth; I *don't* think that it's
true that Robert thinks patches 0001-* and 0002-* are perfectly fine,
and implied as much myself. I just think the *strongly* worded
disapproval of the user-visible interface of 0003-* was odd; it was
way out of proportion to its immediate importance to getting this
patch on track. AFAICT it was the *only* feedback that I didn't act on
with V1.3 (Robert's complaint about how inference happens during parse
analysis was a response to V1.3).

I'm not always going to be able to act on every item of feedback
immediately, or I'll have my own ideas about how to handle certain
things. I don't think that's all that big of a deal, since I've acted
on almost all feedback. I think by far the biggest problem here is the
lack of attention to the design from others.

I did a lot of copy-editing to the Wiki page yesterday. There are
actually few clear open items now:
https://wiki.postgresql.org/wiki/UPSERT#Open_Items

Some previous open items have been moved to here:
https://wiki.postgresql.org/wiki/UPSERT#Miscellaneous_odd_properties_of_proposed_ON_CONFLICT_patch

This is basically a section describing things that have not been
controversial or in need of adjusting, and may well never be, but I
wish we'd talk about because they're in some way novel or
counter-intuitive. This is the kind of things I'd like us to discuss
more.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-10-27 Thread Simon Riggs
On 27 October 2014 17:44, Peter Geoghegan p...@heroku.com wrote:

 I did a lot of copy-editing to the Wiki page yesterday. There are
 actually few clear open items now:
 https://wiki.postgresql.org/wiki/UPSERT#Open_Items

I've read this page.

Please do these things, both of which have been requested multiple times...

1. Take the specific docs that relate to the patch and put them in one
place, so that everybody can read and understand and agree the
behaviour of the patch. So that someone reading that can see *exactly*
what is being proposed, not read through pages of other unchanged
material hoping to catch the few points that really differ.

2. Replace CONFLICTING() with what I have asked for in earlier posts.
The request is not repeated again, to avoid confusion.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] jsonb generator functions

2014-10-27 Thread Pavel Stehule
Hi

2014-10-27 15:33 GMT+01:00 Andrew Dunstan and...@dunslane.net:


 On 10/15/2014 03:54 PM, Andrew Dunstan wrote:



 I checked a code, and I have only two small objection - a name
 jsonb_object_two_arg is not good - maybe json_object_keys_values ?


 It's consistent with the existing json_object_two_arg. In all cases I
 think I kept the names the same except for changing json to jsonb. Note
 that these _two_arg functions are not visible at the SQL level - they are
 only visible in the C code.

 I'm happy to be guided by others in changing or keeping these names.


 Next: there are no tests for to_jsonb function.



 Oh, my bad. I'll add some.

 Thank for the review.



 Here is a new patch that includes documentation and addresses all these
 issues, except that I didn't change the name of jsonb_object_two_arg to
 keep it consistent with the name of json_object_two_arg. I'm happy to
 change both if people feel it matters.


I checked last patch jsonbmissingfunc5.patch and I have no any objections:

1. This jsonb API is consistent with current JSON API, so we surely would
to this functionality

2. A implementation is clean without side effects and without impact on
current code.

3. Patching and compilation are without any issues and warnings

4. Source code respects PostgreSQL coding rules

5. All regress tests was passed without problems

6. Documentation was build without problems

7. Patch contains necessary regress tests

8. Patch contains necessary documentation for new functions.

Patch is ready for commiters

Thank you for patch

Regards

Pavel



 cheers

 andrew



Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-10-27 Thread Peter Geoghegan
On Mon, Oct 27, 2014 at 11:12 AM, Simon Riggs si...@2ndquadrant.com wrote:
 1. Take the specific docs that relate to the patch and put them in one
 place, so that everybody can read and understand and agree the
 behaviour of the patch. So that someone reading that can see *exactly*
 what is being proposed, not read through pages of other unchanged
 material hoping to catch the few points that really differ.

I'm afraid I don't follow. I have links to the user-visible
documentation (v1.3) on the Wiki:
https://wiki.postgresql.org/wiki/UPSERT#Documentation

The documentation is complete. I link to every interesting page from
the documentation directly from the Wiki, too. Of course, I also
describe the issues in more detail for those with an interest in the
implementation on the Wiki page itself (and list open issues). I have
isolation tests that illustrate the new facets of visibility for READ
COMMITTED, too.

How, specifically, have I failed to do what you ask here? If you want
to see exactly what has changed, in a differential fashion, well,
that's what a diff is for. I'm not aware of any existing way of
rendering to html for readability, while highlighting what is new.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] TODO request: log_long_transaction

2014-10-27 Thread Josh Berkus
Hackers,

I just realized that there is one thing we can't log currently:
transactions which last more than #ms.  This is valuable diagnostic
information when looking for issues like causes of bloat and deadlocks.

I'd like it to be on the TODO list because it seems like part of a good
GSOC project or first-time contribution.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] TODO request: log_long_transaction

2014-10-27 Thread Thom Brown
On 27 October 2014 19:21, Josh Berkus j...@agliodbs.com wrote:

 Hackers,

 I just realized that there is one thing we can't log currently:
 transactions which last more than #ms.  This is valuable diagnostic
 information when looking for issues like causes of bloat and deadlocks.

 I'd like it to be on the TODO list because it seems like part of a good
 GSOC project or first-time contribution.


So effectively, log_min_duration_transaction?  Sounds useful.

Thom


[HACKERS] Reducing the cost of sinval messaging

2014-10-27 Thread Tom Lane
I happened to be looking at sinvaladt.c and noticed the loop added in
commit b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4:

/*
 * Now that the maxMsgNum change is globally visible, we give everyone
 * a swift kick to make sure they read the newly added messages.
 * Releasing SInvalWriteLock will enforce a full memory barrier, so
 * these (unlocked) changes will be committed to memory before we exit
 * the function.
 */
for (i = 0; i  segP-lastBackend; i++)
{
ProcState  *stateP = segP-procState[i];

stateP-hasMessages = true;
}

This seems fairly awful when there are a large number of backends.

Why could we not remove the hasMessages flags again, and change the
unlocked test

if (!stateP-hasMessages)
return 0;

into

if (stateP-nextMsgNum == segP-maxMsgNum 
!stateP-resetState)
return 0;

If we are looking at stale shared state, this test might produce a
false positive, but I don't see why it's any less safe than testing a
hasMessages boolean.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing the cost of sinval messaging

2014-10-27 Thread Robert Haas
On Mon, Oct 27, 2014 at 3:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I happened to be looking at sinvaladt.c and noticed the loop added in
 commit b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4:

 /*
  * Now that the maxMsgNum change is globally visible, we give everyone
  * a swift kick to make sure they read the newly added messages.
  * Releasing SInvalWriteLock will enforce a full memory barrier, so
  * these (unlocked) changes will be committed to memory before we exit
  * the function.
  */
 for (i = 0; i  segP-lastBackend; i++)
 {
 ProcState  *stateP = segP-procState[i];

 stateP-hasMessages = true;
 }

 This seems fairly awful when there are a large number of backends.

We did quite a bit of testing at the time and found that it was a
clear improvement over what we had been doing previously.  Of course,
that's not to say we can't or shouldn't improve it further.

 Why could we not remove the hasMessages flags again, and change the
 unlocked test

 if (!stateP-hasMessages)
 return 0;

 into

 if (stateP-nextMsgNum == segP-maxMsgNum 
 !stateP-resetState)
 return 0;

 If we are looking at stale shared state, this test might produce a
 false positive, but I don't see why it's any less safe than testing a
 hasMessages boolean.

It was discussed at the time:

http://www.postgresql.org/message-id/CA+TgmoY3Q56ZR6i8h+iGhXCw6rCZyvdWJ3RQT=PMVev4-=+n...@mail.gmail.com
http://www.postgresql.org/message-id/13077.1311702...@sss.pgh.pa.us

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-10-27 Thread Robert Haas
On Mon, Oct 27, 2014 at 1:44 PM, Peter Geoghegan p...@heroku.com wrote:
 I think by far the biggest problem here is the
 lack of attention to the design from others.

I find that attitude incredible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-10-27 Thread Peter Geoghegan
On Mon, Oct 27, 2014 at 1:22 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Oct 27, 2014 at 1:44 PM, Peter Geoghegan p...@heroku.com wrote:
 I think by far the biggest problem here is the
 lack of attention to the design from others.

 I find that attitude incredible.

What I mean is that that's the thing that clearly needs scrutiny the
most. That isn't an attitude - it's a technical opinion.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing the cost of sinval messaging

2014-10-27 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Oct 27, 2014 at 3:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Why could we not remove the hasMessages flags again, and change the
 unlocked test
 
 if (!stateP-hasMessages)
 return 0;
 
 into
 
 if (stateP-nextMsgNum == segP-maxMsgNum 
 !stateP-resetState)
 return 0;
 
 If we are looking at stale shared state, this test might produce a
 false positive, but I don't see why it's any less safe than testing a
 hasMessages boolean.

 It was discussed at the time:

 http://www.postgresql.org/message-id/CA+TgmoY3Q56ZR6i8h+iGhXCw6rCZyvdWJ3RQT=PMVev4-=+n...@mail.gmail.com
 http://www.postgresql.org/message-id/13077.1311702...@sss.pgh.pa.us

Neither of those messages seem to me to bear on this point.  AFAICS,
the unlocked hasMessages test has a race condition, which the comment
just above it argues isn't a problem in practice.  What I propose above
would have an absolutely indistinguishable race condition, which again
wouldn't be a problem in practice if you believe the comment's argument.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alter user/role CURRENT_USER

2014-10-27 Thread Adam Brightwell
All,

I just ran through the patch giving it a good once over, some items to
address/consider/discuss:

* Trailing whitespace.
* Why are you making changes in foreigncmds.c?  These seem like unnecessary
changes.  I see that you are trying to consolidate but this refactor seems
potentially out of scope.
* To the above point, is ResolveRoleId really necessary?  Also, I'm not
sure I agree with passing in the tuple, I don't see any reason to pull that
look up into this function.  Also, couldn't you simply just add a short
circuit in get_role_oid to check for current_user and return
GetUserId() and similar for session_user?
* In gram.y, is there really a need to have a separate RoleId_or_curruser?
Why not:

-RoleId:NonReservedWord { $$ = $1; };
+RoleId:CURRENT_USER{ $$ =
current_user;}
+   | USER  { $$ = current_user;}
+   | CURRENT_ROLE  { $$ = current_user;}
+   | SESSION_USER  { $$ = session_user;}
+   | NonReservedWord   { $$ = $1; }
+   ;

This would certainly reduce the number of changes to the grammar but might
also help with covering the cases mentioned by Rushabh?  Also are there
cases when we don't want CURRENT_USER to be considered a RoleId?

* The following seems like an unnecessary change:

-   |   RoleId  { $$ = (strcmp($1, public) == 0) ? NULL : $1;
}
+   RoleId  { $$ = (strcmp($1, public) == 0) ?
+   NULL : $1; }

* Why is htup.h included in dbcommands.h?
* What's up with the following in user.c, did you replace tab with spaces?

-   HeapTuple   roletuple;
+   HeapTuple   roletuple;

-- Not working
 alter default privileges for role current_user grant SELECT ON TABLES TO
 current_user ;

-- Not working
 grant user1 TO current_user;


Agreed.  The latter of the two seems like an interesting case to me
though.  But they both kind of jump out at me as potential for security
concerns (but perhaps not given the role id checks, etc).  At any rate, I'd
still expect these to behave accordingly with notification/error messages
when appropriate.

Their might few more syntax like this.


I think this can be covered inherently by the grammar changes recommended
above (if such changes are appropriate).  Though, I'd also recommend
investigating which commands are affected via the grammar (or RoleId) and
then making sure to update the regression tests and the documentation
accordingly.


 I understand that patch is  sightly getting bigger and complex then what
 it was
 originally proposed.


IMHO, I think this patch has become more complex than is required.


 Before going back to more review on latest patch I would
 like to understand the requirement of this new feature. Also would like
 others
 to comment on where/how we should restrict this feature ?


I think this is a fair request.  I believe I can see the potential
convenience of these changes, however, I'm uncertain of the necessity of
them and whether or not it opens any security concerns (again, perhaps not,
but I think it is worth asking).  Also, how would this affect items like
auditing?

-Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com


Re: [HACKERS] Reducing the cost of sinval messaging

2014-10-27 Thread Robert Haas
On Mon, Oct 27, 2014 at 4:27 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Mon, Oct 27, 2014 at 3:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Why could we not remove the hasMessages flags again, and change the
 unlocked test

 if (!stateP-hasMessages)
 return 0;

 into

 if (stateP-nextMsgNum == segP-maxMsgNum 
 !stateP-resetState)
 return 0;

 If we are looking at stale shared state, this test might produce a
 false positive, but I don't see why it's any less safe than testing a
 hasMessages boolean.

 It was discussed at the time:

 http://www.postgresql.org/message-id/CA+TgmoY3Q56ZR6i8h+iGhXCw6rCZyvdWJ3RQT=PMVev4-=+n...@mail.gmail.com
 http://www.postgresql.org/message-id/13077.1311702...@sss.pgh.pa.us

 Neither of those messages seem to me to bear on this point.  AFAICS,
 the unlocked hasMessages test has a race condition, which the comment
 just above it argues isn't a problem in practice.  What I propose above
 would have an absolutely indistinguishable race condition, which again
 wouldn't be a problem in practice if you believe the comment's argument.

Yes, the read from hasMessages has a race condition, which is in fact
not a problem in practice if you believe the comment's argument.
However, your proposed formulation has a second race condition, which
is discussed in those emails, and which is precisely why we rejected
the design you're now proposing three years ago.  That second race
condition is that the read from stateP-nextMsgNum doesn't have to
happen at the same time as the read from stateP-resetState. No CPU or
compiler barrier separates them, so either can happen before the
other. SICleanupQueue updates both values, so you have the problem
first described precisely by Noah:

# (In theory, you could
# hit a case where the load of resetState gives an ancient false just as the
# counters wrap to match.

While it practically seems very unlikely that this would ever really
happen, some guy named Tom Lane told us we should worry about it, so I
did. That led to this new design:

http://www.postgresql.org/message-id/ca+tgmobefjvyqdjvbb6tss996s8zkvyrzttpx0chyo3hve5...@mail.gmail.com

...which is what ended up getting committed.

Personally, I think trying to further optimize this is most likely a
waste of time.  I'm quite sure that there's a whole lot of stuff in
the sinval code that could be further optimized, but it executes so
infrequently that I believe it doesn't matter.  Noah beat on this
pretty hard at the time and found that O(N) loop you're complaining
about - in his testing - never went above 0.26% of CPU time and
reduced overall CPU usage even in a sinval-intensive workload with 50
clients per core:

http://www.postgresql.org/message-id/20110729030514.gb30...@tornado.leadboat.com

If you've got some evidence that this is a real problem as opposed to
a hypothetical one, we could certainly reopen the debate about whether
ignoring the possibility that the loads from stateP-nextMsgNum and
stateP-resetState will be widely-enough spaced in time for a
quarter-million messages to be queued up in the meantime is
sufficiently unlikely to ignore.  However, absent evidence of a
tangible performance problem, I'd be quite disinclined to throw the
guaranteed safety of the current approach out the window.  We could do
some testing where we crank that 262,144 value down to progressively
lower numbers until the race condition actually manifests, just to see
how far we are from disaster.  But presumably the goalposts there
shift every time we get a new generation of hardware; and I don't much
like coding on the basis of eh, seems unlikely the CPU will ever
delay a read by *that* much.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Tomas Vondra
On 27.10.2014 17:24, Heikki Linnakangas wrote:
 On 10/27/2014 03:46 PM, Tom Lane wrote:
 Heikki Linnakangas hlinnakan...@vmware.com writes:
 On 10/27/2014 03:21 PM, Tomas Vondra wrote:
 Thinking about this a bit more, do we really need a full checkpoint?
 That
 is a checkpoint of all the databases in the cluster? Why
 checkpointing the
 source database is not enough?

 A full checkpoint ensures that you always begin recovery *after* the
 DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
 crash recovery (except when you crash before the transaction commits, in
 which case it doesn't matter if the new database's directory is borked).

 Yeah.  After re-reading the 2005 thread, I wonder if we shouldn't just
 bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
 all the copied files instead of doing a cp -r-equivalent directory
 copy.
 That would fix a number of existing replay hazards as well as making it
 safe to do what Tomas wants.  In the small scale this would cause more
 I/O
 (2 copies of the template database's data) but in production situations
 we might well come out ahead by avoiding a forced checkpoint of the rest
 of the cluster.  Also I guess we could skip WAL-logging if WAL archiving
 is off, similarly to the existing optimization for CREATE INDEX etc.
 
 That would be a nasty surprise for anyone who's using CREATE DATABASE
 as a fast way to clone a large database. But I would be OK with that,
 at least if we can skip the WAL-logging with wal_level=minimal.

That's true. Sadly, I can't think of a solution that would address both
use cases at the same time :-(

The only thing I can think of is having two CREATE DATABASE flavors.
One keeping the current approach (suitable for fast cloning) and one
with the WAL logging (minimizing the CREATE DATABASE duration the impact
on other backends).

It will probably make the code significantly more complex, which is not
exactly desirable, I guess. Also, if we keep the current code (even if
only as a special case) it won't eliminate the existing replay hazards
(which was one of the Tom's arguments for biting the bullet).

I'm also thinking that for wal_level=archive and large databases, this
won't really eliminate the checkpoint as it will likely generate enough
WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?

That being said, our CREATE DATABASE docs currently say this

Although it is possible to copy a database other than template1 by
specifying its name as the template, this is not (yet) intended as
a general-purpose COPY DATABASE facility. The principal
limitation is that no other sessions can be connected to the
template database while it is being copied. CREATE DATABASE will
fail if any other connection exists when it starts; otherwise, new
connections to the template database are locked out until CREATE
DATABASE completes. See Section 21.3 for more information.

I think that this limitation pretty much means no one should use CREATE
DATABASE for cloning live databases in production environment (because
of the locking).

It also seems to me the general-purpose COPY DATABASE described in the
docs is what we're describing in this thread.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Directory/File Access Permissions for COPY and Generic File Access Functions

2014-10-27 Thread Adam Brightwell
All,

Attached is a patch with minor updates/corrections.

-Adam

-- 
Adam Brightwell - adam.brightw...@crunchydatasolutions.com
Database Engineer - www.crunchydatasolutions.com
diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
new file mode 100644
index b257b02..8cdc5cb
*** a/src/backend/catalog/Makefile
--- b/src/backend/catalog/Makefile
*** POSTGRES_BKI_SRCS = $(addprefix $(top_sr
*** 41,47 
  	pg_foreign_data_wrapper.h pg_foreign_server.h pg_user_mapping.h \
  	pg_foreign_table.h pg_rowsecurity.h \
  	pg_default_acl.h pg_seclabel.h pg_shseclabel.h pg_collation.h pg_range.h \
! 	toasting.h indexing.h \
  )
  
  # location of Catalog.pm
--- 41,47 
  	pg_foreign_data_wrapper.h pg_foreign_server.h pg_user_mapping.h \
  	pg_foreign_table.h pg_rowsecurity.h \
  	pg_default_acl.h pg_seclabel.h pg_shseclabel.h pg_collation.h pg_range.h \
! 	pg_diralias.h toasting.h indexing.h \
  )
  
  # location of Catalog.pm
diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c
new file mode 100644
index d30612c..3717bf5
*** a/src/backend/catalog/aclchk.c
--- b/src/backend/catalog/aclchk.c
***
*** 30,35 
--- 30,36 
  #include catalog/pg_collation.h
  #include catalog/pg_conversion.h
  #include catalog/pg_database.h
+ #include catalog/pg_diralias.h
  #include catalog/pg_default_acl.h
  #include catalog/pg_event_trigger.h
  #include catalog/pg_extension.h
***
*** 48,53 
--- 49,55 
  #include catalog/pg_ts_config.h
  #include catalog/pg_ts_dict.h
  #include commands/dbcommands.h
+ #include commands/diralias.h
  #include commands/proclang.h
  #include commands/tablespace.h
  #include foreign/foreign.h
*** ExecGrant_Type(InternalGrant *istmt)
*** 3183,3188 
--- 3185,3374 
  	heap_close(relation, RowExclusiveLock);
  }
  
+ /*
+  * ExecuteGrantDirAliasStmt
+  *   handles the execution of the GRANT/REVOKE ON DIRALIAS command.
+  *
+  * stmt - the GrantDirAliasStmt that describes the directory aliases and
+  *permissions to be granted/revoked.
+  */
+ void
+ ExecuteGrantDirAliasStmt(GrantDirAliasStmt *stmt)
+ {
+ 	Relation		pg_diralias_rel;
+ 	Oidgrantor;
+ 	List		   *grantee_ids = NIL;
+ 	AclMode			permissions;
+ 	ListCell	   *item;
+ 
+ 	/* Must be superuser to grant directory alias permissions */
+ 	if (!superuser())
+ 		ereport(ERROR,
+ (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+  errmsg(must be superuser to grant directory alias permissions)));
+ 
+ 	/*
+ 	 * Grantor is optional.  If it is not provided then set it to the current
+ 	 * user.
+ 	 */
+ 	if (stmt-grantor)
+ 		grantor = get_role_oid(stmt-grantor, false);
+ 	else
+ 		grantor = GetUserId();
+ 
+ 	/* Convert grantee names to oids */
+ 	foreach(item, stmt-grantees)
+ 	{
+ 		PrivGrantee *grantee = (PrivGrantee *) lfirst(item);
+ 
+ 		if (grantee-rolname == NULL)
+ 			grantee_ids = lappend_oid(grantee_ids, ACL_ID_PUBLIC);
+ 		else
+ 		{
+ 			Oid roleid = get_role_oid(grantee-rolname, false);
+ 			grantee_ids = lappend_oid(grantee_ids, roleid);
+ 		}
+ 	}
+ 
+ 	permissions = ACL_NO_RIGHTS;
+ 
+ 	/* If ALL was provided then set permissions to ACL_ALL_RIGHTS_DIRALIAS */
+ 	if (stmt-permissions == NIL)
+ 		permissions = ACL_ALL_RIGHTS_DIRALIAS;
+ 	else
+ 	{
+ 		/* Condense all permissions */
+ 		foreach(item, stmt-permissions)
+ 		{
+ 			AccessPriv *priv = (AccessPriv *) lfirst(item);
+ 			permissions |= string_to_privilege(priv-priv_name);
+ 		}
+ 	}
+ 
+ 
+ 	/*
+ 	 * Though it shouldn't be possible to provide permissions other than READ
+ 	 * and WRITE, check to make sure no others have been set.  If they have,
+ 	 * then warn the user and correct the permissions.
+ 	 */
+ 	if (permissions  !((AclMode) ACL_ALL_RIGHTS_DIRALIAS))
+ 	{
+ 		ereport(WARNING,
+ (errcode(ERRCODE_INVALID_GRANT_OPERATION),
+  errmsg(directory aliases only support READ and WRITE permissions)));
+ 
+ 		permissions = ACL_ALL_RIGHTS_DIRALIAS;
+ 	}
+ 
+ 	pg_diralias_rel = heap_open(DirAliasRelationId, RowExclusiveLock);
+ 
+ 	/* Grant/Revoke permissions on directory aliases */
+ 	foreach(item, stmt-directories)
+ 	{
+ 		Datum			values[Natts_pg_diralias];
+ 		bool			replaces[Natts_pg_diralias];
+ 		bool			nulls[Natts_pg_diralias];
+ 		ScanKeyData		skey[1];
+ 		HeapScanDesc	scandesc;
+ 		HeapTuple		tuple;
+ 		HeapTuple		new_tuple;
+ 		Datum			datum;
+ 		Oidowner_id;
+ 		Acl			   *dir_acl;
+ 		Acl			   *new_acl;
+ 		bool			is_null;
+ 		intnum_old_members;
+ 		intnum_new_members;
+ 		Oid			   *old_members;
+ 		Oid			   *new_members;
+ 		Oiddiralias_id;
+ 		char		   *name;
+ 
+ 		name = strVal(lfirst(item));
+ 
+ 		ScanKeyInit(skey[0],
+ 	Anum_pg_diralias_dirname,
+ 	BTEqualStrategyNumber, F_NAMEEQ,
+ 	CStringGetDatum(name));
+ 
+ 		scandesc = heap_beginscan_catalog(pg_diralias_rel, 1, skey);
+ 
+ 		tuple = heap_getnext(scandesc, ForwardScanDirection);
+ 
+ 		if (!HeapTupleIsValid(tuple))
+ 			elog(ERROR, could not find 

Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 That being said, our CREATE DATABASE docs currently say this

 Although it is possible to copy a database other than template1 by
 specifying its name as the template, this is not (yet) intended as
 a general-purpose COPY DATABASE facility. The principal
 limitation is that no other sessions can be connected to the
 template database while it is being copied. CREATE DATABASE will
 fail if any other connection exists when it starts; otherwise, new
 connections to the template database are locked out until CREATE
 DATABASE completes. See Section 21.3 for more information.

 I think that this limitation pretty much means no one should use CREATE
 DATABASE for cloning live databases in production environment (because
 of the locking).

Good point.  But the other side of that coin is that if somebody *was*
doing this, the locking would mean that slowing it down would be even
more painful than you might think.  Still, personally I'm willing to
accept that downside, given that we've pretty much always had the above
caveat in the docs.

 It also seems to me the general-purpose COPY DATABASE described in the
 docs is what we're describing in this thread.

Well, no, because the restriction nobody can be connected to the source
database would still apply; relaxing that would be enormously more
complicated, and probably fragile, than what we're talking about here.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgcrypto: PGP signatures

2014-10-27 Thread Jeff Janes
On Mon, Oct 20, 2014 at 3:32 PM, Marko Tiikkaja ma...@joh.to wrote:

 Hi,

 Here's the rebased patch -- as promised -- in a v7.



Hi Marko,

Using the same script as for the memory leak, I am getting seg faults using
this patch.


24425  2014-10-27 15:42:11.819 PDT LOG:  server process (PID 24452) was
terminated by signal 11: Segmentation fault
24425  2014-10-27 15:42:11.819 PDT DETAIL:  Failed process was running:
select pgp_sym_decrypt_verify(dearmor($1),$2,dearmor($3),'debug=1')

gdb backtrace:

#0  pfree (pointer=0x0) at mcxt.c:749
#1  0x7f617d7973e7 in pgp_sym_decrypt_verify_text (fcinfo=0x1d7f1f0) at
pgp-pgsql.c:1047
#2  0x0059f185 in ExecMakeFunctionResultNoSets (fcache=0x1d7f180,
econtext=0x1d7fb00, isNull=0x7fff02e902bf , isDone=value optimized out)
at execQual.c:1992
#3  0x0059ae0c in ExecEvalExprSwitchContext (expression=value
optimized out, econtext=value optimized out, isNull=value optimized
out,
isDone=value optimized out) at execQual.c:4320


Cheers,

Jeff


Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Andrew Dunstan


On 10/27/2014 05:58 PM, Tomas Vondra wrote:

On 27.10.2014 17:24, Heikki Linnakangas wrote:

On 10/27/2014 03:46 PM, Tom Lane wrote:

Heikki Linnakangas hlinnakan...@vmware.com writes:

On 10/27/2014 03:21 PM, Tomas Vondra wrote:

Thinking about this a bit more, do we really need a full checkpoint?
That
is a checkpoint of all the databases in the cluster? Why
checkpointing the
source database is not enough?

A full checkpoint ensures that you always begin recovery *after* the
DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
crash recovery (except when you crash before the transaction commits, in
which case it doesn't matter if the new database's directory is borked).

Yeah.  After re-reading the 2005 thread, I wonder if we shouldn't just
bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
all the copied files instead of doing a cp -r-equivalent directory
copy.
That would fix a number of existing replay hazards as well as making it
safe to do what Tomas wants.  In the small scale this would cause more
I/O
(2 copies of the template database's data) but in production situations
we might well come out ahead by avoiding a forced checkpoint of the rest
of the cluster.  Also I guess we could skip WAL-logging if WAL archiving
is off, similarly to the existing optimization for CREATE INDEX etc.

That would be a nasty surprise for anyone who's using CREATE DATABASE
as a fast way to clone a large database. But I would be OK with that,
at least if we can skip the WAL-logging with wal_level=minimal.

That's true. Sadly, I can't think of a solution that would address both
use cases at the same time :-(

The only thing I can think of is having two CREATE DATABASE flavors.
One keeping the current approach (suitable for fast cloning) and one
with the WAL logging (minimizing the CREATE DATABASE duration the impact
on other backends).

It will probably make the code significantly more complex, which is not
exactly desirable, I guess. Also, if we keep the current code (even if
only as a special case) it won't eliminate the existing replay hazards
(which was one of the Tom's arguments for biting the bullet).

I'm also thinking that for wal_level=archive and large databases, this
won't really eliminate the checkpoint as it will likely generate enough
WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?

That being said, our CREATE DATABASE docs currently say this

 Although it is possible to copy a database other than template1 by
 specifying its name as the template, this is not (yet) intended as
 a general-purpose COPY DATABASE facility. The principal
 limitation is that no other sessions can be connected to the
 template database while it is being copied. CREATE DATABASE will
 fail if any other connection exists when it starts; otherwise, new
 connections to the template database are locked out until CREATE
 DATABASE completes. See Section 21.3 for more information.

I think that this limitation pretty much means no one should use CREATE
DATABASE for cloning live databases in production environment (because
of the locking).

It also seems to me the general-purpose COPY DATABASE described in the
docs is what we're describing in this thread.




Notwithstanding what the docs say, I have seen CREATE DATABASE used 
plenty of times, and quite effectively, to clone databases. I don't 
think making it do twice the IO in the general case is going to go down 
well.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Andres Freund
On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote:
 
 On 10/27/2014 05:58 PM, Tomas Vondra wrote:
 On 27.10.2014 17:24, Heikki Linnakangas wrote:
 I'm also thinking that for wal_level=archive and large databases, this
 won't really eliminate the checkpoint as it will likely generate enough
 WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?
 
 That being said, our CREATE DATABASE docs currently say this
 
  Although it is possible to copy a database other than template1 by
  specifying its name as the template, this is not (yet) intended as
  a general-purpose COPY DATABASE facility. The principal
  limitation is that no other sessions can be connected to the
  template database while it is being copied. CREATE DATABASE will
  fail if any other connection exists when it starts; otherwise, new
  connections to the template database are locked out until CREATE
  DATABASE completes. See Section 21.3 for more information.
 
 I think that this limitation pretty much means no one should use CREATE
 DATABASE for cloning live databases in production environment (because
 of the locking).
 
 It also seems to me the general-purpose COPY DATABASE described in the
 docs is what we're describing in this thread.
 
 
 
 Notwithstanding what the docs say, I have seen CREATE DATABASE used plenty
 of times, and quite effectively, to clone databases. I don't think making it
 do twice the IO in the general case is going to go down well.

I think they're actually more likely to be happy that we wouldn't need
do a immediate checkpoint anymore. The performance penalty from that
likely to be much more severe than the actual IO.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Directory/File Access Permissions for COPY and Generic File Access Functions

2014-10-27 Thread Peter Eisentraut
I think the way this should work is that if you create a DIRALIAS, then
the COPY command should refer to it by logical name, e.g.,

CREATE DIRALIAS dumpster AS '/tmp/trash';
COPY mytable TO dumpster;


If you squint a bit, this is the same as a tablespace.  Maybe those two
concepts could be combined.

On the other hand, we already have file_fdw, which does something very
similar.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Andrew Dunstan


On 10/27/2014 07:01 PM, Andres Freund wrote:

On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote:

On 10/27/2014 05:58 PM, Tomas Vondra wrote:

On 27.10.2014 17:24, Heikki Linnakangas wrote:
I'm also thinking that for wal_level=archive and large databases, this
won't really eliminate the checkpoint as it will likely generate enough
WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?

That being said, our CREATE DATABASE docs currently say this

 Although it is possible to copy a database other than template1 by
 specifying its name as the template, this is not (yet) intended as
 a general-purpose COPY DATABASE facility. The principal
 limitation is that no other sessions can be connected to the
 template database while it is being copied. CREATE DATABASE will
 fail if any other connection exists when it starts; otherwise, new
 connections to the template database are locked out until CREATE
 DATABASE completes. See Section 21.3 for more information.

I think that this limitation pretty much means no one should use CREATE
DATABASE for cloning live databases in production environment (because
of the locking).

It also seems to me the general-purpose COPY DATABASE described in the
docs is what we're describing in this thread.



Notwithstanding what the docs say, I have seen CREATE DATABASE used plenty
of times, and quite effectively, to clone databases. I don't think making it
do twice the IO in the general case is going to go down well.

I think they're actually more likely to be happy that we wouldn't need
do a immediate checkpoint anymore. The performance penalty from that
likely to be much more severe than the actual IO.




At the very least that needs to be benchmarked.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Directory/File Access Permissions for COPY and Generic File Access Functions

2014-10-27 Thread Peter Eisentraut
On 10/27/14 7:27 AM, Stephen Frost wrote:
 * Peter Eisentraut (pete...@gmx.net) wrote:
 On 10/16/14 12:01 PM, Stephen Frost wrote:
 This started out as a request for a non-superuser to be able to review
 the log files without needing access to the server.

 I think that can be done with a security-definer function.
 
 Of course it can be.  We could replace the entire authorization system
 with security definer functions too.

I don't think that is correct.

It's easy to do something with security definer functions if it's single
purpose, with a single argument, like load this file into this table,
let these users do it.

It's not easy to do it with functions if you have many parameters, like
in a general SELECT statement.

So I would like to see at least three wildly different use cases for
this before believing that a security definer function isn't appropriate.

 I don't view this as an argument
 against this feature, particularly as we know other systems have it,
 users have asked for multiple times, and large PG deployments have had
 to hack around our lack of it.

What other systems have it?  Do you have links to their documentation?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote:
 Notwithstanding what the docs say, I have seen CREATE DATABASE used plenty
 of times, and quite effectively, to clone databases. I don't think making it
 do twice the IO in the general case is going to go down well.

 I think they're actually more likely to be happy that we wouldn't need
 do a immediate checkpoint anymore. The performance penalty from that
 likely to be much more severe than the actual IO.

Note that currently, CREATE DATABASE requires fsync'ing each file written
into the new database.  With the proposed new implementation, we'd write
out that data to the kernel *but not have to fsync it*.  Instead, we'd
fsync just the WAL.  At least on spinning rust, that could be a
considerable win, for exactly the same reasons that we don't fsync
anything but WAL for ordinary transaction commits (ie, way fewer seeks).
Maybe not by enough to counteract doubling the write volume, but I think
we'd need some benchmarks before concluding that it's completely horrid.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-10-27 Thread Simon Riggs
On 27 October 2014 20:24, Peter Geoghegan p...@heroku.com wrote:
 On Mon, Oct 27, 2014 at 1:22 PM, Robert Haas robertmh...@gmail.com wrote:
 On Mon, Oct 27, 2014 at 1:44 PM, Peter Geoghegan p...@heroku.com wrote:
 I think by far the biggest problem here is the
 lack of attention to the design from others.

 I find that attitude incredible.

 What I mean is that that's the thing that clearly needs scrutiny the
 most. That isn't an attitude - it's a technical opinion.

Let's see if we can link these two thoughts.

1. You think the biggest problem is the lack of attention to the design.

2. I keep asking you to put the docs in a readable form.

If you can't understand the link between those two things, I am at a loss.

Good luck with the patch.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Directory/File Access Permissions for COPY and Generic File Access Functions

2014-10-27 Thread Stephen Frost
* Peter Eisentraut (pete...@gmx.net) wrote:
 On 10/27/14 7:27 AM, Stephen Frost wrote:
  * Peter Eisentraut (pete...@gmx.net) wrote:
  On 10/16/14 12:01 PM, Stephen Frost wrote:
  This started out as a request for a non-superuser to be able to review
  the log files without needing access to the server.
 
  I think that can be done with a security-definer function.
  
  Of course it can be.  We could replace the entire authorization system
  with security definer functions too.
 
 I don't think that is correct.

Of course it is- you simply have to move all the logic into the
function.

 It's easy to do something with security definer functions if it's single
 purpose, with a single argument, like load this file into this table,
 let these users do it.

The files won't be consistently named and there may be cases to make
ad-hoc runs or test runs.  No, it isn't as simple as always being a
single, specific filename and when consider that there needs to be
intelligence about the actual path being specified and making sure that
there can't be '..' and similar, it gets to be a pretty ugly situation
to make our users have to deal with.

 It's not easy to do it with functions if you have many parameters, like
 in a general SELECT statement.

You could define SRFs for every table.

 So I would like to see at least three wildly different use cases for
 this before believing that a security definer function isn't appropriate.

I'm not following this- there's probably 100s of use-cases for this, but
they're all variations n 'read and/or write data server-side instead of
through a front-end connection', which is what the purpose of the
feature is..  I do see this as being useful for COPY, Large Object, and
the file_fdw...

  I don't view this as an argument
  against this feature, particularly as we know other systems have it,
  users have asked for multiple times, and large PG deployments have had
  to hack around our lack of it.
 
 What other systems have it?  Do you have links to their documentation?

MySQL:
http://dev.mysql.com/doc/refman/5.1/en/privileges-provided.html#priv_file

(note they provide a way to limit access also, via secure_file_priv)

Oracle:
http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_5007.htm
http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_9013.htm#i2125999

SQL Server:
http://msdn.microsoft.com/en-us/library/ms175915.aspx
(Note: they can actually run as the user connected instead of the SQL DB
server, if Windows authentication is used, which is basically doing
Kerberos proxying unless I'm mistaken; it's unclear how the security is
maintained if it's a SQL server logon user..).

DB2:
http://www-01.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.admin.dm.doc/doc/c0004589.html?cp=SSEPGG_9.7.0

etc...

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Tomas Vondra
On 28.10.2014 00:06, Andrew Dunstan wrote:
 
 On 10/27/2014 07:01 PM, Andres Freund wrote:
 On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote:
 On 10/27/2014 05:58 PM, Tomas Vondra wrote:
 On 27.10.2014 17:24, Heikki Linnakangas wrote:
 I'm also thinking that for wal_level=archive and large databases, this
 won't really eliminate the checkpoint as it will likely generate enough
 WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?

 That being said, our CREATE DATABASE docs currently say this

  Although it is possible to copy a database other than template1 by
  specifying its name as the template, this is not (yet) intended as
  a general-purpose COPY DATABASE facility. The principal
  limitation is that no other sessions can be connected to the
  template database while it is being copied. CREATE DATABASE will
  fail if any other connection exists when it starts; otherwise, new
  connections to the template database are locked out until CREATE
  DATABASE completes. See Section 21.3 for more information.

 I think that this limitation pretty much means no one should use CREATE
 DATABASE for cloning live databases in production environment (because
 of the locking).

 It also seems to me the general-purpose COPY DATABASE described in
 the
 docs is what we're describing in this thread.


 Notwithstanding what the docs say, I have seen CREATE DATABASE used
 plenty
 of times, and quite effectively, to clone databases. I don't think
 making it
 do twice the IO in the general case is going to go down well.
 I think they're actually more likely to be happy that we wouldn't need
 do a immediate checkpoint anymore. The performance penalty from that
 likely to be much more severe than the actual IO.

 
 
 At the very least that needs to be benchmarked.

The question is what workload are we going to benchmark. It's unlikely
one of the approaches to be a clear winner in all cases, so we'll have
to decide which ones are more common / important, or somehow combine
both approaches (and thus not getting some of the WAL-only benefits).

I'm pretty sure we'll see about three main cases:

(1) read-mostly workloads

Current approach wins, because checkpoint is cheap and the
WAL-based approach results in 2x the I/O.

The difference is proportional to template database size. For
small databases it's negligible, for large databases it's more
significant.

(2) write-heavy workloads / small template database

WAL-based approach wins, because it does not require explicit
checkpoint and for small databases the I/O generated by WAL-logging
everything is lower than checkpoint (which is more random).

This is the case of our PostgreSQL clusters.

(3) write-heavy workloads / large template database

Current approach wins, for two reasons: (a) for large databases the
WAL-logging overhead may generate much more I/O than a checkpoint,
and (b) it may generate so many WAL segments it eventually triggers
a checkpoint anyway (even repeatedly).

The exact boundary between the cases really depends on multiple things:

(a) shared_buffers size (the larger the more expensive checkpoint)
(b) read-write activity (more writes = more expensive checkpoint)
(c) hardware (especially how well it handles random I/O)

Not sure how to decide which case is more important, and I agree that
there are people using CREATE DATABASE to clone databases - maybe not in
production, but e.g. for testing purposes (still, it'd be rather
unfortunate to make it considerably slower for them). Not sure how to
balance this :-/

So maybe we shouldn't cling to the WAL-logging approach too much. Maybe
Heikki's idea from to abandon the full checkpoint and instead assume
that once the transaction commits, all the files were fsynced OK. Of
couse, this will do nothing about the replay hazards.

regards
Tomas


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-10-27 Thread Peter Geoghegan
On Mon, Oct 27, 2014 at 4:34 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Let's see if we can link these two thoughts.

 1. You think the biggest problem is the lack of attention to the design.

 2. I keep asking you to put the docs in a readable form.

 If you can't understand the link between those two things, I am at a loss.

You've read the docs. Please be clearer. In what sense are they not
readable? The main description of the feature appears on the INSERT
reference page:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/sql-insert.html

These paragraphs were added (there is more on the INSERT reference
page than mentioned here, but this is the main part):


The optional ON CONFLICT clause specifies a path to take as an
alternative to raising a uniqueness violation error. ON CONFLICT
IGNORE simply avoids inserting any individual row when it is
determined that a uniqueness violation error would otherwise need to
be raised. ON CONFLICT UPDATE has the system take an UPDATE path in
respect of such rows instead. ON CONFLICT UPDATE guarantees an atomic
INSERT or UPDATE outcome. While rows may be updated, the top-level
statement is still an INSERT, which is significant for the purposes of
statement-level triggers and the rules system. Note that in the event
of an ON CONFLICT path being taken, RETURNING returns no value in
respect of any not-inserted rows.

ON CONFLICT UPDATE optionally accepts a WHERE clause condition. When
provided, the statement only proceeds with updating if the condition
is satisfied. Otherwise, unlike a conventional UPDATE, the row is
still locked for update. Note that the condition is evaluated last,
after a conflict has been identified as a candidate to update.

ON CONFLICT UPDATE accepts EXCLUDED expressions in both its targetlist
and WHERE clause. This allows expressions (in particular, assignments)
to reference rows originally proposed for insertion. Note that the
effects of all per-row BEFORE INSERT triggers are carried forward.
This is particularly useful for multi-insert ON CONFLICT UPDATE
statements; when inserting or updating multiple rows, constants need
only appear once.

There are several restrictions on the ON CONFLICT UPDATE clause that
do not apply to UPDATE statements. Subqueries may not appear in either
the UPDATE targetlist, nor its WHERE clause (although simple
multi-assignment expressions are supported). WHERE CURRENT OF cannot
be used. In general, only columns in the target table, and excluded
values originally proposed for insertion may be referenced. Operators
and functions may be used freely, though.

INSERT with an ON CONFLICT UPDATE clause is a deterministic statement.
You cannot UPDATE any single row more than once. Writing such an
INSERT statement will raise a cardinality violation error. Rows
proposed for insertion should not duplicate each other in terms of
attributes constrained by the conflict-arbitrating unique index. Note
that the ordinary rules for unique indexes with regard to NULL apply
analogously to whether or not an arbitrating unique index indicates if
the alternative path should be taken, so multiple NULL values across a
set of rows proposed for insertion are acceptable; the statement will
always insert (assuming there is no unrelated error). Note that merely
locking a row (by having it not satisfy the WHERE clause condition)
does not count towards whether or not the row has been affected
multiple times (and whether or not a cardinality violation error is
raised).

ON CONFLICT UPDATE requires a column specification, which is used to
infer an index to limit pre-checking for duplicates to. ON CONFLICT
IGNORE makes this optional. Failure to anticipate and prevent would-be
uniqueness violations originating in some other unique index than the
single unique index that was anticipated as the sole source of
would-be uniqueness violations can result in failing to insert a row
(taking the IGNORE path) when a uniqueness violation may have actually
been appropriate; omitting the specification indicates a total
indifference to where any would-be uniqueness violation could occur.
Note that the ON CONFLICT UPDATE assignment may result in a uniqueness
violation, just as with a conventional UPDATE.

The rules for unique index inference are straightforward. Columns
and/or expressions specified must match all the columns/expressions of
some existing unique index on table_name. The order of the
columns/expressions in the index definition, or whether or not the
index definition specified NULLS FIRST or NULLS LAST, or the internal
sort order of each column (whether DESC or ASC were specified) are all
irrelevant. However, partial unique indexes are not supported as
arbiters of whether an alternative ON CONFLICT path should be taken.
Deferred unique constraints are also not accepted.

You must have INSERT privilege on a table in order to insert into it,
as well as UPDATE privilege if and only if ON CONFLICT UPDATE is
specified. If a 

Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread Tom Lane
Tomas Vondra t...@fuzzy.cz writes:
 So maybe we shouldn't cling to the WAL-logging approach too much. Maybe
 Heikki's idea from to abandon the full checkpoint and instead assume
 that once the transaction commits, all the files were fsynced OK. Of
 couse, this will do nothing about the replay hazards.

Well, I'm not insisting on any particular method of getting there, but
if we're going to touch this area at all then I think fix the replay
hazards should be a non-negotiable requirement.  We'd never have accepted
such hazards if CREATE DATABASE were being introduced for the first time;
it's only like this because nobody felt like rewriting a Berkeley-era
kluge.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Function array_agg(array)

2014-10-27 Thread Ali Akbar

 super

 I tested last version and I have not any objections.

 1. We would to have this feature - it is long time number of our ToDo List

 2. Proposal and design of multidimensional aggregation is clean and nobody
 has objection here.

 3. There is zero impact on current implementation. From performance
 reasons it uses new array optimized aggregator - 30% faster for this
 purpose than current (scalar) array aggregator

 4. Code is clean and respect PostgreSQL coding rules

 5. There are documentation and necessary regress tests

 6. Patching and compilation is clean without warnings.

 This patch is ready for commit

 Thank you for patch


Thank you for the thorough review process.

Regards,
-- 
Ali Akbar


Re: [HACKERS] alter user/role CURRENT_USER

2014-10-27 Thread Marti Raudsepp
On Fri, Oct 24, 2014 at 11:29 AM, Kyotaro HORIGUCHI
horiguchi.kyot...@lab.ntt.co.jp wrote:
  - 0001-ALTER-ROLE-CURRENT_USER_v2.patch  - the patch.

+RoleId:CURRENT_USER{ $$ = current_user;}
+   | USER  { $$ = current_user;}
+   | CURRENT_ROLE  { $$ = current_user;}
+   | SESSION_USER  { $$ = session_user;}

This is kind of ugly, and it means you can't distinguish between a
CURRENT_USER keyword and a quoted user name current_user. It's a
legitimate user name, so the behavior of the following now changes:

CREATE ROLE current_user;
ALTER ROLE current_user SET work_mem='10MB';

There ought to be a better way to represent this than using magic string values.


It accepts CURRENT_USER/USER/CURRENT_ROLE/SESSION_USER.

On a stylistic note, do we really want to support the alternative
spellings of CURRENT_USER, like CURRENT_ROLE and USER? The SQL
standard is well-hated for inventing more keywords than necessary.
There is no precedent for using/allowing these aliases in PostgreSQL
DDL commands, and ALTER USER  ROLE aren't SQL standard anyway. So
maybe we should stick with just accepting one canonical syntax
instead.

I think the word USER may reasonably arise from an editing mistake,
ALTER USER USER does not seem like sane syntax that should be
accepted.


From your original email:

 - Modify syntax of ALTER USER so as to be an alias of ALTER ROLE.
   - Added ALL syntax as user name to ALTER USER.

But should ALTER USER ALL and ALTER ROLE ALL really do the same thing?
A user is a role with the LOGIN option. Every user is a role, but not
every role is a user. I suspect that ambiguity was why ALTER USER ALL
wasn't originally implemented.

Regards,
Marti


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing lock strength of adding foreign keys

2014-10-27 Thread Noah Misch
On Mon, Oct 27, 2014 at 08:24:15AM -0400, Robert Haas wrote:
 On Sat, Oct 25, 2014 at 2:00 PM, Noah Misch n...@leadboat.com wrote:
  http://www.postgresql.org/message-id/ca+tgmoy4glsxzk0tao29-ljtcuj0sl1xwcwq51xb-hfysgi...@mail.gmail.com
  http://www.postgresql.org/message-id/20893.1393892...@sss.pgh.pa.us
  http://www.postgresql.org/message-id/20140306224340.ga3551...@tornado.leadboat.com
 
  As far as triggers are concerned, the issue of skew between the
  transaction snapshot and what the ruleutils.c snapshots do seems to be
  the principal issue.  Commit e5550d5fec66aa74caad1f79b79826ec64898688
  changed pg_get_constraintdef() to use an MVCC snapshot rather than a
  current MVCC snapshot; if that change is safe, I am not aware of any
  reason why we couldn't change pg_get_triggerdef() similarly.
 
  pg_get_triggerdef() is fine as-is with concurrent CREATE TRIGGER.  The
  pg_get_constraintdef() change arose to ensure a consistent result when
  concurrent ALTER TABLE VALIDATE CONSTRAINT mutates a constraint definition.
  (Reducing the lock level of DROP TRIGGER or ALTER TRIGGER, however, would
  create the analogous problem for pg_get_triggerdef().)
 
 Maybe so, but I'd favor changing it anyway and getting it over with.
 The current situation seems to have little to recommend it; moreover,
 it would be nice, if it's possible and safe, to weaken the lock levels
 for all three of those commands at the same time.  Do you see any
 hazards for ALTER or DROP that do not exist for CREATE?

ALTER TRIGGER is not bad; like you say, change pg_get_triggerdef_worker() the
way commit e5550d5 changed pg_get_constraintdef_worker().  DROP TRIGGER is
more difficult.  pg_constraint.tgqual of a dropped trigger may reference other
dropped objects, which calls for equipping get_rule_expr() to use the
transaction snapshot.  That implicates quite a bit of ruleutils.c code.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread David G Johnston
Tomas Vondra wrote
 I mean, when we use database A as a template, why do we need to checkpoint
 B, C, D and F too? (Apologies if this is somehow obvious, I'm way out of
 my comfort zone in this part of the code.)

IIUC you have to checkpoint the whole cluster because it is not possible to
do checkpoint individual databases.  There is only a single WAL stream and
while it could have source database markers I don't believe it does so there
is no way to have separate checkpoint locations recorded for different
databases. 

Adding such seems to introduce a lot of book-keeping and both reload and
file size overhead for little practical gain.

David J.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/proposal-CREATE-DATABASE-vs-partial-CHECKPOINT-tp5824343p5824522.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}

2014-10-27 Thread Peter Geoghegan
 ---
  doc/src/sgml/ddl.sgml |  23 +++
  doc/src/sgml/indices.sgml |  11 +-
  doc/src/sgml/mvcc.sgml|  43 --
  doc/src/sgml/plpgsql.sgml |  20 ++-
  doc/src/sgml/postgres-fdw.sgml|   8 ++
  doc/src/sgml/ref/create_index.sgml|   7 +-
  doc/src/sgml/ref/create_rule.sgml |   6 +-
  doc/src/sgml/ref/create_table.sgml|   5 +-
  doc/src/sgml/ref/create_trigger.sgml  |   5 +-
  doc/src/sgml/ref/create_view.sgml |  36 -
  doc/src/sgml/ref/insert.sgml  | 258 
 --
  doc/src/sgml/ref/set_constraints.sgml |   6 +-
  doc/src/sgml/trigger.sgml |  46 --
  13 files changed, 426 insertions(+), 48 deletions(-)

 #1 internal documentation stats:

  doc/src/sgml/indexam.sgml| 133 
 ---
  src/backend/access/nbtree/README |  90 +-
  src/backend/executor/README  |  35 +++
  3 files changed, 247 insertions(+), 11 deletions(-)

 #2 internal documentation stats:

 ---
  src/backend/executor/README | 49 
 +
  1 file changed, 49 insertions(+)

Just to put that in context, here are the documentation changes from
the original LATERAL commit:

 doc/src/sgml/keywords.sgml|   2 +-
 doc/src/sgml/queries.sgml |  83
+-
 doc/src/sgml/ref/select.sgml  | 102
+---

Commit 0ef0b30 added data-modifying CTE docs (docs only). That looks like:

 doc/src/sgml/queries.sgml| 177
+---
 doc/src/sgml/ref/select.sgml |  49 ++---
 2 files changed, 214 insertions(+), 12 deletions(-)

 Have I not provided a total of 4 isolation tests illustrating
 interesting concurrency/visibility interactions? That's a lot of
 isolation tests. Here is the tests commit stat:

  31 files changed, 1159 insertions(+), 8 deletions(-)

And to put the tests in context, here are the stats from the original
Hot Standby commit:

 src/test/regress/expected/hs_standby_allowed.out|  215
++
 src/test/regress/expected/hs_standby_check.out  |   20 +++
 src/test/regress/expected/hs_standby_disallowed.out |  137 +
 src/test/regress/expected/hs_standby_functions.out  |   40 +
 src/test/regress/pg_regress.c   |   33 ++--
 src/test/regress/sql/hs_primary_extremes.sql|   74 +
 src/test/regress/sql/hs_primary_setup.sql   |   25 +++
 src/test/regress/sql/hs_standby_allowed.sql |  121 +++
 src/test/regress/sql/hs_standby_check.sql   |   16 ++
 src/test/regress/sql/hs_standby_disallowed.sql  |  105 +
 src/test/regress/sql/hs_standby_functions.sql   |   24 +++
 src/test/regress/standby_schedule   |   21 +++

So (at least as measured by raw lines of tests), this feature is
better tested than the original Hot Standby commit, and by a wide
margin. Tests also serve as an explanatory tool for the feature (in
particular, isolation tests can be used in this way).
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing the cost of sinval messaging

2014-10-27 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Mon, Oct 27, 2014 at 4:27 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Neither of those messages seem to me to bear on this point.  AFAICS,
 the unlocked hasMessages test has a race condition, which the comment
 just above it argues isn't a problem in practice.  What I propose above
 would have an absolutely indistinguishable race condition, which again
 wouldn't be a problem in practice if you believe the comment's argument.

 Yes, the read from hasMessages has a race condition, which is in fact
 not a problem in practice if you believe the comment's argument.
 However, your proposed formulation has a second race condition, which
 is discussed in those emails, and which is precisely why we rejected
 the design you're now proposing three years ago.  That second race
 condition is that the read from stateP-nextMsgNum doesn't have to
 happen at the same time as the read from stateP-resetState. No CPU or
 compiler barrier separates them, so either can happen before the
 other. SICleanupQueue updates both values, so you have the problem
 first described precisely by Noah:

 # (In theory, you could
 # hit a case where the load of resetState gives an ancient false just as the
 # counters wrap to match.

 While it practically seems very unlikely that this would ever really
 happen, some guy named Tom Lane told us we should worry about it, so I
 did.

That argument is nonsense.  I complained about a lack of close analysis,
but with close analysis I think this is perfectly safe; or at least no
less safe than what's there now, with its not terribly bulletproof
assumption that a suitable memory barrier has happened recently.

You'll grant that reads and writes of the integer msgNum variables are
atomic, I trust (if not, we've got lots of bugs elsewhere).  So what we
have to worry about does not include totally-corrupted values, but only
stale values, including possibly out-of-order reads.  Ignoring the
possibility of MSGNUMWRAPAROUND wraparound for a moment, only our own
process ever changes nextMsgNum, so we should certainly see a current
value of that.  So the argument for a hazard is that maxMsgNum could have
been advanced 2^32 times since our process last read messages, and then we
come along and see that value, but we *don't* see our resetState flag set
even though that happened 2^32 - MAXNUMMESSAGES messages ago, and was
certainly flushed into shared memory shortly thereafter.  That's nonsense.
It's especially nonsense if you assume, as the existing code already does,
that the reading process recently executed a read barrier.

Now, as to what happens when MSGNUMWRAPAROUND wraparound occurs: that code
decrements all msgNum variables whether they belong to hopelessly-behind
backends or not.  That's why the time until a possible chance match is
2^32 messages and not something else.  However, since the proposed new
unlocked test might come along and see a partially-complete wraparound
update, it could see either nextMsgNum or maxMsgNum offset by
MSGNUMWRAPAROUND compared to the other.  What would most likely happen is
that this would make for a false decision that they're unequal, resulting
in a useless trip through the locked part of SIGetDataEntries, which is
harmless.  If our backend had fallen behind by exactly 2^32-MSGNUMWRAPAROUND
messages or 2^32+MSGNUMWRAPAROUND messages, we could make a false
conclusion that nextMsgNum and maxMsgNum are equal --- but, again, our
resetState must have become set a billion or so sinval messages back, and
it's just not credible that we're not going to see that flag set.

So I still say that the current code is wasting a lot of cycles for
no actual gain in safety.

 Personally, I think trying to further optimize this is most likely a
 waste of time.  I'm quite sure that there's a whole lot of stuff in
 the sinval code that could be further optimized, but it executes so
 infrequently that I believe it doesn't matter.

Maybe not today, but it's not hard to imagine usage patterns where this
would result in O(N^2) total work in the sinval code.

 If you've got some evidence that this is a real problem as opposed to
 a hypothetical one, we could certainly reopen the debate about whether
 ignoring the possibility that the loads from stateP-nextMsgNum and
 stateP-resetState will be widely-enough spaced in time for a
 quarter-million messages to be queued up in the meantime is
 sufficiently unlikely to ignore.

It's a billion messages, not a quarter million...

 However, absent evidence of a
 tangible performance problem, I'd be quite disinclined to throw the
 guaranteed safety of the current approach out the window.

I fail to find any guaranteed safety in the current approach.  It's
dependent on the assumption of a recently-executed read barrier in the
reading backend, which assumption is sufficient for what I'm proposing
as well.

Another thought here is that the whole thing becomes moot if we stick a
read barrier into the unlocked test.  We 

Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT

2014-10-27 Thread David G Johnston
Tom Lane-2 wrote
 Tomas Vondra lt;

 tv@

 gt; writes:
 So maybe we shouldn't cling to the WAL-logging approach too much. Maybe
 Heikki's idea from to abandon the full checkpoint and instead assume
 that once the transaction commits, all the files were fsynced OK. Of
 couse, this will do nothing about the replay hazards.
 
 Well, I'm not insisting on any particular method of getting there, but
 if we're going to touch this area at all then I think fix the replay
 hazards should be a non-negotiable requirement.  We'd never have accepted
 such hazards if CREATE DATABASE were being introduced for the first time;
 it's only like this because nobody felt like rewriting a Berkeley-era
 kluge.

Maybe adding a ToDo:

Fix replay hazards in CREATE DATABASE 

and listing them explicitly would be a good start.

Not sure if WAL or CREATE would be more appropriate but WAL seems like a
better fit.

To the topic at hand would CREATE DATABASE name WITH LOGGED = true work? 
As with UNLOGGED tables giving the user the choice of WAL/fsync/checkpoint
behavior seems reasonable.  As Thomas said there a couple of diametrically
opposed use-cases here and it seems like we've already implemented the more
difficult one.

Addressing the reply hazards in the existing implementation could be
considered separately.  What I'm not sure is whether the logging version in
fact already addresses them and in the interest of safe and sane defaults
whether we should implement and make it default and have the user specify
logged = false to get the old behavior with warnings emitted as to the
replay hazards that could occur should the database crash during the create.  

Is it possible to emit a warning before the command completes (or, rather,
begins)?

David J.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/proposal-CREATE-DATABASE-vs-partial-CHECKPOINT-tp5824343p5824526.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Directory/File Access Permissions for COPY and Generic File Access Functions

2014-10-27 Thread Stephen Frost
* Peter Eisentraut (pete...@gmx.net) wrote:
 I think the way this should work is that if you create a DIRALIAS, then
 the COPY command should refer to it by logical name, e.g.,
 
 CREATE DIRALIAS dumpster AS '/tmp/trash';
 COPY mytable TO dumpster;

You'd have to be able to specify the filename also.  I'm not against the
idea of using the 'diralias' alias name this way, just saying it isn't
quite as simple as the above.

 If you squint a bit, this is the same as a tablespace.  Maybe those two
 concepts could be combined.

CREATE TABLESPACE is something else which could be supported with
diralias, though it'd have to be an independently grantable capability
and it'd be a bad idea to let a user create tablespaces in a directory
and then also be able to copy from/to files there (backend crashes,
etc).  This exact capability is more-or-less what RDS has had to hack on
to PG for their environment, as I understand it, in case you're looking
for a use-case.

 On the other hand, we already have file_fdw, which does something very
 similar.

It's really not at all the same..  Perhaps we'll get there some day, but
we're a very long way away from file_fdw having the ability to replace
normal tablespaces...

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] alter user/role CURRENT_USER

2014-10-27 Thread David G Johnston
Marti Raudsepp wrote
 On Fri, Oct 24, 2014 at 11:29 AM, Kyotaro HORIGUCHI
 lt;

 horiguchi.kyotaro@.co

 gt; wrote:
 
 But should ALTER USER ALL and ALTER ROLE ALL really do the same thing?
 A user is a role with the LOGIN option. Every user is a role, but not
 every role is a user. I suspect that ambiguity was why ALTER USER ALL
 wasn't originally implemented.

There is no system level distinction here - only semantic and only during
the issuance of a CREATE without specifying an explicit value for
LOGIN/NOLGIN.

The documentation states user and group are aliases for ROLE, not subsets
that require or disallow login privileges.

I am OK with not making them true aliases in order to minimize user
confusion w.r.t. the semantics implied by user and group but then I'd submit
we simply note those two forms as deprecated in favor of role and that all
new role-based functionality will only be attached to role-based commands.

Since all of user, current_user and session_user have special syntactic
consideration in SQL [1] I'd be generally in favor of trying to keep that
dynamic intact while implementing this feature.  And for the same reason I
would not allow current_role as a keyword.  We didn't add a current_role
function but instead chose to rely on the standard keywords even when we
introduced ROLE.

I'm not clear on the keyword confusion since while current_user is a valid
literal it requires quotation while the token current_user does not.

1. http://www.postgresql.org/docs/9.4/static/functions-info.html

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/alter-user-role-CURRENT-USER-tp5822520p5824528.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Reducing the cost of sinval messaging

2014-10-27 Thread Robert Haas
On Mon, Oct 27, 2014 at 8:54 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 That argument is nonsense.  I complained about a lack of close analysis,
 but with close analysis I think this is perfectly safe; or at least no
 less safe than what's there now, with its not terribly bulletproof
 assumption that a suitable memory barrier has happened recently.

I don't think there's any reason to think that assumption is any less
than clad in iron, if not tungsten.  The entire sinval mechanism
relies for its correctness on heavyweight locking: at transaction
commit, we queue up sinval messages before releasing locks; and we
check for sinval messages - at least - after acquiring locks.  For
this to be correct, the lock held when queueing up a sinval message
must be strong enough to prevent backends who might need to see that
message from acquiring a lock of their own.  In that way, by the time
a process manages to acquire a lock, it can be confident that the
invalidation messages it needs to see have already been queued.

Since a heavyweight lock acquisition or release is a memory barrier
many times over, it's fair to say that by the time the process
queueing the invalidation messages releases locks, the messages it
wrote must be committed to memory.  And conversely, when a process has
just acquired a lock, any invalidations committed to memory by another
backend will be visible to it by the time the lock acquisition has
completed.  The only way this isn't a sufficient interlock is if the
lock acquired by the second backend doesn't conflict with the lock
held by the first one.  But if that's the case, the whole system is
broken far beyond the ability of a memory barrier to add or detract.

 You'll grant that reads and writes of the integer msgNum variables are
 atomic, I trust (if not, we've got lots of bugs elsewhere).  So what we
 have to worry about does not include totally-corrupted values, but only
 stale values, including possibly out-of-order reads.

Check and check.

 Ignoring the
 possibility of MSGNUMWRAPAROUND wraparound for a moment, only our own
 process ever changes nextMsgNum, so we should certainly see a current
 value of that.

OK...

 So the argument for a hazard is that maxMsgNum could have
 been advanced 2^32 times since our process last read messages, and then we
 come along and see that value, but we *don't* see our resetState flag set
 even though that happened 2^32 - MAXNUMMESSAGES messages ago, and was
 certainly flushed into shared memory shortly thereafter.  That's nonsense.
 It's especially nonsense if you assume, as the existing code already does,
 that the reading process recently executed a read barrier.

I'm not sure I follow this part.

 Now, as to what happens when MSGNUMWRAPAROUND wraparound occurs: that code
 decrements all msgNum variables whether they belong to hopelessly-behind
 backends or not.  That's why the time until a possible chance match is
 2^32 messages and not something else.  However, since the proposed new
 unlocked test might come along and see a partially-complete wraparound
 update, it could see either nextMsgNum or maxMsgNum offset by
 MSGNUMWRAPAROUND compared to the other.

I definitely agree with that last sentence, which I think is the key
point, but I don't understand how 2^32 comes into it.  It seems to me
that a chance match is a possibility every MSGNUMWRAPAROUND messages,
precisely because one might be offset by that amount compared to the
other.

 If our backend had fallen behind by exactly 2^32-MSGNUMWRAPAROUND
 messages or 2^32+MSGNUMWRAPAROUND messages, we could make a false
 conclusion that nextMsgNum and maxMsgNum are equal --- but, again, our
 resetState must have become set a billion or so sinval messages back, and
 it's just not credible that we're not going to see that flag set.

I'd like a more precise argument than not credible.  How many sinval
messages back does it have to be before we consider it safe?  Why that
number and not twice as many or half as many?

 So I still say that the current code is wasting a lot of cycles for
 no actual gain in safety

 Personally, I think trying to further optimize this is most likely a
 waste of time.  I'm quite sure that there's a whole lot of stuff in
 the sinval code that could be further optimized, but it executes so
 infrequently that I believe it doesn't matter.

 Maybe not today, but it's not hard to imagine usage patterns where this
 would result in O(N^2) total work in the sinval code.

To all of this I say: show me the profile or usage pattern where this
actually matters.  We worked hard when this patch was in development
and under review to find a usage pattern where this was a significant
impact and were unable to do so.

 Another thought here is that the whole thing becomes moot if we stick a
 read barrier into the unlocked test.  We did not have that option during
 the last go-round with this code, but I'm inclined to think we ought to
 revisit sinvaladt.c with that tool in our belts anyway; 

  1   2   >