Re: [HACKERS] CINE in CREATE TABLE AS ... and CREATE MATERIALIZED VIEW ...
Hi All, - Patch got applied cleanly. - Regression make check run fine. - Patch covered the documentation changes Here are few comments: 1) What the need of following change: diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index bcec173..9fe6855 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -1005,12 +1005,6 @@ LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval) lock-tail = proc; lock-head = proc; -/* - * Set releaseOK, to make sure we get woken up as soon as the lock is - * released. - */ -lock-releaseOK = true; - /* Can release the mutex now */ SpinLockRelease(lock-mutex); It doesn't look like related to this patch. 2) diff --git a/src/bin/psql/help.c b/src/bin/psql/help.c index ae5fe88..4d11952 100644 --- a/src/bin/psql/help.c +++ b/src/bin/psql/help.c @@ -247,7 +247,7 @@ slashUsage(unsigned short int pager) fprintf(output, _( \\f [STRING]show or set field separator for unaligned query output\n)); fprintf(output, _( \\H toggle HTML output mode (currently %s)\n), ON(pset.popt.topt.format == PRINT_HTML)); -fprintf(output, _( \\pset [NAME [VALUE]] set table output option\n +fprintf(output, _( \\pset [NAME [VALUE]] set table output option\n (NAME := {format|border|expanded|fieldsep|fieldsep_zero|footer|null|\n numericlocale|recordsep|recordsep_zero|tuples_only|title|tableattr|pager|\n unicode_border_linestyle|unicode_column_linestyle|unicode_header_linestyle})\n)); Why above changes reqired ? 3) This patch adding IF NOT EXIST_S for CREATE TABLE AS and CREATE MATERIALIZED TABLE, but testcase coverage for CREATE MATERIALIZED TABLE is missing. Apart from this changes looks good to me. On Tue, Oct 14, 2014 at 10:28 PM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: On Wed, Oct 1, 2014 at 9:17 AM, Fabrízio de Royes Mello fabriziome...@gmail.com wrote: Hi all, We already have IF NOT EXISTS for CREATE TABLE. There are some reason to don't have to CREATE TABLE AS and CREATE MATERIALIZED VIEW?? Patch attached to add CINE support to: - CREATE TABLE AS - CREATE MATERIALIZED VIEW Regards, -- Fabrízio de Royes Mello Consultoria/Coaching PostgreSQL Timbira: http://www.timbira.com.br Blog: http://fabriziomello.github.io Linkedin: http://br.linkedin.com/in/fabriziomello Twitter: http://twitter.com/fabriziomello Github: http://github.com/fabriziomello -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Rushabh Lathia
Re: [HACKERS] [v9.5] Custom Plan API
FYI, patch v12 part 2 no longer applies cleanly. Thanks. I rebased the patch set according to the latest master branch. The attached v13 can be applied to the master. -- NEC OSS Promotion Center / PG-Strom Project KaiGai Kohei kai...@ak.jp.nec.com -Original Message- From: thombr...@gmail.com [mailto:thombr...@gmail.com] On Behalf Of Thom Brown Sent: Sunday, October 26, 2014 9:22 PM To: Kaigai Kouhei(海外 浩平) Cc: Robert Haas; Kohei KaiGai; Tom Lane; Alvaro Herrera; Shigeru Hanada; Simon Riggs; Stephen Frost; Andres Freund; PgHacker; Jim Mlodgenski; Peter Eisentraut Subject: Re: [HACKERS] [v9.5] Custom Plan API On 30 September 2014 07:27, Kouhei Kaigai kai...@ak.jp.nec.com wrote: On Mon, Sep 29, 2014 at 9:04 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote: Do we need to match the prototype of wrapper function with callback? Yes. OK, I fixed up the patch part-2, to fit declaration of GetSpecialCustomVar() with corresponding callback. Also, a noise in the part-3 patch, by git-pull, was removed. FYI, patch v12 part 2 no longer applies cleanly. -- Thom pgsql-v9.5-custom-scan.part-3.v13.patch Description: pgsql-v9.5-custom-scan.part-3.v13.patch pgsql-v9.5-custom-scan.part-2.v13.patch Description: pgsql-v9.5-custom-scan.part-2.v13.patch pgsql-v9.5-custom-scan.part-1.v13.patch Description: pgsql-v9.5-custom-scan.part-1.v13.patch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Function array_agg(array)
2014-10-27 9:11 GMT+07:00 Ali Akbar the.ap...@gmail.com: 2014-10-27 1:38 GMT+07:00 Pavel Stehule pavel.steh...@gmail.com: Hi My idea is using new ArrayBuilder optimized for building multidimensional arrays with own State type. I think so casting to ArrayBuildState is base of our problems, so I don't would to do. Code in array_agg_* is simple, little bit more complex code is in nodeSubplan.c. Some schematic changes are in attachments. Thanks! The structure looks clear, and thanks for the example on nodeSubplan.c. I will restructure the v10 of the patch to this structure. Patch attached. Regards, -- Ali Akbar diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7e5bcd9..f59738a 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -12046,6 +12046,22 @@ NULL baz/literallayout(3 rows)/entry row entry indexterm +primaryarray_agg/primary + /indexterm + functionarray_agg(replaceable class=parameteranyarray/replaceable)/function + /entry + entry + any + /entry + entry + the same array type as input type + /entry + entryinput arrays, aggregated into higher-order multidimesional array. Rejects NULL and empty array as input./entry + /row + + row + entry + indexterm primaryaverage/primary /indexterm indexterm diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml index 2f0680f..8c182a4 100644 --- a/doc/src/sgml/syntax.sgml +++ b/doc/src/sgml/syntax.sgml @@ -2238,6 +2238,11 @@ SELECT ARRAY(SELECT oid FROM pg_proc WHERE proname LIKE 'bytea%'); array --- {2011,1954,1948,1952,1951,1244,1950,2005,1949,1953,2006,31,2412,2413} + +SELECT ARRAY(SELECT array(select i) FROM generate_series(1,5) a(i)); + array +--- + {{1},{2},{3},{4},{5}} (1 row) /programlisting The subquery must return a single column. The resulting diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c index 401bad4..eb4de3b 100644 --- a/src/backend/executor/nodeSubplan.c +++ b/src/backend/executor/nodeSubplan.c @@ -231,7 +231,10 @@ ExecScanSubPlan(SubPlanState *node, bool found = false; /* TRUE if got at least one subplan tuple */ ListCell *pvar; ListCell *l; + bool use_md_array_builder; + Oid md_array_element_type; ArrayBuildState *astate = NULL; + MdArrayBuildState *mdastate = NULL; /* * MULTIEXPR subplans, when executed, just return NULL; but first we @@ -366,8 +369,25 @@ ExecScanSubPlan(SubPlanState *node, /* stash away current value */ Assert(subplan-firstColType == tdesc-attrs[0]-atttypid); dvalue = slot_getattr(slot, 1, disnull); - astate = accumArrayResult(astate, dvalue, disnull, - subplan-firstColType, oldcontext); + /* + * use a fast array multidimensional builder when input is a array + * only check on first iteration. On subsequent, use the cached values + */ + if (astate == NULL mdastate == NULL) + { +md_array_element_type = get_element_type(subplan-firstColType); +use_md_array_builder = OidIsValid(md_array_element_type); + } + + if (use_md_array_builder) +mdastate = accumMdArray(mdastate, + disnull? NULL : + DatumGetArrayTypeP(dvalue), + disnull, md_array_element_type, + oldcontext); + else +astate = accumArrayResult(astate, dvalue, disnull, + subplan-firstColType, oldcontext); /* keep scanning subplan to collect all values */ continue; } @@ -439,6 +459,8 @@ ExecScanSubPlan(SubPlanState *node, /* We return the result in the caller's context */ if (astate != NULL) result = makeArrayResult(astate, oldcontext); + else if (mdastate != NULL) + result = makeMdArray(mdastate, oldcontext); else result = PointerGetDatum(construct_empty_array(subplan-firstColType)); } @@ -951,7 +973,10 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext) ListCell *pvar; ListCell *l; bool found = false; + bool use_md_array_builder; + Oid md_array_element_type; ArrayBuildState *astate = NULL; + MdArrayBuildState *mdastate = NULL; if (subLinkType == ANY_SUBLINK || subLinkType == ALL_SUBLINK) @@ -1018,8 +1043,25 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext) /* stash away current value */ Assert(subplan-firstColType == tdesc-attrs[0]-atttypid); dvalue = slot_getattr(slot, 1, disnull); - astate = accumArrayResult(astate, dvalue, disnull, - subplan-firstColType, oldcontext); + /* + * use a fast array multidimensional builder when input is a array + * only check on first iteration. On subsequent, use the cached values + */ + if (astate == NULL mdastate == NULL) + { +md_array_element_type = get_element_type(subplan-firstColType); +use_md_array_builder =
Re: [HACKERS] DISTINCT with btree skip scan
On Sat, Jul 5, 2014 at 12:17 PM, Thomas Munro mu...@ip9.org wrote: postgres=# set enable_hashagg = false; SET Time: 0.302 ms postgres=# explain select distinct a from foo; ┌─┐ │ QUERY PLAN │ ├─┤ │ Index Only Scan for distinct prefix 1 using foo_pkey on foo (cost=0.43..263354.20 rows=10 width=4) │ │ Planning time: 0.063 ms │ └─┘ (2 rows) Time: 0.443 ms postgres=# select distinct a from foo; ┌───┐ │ a │ ├───┤ │ 0 │ │ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 │ │ 6 │ │ 7 │ │ 8 │ │ 9 │ └───┘ (10 rows) Time: 0.565 ms Hi Thomas, I've had a quick look at this and it seems like a great win! I'm quite surprised that we've not got this already. I think this technology could also really help performance of queries such as SELECT * from bigtable bt WHERE EXISTS(SELECT 1 FROM otherbigtable obt WHERE bt.somecol = obt.someIndexedButNonUniqueColumn); I know you're not proposing to improve that first off, but it could be done later once the benefits of this are more commonly realised. I think our shortfalls in this area have not gone unnoticed. I was reading this post https://www.periscope.io/blog/count-distinct-in-mysql-postgres-sql-server-and-oracle.html about comparing performance of COUNT(DISTINCT col). I think this would give a big win for query 3 in that post. I'm trying to find some time make some changes to transform queries to allow the group by to happen before the joins when possible, so between that and your patch we'd be 2 steps closer to making query 1 in the link above perform a little better on PostgreSQL. Do you think you'll manage to get time to look at this a bit more? I'd be keen to look into the costing side of it if you think that'll help you any? Regards David Rowley
Re: [HACKERS] On partitioning
Hi, On Mon, Oct 13, 2014 at 04:38:39PM -0300, Alvaro Herrera wrote: Bruce Momjian wrote: I realize there hasn't been much progress on this thread, but I wanted to chime in to say I think our current partitioning implementation is too heavy administratively, error-prone, and performance-heavy. On the contrary, I think there was lots of progress; there's lots of useful feedback from the initial design proposal I posted. I am a bit sad to admit that I'm not working on it at the moment as I had originally planned, though, because other priorities slipped in and I am not able to work on this for a while. Therefore if someone else wants to work on this topic, be my guest -- otherwise I hope to get on it in a few months. Oh, I just meant code progress --- I agree the discussion was fruitful. FWIW, I think Robert's criticism regarding not basing this on inheritance scheme was not responded to. He mentions a patch by Itagaki-san (four years ago, abandoned unfortunately); details here: https://wiki.postgresql.org/wiki/Table_partitioning#Active_Work_In_Progress This patch could be resurrected fixing some parts of it as was suggested at the time. But, the most important decisions regarding the patch like storage structure, syntax etc. would require building some consensus whether this is a worthwhile direction. At least some consideration must be given to the idea that we might want to have remote partitions backed by FDW infrastructure in near future, although that may not be the primary goal of partitioning effort. What do others think? -- Amit -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Function array_agg(array)
Hi I did some minor changes in code * move tests of old or new builder style for array sublink out of main cycles * some API simplification of new builder - we should not to create identical API, mainly it has no sense Regards Pavel Stehule 2014-10-27 8:12 GMT+01:00 Ali Akbar the.ap...@gmail.com: 2014-10-27 9:11 GMT+07:00 Ali Akbar the.ap...@gmail.com: 2014-10-27 1:38 GMT+07:00 Pavel Stehule pavel.steh...@gmail.com: Hi My idea is using new ArrayBuilder optimized for building multidimensional arrays with own State type. I think so casting to ArrayBuildState is base of our problems, so I don't would to do. Code in array_agg_* is simple, little bit more complex code is in nodeSubplan.c. Some schematic changes are in attachments. Thanks! The structure looks clear, and thanks for the example on nodeSubplan.c. I will restructure the v10 of the patch to this structure. Patch attached. Regards, -- Ali Akbar commit f09f41f5fe780bed4cf25949961a4e68e6402ff0 Author: Pavel Stehule pavel.steh...@gooddata.com Date: Mon Oct 27 10:03:07 2014 +0100 initial diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7e5bcd9..f59738a 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -12046,6 +12046,22 @@ NULL baz/literallayout(3 rows)/entry row entry indexterm +primaryarray_agg/primary + /indexterm + functionarray_agg(replaceable class=parameteranyarray/replaceable)/function + /entry + entry + any + /entry + entry + the same array type as input type + /entry + entryinput arrays, aggregated into higher-order multidimesional array. Rejects NULL and empty array as input./entry + /row + + row + entry + indexterm primaryaverage/primary /indexterm indexterm diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml index 2f0680f..8c182a4 100644 --- a/doc/src/sgml/syntax.sgml +++ b/doc/src/sgml/syntax.sgml @@ -2238,6 +2238,11 @@ SELECT ARRAY(SELECT oid FROM pg_proc WHERE proname LIKE 'bytea%'); array --- {2011,1954,1948,1952,1951,1244,1950,2005,1949,1953,2006,31,2412,2413} + +SELECT ARRAY(SELECT array(select i) FROM generate_series(1,5) a(i)); + array +--- + {{1},{2},{3},{4},{5}} (1 row) /programlisting The subquery must return a single column. The resulting diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c index 401bad4..21b8de1 100644 --- a/src/backend/executor/nodeSubplan.c +++ b/src/backend/executor/nodeSubplan.c @@ -231,7 +231,10 @@ ExecScanSubPlan(SubPlanState *node, bool found = false; /* TRUE if got at least one subplan tuple */ ListCell *pvar; ListCell *l; + bool use_md_array_builder = false; + Oid md_array_element_type = InvalidOid; ArrayBuildState *astate = NULL; + MdArrayBuildState *mdastate = NULL; /* * MULTIEXPR subplans, when executed, just return NULL; but first we @@ -260,6 +263,16 @@ ExecScanSubPlan(SubPlanState *node, } /* + * use a fast array multidimensional builder when input is a array + * only check on first iteration. On subsequent, use the cached values + */ + if (subLinkType == ARRAY_SUBLINK) + { + md_array_element_type = get_element_type(subplan-firstColType); + use_md_array_builder = OidIsValid(md_array_element_type); + } + + /* * We are probably in a short-lived expression-evaluation context. Switch * to the per-query context for manipulating the child plan's chgParam, * calling ExecProcNode on it, etc. @@ -366,8 +379,16 @@ ExecScanSubPlan(SubPlanState *node, /* stash away current value */ Assert(subplan-firstColType == tdesc-attrs[0]-atttypid); dvalue = slot_getattr(slot, 1, disnull); - astate = accumArrayResult(astate, dvalue, disnull, - subplan-firstColType, oldcontext); + + if (use_md_array_builder) +mdastate = accumMdArray(mdastate, + disnull? NULL : + DatumGetArrayTypeP(dvalue), + disnull, md_array_element_type, + oldcontext); + else +astate = accumArrayResult(astate, dvalue, disnull, + subplan-firstColType, oldcontext); /* keep scanning subplan to collect all values */ continue; } @@ -439,6 +460,8 @@ ExecScanSubPlan(SubPlanState *node, /* We return the result in the caller's context */ if (astate != NULL) result = makeArrayResult(astate, oldcontext); + else if (mdastate != NULL) + result = PointerGetDatum(makeMdArray(mdastate, oldcontext, true)); else result = PointerGetDatum(construct_empty_array(subplan-firstColType)); } @@ -951,7 +974,10 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext) ListCell *pvar; ListCell *l; bool found = false; + bool use_md_array_builder = false; + Oid md_array_element_type = InvalidOid; ArrayBuildState *astate =
Re: [HACKERS] On partitioning
Amit Langote wrote: On Mon, Oct 13, 2014 at 04:38:39PM -0300, Alvaro Herrera wrote: Bruce Momjian wrote: I realize there hasn't been much progress on this thread, but I wanted to chime in to say I think our current partitioning implementation is too heavy administratively, error-prone, and performance-heavy. On the contrary, I think there was lots of progress; there's lots of useful feedback from the initial design proposal I posted. I am a bit sad to admit that I'm not working on it at the moment as I had originally planned, though, because other priorities slipped in and I am not able to work on this for a while. Therefore if someone else wants to work on this topic, be my guest -- otherwise I hope to get on it in a few months. Oh, I just meant code progress --- I agree the discussion was fruitful. FWIW, I think Robert's criticism regarding not basing this on inheritance scheme was not responded to. It was responded to by ignoring it. I didn't see anybody else supporting the idea that inheritance is in any way a sane thing to base partitioning on. Sure, we have accumulated lots of kludges over the years to cope with the fact that, really, it doesn't work very well. So what. We can keep them, I don't care. Anyway as I said above, I'm not particularly interested in any more discussion on this topic for the time being, since I don't have time to work on this patch. If anybody wants to continue discussing to improve the design some more, and even implement it or parts of it, that's fine with me -- but please expect me not to answer. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
On 10/26/2014 11:47 PM, Tomas Vondra wrote: After eyeballing the code for an hour or two, I think CREATE DATABASE should be fine with performing only a 'partial checkpoint' on the template database - calling FlushDatabaseBuffers and processing unlink requests, as suggested by the comment in createdb(). Hmm. You could replace the first checkpoint with that, but I don't think that's enough for the second. To get any significant performance benefit, you need to get rid of both checkpoints, because doing two checkpoints one after another is almost as fast as doing a single checkpoint; the second checkpoint has very little work to do because the first checkpoint already flushed out everything. The second checkpoint, after copying but before commit, is done because (from the comments in createdb function): * #1: When PITR is off, we don't XLOG the contents of newly created * indexes; therefore the drop-and-recreate-whole-directory behavior * of DBASE_CREATE replay would lose such indexes. * #2: Since we have to recopy the source database during DBASE_CREATE * replay, we run the risk of copying changes in it that were * committed after the original CREATE DATABASE command but before the * system crash that led to the replay. This is at least unexpected * and at worst could lead to inconsistencies, eg duplicate table * names. Doing only FlushDatabaseBuffers would not prevent these issues - you need a full checkpoint. These issues are better explained here: http://www.postgresql.org/message-id/28884.1119727...@sss.pgh.pa.us To solve #1, we could redesign CREATE DATABASE so that replaying the DBASE_CREATE record doesn't zap the old directory, and also doesn't copy any files. We could instead just assume that if the transaction commits, all the files have been copied and fsync'd already, like we assume that if a CREATE INDEX commits in wal_level=minimal, the underlying file was fsync'd before the commit. That would also solve #2, when doing crash recovery. But it would remain when doing archive recovery. I guess we could still redo the copy when in archive recovery mode. I believe it would be the first time we have a WAL record that's replayed differently in crash recovery than in archive recovery, so here be dragons... It's not exactly trivial change, but it does not seem frighteningly difficult coding either. The templates are usually static, so this would minimize both the CREATE DATABASE duration and disruption to the cluster it causes. I wonder if we should bite the bullet and start WAL-logging all the files that are copied from the template database to the new database. When the template database is small (template0 is 6.4MB currently), that wouldn't generate too much WAL. We could perhaps do that only if the template database is small, and do the checkpoints otherwise, although I wouldn't like to have subtly different behavior depending on database size like that. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strip nulls functions for json and jsonb
Hi I am sending a final review of this patch: 0. this patch implements null fields stripping. It does exactly what was proposed and we would to have this feature in core. It is requested feature for JSON types. 1. there is no problem with patch apply and with compilation - one warning is fixed in attachments 2. code is relative small and clean, I have no any objection 3. there is necessary regress tests and related documentation. I have no any objection - this patch is ready for commiter. Thank you for patch Regards Pavel 2014-10-26 21:57 GMT+01:00 Andrew Dunstan and...@dunslane.net: On 10/26/2014 04:22 PM, Pavel Stehule wrote: 2014-10-26 21:18 GMT+01:00 Andrew Dunstan and...@dunslane.net mailto: and...@dunslane.net: On 10/26/2014 04:14 PM, Thom Brown wrote: On 26 October 2014 20:07, Andrew Dunstan and...@dunslane.net mailto:and...@dunslane.net mailto:and...@dunslane.net mailto:and...@dunslane.net wrote: On 10/26/2014 03:50 PM, Pavel Stehule wrote: Hi I have a question, what is expected result of null strip of {a: {b: null, c, null} } ? Please remember not to top-post. The above is not legal json, so the answer would be an error. I believe Pavel means: {a: {b: null, c: null} } This is the expected result: andrew=# select json_strip_nulls('{a: {b: null, c: null} }'); json_strip_nulls -- {a:{}} (1 row) It is NOT expected that we replace an empty object with NULL (and then strip it if it's a field value of an outer level object). ok, This case should be in regress test probably Patch attached. cheers andrew commit 03dc02f601c6b9097402317e2dfa34e0f5d33ba5 Author: Pavel Stehule pavel.steh...@gooddata.com Date: Mon Oct 27 10:53:55 2014 +0100 initial diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index f59738a..e453da4 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -10716,6 +10716,19 @@ table2-mapping /programlisting /entry /row + row + entryparaliteraljson_strip_nulls(from_json json)/literal + /paraparaliteraljsonb_strip_nulls(from_json jsonb)/literal + /para/entry + entryparatypejson/type/paraparatypejsonb/type/para/entry + entry + Returns replaceablefrom_json/replaceable + with all object fields that have null values omitted. Other null values + are untouched. + /entry + entryliteraljson_strip_nulls('[{f1:1,f2:null},2,null,3]')/literal/entry + entryliteral[{f1:1},2,null,3]/literal/entry + /row /tbody /tgroup /table @@ -10752,6 +10765,16 @@ table2-mapping /para /note + note +para + If the argument to literaljson_strip_nulls/ contains duplicate + field names in any object, the result could be semantically somewhat + different, depending on the order in which they occur. This is not an + issue for literaljsonb_strip_nulls/ since jsonb values never have + duplicate object field names. +/para + /note + para See also xref linkend=functions-aggregate for the aggregate function functionjson_agg/function which aggregates record diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c index 2d00dbe..efbec7f 100644 --- a/src/backend/utils/adt/jsonfuncs.c +++ b/src/backend/utils/adt/jsonfuncs.c @@ -105,6 +105,15 @@ static void populate_recordset_object_end(void *state); static void populate_recordset_array_start(void *state); static void populate_recordset_array_element_start(void *state, bool isnull); +/* semantic action functions for json_strip_nulls */ +static void sn_object_start(void *state); +static void sn_object_end(void *state); +static void sn_array_start(void *state); +static void sn_array_end(void *state); +static void sn_object_field_start (void *state, char *fname, bool isnull); +static void sn_array_element_start (void *state, bool isnull); +static void sn_scalar(void *state, char *token, JsonTokenType tokentype); + /* worker function for populate_recordset and to_recordset */ static Datum populate_recordset_worker(FunctionCallInfo fcinfo, const char *funcname, bool have_record_arg); @@ -225,6 +234,13 @@ typedef struct PopulateRecordsetState MemoryContext fn_mcxt; /* used to stash IO funcs */ } PopulateRecordsetState; +/* state for json_strip_nulls */ +typedef struct StripnullState{ + JsonLexContext *lex; + StringInfo strval; + bool skip_next_null; +} StripnullState; + /* Turn a jsonb object into a record */ static void make_row_from_rec_and_jsonb(Jsonb *element, PopulateRecordsetState *state); @@ -2996,3 +3012,184 @@ findJsonbValueFromContainerLen(JsonbContainer *container, uint32 flags, return findJsonbValueFromContainer(container,
Re: [HACKERS] Function array_agg(array)
2014-10-27 16:15 GMT+07:00 Pavel Stehule pavel.steh...@gmail.com: Hi I did some minor changes in code * move tests of old or new builder style for array sublink out of main cycles * some API simplification of new builder - we should not to create identical API, mainly it has no sense minor changes in the patch: * remove array_agg_finalfn_internal declaration in top of array_userfuncs.c * fix comment of makeMdArray * fix return of makeMdArray * remove unnecesary changes to array_agg_finalfn Regards, -- Ali Akbar diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7e5bcd9..f59738a 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -12046,6 +12046,22 @@ NULL baz/literallayout(3 rows)/entry row entry indexterm +primaryarray_agg/primary + /indexterm + functionarray_agg(replaceable class=parameteranyarray/replaceable)/function + /entry + entry + any + /entry + entry + the same array type as input type + /entry + entryinput arrays, aggregated into higher-order multidimesional array. Rejects NULL and empty array as input./entry + /row + + row + entry + indexterm primaryaverage/primary /indexterm indexterm diff --git a/doc/src/sgml/syntax.sgml b/doc/src/sgml/syntax.sgml index 2f0680f..8c182a4 100644 --- a/doc/src/sgml/syntax.sgml +++ b/doc/src/sgml/syntax.sgml @@ -2238,6 +2238,11 @@ SELECT ARRAY(SELECT oid FROM pg_proc WHERE proname LIKE 'bytea%'); array --- {2011,1954,1948,1952,1951,1244,1950,2005,1949,1953,2006,31,2412,2413} + +SELECT ARRAY(SELECT array(select i) FROM generate_series(1,5) a(i)); + array +--- + {{1},{2},{3},{4},{5}} (1 row) /programlisting The subquery must return a single column. The resulting diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c index 401bad4..21b8de1 100644 --- a/src/backend/executor/nodeSubplan.c +++ b/src/backend/executor/nodeSubplan.c @@ -231,7 +231,10 @@ ExecScanSubPlan(SubPlanState *node, bool found = false; /* TRUE if got at least one subplan tuple */ ListCell *pvar; ListCell *l; + bool use_md_array_builder = false; + Oid md_array_element_type = InvalidOid; ArrayBuildState *astate = NULL; + MdArrayBuildState *mdastate = NULL; /* * MULTIEXPR subplans, when executed, just return NULL; but first we @@ -260,6 +263,16 @@ ExecScanSubPlan(SubPlanState *node, } /* + * use a fast array multidimensional builder when input is a array + * only check on first iteration. On subsequent, use the cached values + */ + if (subLinkType == ARRAY_SUBLINK) + { + md_array_element_type = get_element_type(subplan-firstColType); + use_md_array_builder = OidIsValid(md_array_element_type); + } + + /* * We are probably in a short-lived expression-evaluation context. Switch * to the per-query context for manipulating the child plan's chgParam, * calling ExecProcNode on it, etc. @@ -366,8 +379,16 @@ ExecScanSubPlan(SubPlanState *node, /* stash away current value */ Assert(subplan-firstColType == tdesc-attrs[0]-atttypid); dvalue = slot_getattr(slot, 1, disnull); - astate = accumArrayResult(astate, dvalue, disnull, - subplan-firstColType, oldcontext); + + if (use_md_array_builder) +mdastate = accumMdArray(mdastate, + disnull? NULL : + DatumGetArrayTypeP(dvalue), + disnull, md_array_element_type, + oldcontext); + else +astate = accumArrayResult(astate, dvalue, disnull, + subplan-firstColType, oldcontext); /* keep scanning subplan to collect all values */ continue; } @@ -439,6 +460,8 @@ ExecScanSubPlan(SubPlanState *node, /* We return the result in the caller's context */ if (astate != NULL) result = makeArrayResult(astate, oldcontext); + else if (mdastate != NULL) + result = PointerGetDatum(makeMdArray(mdastate, oldcontext, true)); else result = PointerGetDatum(construct_empty_array(subplan-firstColType)); } @@ -951,7 +974,10 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext) ListCell *pvar; ListCell *l; bool found = false; + bool use_md_array_builder = false; + Oid md_array_element_type = InvalidOid; ArrayBuildState *astate = NULL; + MdArrayBuildState *mdastate = NULL; if (subLinkType == ANY_SUBLINK || subLinkType == ALL_SUBLINK) @@ -960,6 +986,16 @@ ExecSetParamPlan(SubPlanState *node, ExprContext *econtext) elog(ERROR, CTE subplans should not be executed via ExecSetParamPlan); /* + * use a fast array multidimensional builder when input is a array + * only check on first iteration. On subsequent, use the cached values + */ + if (subLinkType == ARRAY_SUBLINK) + { + md_array_element_type = get_element_type(subplan-firstColType); + use_md_array_builder =
Re: [HACKERS] Better support of exported snapshots with pg_dump
On 17/10/14 06:25, Michael Paquier wrote: Two votes in favor of that from two committers sounds like a deal. Here is an refreshed version of the patch introducing --snapshot from here, after fixing a couple of things and adding documentation: http://www.postgresql.org/message-id/ca+u5nmk9+ttcff_-4mfdxwhnastauhuq7u7uedd57vay28a...@mail.gmail.com Hi, I have two minor things: + printf(_( --snapshot use given synchronous snapshot for the dump\n)); I think this should say --snapshot=NAME or something like that to make it visible that you are supposed to provide the parameter. The other thing is, you changed logic of the parallel dump behavior a little - before your patch it works in a way that one connection exports snapshot and others do SET TRANSACTION SNAPSHOT, after your patch, even the connection that exported the snapshot does the SET TRANSACTION SNAPSHOT which is unnecessary. I don't see it as too big deal but I don't see the need to change that behavior either. You could do something like this instead maybe?: if (AH-sync_snapshot_id) SET TRANSACTION SNAPSHOT else if (AH-numWorkers 1 AH-remoteVersion = 90200 !dopt-no_synchronized_snapshots) AH-sync_snapshot_id = get_synchronized_snapshot(AH); And remove the else if for the if (dumpsnapshot) part. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] END_OF_RECOVERY shutdowns and ResetUnloggedRelations()
At 2014-09-25 22:41:18 +0200, and...@2ndquadrant.com wrote: On 2014-09-24 17:06:05 +0530, Abhijit Menon-Sen wrote: 1. Move the call to ResetUnloggedRelations(UNLOGGED_RELATION_INIT) to earlier in StartupXLOG. 2. Inside that function, issue fsync()s for the main forks we create by copying the _init fork. These two imo should definitely be backpatched. I've attached updated versions of these two patches. The only changes are to revise the comments as you suggested. 3. A small fixup to add a const to typedef char *FileName, because the earlier patch gave me warnings about discarding const-ness. This is consistent with many other functions in fd.c that take const char *. I'm happy with consider that one for master (although I seem to recall having had a patch for it rejected?), but I don't think we want to do that in the backbranches. (I gave up on this for now as an unrelated diversion.) 4. Issue an fsync() on the data directory at startup if we need to perform crash recovery. I personally think this one should be backpatched too. Does anyone believe differently? I revised this patch as well, I'll rebase and send it separately along with the initdb -S patch shortly. -- Abhijit From 8487185a5f2073ed475f971e6203d49e1e21a987 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Wed, 8 Oct 2014 11:48:25 +0530 Subject: Call ResetUnloggedRelations(UNLOGGED_RELATION_INIT) earlier We need to call this after recovery, but not after the END_OF_RECOVERY checkpoint is written. If we crash after that checkpoint, for example because of ENOSPC in PreallocXlogFiles, the checkpoint has already set the ControlFile-state to DB_SHUTDOWNED, so we don't enter recovery again at startup. Because we did call ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP) in the first cleanup, this leaves the database with a bunch of _init forks for unlogged relations, but no main forks, leading to scary errors. See thread from 20140912112246.ga4...@alap3.anarazel.de for details. --- src/backend/access/transam/xlog.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 3c9aeae..2ab9da5 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6871,6 +6871,16 @@ StartupXLOG(void) ShutdownWalRcv(); /* + * Reset unlogged relations to the contents of their INIT fork. This + * is done AFTER recovery is complete so as to include any unlogged + * relations created during recovery, but BEFORE recovery is marked + * as having completed successfully (because this step can fail, + * e.g. due to ENOSPC). + */ + if (InRecovery) + ResetUnloggedRelations(UNLOGGED_RELATION_INIT); + + /* * We don't need the latch anymore. It's not strictly necessary to disown * it, but let's do it for the sake of tidiness. */ @@ -7138,14 +7148,6 @@ StartupXLOG(void) PreallocXlogFiles(EndOfLog); /* - * Reset initial contents of unlogged relations. This has to be done - * AFTER recovery is complete so that any unlogged relations created - * during recovery also get picked up. - */ - if (InRecovery) - ResetUnloggedRelations(UNLOGGED_RELATION_INIT); - - /* * Okay, we're officially UP. */ InRecovery = false; -- 1.9.1 From 944e7c4e27fca5589b8a103f7f470df23a5161c2 Mon Sep 17 00:00:00 2001 From: Abhijit Menon-Sen a...@2ndquadrant.com Date: Wed, 24 Sep 2014 16:01:37 +0530 Subject: =?UTF-8?q?Make=20ResetUnloggedRelations(=E2=80=A6=5FINIT)=20fsync?= =?UTF-8?q?=20newly-created=20main=20forks?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit During UNLOGGED_RELATION_INIT, after we have copied ${x}_init to $x, we issue fsync()s for the newly-created files. We depend on their existence and a checkpoint isn't going to fsync them for us during recovery. See thread from 20140912112246.ga4...@alap3.anarazel.de for details. --- src/backend/storage/file/reinit.c | 38 ++ 1 file changed, 38 insertions(+) diff --git a/src/backend/storage/file/reinit.c b/src/backend/storage/file/reinit.c index 3229f41..4ad5987 100644 --- a/src/backend/storage/file/reinit.c +++ b/src/backend/storage/file/reinit.c @@ -339,6 +339,44 @@ ResetUnloggedRelationsInDbspaceDir(const char *dbspacedirname, int op) } FreeDir(dbspace_dir); + + /* + * copy_file() above has already called pg_flush_data() on the + * files it created. Now we need to fsync those files, because + * a checkpoint won't do it for us while we're in recovery. We + * do this in a separate pass to allow the kernel to perform + * all the flushes at once. + */ + + dbspace_dir = AllocateDir(dbspacedirname); + while ((de = ReadDir(dbspace_dir, dbspacedirname)) != NULL) + { + ForkNumber forkNum; + int oidchars; + char oidbuf[OIDCHARS + 1]; + char mainpath[MAXPGPATH]; + + /* Skip
[HACKERS] Missing FIN_CRC32 calls in logical replication code
replication/slot.c and replication/logical/snapbuild.c use a CRC on the physical slot and snapshot files. It uses the same algorithm as used e.g. for the WAL. However, they are not doing the finalization step, FIN_CRC32() on the calculated checksums. Not that it matters much, but it's a bit weird and inconsistent, and was probably an oversight. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Function array_agg(array)
2014-10-27 11:20 GMT+01:00 Ali Akbar the.ap...@gmail.com: 2014-10-27 16:15 GMT+07:00 Pavel Stehule pavel.steh...@gmail.com: Hi I did some minor changes in code * move tests of old or new builder style for array sublink out of main cycles * some API simplification of new builder - we should not to create identical API, mainly it has no sense minor changes in the patch: * remove array_agg_finalfn_internal declaration in top of array_userfuncs.c * fix comment of makeMdArray * fix return of makeMdArray * remove unnecesary changes to array_agg_finalfn super I tested last version and I have not any objections. 1. We would to have this feature - it is long time number of our ToDo List 2. Proposal and design of multidimensional aggregation is clean and nobody has objection here. 3. There is zero impact on current implementation. From performance reasons it uses new array optimized aggregator - 30% faster for this purpose than current (scalar) array aggregator 4. Code is clean and respect PostgreSQL coding rules 5. There are documentation and necessary regress tests 6. Patching and compilation is clean without warnings. This patch is ready for commit Thank you for patch Regards Pavel Regards, -- Ali Akbar
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
To solve #1, we could redesign CREATE DATABASE so that replaying the DBASE_CREATE record doesn't zap the old directory, and also doesn't copy any files. We could instead just assume that if the transaction commits, all the files have been copied and fsync'd already, like we assume that if a CREATE INDEX commits in wal_level=minimal, the underlying file was fsync'd before the commit. Do you mean that during a recovery, we just let the database directory be and assume that it is in good shape since the transaction committed originally? I wonder if we should bite the bullet and start WAL-logging all the files that are copied from the template database to the new database. When the template database is small (template0 is 6.4MB currently), that wouldn't generate too much WAL. We could perhaps do that only if the template database is small, and do the checkpoints otherwise, although I wouldn't like to have subtly different behavior depending on database size like that. For the sort of workload Tomas described above (creating a lot of databases on the fly), we may end up with a lot of WAL eventually if we do this. Regards, Atri
Re: [HACKERS] pg_receivexlog --status-interval add fsync feedback
On Fri, Oct 24, 2014 at 11:21 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 10/24/2014 01:24 PM, furu...@pm.nttdata.co.jp wrote: Sorry, I'm going around in the circle. But I'd like to say again, I don't think this is good idea. It prevents asynchronous pg_receivexlog from fsyncing WAL data and sending feedbacks more frequently at all. They are useful, for example, when we want to monitor the write location of asynchronous pg_receivexlog in almost realtime. But if we adopt the idea, since feedback cannot be sent soon in async mode, pg_stat_replication always returns the not-up-to-date location. Why not send a message every 10 seconds when its not sync rep? Or even after every write(). It's a tiny amount of network traffic anyway. I understand that send feedback message frequently will keep pg_stat_replication up-to-date state. Are there really no needs who wants to fsync even in async mode ? I think the people who dislike Data lost will like that idea. The OS will not sit on the written but not fsync'd data forever, it will get flushed to disk in a few seconds even without the fsync. It's just that there is no guarantee on when it will hit the disk, but there are no strong guarantees in asynchronous replication anyway. This makes me come up with the idea about adding new GUC parameter like wal_receiver_fsync so that a user can disable walreceiver to fsync WAL after every write(). IOW, it can allow walreceiver to work like current pg_receivexlog. One problem of the above idea is that the startup process can replay WAL which has not been fsync'd yet. This violates the WAL rule. To resolve this, the startup process would need to ensure that WAL to replay has already been fsync'd, and otherwise issue the fsync. This basically makes the WAL replay longer, but we can reduce the amount of I/O by walreceiver. This seems also useful even with synchronous replication if synchronous_commit is set to remote_write. Since walreceiver doesn't issue the fsync and can focus on receiving and writing WAL, the performance of replication would be improved. Nevertheless in sync or async, returning feedback and executing fsync() same as like walreceiver is such a problem? Probably would be OK. It would increase the I/O a lot, thanks to a lot of small writes and fsyncs, which might be surprising for a tool like pg_receivexlog. One idea is to change the default behavior to be like walreceiver, and fsync() after every write. But provide an option to get the old behavior, --no-fsync. I'm OK with this. That is, by default, pg_receivexlog issues the fsync and sends the feedback message back after every write(). But something like --no-fsync is specified, WAL file is fsync'd only when it's closed and a feedback message is sent back only when --status-interval time is elapsed. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
On 10/27/2014 01:06 PM, Atri Sharma wrote: To solve #1, we could redesign CREATE DATABASE so that replaying the DBASE_CREATE record doesn't zap the old directory, and also doesn't copy any files. We could instead just assume that if the transaction commits, all the files have been copied and fsync'd already, like we assume that if a CREATE INDEX commits in wal_level=minimal, the underlying file was fsync'd before the commit. Do you mean that during a recovery, we just let the database directory be and assume that it is in good shape since the transaction committed originally? Right. I wonder if we should bite the bullet and start WAL-logging all the files that are copied from the template database to the new database. When the template database is small (template0 is 6.4MB currently), that wouldn't generate too much WAL. We could perhaps do that only if the template database is small, and do the checkpoints otherwise, although I wouldn't like to have subtly different behavior depending on database size like that. For the sort of workload Tomas described above (creating a lot of databases on the fly), we may end up with a lot of WAL eventually if we do this. I think writing 6 MB for each CREATE DATABASE would be OK. For comparison, a pg_switch_xlog() call will waste on average half a segment, i.e. 8MB. It's a lot cheaper than a checkpoint on a busy system, for sure. But of course, if the template database is larger, then you will generate more WAL. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Master ip from hot_standby..
Hi, I need to query master ip from hot_standby. *pg_stat_replication* view only shows the slave replication status. Is there any way to get *Master IP* from standby node apart from checking *recovery.conf* file. Thanks, Sudalai - sudalai -- View this message in context: http://postgresql.1045698.n5.nabble.com/Master-ip-from-hot-standby-tp5824415.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
On Mon, Oct 27, 2014 at 4:44 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 10/27/2014 01:06 PM, Atri Sharma wrote: To solve #1, we could redesign CREATE DATABASE so that replaying the DBASE_CREATE record doesn't zap the old directory, and also doesn't copy any files. We could instead just assume that if the transaction commits, all the files have been copied and fsync'd already, like we assume that if a CREATE INDEX commits in wal_level=minimal, the underlying file was fsync'd before the commit. Do you mean that during a recovery, we just let the database directory be and assume that it is in good shape since the transaction committed originally? Right. It does make sense, however, with the checkpoint after creating the files gone, the window between the creation of files and actual commit might be increased, increasing the possibility of a crash during that period and causing an orphan database. However, my understanding of the consequences of removing the checkpoint might be incorrect, so my fears might be wrong. Regards, Atri
Re: [HACKERS] Directory/File Access Permissions for COPY and Generic File Access Functions
* Peter Eisentraut (pete...@gmx.net) wrote: On 10/16/14 12:01 PM, Stephen Frost wrote: This started out as a request for a non-superuser to be able to review the log files without needing access to the server. I think that can be done with a security-definer function. Of course it can be. We could replace the entire authorization system with security definer functions too. I don't view this as an argument against this feature, particularly as we know other systems have it, users have asked for multiple times, and large PG deployments have had to hack around our lack of it. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Index-only scans for GIST
On 18 August 2014 09:05, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 08/17/2014 07:15 PM, Anastasia Lubennikova wrote: 2014-08-07 0:30 GMT+04:00 Heikki Linnakangas hlinnakan...@vmware.com: * I'm getting two regression failures with this (opr_sanity and join). opr_sanity failure is corrected. But there is remain question with join. I check the latest version of my github repo and there's no fail in join regression test All 145 tests passed. To tell the truth, I don't understand which changes could led to this failure. Could you show me regression diffs? Sure, here you go. It seems like a change in a plan. At a quick glance it seems harmless: the new plan is identical except that the left and right side of a join have been reversed. But I don't understand either why this patch would change that, so it needs to be investigated. * The regression test queries that use LIMIT are not guaranteed to always return the same rows, hence they're not very good regression test cases. I'd suggest using more restricting WHERE clauses, so that each query only returns a handful of rows. Thank you for comment, I rewrote wrong queries. But could you explain why LIMIT queries may return different results? Is it happens because of different query optimization? Imagine that you have a table with two rows, A and B. If you run a query like SELECT * FROM table LIMIT 1, the system can legally return either row A or B, because there's no ORDER BY. Now, admittedly you have a similar problem even without the LIMIT - the system can legally return the rows in either order - but it's less of an issue because at least you can quickly see from the diff that the result set is in fact the same, the rows are just in different order. You could fix that by adding an ORDER BY to all test queries, but we haven't done that in the regression suite because then we would not have any test coverage for cases where you don't have an ORDER BY. As a compromise, test cases are usually written without an ORDER BY, but if e.g. the buildfarm starts failing because of differences in the result set order across platforms, then we add an ORDER BY to make it stable. * I think it's leaking memory, in GIST scan context. I tested this with a variant of the regression tests: insert into gist_tbl select box(point(0.05*i, 0.05*i), point(0.05*i, 0.05*i)), point(0.05*i, 0.05*i) FROM generate_series(0, 1000) as i; CREATE INDEX gist_tbl_point_index ON gist_tbl USING gist (p); set enable_seqscan=off; set enable_bitmapscan=off; explain analyze select p from gist_tbl where p @ box(point(0,0), point(999,999)) and length(p::text) 10; while the final query runs, 'top' shows constantly increasing memory usage. I don't think it's memory leak. After some increasing, memory using remain the same. It works similar without using indexonlyscan. No, it's definitely a leak caused by the patch. Test with the attached patch, which prints out to the server log the amount of memory used by the GiST scan memory context every 1 rows. It clearly shows increasing memory usage all the way to the end of the query. It's cleaned up at the end of the query, but that's not good enough because for a large query you might accumulate gigabytes of leaked memory until the query has finished. If you (manually) apply the same patch to git master, you'll see that the memory usage stays consistent and small. Hi Anastasia, Do you have time to address the issues brought up in Heikki's review? It would be good if we could your work into PostgreSQL 9.5. Thanks Thom
Re: [HACKERS] Index scan optimization
On 26 October 2014 10:42, Haribabu Kommi wrote: Hi, I reviewed index scan optimization patch, the following are the observations. - Patch applies cleanly. - Compiles without warnings - All regress tests are passed. There is a good performance gain with the patch in almost all scenarios. Thanks for reviewing. I have a question regarding setting of key flags matched. Only the first key was set as matched even if we have multiple index conditions. Is there any reason behind that? Actually the keysCount can be more than one only if the key strategy is BTEqualStrategyNumber (i.e. = condition) But our optimization is only for the '' or '=' condition only. So adding code to set flag for all keys is of no use. Let me know if I am missing anything here? If any volatile function is present in the index condition, the index scan itself is not choosen, Is there any need of handling the same similar to NULLS? Do you mean to say that whether we should add any special handling for volatile function? If yes then as you said since index scan itself is not selected for such functions, then I feel We don’t need to add anything for that. Any opinion? Thanks and Regards, Kumar Rajeev Rastogi -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] KNN searches support for SP-GiST [GSOC'14]
On 20 August 2014 08:09, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 08/20/2014 03:35 AM, Vladislav Sterzhanov wrote: Hi there, pg-Hackers! Here I go with the patch which brings up the possibility to perform nearest-neighbour searches on SP-GiSTs (as of now includes implementation for quad and kd trees). Pre-reviewed by my GSoC mentor Alexander Korotkov. Sample benchmarking script included in the attachment (dumps the current geonames archive and runs several searches on the (latitude, longitude) points), which demonstrates the dramatic improvements against plain searches and sorting. Cool! I think this needs a high-level description in access/spgist/README on how this works. You can probably copy-paste the similar description from gist's README, because it's the same design. If I understand correctly, the support functions return distances between the nodes and the query, and the SP-GiST code uses those distances to return the rows in order. An important detail there is that the distance returned for an inner node is a lower bound of the distance of any node in that branch. A binary heap would be a better data structure than a red-black tree for the queue of nodes to visit/return. I know that GiST also uses a red-black for the same thing, but that's no reason not to do better here. There is a binary heap implementation in the backend too (see src/backend/lib/binaryheap.c), so it's not any more difficult to use. Does the code handle ties correctly? The distance function is just an approximation, so it's possible that there are two items with same distance, but are not equal according to the ordering operator. As an extreme example, if you hack the distance function to always return 0.0, do you still get correct results (albeit slowly)? The new suppLen field in spgConfigOut is not mentioned in the docs. Why not support pass-by-value supplementary values? A couple of quick comments on the code: @@ -137,8 +138,8 @@ spg_quad_picksplit(PG_FUNCTION_ARGS) { spgPickSplitIn *in = (spgPickSplitIn *) PG_GETARG_POINTER(0); spgPickSplitOut *out = (spgPickSplitOut *) PG_GETARG_POINTER(1); - int i; - Point *centroid; + int i; + Point *centroid; #ifdef USE_MEDIAN /* Use the median values of x and y as the centroid point */ Please remove all unnecessary whitespace changes like this. @@ -213,14 +215,19 @@ spg_quad_inner_consistent(PG_FUNCTION_ARGS) } Assert(in-nNodes == 4); + + if (in-norderbys 0) + { + out-distances = palloc(in-nNodes * sizeof (double *)); + } I think that should be sizeof(double). + *distances = (double *) malloc(norderbys * sizeof (double *)); No mallocs allowed in the backend, should be palloc. Hi Vlad, I've been revisiting the status of the GSoC projects for this year and saw this had stalled. Do you have time to address the remaining issues so we can get your work committed? Thanks Thom
Re: [HACKERS] On partitioning
On 2014-10-27 06:29:33 -0300, Alvaro Herrera wrote: Amit Langote wrote: On Mon, Oct 13, 2014 at 04:38:39PM -0300, Alvaro Herrera wrote: Bruce Momjian wrote: I realize there hasn't been much progress on this thread, but I wanted to chime in to say I think our current partitioning implementation is too heavy administratively, error-prone, and performance-heavy. On the contrary, I think there was lots of progress; there's lots of useful feedback from the initial design proposal I posted. I am a bit sad to admit that I'm not working on it at the moment as I had originally planned, though, because other priorities slipped in and I am not able to work on this for a while. Therefore if someone else wants to work on this topic, be my guest -- otherwise I hope to get on it in a few months. Oh, I just meant code progress --- I agree the discussion was fruitful. FWIW, I think Robert's criticism regarding not basing this on inheritance scheme was not responded to. It was responded to by ignoring it. I didn't see anybody else supporting the idea that inheritance is in any way a sane thing to base partitioning on. Sure, we have accumulated lots of kludges over the years to cope with the fact that, really, it doesn't work very well. So what. We can keep them, I don't care. As far as I understdood Robert's criticism it was more about the internals, than about the userland representation. To me it's absolutely clear that 'real partitioning' userland shouldn't be based on the current hacks to allow it. But I do think that a first step very well might reuse the planner/executor smarts about it. Even a good chunk of the tablecmd.c logic might be reusable for individual partitions without much change. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Dynamically change Master(recovery info) without restarting standby server..
Hi, Is there any way to change the *master* without restarting the *standby* server. Postgresql Documentation says, --Recovery.conf file only read on the startup in standby mode. In that file we specify the Masterip. --If you want to change the master we need to change the recovery.conf file and restart the standby node. but, *I Need to change masterip without restarting the standby node.* - sudalai -- View this message in context: http://postgresql.1045698.n5.nabble.com/Dynamically-change-Master-recovery-info-without-restarting-standby-server-tp5824423.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TODO : Allow parallel cores to be used by vacuumdb [ WIP ]
On Sat, Oct 25, 2014 at 5:52 PM, Amit Kapila amit.kapil...@gmail.com wrote: *** *** 358,363 handle_sigint(SIGNAL_ARGS) --- 358,364 /* Send QueryCancel if we are processing a database query */ if (cancelConn != NULL) { + inAbort = true; if (PQcancel(cancelConn, errbuf, sizeof(errbuf))) fprintf(stderr, _(Cancel request sent\n)); else Do we need to set inAbort flag incase PQcancel is successful? Basically if PQCancel fails due to any reason, I think behaviour can be undefined as the executing thread can assume that cancel is done. *** 391,396 consoleHandler(DWORD dwCtrlType) --- 392,399 EnterCriticalSection (cancelConnLock); if (cancelConn != NULL) { + inAbort = true; + You have set this flag in case of windows handler, however the same is never used incase of windows, are you expecting any use of this flag for windows? Going further with verification of this patch, I found below issue: Run the testcase.sql file at below link: http://www.postgresql.org/message-id/4205e661176a124faf891e0a6ba9135266347...@szxeml509-mbs.china.huawei.com ./vacuumdb --analyze-in-stages -j 8 -d postgres Generating minimal optimizer statistics (1 target) Segmentation fault Server Log: ERROR: syntax error at or near minimal at character 12 STATEMENT: ANALYZE ng minimal optimizer statistics (1 target) LOG: could not receive data from client: Connection reset by peer Fixed below issues and attached an updated patch with mail: 1. make check for docs gives below errors: { \ echo !ENTITY version \9.5devel\; \ echo !ENTITY majorversion \9.5\; \ } version.sgml '/usr/bin/perl' ./mk_feature_tables.pl YES ../../../src/backend/catalog/sql_feature_packages.txt ../../../src/backend/catalog/sql_features.txt features-supported.sgml '/usr/bin/perl' ./mk_feature_tables.pl NO ../../../src/backend/catalog/sql_feature_packages.txt ../../../src/backend/catalog/sql_features.txt features-unsupported.sgml '/usr/bin/perl' ./generate-errcodes-table.pl ../../../src/backend/utils/errcodes.txt errcodes-table.sgml onsgmls -wall -wno-unused-param -wno-empty -wfully-tagged -D . -D . -s postgres.sgml onsgmls:ref/vacuumdb.sgml:224:15:E: end tag for LISTITEM omitted, but OMITTAG NO was specified onsgmls:ref/vacuumdb.sgml:209:8: start tag was here onsgmls:ref/vacuumdb.sgml:224:15:E: end tag for VARLISTENTRY omitted, but OMITTAG NO was specified onsgmls:ref/vacuumdb.sgml:206:5: start tag was here onsgmls:ref/vacuumdb.sgml:224:15:E: end tag for VARIABLELIST omitted, but OMITTAG NO was specified onsgmls:ref/vacuumdb.sgml:79:4: start tag was here onsgmls:ref/vacuumdb.sgml:225:18:E: end tag for element LISTITEM which is not open onsgmls:ref/vacuumdb.sgml:226:21:E: end tag for element VARLISTENTRY which is not open onsgmls:ref/vacuumdb.sgml:228:18:E: document type does not allow element VARLISTENTRY here; assuming missing VARIABLELIST start-tag onsgmls:ref/vacuumdb.sgml:260:9:E: end tag for element PARA which is not open make: *** [check] Error 1 Fixed. 2. Below multi-line comment is wrong: /* Otherwise, we got a stage from vacuum_all_databases(), so run * only that one. */ Fixed. 3. fprintf(stderr, _(%s: vacuuming of database \%s\ failed: %s), progname, dbname, PQerrorMessage(conn)); indentation of fprintf is not proper. Fixed. 4. /* This can only happen if user has sent the cacle request using * Ctrl+C, Cancle is handled by 0th slot, so fetch the error result */ spelling of cancel is wrong and multi-line comment is not proper. Fixed 5. /* This can only happen if user has sent the cacle request using * Ctrl+C, Cancle is handled by 0th slot, so fetch the error result */ GetQueryResult(pSlot[0].connection, dbname, progname, completedb); indentation of completedb parameter is wrong. Fixed. 6. /* * vacuum_parallel * This function will open the multiple concurrent connections as * suggested by used, it will derive the table list using server call * if table list is not given by user and perform the vacuum call */ s/used/user Fixed. In general, I think you can once go through all the comments and code to see if similar issues exist at other places as well. I have done some performance test with the patch, data for which is as below: Performance Data -- IBM POWER-7 16 cores, 64 hardware threads RAM = 64GB max_connections = 128 checkpoint_segments=256 checkpoint_timeout=15min shared_buffers = 1GB Before each test, run the testcase.sql file at below link: http://www.postgresql.org/message-id/4205e661176a124faf891e0a6ba9135266347...@szxeml509-mbs.china.huawei.com Un-patched - time ./vacuumdb -t t1 -t t2 -t t3 -t t4 -t t5 -t t6 -t t7 -t t8 -d postgres real 0m2.454s user 0m0.002s sys 0m0.006s Patched - time ./vacuumdb -t t1 -t t2 -t t3 -t t4 -t t5 -t t6 -t t7 -t t8 -j 4 -d postgres real 0m1.691s user 0m0.001s sys 0m0.004s time ./vacuumdb -t t1 -t t2 -t t3 -t t4 -t t5 -t t6 -t t7 -t t8 -j 8 -d postgres real 0m1.496s user
Re: [HACKERS] Missing FIN_CRC32 calls in logical replication code
On 2014-10-27 12:51:44 +0200, Heikki Linnakangas wrote: replication/slot.c and replication/logical/snapbuild.c use a CRC on the physical slot and snapshot files. It uses the same algorithm as used e.g. for the WAL. However, they are not doing the finalization step, FIN_CRC32() on the calculated checksums. Not that it matters much, but it's a bit weird and inconsistent, and was probably an oversight. Hm. Good catch - that's stupid. I wonder what to do about it. I'm tempted to just add a comment about it to 9.4 and fix it on master as changing it is essentially a catversion bump. Any objections to that? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Dne 27 Říjen 2014, 10:47, Heikki Linnakangas napsal(a): On 10/26/2014 11:47 PM, Tomas Vondra wrote: After eyeballing the code for an hour or two, I think CREATE DATABASE should be fine with performing only a 'partial checkpoint' on the template database - calling FlushDatabaseBuffers and processing unlink requests, as suggested by the comment in createdb(). Hmm. You could replace the first checkpoint with that, but I don't think that's enough for the second. To get any significant performance benefit, you need to get rid of both checkpoints, because doing two checkpoints one after another is almost as fast as doing a single checkpoint; the second checkpoint has very little work to do because the first checkpoint already flushed out everything. Yes, that's why I wrote that two checkpoints not too far away are effectively a single checkpoint. OTOH if the template database is not small, it may take a while to copy the data, increasing the distance between the checkpoints. While our template databases are small (in the order of ~10-100MB), there are probably people using this to clone much larger databases. The second checkpoint, after copying but before commit, is done because (from the comments in createdb function): * #1: When PITR is off, we don't XLOG the contents of newly created * indexes; therefore the drop-and-recreate-whole-directory behavior * of DBASE_CREATE replay would lose such indexes. * #2: Since we have to recopy the source database during DBASE_CREATE * replay, we run the risk of copying changes in it that were * committed after the original CREATE DATABASE command but before the * system crash that led to the replay. This is at least unexpected * and at worst could lead to inconsistencies, eg duplicate table * names. Doing only FlushDatabaseBuffers would not prevent these issues - you need a full checkpoint. These issues are better explained here: http://www.postgresql.org/message-id/28884.1119727...@sss.pgh.pa.us To solve #1, we could redesign CREATE DATABASE so that replaying the DBASE_CREATE record doesn't zap the old directory, and also doesn't copy any files. We could instead just assume that if the transaction commits, all the files have been copied and fsync'd already, like we assume that if a CREATE INDEX commits in wal_level=minimal, the underlying file was fsync'd before the commit. That would also solve #2, when doing crash recovery. But it would remain when doing archive recovery. I guess we could still redo the copy when in archive recovery mode. I believe it would be the first time we have a WAL record that's replayed differently in crash recovery than in archive recovery, so here be dragons... Yeah ... It's not exactly trivial change, but it does not seem frighteningly difficult coding either. The templates are usually static, so this would minimize both the CREATE DATABASE duration and disruption to the cluster it causes. I wonder if we should bite the bullet and start WAL-logging all the files that are copied from the template database to the new database. When the template database is small (template0 is 6.4MB currently), that wouldn't generate too much WAL. We could perhaps do that only if the template database is small, and do the checkpoints otherwise, although I wouldn't like to have subtly different behavior depending on database size like that. IMHO writing all the data into a WAL would be the cleanest solution. Also, what is a small database? I don't think a static value will work, because the sweet spot between the current approach (forcing two checkpoints) and writing everything in WAL depends on the amount of dirty buffers that need to be checkpointed. Which is mostly driven by the size of shared buffers and write activity - for small shared buffers and/or mostly-read workload, checkpoints are cheap, so the 'small database' threshold (when choosing the WAL approach) is much higher than for large shared buffers or write-heavy workloads. So maybe if we could determine the amount of data to be checkpointed, and then base the decision on that, that'd work better? This would also have to take into account that writing into WAL is sequential, while checkpoints usually cause random writes all over the datafiles (which is more expensive). Another option might be forcing just a spread checkpoint, not the immediate one (which is what we do now). That would not fix the CREATE DATABASE duration (actually, it would make it longer), but it would lower the impact on other activity on the machine. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)
On Fri, Oct 24, 2014 at 10:05 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 10/23/2014 11:09 AM, Heikki Linnakangas wrote: At least for master, we should consider changing the way the archiving works so that we only archive WAL that was generated in the same server. I.e. we should never try to archive WAL files belonging to another timeline. I just remembered that we discussed a different problem related to this some time ago, at http://www.postgresql.org/message-id/20131212.110002.204892575.horiguchi.kyot...@lab.ntt.co.jp. The conclusion of that was that at promotion, we should not archive the last, partial, segment from the old timeline. So, this is what I came up with for master. Does anyone see a problem with it? What about the problem that I raised upthread? This is, the patch prevents the last, partial, WAL file of the old timeline from being archived. So we can never PITR the database to the point that the last, partial WAL file has. Isn't this problem? For example, please imagine the following scenario. 1. The important data was deleted but no one noticed that. This deletion was logged in last, partial WAL file. 2. The server crashed and DBA started an archive recovery from old backup. 3. After recovery, all WAL files of the old timeline were recycled. 4. Finally DBA noticed the loss of important data and tried to do PITR to the point where the data was deleted. HOWEVER, the WAL file containing that deletion operation no longer exists. So DBA will never be able to recover that important data Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] alter user set local_preload_libraries.
On Tue, Oct 21, 2014 at 3:16 PM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: Wow. By the way, I became unable to login at all after wrongly setting *_preload_libraries for all available users. Is there any releaf measures for the situation? I think it's okay even if there's no way to login again but want to know if any. Yep, that's a problem. You can login to the server even in that case by, for example, executing the following commands, though. $ export PGOPTIONS=-c *_preload_libraries= $ psql Thank you. I see. It seems enough for me. The section 18.1.3 of 9.4 document describes about it, http://www.postgresql.org/docs/9.4/static/config-setting.html 18.1.4. Parameter Interaction via Shell .. On the libpq-client, command-line options can be specified using the PGOPTIONS environment variable. When connecting to the server, the contents of this variable are sent to the server as if they were being executed via SQL SET at the beginning of the session. This implicitly denies PGC_BACKEND variables to be set by this method but it also seems not proper to describe precisely... Sorry, probably I failed to understand your point. Could you tell me what you're suggesting? You're thinking that the description needs to be updated? How? Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing lock strength of adding foreign keys
Thanks for weighing in, Noah. On Sat, Oct 25, 2014 at 2:00 PM, Noah Misch n...@leadboat.com wrote: http://www.postgresql.org/message-id/ca+tgmoy4glsxzk0tao29-ljtcuj0sl1xwcwq51xb-hfysgi...@mail.gmail.com http://www.postgresql.org/message-id/20893.1393892...@sss.pgh.pa.us http://www.postgresql.org/message-id/20140306224340.ga3551...@tornado.leadboat.com As far as triggers are concerned, the issue of skew between the transaction snapshot and what the ruleutils.c snapshots do seems to be the principal issue. Commit e5550d5fec66aa74caad1f79b79826ec64898688 changed pg_get_constraintdef() to use an MVCC snapshot rather than a current MVCC snapshot; if that change is safe, I am not aware of any reason why we couldn't change pg_get_triggerdef() similarly. pg_get_triggerdef() is fine as-is with concurrent CREATE TRIGGER. The pg_get_constraintdef() change arose to ensure a consistent result when concurrent ALTER TABLE VALIDATE CONSTRAINT mutates a constraint definition. (Reducing the lock level of DROP TRIGGER or ALTER TRIGGER, however, would create the analogous problem for pg_get_triggerdef().) Maybe so, but I'd favor changing it anyway and getting it over with. The current situation seems to have little to recommend it; moreover, it would be nice, if it's possible and safe, to weaken the lock levels for all three of those commands at the same time. Do you see any hazards for ALTER or DROP that do not exist for CREATE? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing lock strength of adding foreign keys
On Sun, Oct 26, 2014 at 9:48 PM, Andreas Karlsson andr...@proxel.se wrote: Agreed.. But I think reducing the lock level of the secondary table is much more important than doing the same for the primary table due to the case where the secondary table is an existing table which is hit by a workload of long running queries and DML while the primary is a new table which is added now. In my dream world I could add the new table without any disruption at all of queries using the secondary table, no matter the duration of the transaction adding the table (barring insertion of actual data into the primary table, which would take row locks). That would indeed be nice, but it doesn't seem very practical, because the parent needs triggers, too: if you try to delete a row from the parent, or update the key, it's got to go look at the child and see whether there are rows against the old value. Then it's got to either update those rows, or null out the value, or throw an error. Regardless of which of those things it does (which depends on the ON DELETE and ON UPDATE settings you choose), it's hard to imagine that it would be a good idea for any of those things to start happening in the middle of a transaction or statement. So, let's take what we can get. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
IMHO writing all the data into a WAL would be the cleanest solution. Also, what is a small database? I don't think a static value will work, because the sweet spot between the current approach (forcing two checkpoints) and writing everything in WAL depends on the amount of dirty buffers that need to be checkpointed. Which is mostly driven by the size of shared buffers and write activity - for small shared buffers and/or mostly-read workload, checkpoints are cheap, so the 'small database' threshold (when choosing the WAL approach) is much higher than for large shared buffers or write-heavy workloads. So are you proposing having a heuristic based on the amount of data in shared buffers and write activity? Do you have something in mind that works for general workloads as well? So maybe if we could determine the amount of data to be checkpointed, and then base the decision on that, that'd work better? This would also have to take into account that writing into WAL is sequential, while checkpoints usually cause random writes all over the datafiles (which is more expensive). Another option might be forcing just a spread checkpoint, not the immediate one (which is what we do now). That would not fix the CREATE DATABASE duration (actually, it would make it longer), but it would lower the impact on other activity on the machine. I believe this to be the cleanest way to reduce the amount of I/O generated. If I understand correctly, the original problem you mentioned was not the time CREATE DATABASE is taking but rather the amount of I/O each one is generating. This also leads me to think if it makes sense to explore group commits around the creation of files for a new database (for a same backend, of course). This might be on call, if the user knows he/she is going to create a lot of databases in the near future and is fine with a large spike in I/O at one go. Again, might be even more broken than the current scenario, but depends on what the user wants... Regards, Atri -- Regards, Atri *l'apprenant*
Re: [HACKERS] Possible problem with shm_mq spin lock
On Sat, Oct 25, 2014 at 9:12 PM, Tom Lane t...@sss.pgh.pa.us wrote: Haribabu Kommi kommi.harib...@gmail.com writes: Thanks for the details. I am sorry It is not proc_exit. It is the exit callback functions that can cause problem. The following is the callstack where the problem can happen, if the signal handler is called after the spin lock took by the worker. Breakpoint 1, 0x0072dd83 in shm_mq_detach () (gdb) bt #0 0x0072dd83 in shm_mq_detach () #1 0x0072e7db in shm_mq_detach_callback () #2 0x00726d71 in dsm_detach () #3 0x00726c43 in dsm_backend_shutdown () #4 0x00727450 in shmem_exit () #5 0x007272fc in proc_exit_prepare () #6 0x00727501 in atexit_callback () #7 0x0030ff435da2 in exit () from /lib64/libc.so.6 #8 0x006ddaec in bgworker_quickdie () Or in other words, Robert broke it. This control path should absolutely not occur: the entire point of the on_exit_reset call in quickdie() is to prevent any callbacks from being executed when we get to shmem_exit(). DSM-related functions DO NOT get an exemption. All true. However, Robert also fixed it, in commit cb9a0c7987466b130fbced01ab5d5481cf3a16df, when you complained about it previously. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Dne 27 Říjen 2014, 13:50, Atri Sharma napsal(a): IMHO writing all the data into a WAL would be the cleanest solution. Also, what is a small database? I don't think a static value will work, because the sweet spot between the current approach (forcing two checkpoints) and writing everything in WAL depends on the amount of dirty buffers that need to be checkpointed. Which is mostly driven by the size of shared buffers and write activity - for small shared buffers and/or mostly-read workload, checkpoints are cheap, so the 'small database' threshold (when choosing the WAL approach) is much higher than for large shared buffers or write-heavy workloads. So are you proposing having a heuristic based on the amount of data in shared buffers and write activity? Do you have something in mind that works for general workloads as well? So maybe if we could determine the amount of data to be checkpointed, and then base the decision on that, that'd work better? This would also have to take into account that writing into WAL is sequential, while checkpoints usually cause random writes all over the datafiles (which is more expensive). Another option might be forcing just a spread checkpoint, not the immediate one (which is what we do now). That would not fix the CREATE DATABASE duration (actually, it would make it longer), but it would lower the impact on other activity on the machine. I believe this to be the cleanest way to reduce the amount of I/O generated. If I understand correctly, the original problem you mentioned was not the time CREATE DATABASE is taking but rather the amount of I/O each one is generating. Not exactly. There are two related issues, both caused by the I/O activity from the CHECKPOINT. (a) I/O spike, because of checkpointing everything (b) long CREATE DATABASE durations This spread CREATE DATABASE only fixes (a), and makes (b) a bit worse. It however makes the 'create databases in advance' a bit more flexible, because it allows more frequent runs of the cron job that actually creates them. Right now we're forced to schedule it rather rarely (say, 1/day) to minimize the impact, which has downsides (higher probability of running out of spare dbs, slow response to updated template etc.). If we can fix both (a) and (b), that'd be great. If we can fix only (a), so be it. This also leads me to think if it makes sense to explore group commits around the creation of files for a new database (for a same backend, of course). This might be on call, if the user knows he/she is going to create a lot of databases in the near future and is fine with a large spike in I/O at one go. Again, might be even more broken than the current scenario, but depends on what the user wants... That seems much more complex than what I proposed to do, but maybe I'm wrong. If we could get rid of the checkpoint everything altogether, we don't really need the group commit, no? regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Master ip from hot_standby..
On Mon, Oct 27, 2014 at 8:15 PM, sudalai sudala...@gmail.com wrote: I need to query master ip from hot_standby. *pg_stat_replication* view only shows the slave replication status. Is there any way to get *Master IP* from standby node apart from checking *recovery.conf* file. That's the way to go as it is in primary_conninfo. Recovery parameters are only loaded by the startup process after reading it from recovery.conf. Note that this would be more a question for pgsql-general or pgsql-novice but surely not hackers. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [REVIEW] Re: Compression of full-page-writes
On Fri, Oct 17, 2014 at 1:52 PM, Rahila Syed rahilasye...@gmail.com wrote: Hello, Please find the updated patch attached. Thanks for updating the patch! Here are the comments. The patch isn't applied to the master cleanly. I got the following compiler warnings. xlog.c:930: warning: ISO C90 forbids mixed declarations and code xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code xlogreader.c:744: warning: ISO C90 forbids mixed declarations and code The compilation of the document failed with the following error message. openjade:config.sgml:2188:12:E: end tag for element TERM which is not open make[3]: *** [HTML.index] Error 1 Only backend calls CompressBackupBlocksPagesAlloc when SIGHUP is sent. Why does only backend need to do that? What about other processes which can write FPW, e.g., autovacuum? Do we release the buffers for compressed data when fpw is changed from compress to on? +if (uncompressedPages == NULL) +{ +uncompressedPages = (char *)malloc(XLR_TOTAL_BLCKSZ); +if (uncompressedPages == NULL) +outOfMem = 1; +} The memory is always (i.e., even when fpw=on) allocated to uncompressedPages, but not to compressedPages. Why? I guess that the test of fpw needs to be there. Regards, -- Fujii Masao -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Missing FIN_CRC32 calls in logical replication code
Andres Freund and...@2ndquadrant.com writes: On 2014-10-27 12:51:44 +0200, Heikki Linnakangas wrote: replication/slot.c and replication/logical/snapbuild.c use a CRC on the physical slot and snapshot files. It uses the same algorithm as used e.g. for the WAL. However, they are not doing the finalization step, FIN_CRC32() on the calculated checksums. Not that it matters much, but it's a bit weird and inconsistent, and was probably an oversight. Hm. Good catch - that's stupid. I wonder what to do about it. I'm tempted to just add a comment about it to 9.4 and fix it on master as changing it is essentially a catversion bump. Any objections to that? Yeah, I think you should get it right the first time. It hardly seems likely that any 9.4 beta testers are depending on those files to be stable yet. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Lockless StrategyGetBuffer() clock sweep
Hi, I've previously posted a patch at http://archives.postgresql.org/message-id/20141010160020.GG6670%40alap3.anarazel.de that reduces contention in StrategyGetBuffer() by making the clock sweep lockless. Robert asked me to post it to a new thread; I originally wrote it to see some other contention in more detail, that's why it ended up in that thread... The performance numbers I've quoted over there are: Test: pgbench -M prepared -P 5 -S -c 496 -j 496 -T 5000 on a scale=1000 database, with 4GB of shared buffers. Before: progress: 40.0 s, 136252.3 tps, lat 3.628 ms stddev 4.547 progress: 45.0 s, 135049.0 tps, lat 3.660 ms stddev 4.515 progress: 50.0 s, 135788.9 tps, lat 3.640 ms stddev 4.398 progress: 55.0 s, 135268.4 tps, lat 3.654 ms stddev 4.469 progress: 60.0 s, 134991.6 tps, lat 3.661 ms stddev 4.739 after: progress: 40.0 s, 207701.1 tps, lat 2.382 ms stddev 3.018 progress: 45.0 s, 208022.4 tps, lat 2.377 ms stddev 2.902 progress: 50.0 s, 209187.1 tps, lat 2.364 ms stddev 2.970 progress: 55.0 s, 206462.7 tps, lat 2.396 ms stddev 2.871 progress: 60.0 s, 210263.8 tps, lat 2.351 ms stddev 2.914 Imo the patch doesn't complicate the logic noticeably... I do wonder if we could make the freelist accesses lockless as well - but I think that's harder. So I don't want to mix that with this. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services From 6b486e5b467e94ab9297d7656a5b39b816c5c55a Mon Sep 17 00:00:00 2001 From: Andres Freund and...@anarazel.de Date: Fri, 10 Oct 2014 17:36:46 +0200 Subject: [PATCH] WIP: lockless StrategyGetBuffer hotpath --- src/backend/storage/buffer/freelist.c | 154 -- 1 file changed, 90 insertions(+), 64 deletions(-) diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index 5966beb..0c634a0 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -18,6 +18,12 @@ #include storage/buf_internals.h #include storage/bufmgr.h +#include port/atomics.h + + +#define LATCHPTR_ACCESS_ONCE(var) ((Latch *)(*((volatile Latch **)(var +#define INT_ACCESS_ONCE(var) ((int)(*((volatile int *)(var + /* * The shared freelist control information. @@ -27,8 +33,12 @@ typedef struct /* Spinlock: protects the values below */ slock_t buffer_strategy_lock; - /* Clock sweep hand: index of next buffer to consider grabbing */ - int nextVictimBuffer; + /* + * Clock sweep hand: index of next buffer to consider grabbing. Note that + * this isn't a concrete buffer - we only ever increase the value. So, to + * get an actual buffer, it needs to be used modulo NBuffers. + */ + pg_atomic_uint32 nextVictimBuffer; int firstFreeBuffer; /* Head of list of unused buffers */ int lastFreeBuffer; /* Tail of list of unused buffers */ @@ -42,8 +52,8 @@ typedef struct * Statistics. These counters should be wide enough that they can't * overflow during a single bgwriter cycle. */ - uint32 completePasses; /* Complete cycles of the clock sweep */ - uint32 numBufferAllocs; /* Buffers allocated since last reset */ + pg_atomic_uint32 completePasses; /* Complete cycles of the clock sweep */ + pg_atomic_uint32 numBufferAllocs; /* Buffers allocated since last reset */ /* * Notification latch, or NULL if none. See StrategyNotifyBgWriter. @@ -124,87 +134,107 @@ StrategyGetBuffer(BufferAccessStrategy strategy) return buf; } - /* Nope, so lock the freelist */ - SpinLockAcquire(StrategyControl-buffer_strategy_lock); - /* * We count buffer allocation requests so that the bgwriter can estimate * the rate of buffer consumption. Note that buffers recycled by a * strategy object are intentionally not counted here. */ - StrategyControl-numBufferAllocs++; + pg_atomic_fetch_add_u32(StrategyControl-numBufferAllocs, 1); /* - * If bgwriterLatch is set, we need to waken the bgwriter, but we should - * not do so while holding buffer_strategy_lock; so release and re-grab. - * This is annoyingly tedious, but it happens at most once per bgwriter - * cycle, so the performance hit is minimal. + * If bgwriterLatch is set, we need to waken the bgwriter. We don't want + * to check this while holding the spinlock, so we the latch from memory + * once to see whether it's set. There's no need to do so with a lock + * present - we'll just set the latch next call if we missed it once. + * + * Since we're not guaranteed atomic 8 byte reads we need to acquire the + * spinlock if not null to be sure we get a correct pointer. Because we + * don't want to set the latch while holding the buffer_strategy_lock we + * just grab the lock to read and reset the pointer. */ - bgwriterLatch = StrategyControl-bgwriterLatch; + bgwriterLatch = LATCHPTR_ACCESS_ONCE(StrategyControl-bgwriterLatch); if (bgwriterLatch) { + /* we don't have guaranteed
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Heikki Linnakangas hlinnakan...@vmware.com writes: On 10/27/2014 03:21 PM, Tomas Vondra wrote: Thinking about this a bit more, do we really need a full checkpoint? That is a checkpoint of all the databases in the cluster? Why checkpointing the source database is not enough? A full checkpoint ensures that you always begin recovery *after* the DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during crash recovery (except when you crash before the transaction commits, in which case it doesn't matter if the new database's directory is borked). Yeah. After re-reading the 2005 thread, I wonder if we shouldn't just bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log all the copied files instead of doing a cp -r-equivalent directory copy. That would fix a number of existing replay hazards as well as making it safe to do what Tomas wants. In the small scale this would cause more I/O (2 copies of the template database's data) but in production situations we might well come out ahead by avoiding a forced checkpoint of the rest of the cluster. Also I guess we could skip WAL-logging if WAL archiving is off, similarly to the existing optimization for CREATE INDEX etc. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
On 2014-10-27 09:46:41 -0400, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 10/27/2014 03:21 PM, Tomas Vondra wrote: Thinking about this a bit more, do we really need a full checkpoint? That is a checkpoint of all the databases in the cluster? Why checkpointing the source database is not enough? A full checkpoint ensures that you always begin recovery *after* the DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during crash recovery (except when you crash before the transaction commits, in which case it doesn't matter if the new database's directory is borked). Yeah. After re-reading the 2005 thread, I wonder if we shouldn't just bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log all the copied files instead of doing a cp -r-equivalent directory copy. That would fix a number of existing replay hazards as well as making it safe to do what Tomas wants. In the small scale this would cause more I/O (2 copies of the template database's data) but in production situations we might well come out ahead by avoiding a forced checkpoint of the rest of the cluster. Also I guess we could skip WAL-logging if WAL archiving is off, similarly to the existing optimization for CREATE INDEX etc. +1. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb generator functions
On 10/15/2014 03:54 PM, Andrew Dunstan wrote: I checked a code, and I have only two small objection - a name jsonb_object_two_arg is not good - maybe json_object_keys_values ? It's consistent with the existing json_object_two_arg. In all cases I think I kept the names the same except for changing json to jsonb. Note that these _two_arg functions are not visible at the SQL level - they are only visible in the C code. I'm happy to be guided by others in changing or keeping these names. Next: there are no tests for to_jsonb function. Oh, my bad. I'll add some. Thank for the review. Here is a new patch that includes documentation and addresses all these issues, except that I didn't change the name of jsonb_object_two_arg to keep it consistent with the name of json_object_two_arg. I'm happy to change both if people feel it matters. cheers andrew diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 7e5bcd9..21ce293 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -10245,9 +10245,10 @@ table2-mapping para xref linkend=functions-json-creation-table shows the functions that are - available for creating typejson/type values. - (Currently, there are no equivalent functions for typejsonb/, but you - can cast the result of one of these functions to typejsonb/.) + available for creating typejson/type and typejsonb/type values. + (There are no equivalent functions for typejsonb/, of the literalrow_to_json/ + and literalarray_to_json/ functions. However, the literalto_jsonb/ + function supplies much the same functionality as these functions would.) /para indexterm @@ -10268,6 +10269,18 @@ table2-mapping indexterm primaryjson_object/primary /indexterm + indexterm + primaryto_jsonb/primary + /indexterm + indexterm + primaryjsonb_build_array/primary + /indexterm + indexterm + primaryjsonb_build_object/primary + /indexterm + indexterm + primaryjsonb_object/primary + /indexterm table id=functions-json-creation-table titleJSON Creation Functions/title @@ -10282,17 +10295,19 @@ table2-mapping /thead tbody row + entryparaliteralto_json(anyelement)/literal + /paraparaliteralto_jsonb(anyelement)/literal + /para/entry entry - literalto_json(anyelement)/literal - /entry - entry - Returns the value as JSON. Arrays and composites are converted + Returns the value as typejson/ or typejsonb/. + Arrays and composites are converted (recursively) to arrays and objects; otherwise, if there is a cast - from the type to typejson/type, the cast function will be used to - perform the conversion; otherwise, a JSON scalar value is produced. + from the type to typejson/type or typejsonb/ in the case of + literalto_jsonb/, the cast function will be used to + perform the conversion; otherwise, a scalar value is produced. For any scalar type other than a number, a Boolean, or a null value, - the text representation will be used, properly quoted and escaped - so that it is a valid JSON string. + the text representation will be used, in such a fashion that it is a + valid typejson/ or typejsonb/ value. /entry entryliteralto_json('Fred said Hi.'::text)/literal/entry entryliteralFred said \Hi.\/literal/entry @@ -10321,9 +10336,9 @@ table2-mapping entryliteral{f1:1,f2:foo}/literal/entry /row row - entry - literaljson_build_array(VARIADIC any)/literal - /entry + entryparaliteraljson_build_array(VARIADIC any)/literal + /paraparaliteraljsonb_build_array(VARIADIC any)/literal + /para/entry entry Builds a possibly-heterogeneously-typed JSON array out of a variadic argument list. @@ -10332,9 +10347,9 @@ table2-mapping entryliteral[1, 2, 3, 4, 5]/literal/entry /row row - entry - literaljson_build_object(VARIADIC any)/literal - /entry + entryparaliteraljson_build_object(VARIADIC any)/literal + /paraparaliteraljsonb_build_object(VARIADIC any)/literal + /para/entry entry Builds a JSON object out of a variadic argument list. By convention, the argument list consists of alternating @@ -10344,9 +10359,9 @@ table2-mapping entryliteral{foo: 1, bar: 2}/literal/entry /row row - entry - literaljson_object(text[])/literal - /entry + entryparaliteraljson_object(text[])/literal + /paraparaliteraljsonb_object(text[])/literal + /para/entry entry Builds a JSON object out of a text array. The array must have either exactly one dimension with an even number of members, in which case @@ -10359,9 +10374,9 @@ table2-mapping entryliteral{a: 1, b: def, c:
Re: [HACKERS] Review of GetUserId() Usage
* Stephen Frost (sfr...@snowman.net) wrote: Attached is a patch to address the pg_cancel/terminate_backend and the statistics info as discussed previously. It sounds like we're coming to And I forgot the attachment, of course. Apologies. Thanks, Stephen diff --git a/src/backend/utils/adt/misc.c b/src/backend/utils/adt/misc.c new file mode 100644 index 67539ec..6924fb7 *** a/src/backend/utils/adt/misc.c --- b/src/backend/utils/adt/misc.c *** *** 37,42 --- 37,43 #include utils/lsyscache.h #include utils/ruleutils.h #include tcop/tcopprot.h + #include utils/acl.h #include utils/builtins.h #include utils/timestamp.h *** pg_signal_backend(int pid, int sig) *** 113,121 return SIGNAL_BACKEND_ERROR; } ! if (!(superuser() || proc-roleId == GetUserId())) return SIGNAL_BACKEND_NOPERMISSION; /* * Can the process we just validated above end, followed by the pid being * recycled for a new process, before reaching here? Then we'd be trying --- 114,127 return SIGNAL_BACKEND_ERROR; } ! /* Only allow superusers to signal superuser-owned backends. */ ! if (superuser_arg(proc-roleId) !superuser()) return SIGNAL_BACKEND_NOPERMISSION; + /* Users can signal their own backends (including through membership) */ + if (!has_privs_of_role(GetUserId(), proc-roleId)) + return SIGNAL_BACKEND_NOPERMISSION; + /* * Can the process we just validated above end, followed by the pid being * recycled for a new process, before reaching here? Then we'd be trying diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c new file mode 100644 index d621a68..3663e93 *** a/src/backend/utils/adt/pgstatfuncs.c --- b/src/backend/utils/adt/pgstatfuncs.c *** *** 20,25 --- 20,26 #include libpq/ip.h #include miscadmin.h #include pgstat.h + #include utils/acl.h #include utils/builtins.h #include utils/inet.h #include utils/timestamp.h *** pg_stat_get_activity(PG_FUNCTION_ARGS) *** 674,681 else nulls[15] = true; ! /* Values only available to same user or superuser */ ! if (superuser() || beentry-st_userid == GetUserId()) { SockAddr zero_clientaddr; --- 675,682 else nulls[15] = true; ! /* Values only available to role member */ ! if (has_privs_of_role(GetUserId(), beentry-st_userid)) { SockAddr zero_clientaddr; *** pg_stat_get_backend_activity(PG_FUNCTION *** 877,883 if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL) activity = backend information not available; ! else if (!superuser() beentry-st_userid != GetUserId()) activity = insufficient privilege; else if (*(beentry-st_activity) == '\0') activity = command string not enabled; --- 878,884 if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL) activity = backend information not available; ! else if (!has_privs_of_role(GetUserId(), beentry-st_userid)) activity = insufficient privilege; else if (*(beentry-st_activity) == '\0') activity = command string not enabled; *** pg_stat_get_backend_waiting(PG_FUNCTION_ *** 898,904 if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL) PG_RETURN_NULL(); ! if (!superuser() beentry-st_userid != GetUserId()) PG_RETURN_NULL(); result = beentry-st_waiting; --- 899,905 if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL) PG_RETURN_NULL(); ! if (!has_privs_of_role(GetUserId(), beentry-st_userid)) PG_RETURN_NULL(); result = beentry-st_waiting; *** pg_stat_get_backend_activity_start(PG_FU *** 917,923 if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL) PG_RETURN_NULL(); ! if (!superuser() beentry-st_userid != GetUserId()) PG_RETURN_NULL(); result = beentry-st_activity_start_timestamp; --- 918,924 if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL) PG_RETURN_NULL(); ! if (!has_privs_of_role(GetUserId(), beentry-st_userid)) PG_RETURN_NULL(); result = beentry-st_activity_start_timestamp; *** pg_stat_get_backend_xact_start(PG_FUNCTI *** 943,949 if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL) PG_RETURN_NULL(); ! if (!superuser() beentry-st_userid != GetUserId()) PG_RETURN_NULL(); result = beentry-st_xact_start_timestamp; --- 944,950 if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL) PG_RETURN_NULL(); ! if (!has_privs_of_role(GetUserId(), beentry-st_userid)) PG_RETURN_NULL(); result = beentry-st_xact_start_timestamp; *** pg_stat_get_backend_start(PG_FUNCTION_AR *** 965,971 if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL) PG_RETURN_NULL(); ! if (!superuser() beentry-st_userid != GetUserId()) PG_RETURN_NULL(); result = beentry-st_proc_start_timestamp;
Re: [HACKERS] Review of GetUserId() Usage
All, * Peter Eisentraut (pete...@gmx.net) wrote: It would be weird if it were inconsistent: some things require role membership, some things require SET ROLE. Try explaining that. Attached is a patch to address the pg_cancel/terminate_backend and the statistics info as discussed previously. It sounds like we're coming to consensus that this is the correct approach. Comments are always welcome, of course, but it's a pretty minor patch and I'd like to move forward with it soon against master. We'll rebase the role attributes patch with these changes and update that patch for review (with a few other changes) after. Thanks! Stephen signature.asc Description: Digital signature
[HACKERS] cost_index()
Hi! Some fragment of code (src/backend/optimizer/path/costsize.c, lineno ~400): /* * Normal case: apply the Mackert and Lohman formula, and then * interpolate between that and the correlation-derived result. */ pages_fetched = index_pages_fetched(tuples_fetched, baserel-pages, (double) index-pages, root); if (indexonly) pages_fetched = ceil(pages_fetched * (1.0 - baserel-allvisfrac)); As I understand the code, index_pages_fetched() returns summary of page's read for index and heap together. But on next line this recalculates with all visible fraction of heap. After recent vacuum it could be 1.0 and then pages_fetches will be zero. It seems to me obviously wrong, because it's for index only scan it could be true only for heap, not for index pages. Am I wrong or miss something? -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cost_index()
Teodor Sigaev teo...@sigaev.ru writes: if (indexonly) pages_fetched = ceil(pages_fetched * (1.0 - baserel-allvisfrac)); As I understand the code, index_pages_fetched() returns summary of page's read for index and heap together. No. Costs for touching the index were computed by the amcostestimate function; this code is solely about estimating costs for touching the heap. But on next line this recalculates with all visible fraction of heap. After recent vacuum it could be 1.0 and then pages_fetches will be zero. It seems to me obviously wrong, because it's for index only scan it could be true only for heap, not for index pages. Seems correct to me: if it's an index-only scan, and all pages are all-visible, then indeed no heap pages will be fetched. (Note that the cost model doesn't charge anything for touching the visibility map. This might be too simplistic, but certainly that cost should be much smaller than touching the heap, since the map is small enough that it should stay resident in shared buffers.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] superuser() shortcuts
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: Brightwell, Adam wrote: If we were to make it consistent and use the old wording, what do you think about providing an errhint as well? Perhaps for example in slotfuncs.c#pg_create_physical_replication_stot: errmsg - permission denied to create physical replication slot errhint - You must be superuser or replication role to use replication slots. Sure. Sounds reasonable to me and looks to be in-line with wording elsewhere. As I started looking at this, there are multiple other places where these types of error messages occur (opclasscmds.c, user.c, postinit.c, miscinit.c are just a few), not just around the changes in this patch. If we change them in one place, wouldn't it be best to change them in the rest? If that is the case, I'm afraid that might distract from the purpose of this patch. Perhaps, if we want to change them, then that should be submitted as a separate patch? Yeah. I'm just saying that maybe this patch should adopt whatever wording we agree to, not that we need to change other places. On the other hand, since so many other places have adopted the different wording, maybe there's a reason for it and if so, does anybody know what it is. But I have to say that it does look inconsistent to me. I agree that this patch should adopt the wording agreed to above and that it's 'more similar' to most of the usage in the backend. As for the general question, I'd suggest we add to the error message policy that more-or-less anything with errcode(ERRCODE_INSUFFICIENT_PRIVILEGE) have errmsg(permission denied to X) with any comments about 'must be a superuser' or similar relegated to errhint(). We could add that as a TODO and have more junior developers review these cases and come up with patches (or argue for cases where they feel deviation is warranted). Either way, that larger rework isn't for this patch and since you seem happy with it (modulo the change above), I'm going to make that change, do my own review, and move foward with it unless someone else wants to speak up. Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] Typo fixes for pg_recvlogical documentation
On Fri, Oct 24, 2014 at 5:14 PM, Michael Paquier michael.paqu...@gmail.com wrote: On Fri, Oct 24, 2014 at 7:42 PM, Robert Haas robertmh...@gmail.com wrote: [rhaas pgsql]$ patch -p1 ~/Downloads/20141023_pg_recvlogical_fixes.patch patching file doc/src/sgml/ref/pg_recvlogical.sgml Hunk #1 succeeded at 270 with fuzz 1 (offset 165 lines). Hunk #2 FAILED at 282. Hunk #3 FAILED at 295. Hunk #4 FAILED at 308. 3 out of 4 hunks FAILED -- saving rejects to file doc/src/sgml/ref/pg_recvlogical.sgml.rej Sorry, I did not rebase much these days, there were conflicts with 52c1ae2... Committed and back-patched to 9.4. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Typo fixes for pg_recvlogical documentation
On Tue, Oct 28, 2014 at 12:12 AM, Robert Haas robertmh...@gmail.com wrote: Committed and back-patched to 9.4. Thanks. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] cost_index()
No. Costs for touching the index were computed by the amcostestimate function; this code is solely about estimating costs for touching the heap. I see. Thank you. -- Teodor Sigaev E-mail: teo...@sigaev.ru WWW: http://www.sigaev.ru/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] superuser() shortcuts
* Alvaro Herrera (alvhe...@2ndquadrant.com) wrote: As I started looking at this, there are multiple other places where these types of error messages occur (opclasscmds.c, user.c, postinit.c, miscinit.c are just a few), not just around the changes in this patch. If we change them in one place, wouldn't it be best to change them in the rest? If that is the case, I'm afraid that might distract from the purpose of this patch. Perhaps, if we want to change them, then that should be submitted as a separate patch? Yeah. I'm just saying that maybe this patch should adopt whatever wording we agree to, not that we need to change other places. On the other hand, since so many other places have adopted the different wording, maybe there's a reason for it and if so, does anybody know what it is. But I have to say that it does look inconsistent to me. Updated patch attached. Comments welcome. Thanks! Stephen diff --git a/contrib/test_decoding/expected/permissions.out b/contrib/test_decoding/expected/permissions.out new file mode 100644 index 212fd1d..f011955 *** a/contrib/test_decoding/expected/permissions.out --- b/contrib/test_decoding/expected/permissions.out *** RESET ROLE; *** 54,66 -- plain user *can't* can control replication SET ROLE lr_normal; SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding'); ! ERROR: must be superuser or replication role to use replication slots INSERT INTO lr_test VALUES('lr_superuser_init'); ERROR: permission denied for relation lr_test SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); ! ERROR: must be superuser or replication role to use replication slots SELECT pg_drop_replication_slot('regression_slot'); ! ERROR: must be superuser or replication role to use replication slots RESET ROLE; -- replication users can drop superuser created slots SET ROLE lr_superuser; --- 54,69 -- plain user *can't* can control replication SET ROLE lr_normal; SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 'test_decoding'); ! ERROR: permission denied to use replication slots ! HINT: You must be superuser or replication role to use replication slots. INSERT INTO lr_test VALUES('lr_superuser_init'); ERROR: permission denied for relation lr_test SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1'); ! ERROR: permission denied to use replication slots ! HINT: You must be superuser or replication role to use replication slots. SELECT pg_drop_replication_slot('regression_slot'); ! ERROR: permission denied to use replication slots ! HINT: You must be superuser or replication role to use replication slots. RESET ROLE; -- replication users can drop superuser created slots SET ROLE lr_superuser; *** SELECT 'init' FROM pg_create_logical_rep *** 90,96 RESET ROLE; SET ROLE lr_normal; SELECT pg_drop_replication_slot('regression_slot'); ! ERROR: must be superuser or replication role to use replication slots RESET ROLE; -- all users can see existing slots SET ROLE lr_superuser; --- 93,100 RESET ROLE; SET ROLE lr_normal; SELECT pg_drop_replication_slot('regression_slot'); ! ERROR: permission denied to use replication slots ! HINT: You must be superuser or replication role to use replication slots. RESET ROLE; -- all users can see existing slots SET ROLE lr_superuser; diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c new file mode 100644 index 133143d..a05de7e *** a/src/backend/access/transam/xlogfuncs.c --- b/src/backend/access/transam/xlogfuncs.c *** *** 27,32 --- 27,33 #include miscadmin.h #include replication/walreceiver.h #include storage/smgr.h + #include utils/acl.h #include utils/builtins.h #include utils/numeric.h #include utils/guc.h *** pg_start_backup(PG_FUNCTION_ARGS) *** 54,63 backupidstr = text_to_cstring(backupid); ! if (!superuser() !has_rolreplication(GetUserId())) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), ! errmsg(must be superuser or replication role to run a backup))); startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL); --- 55,65 backupidstr = text_to_cstring(backupid); ! if (!has_replication_privilege(GetUserId())) ereport(ERROR, (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), ! errmsg(permission denied to run a backup), ! errhint(You must be superuser or replication role to run a backup.))); startpoint = do_pg_start_backup(backupidstr, fast, NULL, NULL); *** pg_stop_backup(PG_FUNCTION_ARGS) *** 82,91 { XLogRecPtr stoppoint; ! if (!superuser() !has_rolreplication(GetUserId())) ereport(ERROR,
Re: [HACKERS] TAP test breakage on MacOS X
On Sun, Oct 26, 2014 at 12:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: The larger issue though is that even with both the above things fixed, the TAP tests would still be an expensive no-op on the majority of buildfarm members. AFAICT, I do not own a single machine on which the current TAP tests will consent to run, and in most cases that's after going out of my way to fetch CPAN modules that aren't in the vendor Perl installs. Peter's mostly been fixing the portability issues by disabling tests, which I guess is better than no fix at all, but it leaves darn little useful functionality. I agree, emphatically. Honestly, if we can't get these tests running everywhere with reasonable effort, we should just rip them out. We've gone to a lot of trouble in general to make sure that our source code can be ported even to systems that arguably nobody uses any more, and instrumental to that effort is keeping the system requirements to install and test PostgreSQL minimal. At this point, I wouldn't mind moving the goalposts from C89 to C89 + a bunch of C99 features that are available on all the platforms we have buildfarm coverage for, and I wouldn't mind require perl to compile and install, full stop. But this patch has gone much further than that: you need a new-enough version of perl, and a new-enough version of a bunch of modules that aren't installed by default, and maybe not even in the vendor install, and the documentation on how to make it work is an embarrassment: http://www.postgresql.org/docs/devel/static/regress-tap.html Beyond all that, I have serious doubts about whether, even if we eventually get these tests mostly working in most places, whether they will actually catch any bugs. For example, consider dropdb. The TAP tests cover the following points: - When run with --help or --version, it should exit with code 0, print something on stderr, and print nothing on stdout. - When run with --not-a-valid-option, it should exit with a non-zero exit code, print something on stderr, and print nothing on stdout. - dropdb foobar1 should cause statement: DROP DATABASE foobar1 to show up in the postgresql log file. - dropdb nonexistent should exit non-zero. These are certainly good things to test, but I'd argue that once you've verified that they are working, they're unlikely to get broken again in the future. I'm generally supportive of the idea that we need more automated tests, but these tests seem pretty low-value to me, even if they were running everywhere, and even moreso if they aren't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Materialized views don't show up in information_schema
* Robert Haas (robertmh...@gmail.com) wrote: On Fri, Oct 17, 2014 at 8:10 PM, Stephen Frost sfr...@snowman.net wrote: * Peter Eisentraut (pete...@gmx.net) wrote: On 10/16/14 9:45 AM, Stephen Frost wrote: Alright, coming back to this, I have to ask- how are matviews different from views from the SQL standard's perspective? I tried looking through the standard to figure it out (and I admit that I probably missed something), but the only thing appears to be a statement in the standard that (paraphrased) functions are run with the view is queried and that strikes me as a relatively minor point.. To me, the main criterion is that you cannot DROP VIEW a materialized view. That is an entirely correctable situation. We don't require 'DROP UNLOGGED TABLE'. I think that's an inapposite comparison. The fact that a table is unlogged is merely a property of the table; it does not change the fact that it is a table. A materialized view, on the other hand, is different kind of object from a view. This manifests itself the fact that it's represented by a different relkind; and that different syntax is used not only for DROP but also for COMMENT, ALTER VIEW, SECURITY LABEL, and ALTER EXTENSION .. ADD/DROP; and that the set of supported operations on a materialized view is different from a regular view (and will probably be more different in the future). If we really want to change this, we can't just change DROP VIEW; we need to change all of the places in a consistent fashion, and we probably have to continue to support the old syntax so that we don't break existing dumps. Yes, clearly we'd have to adjust the syntax for all of the commands to support both with MATERIALIZED and without. I don't have an issue with that. The argument makes more sense if there are operations which ONLY work against a regular view and do *not* work against a matview. Operations which operate only against a matview simply must have MATERIALIZED used for them. But I think it's the wrong thing anyway, because it presumes that, when Kevin chose to make materialized views a different relkind and a different object type, rather than just a property of an object, he made the wrong call, and I don't agree with that. I think he got it exactly right. A materialized view is really much more like a table than a view: it has storage and can be vacuumed, clustered, analyzed, and so on. That's far more significant IMV than the difference between a table and unlogged table. I don't think Kevin was wrong to use a different relkind, but I don't buy into the argument that a different relkind means it's not a view. As for the other comments, I agree that a matview is *more* than a view, but at its base, in my view (pun intended), it's still a view. Why not call it a materialized query? Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Sun, Oct 26, 2014 at 4:39 PM, Peter Geoghegan p...@heroku.com wrote: I don't care whether you actually generate index-paths or not, and in fact I suspect it makes no sense to do so. But I do care that you do a cost comparison between the available indexes and pick the one that looks cheapest. If somebody's got a bloated index and builds a new one, they will want it to get used even before they drop the old one. That seems extremely narrow. I am about ready to just do what you say, by costing the index based on something like relpages, but for the record I see no point. If I see no point, and you're sure that there is a point, then there is a danger that I'll miss the point, which contributes to my giving push back. I know I said that already, but I reiterate it once more for emphasis. I don't think it's either narrow or difficult: I think you just need to apply the existing index costing function. CLUSTER does a similar calculation to figure out whether to seq-scan or sort, even though it doesn't use index paths per se, or at least I don't think it does. Even if that weren't an issue, though, the fact that you can't re-parse but you can re-plan is a killer point AFAICS. It means you are borked if the statement gets re-executed after the index you picked is dropped. That simply isn't the case. As specifically noted in comments in the patch, relcache invalidation works in such a way as to invalidate the query proper, even when only an index has been dropped. For a time during development, the semantic dependence was more directly represented by actually having extract_query_dependencies() extract the arbitrating unique index pg_class Oid as a dependency for the benefit of the plan cache, but I ultimately decided against doing anything more than noting the potential future problem (the problem that arises in a world where relcache invalidation is more selective). I don't know what you mean by invalidate the query proper. Here's my chain of reasoning: If the index selection occurs in parse analysis, then it's embedded in the parse tree. If this can used a writable CTE, then the parse tree can get stored someplace. If the index is then dropped, the parse tree is no good any more. Where's the flaw in that reasoning? I haven't looked at the patch to know exactly what will happen in that case, but it seems to me like it must either not work or there must be some ugly hack to make it work. But, I digress: I'll have inference occur during planning during the next revision since you think that's important. Great. I think that it's accurate to say that I asked the community to do that once, last year. I was even very explicit about it at the time. I'm surprised that you think that's the case now, though. I learned my lesson a little later: don't do that, because fellow contributors are unwilling to go along with it, even for something as important as this. I'm just telling you how I feel. I can't vouch for it being perfectly fair or accurate, though I do strive for that. As recently as a few months ago, you wanted me to go a *completely* different direction with this (i.e. attempt an UPDATE first). I'm surprised that you're emphasizing the EXCLUDED()/CONFLICTING() thing so much. Should I take it that I more or less have your confidence that the way the patch fits together at a high level is sound? No. Commenting on one aspect of a patch doesn't imply agreement with other aspects of the patch. Please don't put words into my mouth. I haven't reviewed this patch in detail; I've only commented on specific aspects of it as they have arisen in discussion. I may or may not someday review it in detail, but not before I'm fairly confident that the known issues raised by other community members have been addressed as thoroughly as possible. In reality, the main reason for that is: getting this feature to a point where it might plausibly be committed is bloody difficult, as evidenced by the fact that it took this long for someone to produce a patch. Please don't lose sight of that. I'm not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TAP test breakage on MacOS X
On 10/27/2014 05:41 PM, Robert Haas wrote: On Sun, Oct 26, 2014 at 12:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: The larger issue though is that even with both the above things fixed, the TAP tests would still be an expensive no-op on the majority of buildfarm members. AFAICT, I do not own a single machine on which the current TAP tests will consent to run, and in most cases that's after going out of my way to fetch CPAN modules that aren't in the vendor Perl installs. Peter's mostly been fixing the portability issues by disabling tests, which I guess is better than no fix at all, but it leaves darn little useful functionality. I agree, emphatically. Honestly, if we can't get these tests running everywhere with reasonable effort, we should just rip them out. We've gone to a lot of trouble in general to make sure that our source code can be ported even to systems that arguably nobody uses any more, and instrumental to that effort is keeping the system requirements to install and test PostgreSQL minimal. At this point, I wouldn't mind moving the goalposts from C89 to C89 + a bunch of C99 features that are available on all the platforms we have buildfarm coverage for, and I wouldn't mind require perl to compile and install, full stop. But this patch has gone much further than that: you need a new-enough version of perl, and a new-enough version of a bunch of modules that aren't installed by default, and maybe not even in the vendor install, and the documentation on how to make it work is an embarrassment: http://www.postgresql.org/docs/devel/static/regress-tap.html Beyond all that, I have serious doubts about whether, even if we eventually get these tests mostly working in most places, whether they will actually catch any bugs. The existing tests are not very useful, but it certainly would be nice to have some infrastructure like TAP to write useful tests. For example, I recently wrote a test suite for SSL, using TAP (http://www.postgresql.org/message-id/53df9afe.4090...@vmware.com), and it would be nice to have some TAP tests for hot standby, replication, PITR etc. I'm not sure if the current infrastructure is very good for that, but we need something. If the current infrastructure is beyond repair, then let's discuss what we could replace it with. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TAP test breakage on MacOS X
On 10/27/2014 11:41 AM, Robert Haas wrote: On Sun, Oct 26, 2014 at 12:29 PM, Tom Lane t...@sss.pgh.pa.us wrote: The larger issue though is that even with both the above things fixed, the TAP tests would still be an expensive no-op on the majority of buildfarm members. AFAICT, I do not own a single machine on which the current TAP tests will consent to run, and in most cases that's after going out of my way to fetch CPAN modules that aren't in the vendor Perl installs. Peter's mostly been fixing the portability issues by disabling tests, which I guess is better than no fix at all, but it leaves darn little useful functionality. I agree, emphatically. Honestly, if we can't get these tests running everywhere with reasonable effort, we should just rip them out. We've gone to a lot of trouble in general to make sure that our source code can be ported even to systems that arguably nobody uses any more, and instrumental to that effort is keeping the system requirements to install and test PostgreSQL minimal. At this point, I wouldn't mind moving the goalposts from C89 to C89 + a bunch of C99 features that are available on all the platforms we have buildfarm coverage for, and I wouldn't mind require perl to compile and install, full stop. But this patch has gone much further than that: you need a new-enough version of perl, and a new-enough version of a bunch of modules that aren't installed by default, and maybe not even in the vendor install, and the documentation on how to make it work is an embarrassment: http://www.postgresql.org/docs/devel/static/regress-tap.html Beyond all that, I have serious doubts about whether, even if we eventually get these tests mostly working in most places, whether they will actually catch any bugs. For example, consider dropdb. The TAP tests cover the following points: - When run with --help or --version, it should exit with code 0, print something on stderr, and print nothing on stdout. - When run with --not-a-valid-option, it should exit with a non-zero exit code, print something on stderr, and print nothing on stdout. - dropdb foobar1 should cause statement: DROP DATABASE foobar1 to show up in the postgresql log file. - dropdb nonexistent should exit non-zero. These are certainly good things to test, but I'd argue that once you've verified that they are working, they're unlikely to get broken again in the future. I'm generally supportive of the idea that we need more automated tests, but these tests seem pretty low-value to me, even if they were running everywhere, and even moreso if they aren't. Yeah. I think at the very least they should be removed from the check-world and installcheck-world targets until this is sorted out. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What exactly is our CRC algorithm?
On 10/09/2014 12:13 AM, Andres Freund wrote: On 2014-10-08 22:13:46 +0300, Heikki Linnakangas wrote: As far as I can tell, PostgreSQL's so-called CRC algorithm doesn't correspond to any bit-by-bit CRC variant and polynomial. My math skills are not strong enough to determine what the consequences of that are. It might still be a decent checksum. Or not. I couldn't tell if the good error detection properties of the normal CRC-32 polynomial apply to our algorithm or not. Additional interesting datapoints are that hstore and ltree contain the same tables - but properly use the reflected computation. Thoughts? It clearly seems like a bad idea to continue with this - I don't think anybody here knows which guarantees this gives us. The question is how can we move away from this. There's unfortunately two places that embed PGC32 that are likely to prove problematic when fixing the algorithm: pg_trgm and tsgist both seem to include crc's in their logic in a persistent way. I think we should provide INIT/COMP/FIN_PG32 using the current algorithm for these. Agreed, it's not worth breaking pg_upgrade for this. If we're switching to a saner computation, we should imo also switch to a better polynom - CRC-32C has better error detection capabilities than CRC32 and is available in hardware. As we're paying the price pf breaking compat anyway... Agreed. Arguably we could also say that given that there's been little evident problems with the borked computation we could also switch to a much faster hash instead of continuing to use crc... I don't feel like taking the leap. Once we switch to slice-by-4/8 and/or use a hardware instruction when available, CRC is fast enough. I came up with the attached patches. They do three things: 1. Get rid of the 64-bit CRC code. It's not used for anything, and haven't been for years, so it doesn't seem worth spending any effort to fix them. 2. Switch to CRC-32C (Castagnoli) for WAL and other places that don't need to remain compatible across major versions. 3. Use the same lookup table for hstore and ltree, as used for the legacy almost CRC-32 algorithm. The tables are identical, so might as well. Any objections? - Heikki From 4fe6472976f2efc22035e802067a70d3cca61d79 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas heikki.linnakan...@iki.fi Date: Mon, 27 Oct 2014 12:01:19 +0200 Subject: [PATCH 1/2] Remove support for 64-bit CRC. It hasn't been used for anything in backend code for a long time. --- src/include/utils/pg_crc.h| 96 - src/include/utils/pg_crc_tables.h | 416 -- 2 files changed, 512 deletions(-) diff --git a/src/include/utils/pg_crc.h b/src/include/utils/pg_crc.h index 375c405..f43f4aa 100644 --- a/src/include/utils/pg_crc.h +++ b/src/include/utils/pg_crc.h @@ -10,9 +10,6 @@ * We use a normal (not reflected, in Williams' terms) CRC, using initial * all-ones register contents and a final bit inversion. * - * The 64-bit variant is not used as of PostgreSQL 8.1, but we retain the - * code for possible future use. - * * * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California @@ -56,97 +53,4 @@ do { \ /* Constant table for CRC calculation */ extern CRCDLLIMPORT const uint32 pg_crc32_table[]; - -#ifdef PROVIDE_64BIT_CRC - -/* - * If we use a 64-bit integer type, then a 64-bit CRC looks just like the - * usual sort of implementation. However, we can also fake it with two - * 32-bit registers. Experience has shown that the two-32-bit-registers code - * is as fast as, or even much faster than, the 64-bit code on all but true - * 64-bit machines. We use SIZEOF_VOID_P to check the native word width. - */ - -#if SIZEOF_VOID_P 8 - -/* - * crc0 represents the LSBs of the 64-bit value, crc1 the MSBs. Note that - * with crc0 placed first, the output of 32-bit and 64-bit implementations - * will be bit-compatible only on little-endian architectures. If it were - * important to make the two possible implementations bit-compatible on - * all machines, we could do a configure test to decide how to order the - * two fields, but it seems not worth the trouble. - */ -typedef struct pg_crc64 -{ - uint32 crc0; - uint32 crc1; -} pg_crc64; - -/* Initialize a CRC accumulator */ -#define INIT_CRC64(crc) ((crc).crc0 = 0x, (crc).crc1 = 0x) - -/* Finish a CRC calculation */ -#define FIN_CRC64(crc) ((crc).crc0 ^= 0x, (crc).crc1 ^= 0x) - -/* Accumulate some (more) bytes into a CRC */ -#define COMP_CRC64(crc, data, len) \ -do { \ - uint32 __crc0 = (crc).crc0; \ - uint32 __crc1 = (crc).crc1; \ - unsigned char *__data = (unsigned char *) (data); \ - uint32 __len = (len); \ -\ - while (__len-- 0) \ - { \ - int __tab_index = ((int) (__crc1 24) ^ *__data++) 0xFF; \ - __crc1 = pg_crc64_table1[__tab_index] ^ ((__crc1 8) | (__crc0 24)); \ - __crc0 =
Re: [HACKERS] BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)
On 10/27/2014 02:12 PM, Fujii Masao wrote: On Fri, Oct 24, 2014 at 10:05 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 10/23/2014 11:09 AM, Heikki Linnakangas wrote: At least for master, we should consider changing the way the archiving works so that we only archive WAL that was generated in the same server. I.e. we should never try to archive WAL files belonging to another timeline. I just remembered that we discussed a different problem related to this some time ago, at http://www.postgresql.org/message-id/20131212.110002.204892575.horiguchi.kyot...@lab.ntt.co.jp. The conclusion of that was that at promotion, we should not archive the last, partial, segment from the old timeline. So, this is what I came up with for master. Does anyone see a problem with it? What about the problem that I raised upthread? This is, the patch prevents the last, partial, WAL file of the old timeline from being archived. So we can never PITR the database to the point that the last, partial WAL file has. A partial WAL file is never archived in the master server to begin with, so if it's ever used in archive recovery, the administrator must have performed some manual action to copy the partial WAL file from the original server. When he does that, he can also copy it manually to the archive, or whatever he wants to do with it. Note that the same applies to any complete, but not-yet archived WAL files. But we've never had any mechanism in place to archive those in the new instance, after PITR. Isn't this problem? For example, please imagine the following scenario. 1. The important data was deleted but no one noticed that. This deletion was logged in last, partial WAL file. 2. The server crashed and DBA started an archive recovery from old backup. 3. After recovery, all WAL files of the old timeline were recycled. 4. Finally DBA noticed the loss of important data and tried to do PITR to the point where the data was deleted. HOWEVER, the WAL file containing that deletion operation no longer exists. So DBA will never be able to recover that important data I think you're missing a step above: 1.5: The administrator copies the last, partial WAL file (and any complete but not yet-archived files) to the new server's pg_xlog directory. Without that, it won't be available for PITR anyway, and the new server won't see it or try to archive it, no matter what. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
On 10/27/2014 03:46 PM, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 10/27/2014 03:21 PM, Tomas Vondra wrote: Thinking about this a bit more, do we really need a full checkpoint? That is a checkpoint of all the databases in the cluster? Why checkpointing the source database is not enough? A full checkpoint ensures that you always begin recovery *after* the DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during crash recovery (except when you crash before the transaction commits, in which case it doesn't matter if the new database's directory is borked). Yeah. After re-reading the 2005 thread, I wonder if we shouldn't just bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log all the copied files instead of doing a cp -r-equivalent directory copy. That would fix a number of existing replay hazards as well as making it safe to do what Tomas wants. In the small scale this would cause more I/O (2 copies of the template database's data) but in production situations we might well come out ahead by avoiding a forced checkpoint of the rest of the cluster. Also I guess we could skip WAL-logging if WAL archiving is off, similarly to the existing optimization for CREATE INDEX etc. That would be a nasty surprise for anyone who's using CREATE DATABASE as a fast way to clone a large database. But I would be OK with that, at least if we can skip the WAL-logging with wal_level=minimal. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TAP test breakage on MacOS X
Heikki Linnakangas hlinnakan...@vmware.com writes: On 10/27/2014 05:41 PM, Robert Haas wrote: Beyond all that, I have serious doubts about whether, even if we eventually get these tests mostly working in most places, whether they will actually catch any bugs. The existing tests are not very useful, but it certainly would be nice to have some infrastructure like TAP to write useful tests. Yeah. I think this is an if you build it they will come situation. The existing tests are really only proof-of-concept; if we can get the infrastructure to the point of being widely usable, we can certainly use it to write many more-useful tests. So I'm not for giving up and ripping it out in the near future. But I do think there's a lot of test-infrastructure work left here to get to the point of being widely usable, and I'm not sure if anyone is committed to making that happen. But let's stop talking in generalities and try to quantify where we think the bar needs to be set. Is it okay to depend on non-core CPAN modules at all? How old versions of Perl, Test::More, etc do we need it to work with? For my own purposes, I'd be satisfied if it ran on RHEL6 and reasonably recent OS X. For RHEL6, I'd not want to have to install any CPAN modules other than the versions supplied by Red Hat (which means no Test::More subplans). This platform is only four years old, if you can't cope with that then you're definitely too bleeding-edge to be a portable test infrastructure. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TAP test breakage on MacOS X
Andrew Dunstan and...@dunslane.net writes: Yeah. I think at the very least they should be removed from the check-world and installcheck-world targets until this is sorted out. +1 for doing that in the 9.4 branch; I'm surprised we've not already heard bitching from packagers about how far we've moved the goalposts for what's required to get make check-world to work. As I said in my other reply, I'm not giving up on this (yet) so I'd be happy with leaving it in check-world in the master branch. But it's not ready for prime time, and 9.4 needs to ship soon. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On 27 October 2014 15:55, Robert Haas robertmh...@gmail.com wrote: Commenting on one aspect of a patch doesn't imply agreement with other aspects of the patch. Please don't put words into my mouth. I haven't reviewed this patch in detail; I've only commented on specific aspects of it as they have arisen in discussion. I may or may not someday review it in detail, but not before I'm fairly confident that the known issues raised by other community members have been addressed as thoroughly as possible. +1 -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Mon, Oct 27, 2014 at 9:43 AM, Simon Riggs si...@2ndquadrant.com wrote: On 27 October 2014 15:55, Robert Haas robertmh...@gmail.com wrote: Commenting on one aspect of a patch doesn't imply agreement with other aspects of the patch. Please don't put words into my mouth. I haven't reviewed this patch in detail; I've only commented on specific aspects of it as they have arisen in discussion. I may or may not someday review it in detail, but not before I'm fairly confident that the known issues raised by other community members have been addressed as thoroughly as possible. +1 I wasn't putting words in anyone's mouth; I *don't* think that it's true that Robert thinks patches 0001-* and 0002-* are perfectly fine, and implied as much myself. I just think the *strongly* worded disapproval of the user-visible interface of 0003-* was odd; it was way out of proportion to its immediate importance to getting this patch on track. AFAICT it was the *only* feedback that I didn't act on with V1.3 (Robert's complaint about how inference happens during parse analysis was a response to V1.3). I'm not always going to be able to act on every item of feedback immediately, or I'll have my own ideas about how to handle certain things. I don't think that's all that big of a deal, since I've acted on almost all feedback. I think by far the biggest problem here is the lack of attention to the design from others. I did a lot of copy-editing to the Wiki page yesterday. There are actually few clear open items now: https://wiki.postgresql.org/wiki/UPSERT#Open_Items Some previous open items have been moved to here: https://wiki.postgresql.org/wiki/UPSERT#Miscellaneous_odd_properties_of_proposed_ON_CONFLICT_patch This is basically a section describing things that have not been controversial or in need of adjusting, and may well never be, but I wish we'd talk about because they're in some way novel or counter-intuitive. This is the kind of things I'd like us to discuss more. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On 27 October 2014 17:44, Peter Geoghegan p...@heroku.com wrote: I did a lot of copy-editing to the Wiki page yesterday. There are actually few clear open items now: https://wiki.postgresql.org/wiki/UPSERT#Open_Items I've read this page. Please do these things, both of which have been requested multiple times... 1. Take the specific docs that relate to the patch and put them in one place, so that everybody can read and understand and agree the behaviour of the patch. So that someone reading that can see *exactly* what is being proposed, not read through pages of other unchanged material hoping to catch the few points that really differ. 2. Replace CONFLICTING() with what I have asked for in earlier posts. The request is not repeated again, to avoid confusion. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] jsonb generator functions
Hi 2014-10-27 15:33 GMT+01:00 Andrew Dunstan and...@dunslane.net: On 10/15/2014 03:54 PM, Andrew Dunstan wrote: I checked a code, and I have only two small objection - a name jsonb_object_two_arg is not good - maybe json_object_keys_values ? It's consistent with the existing json_object_two_arg. In all cases I think I kept the names the same except for changing json to jsonb. Note that these _two_arg functions are not visible at the SQL level - they are only visible in the C code. I'm happy to be guided by others in changing or keeping these names. Next: there are no tests for to_jsonb function. Oh, my bad. I'll add some. Thank for the review. Here is a new patch that includes documentation and addresses all these issues, except that I didn't change the name of jsonb_object_two_arg to keep it consistent with the name of json_object_two_arg. I'm happy to change both if people feel it matters. I checked last patch jsonbmissingfunc5.patch and I have no any objections: 1. This jsonb API is consistent with current JSON API, so we surely would to this functionality 2. A implementation is clean without side effects and without impact on current code. 3. Patching and compilation are without any issues and warnings 4. Source code respects PostgreSQL coding rules 5. All regress tests was passed without problems 6. Documentation was build without problems 7. Patch contains necessary regress tests 8. Patch contains necessary documentation for new functions. Patch is ready for commiters Thank you for patch Regards Pavel cheers andrew
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Mon, Oct 27, 2014 at 11:12 AM, Simon Riggs si...@2ndquadrant.com wrote: 1. Take the specific docs that relate to the patch and put them in one place, so that everybody can read and understand and agree the behaviour of the patch. So that someone reading that can see *exactly* what is being proposed, not read through pages of other unchanged material hoping to catch the few points that really differ. I'm afraid I don't follow. I have links to the user-visible documentation (v1.3) on the Wiki: https://wiki.postgresql.org/wiki/UPSERT#Documentation The documentation is complete. I link to every interesting page from the documentation directly from the Wiki, too. Of course, I also describe the issues in more detail for those with an interest in the implementation on the Wiki page itself (and list open issues). I have isolation tests that illustrate the new facets of visibility for READ COMMITTED, too. How, specifically, have I failed to do what you ask here? If you want to see exactly what has changed, in a differential fashion, well, that's what a diff is for. I'm not aware of any existing way of rendering to html for readability, while highlighting what is new. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] TODO request: log_long_transaction
Hackers, I just realized that there is one thing we can't log currently: transactions which last more than #ms. This is valuable diagnostic information when looking for issues like causes of bloat and deadlocks. I'd like it to be on the TODO list because it seems like part of a good GSOC project or first-time contribution. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] TODO request: log_long_transaction
On 27 October 2014 19:21, Josh Berkus j...@agliodbs.com wrote: Hackers, I just realized that there is one thing we can't log currently: transactions which last more than #ms. This is valuable diagnostic information when looking for issues like causes of bloat and deadlocks. I'd like it to be on the TODO list because it seems like part of a good GSOC project or first-time contribution. So effectively, log_min_duration_transaction? Sounds useful. Thom
[HACKERS] Reducing the cost of sinval messaging
I happened to be looking at sinvaladt.c and noticed the loop added in commit b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4: /* * Now that the maxMsgNum change is globally visible, we give everyone * a swift kick to make sure they read the newly added messages. * Releasing SInvalWriteLock will enforce a full memory barrier, so * these (unlocked) changes will be committed to memory before we exit * the function. */ for (i = 0; i segP-lastBackend; i++) { ProcState *stateP = segP-procState[i]; stateP-hasMessages = true; } This seems fairly awful when there are a large number of backends. Why could we not remove the hasMessages flags again, and change the unlocked test if (!stateP-hasMessages) return 0; into if (stateP-nextMsgNum == segP-maxMsgNum !stateP-resetState) return 0; If we are looking at stale shared state, this test might produce a false positive, but I don't see why it's any less safe than testing a hasMessages boolean. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing the cost of sinval messaging
On Mon, Oct 27, 2014 at 3:31 PM, Tom Lane t...@sss.pgh.pa.us wrote: I happened to be looking at sinvaladt.c and noticed the loop added in commit b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4: /* * Now that the maxMsgNum change is globally visible, we give everyone * a swift kick to make sure they read the newly added messages. * Releasing SInvalWriteLock will enforce a full memory barrier, so * these (unlocked) changes will be committed to memory before we exit * the function. */ for (i = 0; i segP-lastBackend; i++) { ProcState *stateP = segP-procState[i]; stateP-hasMessages = true; } This seems fairly awful when there are a large number of backends. We did quite a bit of testing at the time and found that it was a clear improvement over what we had been doing previously. Of course, that's not to say we can't or shouldn't improve it further. Why could we not remove the hasMessages flags again, and change the unlocked test if (!stateP-hasMessages) return 0; into if (stateP-nextMsgNum == segP-maxMsgNum !stateP-resetState) return 0; If we are looking at stale shared state, this test might produce a false positive, but I don't see why it's any less safe than testing a hasMessages boolean. It was discussed at the time: http://www.postgresql.org/message-id/CA+TgmoY3Q56ZR6i8h+iGhXCw6rCZyvdWJ3RQT=PMVev4-=+n...@mail.gmail.com http://www.postgresql.org/message-id/13077.1311702...@sss.pgh.pa.us -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Mon, Oct 27, 2014 at 1:44 PM, Peter Geoghegan p...@heroku.com wrote: I think by far the biggest problem here is the lack of attention to the design from others. I find that attitude incredible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Mon, Oct 27, 2014 at 1:22 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Oct 27, 2014 at 1:44 PM, Peter Geoghegan p...@heroku.com wrote: I think by far the biggest problem here is the lack of attention to the design from others. I find that attitude incredible. What I mean is that that's the thing that clearly needs scrutiny the most. That isn't an attitude - it's a technical opinion. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing the cost of sinval messaging
Robert Haas robertmh...@gmail.com writes: On Mon, Oct 27, 2014 at 3:31 PM, Tom Lane t...@sss.pgh.pa.us wrote: Why could we not remove the hasMessages flags again, and change the unlocked test if (!stateP-hasMessages) return 0; into if (stateP-nextMsgNum == segP-maxMsgNum !stateP-resetState) return 0; If we are looking at stale shared state, this test might produce a false positive, but I don't see why it's any less safe than testing a hasMessages boolean. It was discussed at the time: http://www.postgresql.org/message-id/CA+TgmoY3Q56ZR6i8h+iGhXCw6rCZyvdWJ3RQT=PMVev4-=+n...@mail.gmail.com http://www.postgresql.org/message-id/13077.1311702...@sss.pgh.pa.us Neither of those messages seem to me to bear on this point. AFAICS, the unlocked hasMessages test has a race condition, which the comment just above it argues isn't a problem in practice. What I propose above would have an absolutely indistinguishable race condition, which again wouldn't be a problem in practice if you believe the comment's argument. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] alter user/role CURRENT_USER
All, I just ran through the patch giving it a good once over, some items to address/consider/discuss: * Trailing whitespace. * Why are you making changes in foreigncmds.c? These seem like unnecessary changes. I see that you are trying to consolidate but this refactor seems potentially out of scope. * To the above point, is ResolveRoleId really necessary? Also, I'm not sure I agree with passing in the tuple, I don't see any reason to pull that look up into this function. Also, couldn't you simply just add a short circuit in get_role_oid to check for current_user and return GetUserId() and similar for session_user? * In gram.y, is there really a need to have a separate RoleId_or_curruser? Why not: -RoleId:NonReservedWord { $$ = $1; }; +RoleId:CURRENT_USER{ $$ = current_user;} + | USER { $$ = current_user;} + | CURRENT_ROLE { $$ = current_user;} + | SESSION_USER { $$ = session_user;} + | NonReservedWord { $$ = $1; } + ; This would certainly reduce the number of changes to the grammar but might also help with covering the cases mentioned by Rushabh? Also are there cases when we don't want CURRENT_USER to be considered a RoleId? * The following seems like an unnecessary change: - | RoleId { $$ = (strcmp($1, public) == 0) ? NULL : $1; } + RoleId { $$ = (strcmp($1, public) == 0) ? + NULL : $1; } * Why is htup.h included in dbcommands.h? * What's up with the following in user.c, did you replace tab with spaces? - HeapTuple roletuple; + HeapTuple roletuple; -- Not working alter default privileges for role current_user grant SELECT ON TABLES TO current_user ; -- Not working grant user1 TO current_user; Agreed. The latter of the two seems like an interesting case to me though. But they both kind of jump out at me as potential for security concerns (but perhaps not given the role id checks, etc). At any rate, I'd still expect these to behave accordingly with notification/error messages when appropriate. Their might few more syntax like this. I think this can be covered inherently by the grammar changes recommended above (if such changes are appropriate). Though, I'd also recommend investigating which commands are affected via the grammar (or RoleId) and then making sure to update the regression tests and the documentation accordingly. I understand that patch is sightly getting bigger and complex then what it was originally proposed. IMHO, I think this patch has become more complex than is required. Before going back to more review on latest patch I would like to understand the requirement of this new feature. Also would like others to comment on where/how we should restrict this feature ? I think this is a fair request. I believe I can see the potential convenience of these changes, however, I'm uncertain of the necessity of them and whether or not it opens any security concerns (again, perhaps not, but I think it is worth asking). Also, how would this affect items like auditing? -Adam -- Adam Brightwell - adam.brightw...@crunchydatasolutions.com Database Engineer - www.crunchydatasolutions.com
Re: [HACKERS] Reducing the cost of sinval messaging
On Mon, Oct 27, 2014 at 4:27 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: On Mon, Oct 27, 2014 at 3:31 PM, Tom Lane t...@sss.pgh.pa.us wrote: Why could we not remove the hasMessages flags again, and change the unlocked test if (!stateP-hasMessages) return 0; into if (stateP-nextMsgNum == segP-maxMsgNum !stateP-resetState) return 0; If we are looking at stale shared state, this test might produce a false positive, but I don't see why it's any less safe than testing a hasMessages boolean. It was discussed at the time: http://www.postgresql.org/message-id/CA+TgmoY3Q56ZR6i8h+iGhXCw6rCZyvdWJ3RQT=PMVev4-=+n...@mail.gmail.com http://www.postgresql.org/message-id/13077.1311702...@sss.pgh.pa.us Neither of those messages seem to me to bear on this point. AFAICS, the unlocked hasMessages test has a race condition, which the comment just above it argues isn't a problem in practice. What I propose above would have an absolutely indistinguishable race condition, which again wouldn't be a problem in practice if you believe the comment's argument. Yes, the read from hasMessages has a race condition, which is in fact not a problem in practice if you believe the comment's argument. However, your proposed formulation has a second race condition, which is discussed in those emails, and which is precisely why we rejected the design you're now proposing three years ago. That second race condition is that the read from stateP-nextMsgNum doesn't have to happen at the same time as the read from stateP-resetState. No CPU or compiler barrier separates them, so either can happen before the other. SICleanupQueue updates both values, so you have the problem first described precisely by Noah: # (In theory, you could # hit a case where the load of resetState gives an ancient false just as the # counters wrap to match. While it practically seems very unlikely that this would ever really happen, some guy named Tom Lane told us we should worry about it, so I did. That led to this new design: http://www.postgresql.org/message-id/ca+tgmobefjvyqdjvbb6tss996s8zkvyrzttpx0chyo3hve5...@mail.gmail.com ...which is what ended up getting committed. Personally, I think trying to further optimize this is most likely a waste of time. I'm quite sure that there's a whole lot of stuff in the sinval code that could be further optimized, but it executes so infrequently that I believe it doesn't matter. Noah beat on this pretty hard at the time and found that O(N) loop you're complaining about - in his testing - never went above 0.26% of CPU time and reduced overall CPU usage even in a sinval-intensive workload with 50 clients per core: http://www.postgresql.org/message-id/20110729030514.gb30...@tornado.leadboat.com If you've got some evidence that this is a real problem as opposed to a hypothetical one, we could certainly reopen the debate about whether ignoring the possibility that the loads from stateP-nextMsgNum and stateP-resetState will be widely-enough spaced in time for a quarter-million messages to be queued up in the meantime is sufficiently unlikely to ignore. However, absent evidence of a tangible performance problem, I'd be quite disinclined to throw the guaranteed safety of the current approach out the window. We could do some testing where we crank that 262,144 value down to progressively lower numbers until the race condition actually manifests, just to see how far we are from disaster. But presumably the goalposts there shift every time we get a new generation of hardware; and I don't much like coding on the basis of eh, seems unlikely the CPU will ever delay a read by *that* much. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
On 27.10.2014 17:24, Heikki Linnakangas wrote: On 10/27/2014 03:46 PM, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 10/27/2014 03:21 PM, Tomas Vondra wrote: Thinking about this a bit more, do we really need a full checkpoint? That is a checkpoint of all the databases in the cluster? Why checkpointing the source database is not enough? A full checkpoint ensures that you always begin recovery *after* the DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during crash recovery (except when you crash before the transaction commits, in which case it doesn't matter if the new database's directory is borked). Yeah. After re-reading the 2005 thread, I wonder if we shouldn't just bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log all the copied files instead of doing a cp -r-equivalent directory copy. That would fix a number of existing replay hazards as well as making it safe to do what Tomas wants. In the small scale this would cause more I/O (2 copies of the template database's data) but in production situations we might well come out ahead by avoiding a forced checkpoint of the rest of the cluster. Also I guess we could skip WAL-logging if WAL archiving is off, similarly to the existing optimization for CREATE INDEX etc. That would be a nasty surprise for anyone who's using CREATE DATABASE as a fast way to clone a large database. But I would be OK with that, at least if we can skip the WAL-logging with wal_level=minimal. That's true. Sadly, I can't think of a solution that would address both use cases at the same time :-( The only thing I can think of is having two CREATE DATABASE flavors. One keeping the current approach (suitable for fast cloning) and one with the WAL logging (minimizing the CREATE DATABASE duration the impact on other backends). It will probably make the code significantly more complex, which is not exactly desirable, I guess. Also, if we keep the current code (even if only as a special case) it won't eliminate the existing replay hazards (which was one of the Tom's arguments for biting the bullet). I'm also thinking that for wal_level=archive and large databases, this won't really eliminate the checkpoint as it will likely generate enough WAL to hit checkpoint_segments and trigger a checkpoint anyway. No? That being said, our CREATE DATABASE docs currently say this Although it is possible to copy a database other than template1 by specifying its name as the template, this is not (yet) intended as a general-purpose COPY DATABASE facility. The principal limitation is that no other sessions can be connected to the template database while it is being copied. CREATE DATABASE will fail if any other connection exists when it starts; otherwise, new connections to the template database are locked out until CREATE DATABASE completes. See Section 21.3 for more information. I think that this limitation pretty much means no one should use CREATE DATABASE for cloning live databases in production environment (because of the locking). It also seems to me the general-purpose COPY DATABASE described in the docs is what we're describing in this thread. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Directory/File Access Permissions for COPY and Generic File Access Functions
All, Attached is a patch with minor updates/corrections. -Adam -- Adam Brightwell - adam.brightw...@crunchydatasolutions.com Database Engineer - www.crunchydatasolutions.com diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile new file mode 100644 index b257b02..8cdc5cb *** a/src/backend/catalog/Makefile --- b/src/backend/catalog/Makefile *** POSTGRES_BKI_SRCS = $(addprefix $(top_sr *** 41,47 pg_foreign_data_wrapper.h pg_foreign_server.h pg_user_mapping.h \ pg_foreign_table.h pg_rowsecurity.h \ pg_default_acl.h pg_seclabel.h pg_shseclabel.h pg_collation.h pg_range.h \ ! toasting.h indexing.h \ ) # location of Catalog.pm --- 41,47 pg_foreign_data_wrapper.h pg_foreign_server.h pg_user_mapping.h \ pg_foreign_table.h pg_rowsecurity.h \ pg_default_acl.h pg_seclabel.h pg_shseclabel.h pg_collation.h pg_range.h \ ! pg_diralias.h toasting.h indexing.h \ ) # location of Catalog.pm diff --git a/src/backend/catalog/aclchk.c b/src/backend/catalog/aclchk.c new file mode 100644 index d30612c..3717bf5 *** a/src/backend/catalog/aclchk.c --- b/src/backend/catalog/aclchk.c *** *** 30,35 --- 30,36 #include catalog/pg_collation.h #include catalog/pg_conversion.h #include catalog/pg_database.h + #include catalog/pg_diralias.h #include catalog/pg_default_acl.h #include catalog/pg_event_trigger.h #include catalog/pg_extension.h *** *** 48,53 --- 49,55 #include catalog/pg_ts_config.h #include catalog/pg_ts_dict.h #include commands/dbcommands.h + #include commands/diralias.h #include commands/proclang.h #include commands/tablespace.h #include foreign/foreign.h *** ExecGrant_Type(InternalGrant *istmt) *** 3183,3188 --- 3185,3374 heap_close(relation, RowExclusiveLock); } + /* + * ExecuteGrantDirAliasStmt + * handles the execution of the GRANT/REVOKE ON DIRALIAS command. + * + * stmt - the GrantDirAliasStmt that describes the directory aliases and + *permissions to be granted/revoked. + */ + void + ExecuteGrantDirAliasStmt(GrantDirAliasStmt *stmt) + { + Relation pg_diralias_rel; + Oidgrantor; + List *grantee_ids = NIL; + AclMode permissions; + ListCell *item; + + /* Must be superuser to grant directory alias permissions */ + if (!superuser()) + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg(must be superuser to grant directory alias permissions))); + + /* + * Grantor is optional. If it is not provided then set it to the current + * user. + */ + if (stmt-grantor) + grantor = get_role_oid(stmt-grantor, false); + else + grantor = GetUserId(); + + /* Convert grantee names to oids */ + foreach(item, stmt-grantees) + { + PrivGrantee *grantee = (PrivGrantee *) lfirst(item); + + if (grantee-rolname == NULL) + grantee_ids = lappend_oid(grantee_ids, ACL_ID_PUBLIC); + else + { + Oid roleid = get_role_oid(grantee-rolname, false); + grantee_ids = lappend_oid(grantee_ids, roleid); + } + } + + permissions = ACL_NO_RIGHTS; + + /* If ALL was provided then set permissions to ACL_ALL_RIGHTS_DIRALIAS */ + if (stmt-permissions == NIL) + permissions = ACL_ALL_RIGHTS_DIRALIAS; + else + { + /* Condense all permissions */ + foreach(item, stmt-permissions) + { + AccessPriv *priv = (AccessPriv *) lfirst(item); + permissions |= string_to_privilege(priv-priv_name); + } + } + + + /* + * Though it shouldn't be possible to provide permissions other than READ + * and WRITE, check to make sure no others have been set. If they have, + * then warn the user and correct the permissions. + */ + if (permissions !((AclMode) ACL_ALL_RIGHTS_DIRALIAS)) + { + ereport(WARNING, + (errcode(ERRCODE_INVALID_GRANT_OPERATION), + errmsg(directory aliases only support READ and WRITE permissions))); + + permissions = ACL_ALL_RIGHTS_DIRALIAS; + } + + pg_diralias_rel = heap_open(DirAliasRelationId, RowExclusiveLock); + + /* Grant/Revoke permissions on directory aliases */ + foreach(item, stmt-directories) + { + Datum values[Natts_pg_diralias]; + bool replaces[Natts_pg_diralias]; + bool nulls[Natts_pg_diralias]; + ScanKeyData skey[1]; + HeapScanDesc scandesc; + HeapTuple tuple; + HeapTuple new_tuple; + Datum datum; + Oidowner_id; + Acl *dir_acl; + Acl *new_acl; + bool is_null; + intnum_old_members; + intnum_new_members; + Oid *old_members; + Oid *new_members; + Oiddiralias_id; + char *name; + + name = strVal(lfirst(item)); + + ScanKeyInit(skey[0], + Anum_pg_diralias_dirname, + BTEqualStrategyNumber, F_NAMEEQ, + CStringGetDatum(name)); + + scandesc = heap_beginscan_catalog(pg_diralias_rel, 1, skey); + + tuple = heap_getnext(scandesc, ForwardScanDirection); + + if (!HeapTupleIsValid(tuple)) + elog(ERROR, could not find
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Tomas Vondra t...@fuzzy.cz writes: That being said, our CREATE DATABASE docs currently say this Although it is possible to copy a database other than template1 by specifying its name as the template, this is not (yet) intended as a general-purpose COPY DATABASE facility. The principal limitation is that no other sessions can be connected to the template database while it is being copied. CREATE DATABASE will fail if any other connection exists when it starts; otherwise, new connections to the template database are locked out until CREATE DATABASE completes. See Section 21.3 for more information. I think that this limitation pretty much means no one should use CREATE DATABASE for cloning live databases in production environment (because of the locking). Good point. But the other side of that coin is that if somebody *was* doing this, the locking would mean that slowing it down would be even more painful than you might think. Still, personally I'm willing to accept that downside, given that we've pretty much always had the above caveat in the docs. It also seems to me the general-purpose COPY DATABASE described in the docs is what we're describing in this thread. Well, no, because the restriction nobody can be connected to the source database would still apply; relaxing that would be enormously more complicated, and probably fragile, than what we're talking about here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pgcrypto: PGP signatures
On Mon, Oct 20, 2014 at 3:32 PM, Marko Tiikkaja ma...@joh.to wrote: Hi, Here's the rebased patch -- as promised -- in a v7. Hi Marko, Using the same script as for the memory leak, I am getting seg faults using this patch. 24425 2014-10-27 15:42:11.819 PDT LOG: server process (PID 24452) was terminated by signal 11: Segmentation fault 24425 2014-10-27 15:42:11.819 PDT DETAIL: Failed process was running: select pgp_sym_decrypt_verify(dearmor($1),$2,dearmor($3),'debug=1') gdb backtrace: #0 pfree (pointer=0x0) at mcxt.c:749 #1 0x7f617d7973e7 in pgp_sym_decrypt_verify_text (fcinfo=0x1d7f1f0) at pgp-pgsql.c:1047 #2 0x0059f185 in ExecMakeFunctionResultNoSets (fcache=0x1d7f180, econtext=0x1d7fb00, isNull=0x7fff02e902bf , isDone=value optimized out) at execQual.c:1992 #3 0x0059ae0c in ExecEvalExprSwitchContext (expression=value optimized out, econtext=value optimized out, isNull=value optimized out, isDone=value optimized out) at execQual.c:4320 Cheers, Jeff
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
On 10/27/2014 05:58 PM, Tomas Vondra wrote: On 27.10.2014 17:24, Heikki Linnakangas wrote: On 10/27/2014 03:46 PM, Tom Lane wrote: Heikki Linnakangas hlinnakan...@vmware.com writes: On 10/27/2014 03:21 PM, Tomas Vondra wrote: Thinking about this a bit more, do we really need a full checkpoint? That is a checkpoint of all the databases in the cluster? Why checkpointing the source database is not enough? A full checkpoint ensures that you always begin recovery *after* the DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during crash recovery (except when you crash before the transaction commits, in which case it doesn't matter if the new database's directory is borked). Yeah. After re-reading the 2005 thread, I wonder if we shouldn't just bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log all the copied files instead of doing a cp -r-equivalent directory copy. That would fix a number of existing replay hazards as well as making it safe to do what Tomas wants. In the small scale this would cause more I/O (2 copies of the template database's data) but in production situations we might well come out ahead by avoiding a forced checkpoint of the rest of the cluster. Also I guess we could skip WAL-logging if WAL archiving is off, similarly to the existing optimization for CREATE INDEX etc. That would be a nasty surprise for anyone who's using CREATE DATABASE as a fast way to clone a large database. But I would be OK with that, at least if we can skip the WAL-logging with wal_level=minimal. That's true. Sadly, I can't think of a solution that would address both use cases at the same time :-( The only thing I can think of is having two CREATE DATABASE flavors. One keeping the current approach (suitable for fast cloning) and one with the WAL logging (minimizing the CREATE DATABASE duration the impact on other backends). It will probably make the code significantly more complex, which is not exactly desirable, I guess. Also, if we keep the current code (even if only as a special case) it won't eliminate the existing replay hazards (which was one of the Tom's arguments for biting the bullet). I'm also thinking that for wal_level=archive and large databases, this won't really eliminate the checkpoint as it will likely generate enough WAL to hit checkpoint_segments and trigger a checkpoint anyway. No? That being said, our CREATE DATABASE docs currently say this Although it is possible to copy a database other than template1 by specifying its name as the template, this is not (yet) intended as a general-purpose COPY DATABASE facility. The principal limitation is that no other sessions can be connected to the template database while it is being copied. CREATE DATABASE will fail if any other connection exists when it starts; otherwise, new connections to the template database are locked out until CREATE DATABASE completes. See Section 21.3 for more information. I think that this limitation pretty much means no one should use CREATE DATABASE for cloning live databases in production environment (because of the locking). It also seems to me the general-purpose COPY DATABASE described in the docs is what we're describing in this thread. Notwithstanding what the docs say, I have seen CREATE DATABASE used plenty of times, and quite effectively, to clone databases. I don't think making it do twice the IO in the general case is going to go down well. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote: On 10/27/2014 05:58 PM, Tomas Vondra wrote: On 27.10.2014 17:24, Heikki Linnakangas wrote: I'm also thinking that for wal_level=archive and large databases, this won't really eliminate the checkpoint as it will likely generate enough WAL to hit checkpoint_segments and trigger a checkpoint anyway. No? That being said, our CREATE DATABASE docs currently say this Although it is possible to copy a database other than template1 by specifying its name as the template, this is not (yet) intended as a general-purpose COPY DATABASE facility. The principal limitation is that no other sessions can be connected to the template database while it is being copied. CREATE DATABASE will fail if any other connection exists when it starts; otherwise, new connections to the template database are locked out until CREATE DATABASE completes. See Section 21.3 for more information. I think that this limitation pretty much means no one should use CREATE DATABASE for cloning live databases in production environment (because of the locking). It also seems to me the general-purpose COPY DATABASE described in the docs is what we're describing in this thread. Notwithstanding what the docs say, I have seen CREATE DATABASE used plenty of times, and quite effectively, to clone databases. I don't think making it do twice the IO in the general case is going to go down well. I think they're actually more likely to be happy that we wouldn't need do a immediate checkpoint anymore. The performance penalty from that likely to be much more severe than the actual IO. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Directory/File Access Permissions for COPY and Generic File Access Functions
I think the way this should work is that if you create a DIRALIAS, then the COPY command should refer to it by logical name, e.g., CREATE DIRALIAS dumpster AS '/tmp/trash'; COPY mytable TO dumpster; If you squint a bit, this is the same as a tablespace. Maybe those two concepts could be combined. On the other hand, we already have file_fdw, which does something very similar. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
On 10/27/2014 07:01 PM, Andres Freund wrote: On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote: On 10/27/2014 05:58 PM, Tomas Vondra wrote: On 27.10.2014 17:24, Heikki Linnakangas wrote: I'm also thinking that for wal_level=archive and large databases, this won't really eliminate the checkpoint as it will likely generate enough WAL to hit checkpoint_segments and trigger a checkpoint anyway. No? That being said, our CREATE DATABASE docs currently say this Although it is possible to copy a database other than template1 by specifying its name as the template, this is not (yet) intended as a general-purpose COPY DATABASE facility. The principal limitation is that no other sessions can be connected to the template database while it is being copied. CREATE DATABASE will fail if any other connection exists when it starts; otherwise, new connections to the template database are locked out until CREATE DATABASE completes. See Section 21.3 for more information. I think that this limitation pretty much means no one should use CREATE DATABASE for cloning live databases in production environment (because of the locking). It also seems to me the general-purpose COPY DATABASE described in the docs is what we're describing in this thread. Notwithstanding what the docs say, I have seen CREATE DATABASE used plenty of times, and quite effectively, to clone databases. I don't think making it do twice the IO in the general case is going to go down well. I think they're actually more likely to be happy that we wouldn't need do a immediate checkpoint anymore. The performance penalty from that likely to be much more severe than the actual IO. At the very least that needs to be benchmarked. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Directory/File Access Permissions for COPY and Generic File Access Functions
On 10/27/14 7:27 AM, Stephen Frost wrote: * Peter Eisentraut (pete...@gmx.net) wrote: On 10/16/14 12:01 PM, Stephen Frost wrote: This started out as a request for a non-superuser to be able to review the log files without needing access to the server. I think that can be done with a security-definer function. Of course it can be. We could replace the entire authorization system with security definer functions too. I don't think that is correct. It's easy to do something with security definer functions if it's single purpose, with a single argument, like load this file into this table, let these users do it. It's not easy to do it with functions if you have many parameters, like in a general SELECT statement. So I would like to see at least three wildly different use cases for this before believing that a security definer function isn't appropriate. I don't view this as an argument against this feature, particularly as we know other systems have it, users have asked for multiple times, and large PG deployments have had to hack around our lack of it. What other systems have it? Do you have links to their documentation? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Andres Freund and...@2ndquadrant.com writes: On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote: Notwithstanding what the docs say, I have seen CREATE DATABASE used plenty of times, and quite effectively, to clone databases. I don't think making it do twice the IO in the general case is going to go down well. I think they're actually more likely to be happy that we wouldn't need do a immediate checkpoint anymore. The performance penalty from that likely to be much more severe than the actual IO. Note that currently, CREATE DATABASE requires fsync'ing each file written into the new database. With the proposed new implementation, we'd write out that data to the kernel *but not have to fsync it*. Instead, we'd fsync just the WAL. At least on spinning rust, that could be a considerable win, for exactly the same reasons that we don't fsync anything but WAL for ordinary transaction commits (ie, way fewer seeks). Maybe not by enough to counteract doubling the write volume, but I think we'd need some benchmarks before concluding that it's completely horrid. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On 27 October 2014 20:24, Peter Geoghegan p...@heroku.com wrote: On Mon, Oct 27, 2014 at 1:22 PM, Robert Haas robertmh...@gmail.com wrote: On Mon, Oct 27, 2014 at 1:44 PM, Peter Geoghegan p...@heroku.com wrote: I think by far the biggest problem here is the lack of attention to the design from others. I find that attitude incredible. What I mean is that that's the thing that clearly needs scrutiny the most. That isn't an attitude - it's a technical opinion. Let's see if we can link these two thoughts. 1. You think the biggest problem is the lack of attention to the design. 2. I keep asking you to put the docs in a readable form. If you can't understand the link between those two things, I am at a loss. Good luck with the patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Directory/File Access Permissions for COPY and Generic File Access Functions
* Peter Eisentraut (pete...@gmx.net) wrote: On 10/27/14 7:27 AM, Stephen Frost wrote: * Peter Eisentraut (pete...@gmx.net) wrote: On 10/16/14 12:01 PM, Stephen Frost wrote: This started out as a request for a non-superuser to be able to review the log files without needing access to the server. I think that can be done with a security-definer function. Of course it can be. We could replace the entire authorization system with security definer functions too. I don't think that is correct. Of course it is- you simply have to move all the logic into the function. It's easy to do something with security definer functions if it's single purpose, with a single argument, like load this file into this table, let these users do it. The files won't be consistently named and there may be cases to make ad-hoc runs or test runs. No, it isn't as simple as always being a single, specific filename and when consider that there needs to be intelligence about the actual path being specified and making sure that there can't be '..' and similar, it gets to be a pretty ugly situation to make our users have to deal with. It's not easy to do it with functions if you have many parameters, like in a general SELECT statement. You could define SRFs for every table. So I would like to see at least three wildly different use cases for this before believing that a security definer function isn't appropriate. I'm not following this- there's probably 100s of use-cases for this, but they're all variations n 'read and/or write data server-side instead of through a front-end connection', which is what the purpose of the feature is.. I do see this as being useful for COPY, Large Object, and the file_fdw... I don't view this as an argument against this feature, particularly as we know other systems have it, users have asked for multiple times, and large PG deployments have had to hack around our lack of it. What other systems have it? Do you have links to their documentation? MySQL: http://dev.mysql.com/doc/refman/5.1/en/privileges-provided.html#priv_file (note they provide a way to limit access also, via secure_file_priv) Oracle: http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_5007.htm http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_9013.htm#i2125999 SQL Server: http://msdn.microsoft.com/en-us/library/ms175915.aspx (Note: they can actually run as the user connected instead of the SQL DB server, if Windows authentication is used, which is basically doing Kerberos proxying unless I'm mistaken; it's unclear how the security is maintained if it's a SQL server logon user..). DB2: http://www-01.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.admin.dm.doc/doc/c0004589.html?cp=SSEPGG_9.7.0 etc... Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
On 28.10.2014 00:06, Andrew Dunstan wrote: On 10/27/2014 07:01 PM, Andres Freund wrote: On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote: On 10/27/2014 05:58 PM, Tomas Vondra wrote: On 27.10.2014 17:24, Heikki Linnakangas wrote: I'm also thinking that for wal_level=archive and large databases, this won't really eliminate the checkpoint as it will likely generate enough WAL to hit checkpoint_segments and trigger a checkpoint anyway. No? That being said, our CREATE DATABASE docs currently say this Although it is possible to copy a database other than template1 by specifying its name as the template, this is not (yet) intended as a general-purpose COPY DATABASE facility. The principal limitation is that no other sessions can be connected to the template database while it is being copied. CREATE DATABASE will fail if any other connection exists when it starts; otherwise, new connections to the template database are locked out until CREATE DATABASE completes. See Section 21.3 for more information. I think that this limitation pretty much means no one should use CREATE DATABASE for cloning live databases in production environment (because of the locking). It also seems to me the general-purpose COPY DATABASE described in the docs is what we're describing in this thread. Notwithstanding what the docs say, I have seen CREATE DATABASE used plenty of times, and quite effectively, to clone databases. I don't think making it do twice the IO in the general case is going to go down well. I think they're actually more likely to be happy that we wouldn't need do a immediate checkpoint anymore. The performance penalty from that likely to be much more severe than the actual IO. At the very least that needs to be benchmarked. The question is what workload are we going to benchmark. It's unlikely one of the approaches to be a clear winner in all cases, so we'll have to decide which ones are more common / important, or somehow combine both approaches (and thus not getting some of the WAL-only benefits). I'm pretty sure we'll see about three main cases: (1) read-mostly workloads Current approach wins, because checkpoint is cheap and the WAL-based approach results in 2x the I/O. The difference is proportional to template database size. For small databases it's negligible, for large databases it's more significant. (2) write-heavy workloads / small template database WAL-based approach wins, because it does not require explicit checkpoint and for small databases the I/O generated by WAL-logging everything is lower than checkpoint (which is more random). This is the case of our PostgreSQL clusters. (3) write-heavy workloads / large template database Current approach wins, for two reasons: (a) for large databases the WAL-logging overhead may generate much more I/O than a checkpoint, and (b) it may generate so many WAL segments it eventually triggers a checkpoint anyway (even repeatedly). The exact boundary between the cases really depends on multiple things: (a) shared_buffers size (the larger the more expensive checkpoint) (b) read-write activity (more writes = more expensive checkpoint) (c) hardware (especially how well it handles random I/O) Not sure how to decide which case is more important, and I agree that there are people using CREATE DATABASE to clone databases - maybe not in production, but e.g. for testing purposes (still, it'd be rather unfortunate to make it considerably slower for them). Not sure how to balance this :-/ So maybe we shouldn't cling to the WAL-logging approach too much. Maybe Heikki's idea from to abandon the full checkpoint and instead assume that once the transaction commits, all the files were fsynced OK. Of couse, this will do nothing about the replay hazards. regards Tomas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
On Mon, Oct 27, 2014 at 4:34 PM, Simon Riggs si...@2ndquadrant.com wrote: Let's see if we can link these two thoughts. 1. You think the biggest problem is the lack of attention to the design. 2. I keep asking you to put the docs in a readable form. If you can't understand the link between those two things, I am at a loss. You've read the docs. Please be clearer. In what sense are they not readable? The main description of the feature appears on the INSERT reference page: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/sql-insert.html These paragraphs were added (there is more on the INSERT reference page than mentioned here, but this is the main part): The optional ON CONFLICT clause specifies a path to take as an alternative to raising a uniqueness violation error. ON CONFLICT IGNORE simply avoids inserting any individual row when it is determined that a uniqueness violation error would otherwise need to be raised. ON CONFLICT UPDATE has the system take an UPDATE path in respect of such rows instead. ON CONFLICT UPDATE guarantees an atomic INSERT or UPDATE outcome. While rows may be updated, the top-level statement is still an INSERT, which is significant for the purposes of statement-level triggers and the rules system. Note that in the event of an ON CONFLICT path being taken, RETURNING returns no value in respect of any not-inserted rows. ON CONFLICT UPDATE optionally accepts a WHERE clause condition. When provided, the statement only proceeds with updating if the condition is satisfied. Otherwise, unlike a conventional UPDATE, the row is still locked for update. Note that the condition is evaluated last, after a conflict has been identified as a candidate to update. ON CONFLICT UPDATE accepts EXCLUDED expressions in both its targetlist and WHERE clause. This allows expressions (in particular, assignments) to reference rows originally proposed for insertion. Note that the effects of all per-row BEFORE INSERT triggers are carried forward. This is particularly useful for multi-insert ON CONFLICT UPDATE statements; when inserting or updating multiple rows, constants need only appear once. There are several restrictions on the ON CONFLICT UPDATE clause that do not apply to UPDATE statements. Subqueries may not appear in either the UPDATE targetlist, nor its WHERE clause (although simple multi-assignment expressions are supported). WHERE CURRENT OF cannot be used. In general, only columns in the target table, and excluded values originally proposed for insertion may be referenced. Operators and functions may be used freely, though. INSERT with an ON CONFLICT UPDATE clause is a deterministic statement. You cannot UPDATE any single row more than once. Writing such an INSERT statement will raise a cardinality violation error. Rows proposed for insertion should not duplicate each other in terms of attributes constrained by the conflict-arbitrating unique index. Note that the ordinary rules for unique indexes with regard to NULL apply analogously to whether or not an arbitrating unique index indicates if the alternative path should be taken, so multiple NULL values across a set of rows proposed for insertion are acceptable; the statement will always insert (assuming there is no unrelated error). Note that merely locking a row (by having it not satisfy the WHERE clause condition) does not count towards whether or not the row has been affected multiple times (and whether or not a cardinality violation error is raised). ON CONFLICT UPDATE requires a column specification, which is used to infer an index to limit pre-checking for duplicates to. ON CONFLICT IGNORE makes this optional. Failure to anticipate and prevent would-be uniqueness violations originating in some other unique index than the single unique index that was anticipated as the sole source of would-be uniqueness violations can result in failing to insert a row (taking the IGNORE path) when a uniqueness violation may have actually been appropriate; omitting the specification indicates a total indifference to where any would-be uniqueness violation could occur. Note that the ON CONFLICT UPDATE assignment may result in a uniqueness violation, just as with a conventional UPDATE. The rules for unique index inference are straightforward. Columns and/or expressions specified must match all the columns/expressions of some existing unique index on table_name. The order of the columns/expressions in the index definition, or whether or not the index definition specified NULLS FIRST or NULLS LAST, or the internal sort order of each column (whether DESC or ASC were specified) are all irrelevant. However, partial unique indexes are not supported as arbiters of whether an alternative ON CONFLICT path should be taken. Deferred unique constraints are also not accepted. You must have INSERT privilege on a table in order to insert into it, as well as UPDATE privilege if and only if ON CONFLICT UPDATE is specified. If a
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Tomas Vondra t...@fuzzy.cz writes: So maybe we shouldn't cling to the WAL-logging approach too much. Maybe Heikki's idea from to abandon the full checkpoint and instead assume that once the transaction commits, all the files were fsynced OK. Of couse, this will do nothing about the replay hazards. Well, I'm not insisting on any particular method of getting there, but if we're going to touch this area at all then I think fix the replay hazards should be a non-negotiable requirement. We'd never have accepted such hazards if CREATE DATABASE were being introduced for the first time; it's only like this because nobody felt like rewriting a Berkeley-era kluge. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Function array_agg(array)
super I tested last version and I have not any objections. 1. We would to have this feature - it is long time number of our ToDo List 2. Proposal and design of multidimensional aggregation is clean and nobody has objection here. 3. There is zero impact on current implementation. From performance reasons it uses new array optimized aggregator - 30% faster for this purpose than current (scalar) array aggregator 4. Code is clean and respect PostgreSQL coding rules 5. There are documentation and necessary regress tests 6. Patching and compilation is clean without warnings. This patch is ready for commit Thank you for patch Thank you for the thorough review process. Regards, -- Ali Akbar
Re: [HACKERS] alter user/role CURRENT_USER
On Fri, Oct 24, 2014 at 11:29 AM, Kyotaro HORIGUCHI horiguchi.kyot...@lab.ntt.co.jp wrote: - 0001-ALTER-ROLE-CURRENT_USER_v2.patch - the patch. +RoleId:CURRENT_USER{ $$ = current_user;} + | USER { $$ = current_user;} + | CURRENT_ROLE { $$ = current_user;} + | SESSION_USER { $$ = session_user;} This is kind of ugly, and it means you can't distinguish between a CURRENT_USER keyword and a quoted user name current_user. It's a legitimate user name, so the behavior of the following now changes: CREATE ROLE current_user; ALTER ROLE current_user SET work_mem='10MB'; There ought to be a better way to represent this than using magic string values. It accepts CURRENT_USER/USER/CURRENT_ROLE/SESSION_USER. On a stylistic note, do we really want to support the alternative spellings of CURRENT_USER, like CURRENT_ROLE and USER? The SQL standard is well-hated for inventing more keywords than necessary. There is no precedent for using/allowing these aliases in PostgreSQL DDL commands, and ALTER USER ROLE aren't SQL standard anyway. So maybe we should stick with just accepting one canonical syntax instead. I think the word USER may reasonably arise from an editing mistake, ALTER USER USER does not seem like sane syntax that should be accepted. From your original email: - Modify syntax of ALTER USER so as to be an alias of ALTER ROLE. - Added ALL syntax as user name to ALTER USER. But should ALTER USER ALL and ALTER ROLE ALL really do the same thing? A user is a role with the LOGIN option. Every user is a role, but not every role is a user. I suspect that ambiguity was why ALTER USER ALL wasn't originally implemented. Regards, Marti -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing lock strength of adding foreign keys
On Mon, Oct 27, 2014 at 08:24:15AM -0400, Robert Haas wrote: On Sat, Oct 25, 2014 at 2:00 PM, Noah Misch n...@leadboat.com wrote: http://www.postgresql.org/message-id/ca+tgmoy4glsxzk0tao29-ljtcuj0sl1xwcwq51xb-hfysgi...@mail.gmail.com http://www.postgresql.org/message-id/20893.1393892...@sss.pgh.pa.us http://www.postgresql.org/message-id/20140306224340.ga3551...@tornado.leadboat.com As far as triggers are concerned, the issue of skew between the transaction snapshot and what the ruleutils.c snapshots do seems to be the principal issue. Commit e5550d5fec66aa74caad1f79b79826ec64898688 changed pg_get_constraintdef() to use an MVCC snapshot rather than a current MVCC snapshot; if that change is safe, I am not aware of any reason why we couldn't change pg_get_triggerdef() similarly. pg_get_triggerdef() is fine as-is with concurrent CREATE TRIGGER. The pg_get_constraintdef() change arose to ensure a consistent result when concurrent ALTER TABLE VALIDATE CONSTRAINT mutates a constraint definition. (Reducing the lock level of DROP TRIGGER or ALTER TRIGGER, however, would create the analogous problem for pg_get_triggerdef().) Maybe so, but I'd favor changing it anyway and getting it over with. The current situation seems to have little to recommend it; moreover, it would be nice, if it's possible and safe, to weaken the lock levels for all three of those commands at the same time. Do you see any hazards for ALTER or DROP that do not exist for CREATE? ALTER TRIGGER is not bad; like you say, change pg_get_triggerdef_worker() the way commit e5550d5 changed pg_get_constraintdef_worker(). DROP TRIGGER is more difficult. pg_constraint.tgqual of a dropped trigger may reference other dropped objects, which calls for equipping get_rule_expr() to use the transaction snapshot. That implicates quite a bit of ruleutils.c code. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Tomas Vondra wrote I mean, when we use database A as a template, why do we need to checkpoint B, C, D and F too? (Apologies if this is somehow obvious, I'm way out of my comfort zone in this part of the code.) IIUC you have to checkpoint the whole cluster because it is not possible to do checkpoint individual databases. There is only a single WAL stream and while it could have source database markers I don't believe it does so there is no way to have separate checkpoint locations recorded for different databases. Adding such seems to introduce a lot of book-keeping and both reload and file size overhead for little practical gain. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/proposal-CREATE-DATABASE-vs-partial-CHECKPOINT-tp5824343p5824522.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] INSERT ... ON CONFLICT {UPDATE | IGNORE}
--- doc/src/sgml/ddl.sgml | 23 +++ doc/src/sgml/indices.sgml | 11 +- doc/src/sgml/mvcc.sgml| 43 -- doc/src/sgml/plpgsql.sgml | 20 ++- doc/src/sgml/postgres-fdw.sgml| 8 ++ doc/src/sgml/ref/create_index.sgml| 7 +- doc/src/sgml/ref/create_rule.sgml | 6 +- doc/src/sgml/ref/create_table.sgml| 5 +- doc/src/sgml/ref/create_trigger.sgml | 5 +- doc/src/sgml/ref/create_view.sgml | 36 - doc/src/sgml/ref/insert.sgml | 258 -- doc/src/sgml/ref/set_constraints.sgml | 6 +- doc/src/sgml/trigger.sgml | 46 -- 13 files changed, 426 insertions(+), 48 deletions(-) #1 internal documentation stats: doc/src/sgml/indexam.sgml| 133 --- src/backend/access/nbtree/README | 90 +- src/backend/executor/README | 35 +++ 3 files changed, 247 insertions(+), 11 deletions(-) #2 internal documentation stats: --- src/backend/executor/README | 49 + 1 file changed, 49 insertions(+) Just to put that in context, here are the documentation changes from the original LATERAL commit: doc/src/sgml/keywords.sgml| 2 +- doc/src/sgml/queries.sgml | 83 +- doc/src/sgml/ref/select.sgml | 102 +--- Commit 0ef0b30 added data-modifying CTE docs (docs only). That looks like: doc/src/sgml/queries.sgml| 177 +--- doc/src/sgml/ref/select.sgml | 49 ++--- 2 files changed, 214 insertions(+), 12 deletions(-) Have I not provided a total of 4 isolation tests illustrating interesting concurrency/visibility interactions? That's a lot of isolation tests. Here is the tests commit stat: 31 files changed, 1159 insertions(+), 8 deletions(-) And to put the tests in context, here are the stats from the original Hot Standby commit: src/test/regress/expected/hs_standby_allowed.out| 215 ++ src/test/regress/expected/hs_standby_check.out | 20 +++ src/test/regress/expected/hs_standby_disallowed.out | 137 + src/test/regress/expected/hs_standby_functions.out | 40 + src/test/regress/pg_regress.c | 33 ++-- src/test/regress/sql/hs_primary_extremes.sql| 74 + src/test/regress/sql/hs_primary_setup.sql | 25 +++ src/test/regress/sql/hs_standby_allowed.sql | 121 +++ src/test/regress/sql/hs_standby_check.sql | 16 ++ src/test/regress/sql/hs_standby_disallowed.sql | 105 + src/test/regress/sql/hs_standby_functions.sql | 24 +++ src/test/regress/standby_schedule | 21 +++ So (at least as measured by raw lines of tests), this feature is better tested than the original Hot Standby commit, and by a wide margin. Tests also serve as an explanatory tool for the feature (in particular, isolation tests can be used in this way). -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing the cost of sinval messaging
Robert Haas robertmh...@gmail.com writes: On Mon, Oct 27, 2014 at 4:27 PM, Tom Lane t...@sss.pgh.pa.us wrote: Neither of those messages seem to me to bear on this point. AFAICS, the unlocked hasMessages test has a race condition, which the comment just above it argues isn't a problem in practice. What I propose above would have an absolutely indistinguishable race condition, which again wouldn't be a problem in practice if you believe the comment's argument. Yes, the read from hasMessages has a race condition, which is in fact not a problem in practice if you believe the comment's argument. However, your proposed formulation has a second race condition, which is discussed in those emails, and which is precisely why we rejected the design you're now proposing three years ago. That second race condition is that the read from stateP-nextMsgNum doesn't have to happen at the same time as the read from stateP-resetState. No CPU or compiler barrier separates them, so either can happen before the other. SICleanupQueue updates both values, so you have the problem first described precisely by Noah: # (In theory, you could # hit a case where the load of resetState gives an ancient false just as the # counters wrap to match. While it practically seems very unlikely that this would ever really happen, some guy named Tom Lane told us we should worry about it, so I did. That argument is nonsense. I complained about a lack of close analysis, but with close analysis I think this is perfectly safe; or at least no less safe than what's there now, with its not terribly bulletproof assumption that a suitable memory barrier has happened recently. You'll grant that reads and writes of the integer msgNum variables are atomic, I trust (if not, we've got lots of bugs elsewhere). So what we have to worry about does not include totally-corrupted values, but only stale values, including possibly out-of-order reads. Ignoring the possibility of MSGNUMWRAPAROUND wraparound for a moment, only our own process ever changes nextMsgNum, so we should certainly see a current value of that. So the argument for a hazard is that maxMsgNum could have been advanced 2^32 times since our process last read messages, and then we come along and see that value, but we *don't* see our resetState flag set even though that happened 2^32 - MAXNUMMESSAGES messages ago, and was certainly flushed into shared memory shortly thereafter. That's nonsense. It's especially nonsense if you assume, as the existing code already does, that the reading process recently executed a read barrier. Now, as to what happens when MSGNUMWRAPAROUND wraparound occurs: that code decrements all msgNum variables whether they belong to hopelessly-behind backends or not. That's why the time until a possible chance match is 2^32 messages and not something else. However, since the proposed new unlocked test might come along and see a partially-complete wraparound update, it could see either nextMsgNum or maxMsgNum offset by MSGNUMWRAPAROUND compared to the other. What would most likely happen is that this would make for a false decision that they're unequal, resulting in a useless trip through the locked part of SIGetDataEntries, which is harmless. If our backend had fallen behind by exactly 2^32-MSGNUMWRAPAROUND messages or 2^32+MSGNUMWRAPAROUND messages, we could make a false conclusion that nextMsgNum and maxMsgNum are equal --- but, again, our resetState must have become set a billion or so sinval messages back, and it's just not credible that we're not going to see that flag set. So I still say that the current code is wasting a lot of cycles for no actual gain in safety. Personally, I think trying to further optimize this is most likely a waste of time. I'm quite sure that there's a whole lot of stuff in the sinval code that could be further optimized, but it executes so infrequently that I believe it doesn't matter. Maybe not today, but it's not hard to imagine usage patterns where this would result in O(N^2) total work in the sinval code. If you've got some evidence that this is a real problem as opposed to a hypothetical one, we could certainly reopen the debate about whether ignoring the possibility that the loads from stateP-nextMsgNum and stateP-resetState will be widely-enough spaced in time for a quarter-million messages to be queued up in the meantime is sufficiently unlikely to ignore. It's a billion messages, not a quarter million... However, absent evidence of a tangible performance problem, I'd be quite disinclined to throw the guaranteed safety of the current approach out the window. I fail to find any guaranteed safety in the current approach. It's dependent on the assumption of a recently-executed read barrier in the reading backend, which assumption is sufficient for what I'm proposing as well. Another thought here is that the whole thing becomes moot if we stick a read barrier into the unlocked test. We
Re: [HACKERS] proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Tom Lane-2 wrote Tomas Vondra lt; tv@ gt; writes: So maybe we shouldn't cling to the WAL-logging approach too much. Maybe Heikki's idea from to abandon the full checkpoint and instead assume that once the transaction commits, all the files were fsynced OK. Of couse, this will do nothing about the replay hazards. Well, I'm not insisting on any particular method of getting there, but if we're going to touch this area at all then I think fix the replay hazards should be a non-negotiable requirement. We'd never have accepted such hazards if CREATE DATABASE were being introduced for the first time; it's only like this because nobody felt like rewriting a Berkeley-era kluge. Maybe adding a ToDo: Fix replay hazards in CREATE DATABASE and listing them explicitly would be a good start. Not sure if WAL or CREATE would be more appropriate but WAL seems like a better fit. To the topic at hand would CREATE DATABASE name WITH LOGGED = true work? As with UNLOGGED tables giving the user the choice of WAL/fsync/checkpoint behavior seems reasonable. As Thomas said there a couple of diametrically opposed use-cases here and it seems like we've already implemented the more difficult one. Addressing the reply hazards in the existing implementation could be considered separately. What I'm not sure is whether the logging version in fact already addresses them and in the interest of safe and sane defaults whether we should implement and make it default and have the user specify logged = false to get the old behavior with warnings emitted as to the replay hazards that could occur should the database crash during the create. Is it possible to emit a warning before the command completes (or, rather, begins)? David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/proposal-CREATE-DATABASE-vs-partial-CHECKPOINT-tp5824343p5824526.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Directory/File Access Permissions for COPY and Generic File Access Functions
* Peter Eisentraut (pete...@gmx.net) wrote: I think the way this should work is that if you create a DIRALIAS, then the COPY command should refer to it by logical name, e.g., CREATE DIRALIAS dumpster AS '/tmp/trash'; COPY mytable TO dumpster; You'd have to be able to specify the filename also. I'm not against the idea of using the 'diralias' alias name this way, just saying it isn't quite as simple as the above. If you squint a bit, this is the same as a tablespace. Maybe those two concepts could be combined. CREATE TABLESPACE is something else which could be supported with diralias, though it'd have to be an independently grantable capability and it'd be a bad idea to let a user create tablespaces in a directory and then also be able to copy from/to files there (backend crashes, etc). This exact capability is more-or-less what RDS has had to hack on to PG for their environment, as I understand it, in case you're looking for a use-case. On the other hand, we already have file_fdw, which does something very similar. It's really not at all the same.. Perhaps we'll get there some day, but we're a very long way away from file_fdw having the ability to replace normal tablespaces... Thanks! Stephen signature.asc Description: Digital signature
Re: [HACKERS] alter user/role CURRENT_USER
Marti Raudsepp wrote On Fri, Oct 24, 2014 at 11:29 AM, Kyotaro HORIGUCHI lt; horiguchi.kyotaro@.co gt; wrote: But should ALTER USER ALL and ALTER ROLE ALL really do the same thing? A user is a role with the LOGIN option. Every user is a role, but not every role is a user. I suspect that ambiguity was why ALTER USER ALL wasn't originally implemented. There is no system level distinction here - only semantic and only during the issuance of a CREATE without specifying an explicit value for LOGIN/NOLGIN. The documentation states user and group are aliases for ROLE, not subsets that require or disallow login privileges. I am OK with not making them true aliases in order to minimize user confusion w.r.t. the semantics implied by user and group but then I'd submit we simply note those two forms as deprecated in favor of role and that all new role-based functionality will only be attached to role-based commands. Since all of user, current_user and session_user have special syntactic consideration in SQL [1] I'd be generally in favor of trying to keep that dynamic intact while implementing this feature. And for the same reason I would not allow current_role as a keyword. We didn't add a current_role function but instead chose to rely on the standard keywords even when we introduced ROLE. I'm not clear on the keyword confusion since while current_user is a valid literal it requires quotation while the token current_user does not. 1. http://www.postgresql.org/docs/9.4/static/functions-info.html David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/alter-user-role-CURRENT-USER-tp5822520p5824528.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reducing the cost of sinval messaging
On Mon, Oct 27, 2014 at 8:54 PM, Tom Lane t...@sss.pgh.pa.us wrote: That argument is nonsense. I complained about a lack of close analysis, but with close analysis I think this is perfectly safe; or at least no less safe than what's there now, with its not terribly bulletproof assumption that a suitable memory barrier has happened recently. I don't think there's any reason to think that assumption is any less than clad in iron, if not tungsten. The entire sinval mechanism relies for its correctness on heavyweight locking: at transaction commit, we queue up sinval messages before releasing locks; and we check for sinval messages - at least - after acquiring locks. For this to be correct, the lock held when queueing up a sinval message must be strong enough to prevent backends who might need to see that message from acquiring a lock of their own. In that way, by the time a process manages to acquire a lock, it can be confident that the invalidation messages it needs to see have already been queued. Since a heavyweight lock acquisition or release is a memory barrier many times over, it's fair to say that by the time the process queueing the invalidation messages releases locks, the messages it wrote must be committed to memory. And conversely, when a process has just acquired a lock, any invalidations committed to memory by another backend will be visible to it by the time the lock acquisition has completed. The only way this isn't a sufficient interlock is if the lock acquired by the second backend doesn't conflict with the lock held by the first one. But if that's the case, the whole system is broken far beyond the ability of a memory barrier to add or detract. You'll grant that reads and writes of the integer msgNum variables are atomic, I trust (if not, we've got lots of bugs elsewhere). So what we have to worry about does not include totally-corrupted values, but only stale values, including possibly out-of-order reads. Check and check. Ignoring the possibility of MSGNUMWRAPAROUND wraparound for a moment, only our own process ever changes nextMsgNum, so we should certainly see a current value of that. OK... So the argument for a hazard is that maxMsgNum could have been advanced 2^32 times since our process last read messages, and then we come along and see that value, but we *don't* see our resetState flag set even though that happened 2^32 - MAXNUMMESSAGES messages ago, and was certainly flushed into shared memory shortly thereafter. That's nonsense. It's especially nonsense if you assume, as the existing code already does, that the reading process recently executed a read barrier. I'm not sure I follow this part. Now, as to what happens when MSGNUMWRAPAROUND wraparound occurs: that code decrements all msgNum variables whether they belong to hopelessly-behind backends or not. That's why the time until a possible chance match is 2^32 messages and not something else. However, since the proposed new unlocked test might come along and see a partially-complete wraparound update, it could see either nextMsgNum or maxMsgNum offset by MSGNUMWRAPAROUND compared to the other. I definitely agree with that last sentence, which I think is the key point, but I don't understand how 2^32 comes into it. It seems to me that a chance match is a possibility every MSGNUMWRAPAROUND messages, precisely because one might be offset by that amount compared to the other. If our backend had fallen behind by exactly 2^32-MSGNUMWRAPAROUND messages or 2^32+MSGNUMWRAPAROUND messages, we could make a false conclusion that nextMsgNum and maxMsgNum are equal --- but, again, our resetState must have become set a billion or so sinval messages back, and it's just not credible that we're not going to see that flag set. I'd like a more precise argument than not credible. How many sinval messages back does it have to be before we consider it safe? Why that number and not twice as many or half as many? So I still say that the current code is wasting a lot of cycles for no actual gain in safety Personally, I think trying to further optimize this is most likely a waste of time. I'm quite sure that there's a whole lot of stuff in the sinval code that could be further optimized, but it executes so infrequently that I believe it doesn't matter. Maybe not today, but it's not hard to imagine usage patterns where this would result in O(N^2) total work in the sinval code. To all of this I say: show me the profile or usage pattern where this actually matters. We worked hard when this patch was in development and under review to find a usage pattern where this was a significant impact and were unable to do so. Another thought here is that the whole thing becomes moot if we stick a read barrier into the unlocked test. We did not have that option during the last go-round with this code, but I'm inclined to think we ought to revisit sinvaladt.c with that tool in our belts anyway;