date:20170322

Re: [HACKERS] GUC for cleanup indexes threshold.

2017-03-22 Thread Masahiko Sawada

On Wed, Mar 15, 2017 at 7:51 AM, Peter Geoghegan  wrote:
> On Tue, Mar 14, 2017 at 3:10 PM, Peter Geoghegan  wrote:
>> We already have BTPageOpaqueData.btpo, a union whose contained type
>> varies based on the page being dead. We could just do the same with
>> some other field in that struct, and then store epoch there. Clearly
>> nobody really cares about most data that remains on the page. Index
>> scans just need to be able to land on it to determine that it's dead,
>> and VACUUM needs to be able to determine whether or not there could
>> possibly be such an index scan at the time it considers recycling..
>
> ISTM that we need all of the fields within BTPageOpaqueData even for
> dead pages, actually. The left links and right links still need to be
> sane, and the flag bits are needed. Plus, the field that stores an XID
> already is clearly necessary. Even if they weren't needed, it would
> probably still be a good idea to keep them around for forensic
> purposes. However, the page header field pd_prune_xid is currently
> unused for indexes, and is the same width as CheckPoint.nextXidEpoch
> (the extra thing we might want to store -- the epoch).
>
> Maybe you could store the epoch within that field when B-Tree VACUUM
> deletes a page, and then compare that within _bt_page_recyclable(). It
> would come before the existing XID comparison in that function. One
> nice thing about this idea is that pd_prune_xid will be all-zero for
> index pages from the current format, so there is no need to take
> special care to make sure that databases that have undergone
> pg_upgrade don't break.
>

Thank you for the suggestion!
If we store the poch within union field, I think we will not be able
to use BTPageOpaqueData.btpo.xact at the same time. Since comparing
btpo.xact is still necessary to determine if that page is recyclable
we cannot store the epoch into the same union field. And if we store
it into BTPageOpaqueData, it would break disk compatibility.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] dblink module printing unnamed connection (with commit acaf7ccb94)

2017-03-22 Thread Rushabh Lathia

Hi All,

DBLINK contrib module started showing :"unnamed" connection name.

Consider the below test:

postgres=# CREATE ROLE alice NOSUPERUSER NOCREATEDB NOCREATEROLE LOGIN
PASSWORD 'wonderland';
CREATE ROLE
postgres=# GRANT EXECUTE ON FUNCTION dblink_connect_u(text,text) to alice;
GRANT
postgres=# \c postgres alice
You are now connected to database "postgres" as user "alice".
postgres=> SELECT dblink_connect_u('sm_conn_30','dbname=postgres user=alice
password=wonderland');
 dblink_connect_u
--
 OK
(1 row)

postgres=> SELECT * FROM dblink_send_query('sm_conn_30','SELECT
pg_stat_reset()') as dgr;
 dgr
-
   1
(1 row)

postgres=> SELECT * FROM dblink_get_result('sm_conn_30') AS dgr(curr_user
boolean);
ERROR:  permission denied for function pg_stat_reset
CONTEXT:  Error occurred on dblink connection named "*unnamed*": could not
execute query.

This started with below commit:

commit acaf7ccb94a3916ea83712671a3563f0eb595558
Author: Peter Eisentraut 
Date:   Sun Dec 25 12:00:00 2016 -0500

dblink: Replace some macros by static functions

Also remove some unused code and the no longer useful dblink.h file.

Reviewed-by: Tsunakawa, Takayuki 


Before this, macro used to assign the conname local variable; I quickly
worked
on the fix and attached patch do fix the issues. Patch assign the conname
local variable, so that error context show the correct connection name.


Regards,
Rushabh Lathia
www.EnterpriseDB.com
diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index c1e9089..e42cec4 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -679,7 +679,7 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
 	PG_TRY();
 	{
 		char	   *sql = NULL;
-		char	   *conname = NULL;
+		char	   *conname = text_to_cstring(PG_GETARG_TEXT_PP(0));
 		bool		fail = true;	/* default to backward compatible */
 
 		if (!is_async)
@@ -687,7 +687,7 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
 			if (PG_NARGS() == 3)
 			{
 /* text,text,bool */
-dblink_get_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)), , , );
+dblink_get_conn(conname, , , );
 sql = text_to_cstring(PG_GETARG_TEXT_PP(1));
 fail = PG_GETARG_BOOL(2);
 			}
@@ -702,7 +702,7 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
 }
 else
 {
-	dblink_get_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)), , , );
+	dblink_get_conn(conname, , , );
 	sql = text_to_cstring(PG_GETARG_TEXT_PP(1));
 }
 			}
@@ -722,13 +722,13 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
 			if (PG_NARGS() == 2)
 			{
 /* text,bool */
-conn = dblink_get_named_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)));
+conn = dblink_get_named_conn(conname);
 fail = PG_GETARG_BOOL(1);
 			}
 			else if (PG_NARGS() == 1)
 			{
 /* text */
-conn = dblink_get_named_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)));
+conn = dblink_get_named_conn(conname);
 			}
 			else
 /* shouldn't happen */
@@ -1390,7 +1390,8 @@ dblink_exec(PG_FUNCTION_ARGS)
 		if (PG_NARGS() == 3)
 		{
 			/* must be text,text,bool */
-			dblink_get_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)), , , );
+			conname = text_to_cstring(PG_GETARG_TEXT_PP(0));
+			dblink_get_conn(conname, , , );
 			sql = text_to_cstring(PG_GETARG_TEXT_PP(1));
 			fail = PG_GETARG_BOOL(2);
 		}
@@ -1405,7 +1406,8 @@ dblink_exec(PG_FUNCTION_ARGS)
 			}
 			else
 			{
-dblink_get_conn(text_to_cstring(PG_GETARG_TEXT_PP(0)), , , );
+conname = text_to_cstring(PG_GETARG_TEXT_PP(0));
+dblink_get_conn(conname, , , );
 sql = text_to_cstring(PG_GETARG_TEXT_PP(1));
 			}
 		}

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [WIP] RE: [HACKERS] DECLARE STATEMENT setting up a connection in ECPG

2017-03-22 Thread Haribabu Kommi

On Wed, Mar 22, 2017 at 7:57 PM, Ideriha, Takeshi <
ideriha.take...@jp.fujitsu.com> wrote:

> Hi, thank you very much for reviewing.
> Attached is v6 patch.
>
> >There was a minor conflict in applying 004_declareXX patch.
>
> I rebased 004_declareStmt_test_v6.patch.
>
> >Some comments in 001_declareStmt_preproc_v5.patch:
>
> >+  if (INFORMIX_MODE)
> >+  {
> >+  if (pg_strcasecmp(stmt+strlen("close "), "database") == 0)
> >+  {
> >+  if (connection)
> >+  mmerror(PARSE_ERROR, ET_ERROR, "AT option
> not allowed in CLOSE DATABASE statement");
> +
> >+  fprintf(base_yyout, "{ ECPGdisconnect(__LINE__,
> \"CURRENT\");");
> >+  whenever_action(2);
> >+  free(stmt);
> >+  break;
> >+  }
> >+  }
>
> >The same code block is present in the stmtClosePortalStmt condition to
> verify the close
> >statement. Because of the above code addition, the code present in
> stmtClosePortalStmt
> >is a dead code. So remove the code present in stmtClosePortalStmt or try
> to reuse it.
>
> I remove almost all the stmtClosePortalStmt code
> and just leave the interface which actually does nothing not to affect
> other codes.
>
> But I'm not sure this implementation is good or not.
> Maybe I should completely remove stmtClosePortalStmt code
> and rename ClosePortalStmtCLOSEcursor_name to stmtClosePortalStmt to make
> code easier to understand.
>
> >+static void
> >+output_cursor_name(char *str)
>
> >This function needs some comments to explain the code flow for better
> understanding.
>
> I add some explanation that output_cursor_name() filters escape sequences.
>
> >+/*
> >+ * Translate the EXEC SQL DECLARE STATEMENT into ECPGdeclare function
> >+ */
>
> >How about using Transform instead of Translate? (similar references also)
>
> I changed two comments with the word Translate.
>

Thanks for the updated patch. It looks good to me.
I have some comments in the doc patch.

+
+  
+   The third option is to declare a sql identifier linked to
+   the connection, for example:
+
+EXEC SQL AT connection-name DECLARE
statement-name STATEMENT;
+EXEC SQL PREPARE statement-name FROM
:dyn-string;
+
+   Once you link a sql identifier to a connection, you execute a dynamic
SQL
+   without AT clause.
+  

The above code part should be moved to the below of the following code
location
and provide some example usage of it in the example code will be better.


EXEC SQL SET CONNECTION connection-name;



+
+ DELARE CURSOR with a SQL statement identifier can
be written before PREPARE.
+

what is the need of providing details about DECLARE CURSOR here?

+
+ AT clause cannot be used with the SQL statement which have been
identified by DECLARE STATEMENT
+

I didn't clearly understand the limitation here, If you can provide an
example here, it will be good.

The test patch looks good to me.

Regards,
Hari Babu
Fujitsu Australia

Re: [HACKERS] Logical decoding on standby

2017-03-22 Thread Craig Ringer

On 23 March 2017 at 12:41, Andres Freund  wrote:
> On 2017-03-23 12:14:02 +0800, Craig Ringer wrote:
>> On 23 March 2017 at 09:39, Andres Freund  wrote:
>> > I still think decoding-on-standby is simply not the right approach as
>> > the basic/first HA approach for logical rep.  It's a nice later-on
>> > feature.  But that's an irrelevant aside.
>>
>> I don't really agree that it's irrelevant.
>
> I'm not sure we have enough time for either getting some parts of your
> patch in, or for figuring out long term goals. But we definitely don't
> have time for both.

Fair.



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash support for grouping sets

2017-03-22 Thread Andrew Gierth

> "Andres" == Andres Freund  writes:

 >> -   Assert(newphase == 0 || newphase == aggstate->current_phase + 1);
 >> +   Assert(newphase <= 1 || newphase == aggstate->current_phase + 1);

 Andres> I think this somewhere in the file header needs an expanded
 Andres> explanation about what these "special" phases mean.

I could move most of the block comment for ExecInitAgg up into the file
header; would that be better?

 Andres> I've not yet read up on the thread, or the whole patch, but an
 Andres> explanation of the memory management for all those tables would
 Andres> be good somewhere (maybe I've just not read that far).

I can expand on that.

 Andres> "mixed types" sounds a bit weird.

changing to "if there are both types present"

 >> +   values = palloc((1 + max_weight) * sizeof(double));
 >> +   sets = palloc((1 + max_weight) * sizeof(Bitmapset *));

 Andres> We usually cast the result of palloc.

Rough count in the backend has ~400 without casts to ~1350 with, so this
doesn't seem to have been consistently enforced.

 >> +static void
 >> +_outGroupingSetData(StringInfo str, const GroupingSetData *node)
 >> +{
 >> +   WRITE_NODE_TYPE("GSDATA");
 >> +
 >> +   WRITE_NODE_FIELD(set);
 >> +   WRITE_FLOAT_FIELD(numGroups, "%.0f");
 >> +}

 Andres> .0f?

Copied from every other node that uses "%.0f" to write out a numGroups
value.

 >> +static grouping_sets_data *
 >> +preprocess_grouping_sets(PlannerInfo *root)
 >> +{

 Andres> It'd not hurt to add a comment as to what this function is
 Andres> actually doing.

Sure.

 Andres> I hope there's tests for both these branches.

A number of tests for both unsortable and unhashable cases are in the
regression test.

 >> +   /*
 >> +* Is it hashable? We pretend empty sets are hashable even 
 >> though we
 >> +* actually force them not to be hashed later. But don't bother 
 >> if
 >> +* there's nothing but empty sets.
 >> +*/

 Andres> Why?

If there's no non-empty grouping set then there's nothing to hash (the
hash code assumes at least one column). I'll put that in the comment.

 >> @@ -3214,6 +3407,11 @@ get_number_of_groups(PlannerInfo *root,
 >> * estimate_hashagg_tablesize
 >> * estimate the number of bytes that a hash aggregate hashtable will
 >> * require based on the agg_costs, path width and dNumGroups.
 >> + *
 >> + * XXX this may be over-estimating the size now that hashagg knows to omit
 >> + * unneeded columns from the hashtable. Also for mixed-mode grouping sets,
 >> + * grouping columns not in the hashed set are counted here even though 
 >> hashagg
 >> + * won't store them. Is this a problem?
 >> */

 Andres> Hm. This seems like it could move us to use sorting, even if
 Andres> not actually warranted.

This is largely a pre-existing issue that this patch doesn't try and
fix.  That would be an issue for a separate patch, I think.  I merely
added the comment to point it out.

 Andres> What's the newline after the comment about?

Old style habits die hard.

 >> +   RelOptInfo *grouped_rel,
 >> +   Path *path,
 >> +   bool is_sorted,
 >> +   bool can_hash,
 >> +   PathTarget *target,
 >> +   grouping_sets_data *gd,
 >> +   const AggClauseCosts 
 >> *agg_costs,
 >> +   double dNumGroups)
 >> +{
 >> +   Query  *parse = root->parse;
 >> +
 >> +   /*
 >> +* If we're not being offered sorted input, then only consider plans 
 >> that
 >> +* can be done entirely by hashing.
 >> +*
 >> +* We can hash everything if it looks like it'll fit in work_mem. But if
 >> +* the input is actually sorted despite not being advertised as such, we
 >> +* prefer to make use of that in order to use less memory.
 >> +* If none of the grouping sets are sortable, then ignore the work_mem
 >> +* limit and generate a path anyway, since otherwise we'll just fail.
 >> +*/
 >> +   if (!is_sorted)
 >> +   {
 >> +   List   *new_rollups = NIL;
 >> +   RollupData *unhashable_rollup = NULL;
 >> +   List   *sets_data;
 >> +   List   *empty_sets_data = NIL;
 >> +   List   *empty_sets = NIL;
 >> +   ListCell   *lc;
 >> +   ListCell   *l_start = list_head(gd->rollups);
 >> +   AggStrategy strat = AGG_HASHED;
 >> +   Sizehashsize;
 >> +   double  exclude_groups = 0.0;
 >> +
 >> +   Assert(can_hash);
 >> +
 >> +   if (pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
 >> +   {
 >> +   unhashable_rollup = lfirst(l_start);

Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-03-22 Thread Andrew Borodin

2017-03-22 22:48 GMT+05:00 Teodor Sigaev :
> hasEmptyChild? and hasNonEmptyChild (BTW, isAnyNonempy has missed 't')

Yes, I think this naming is good. It's clear what's in common in these
flags and what's different.

> And if the whole posting tree is empty,then we could mark root page as leaf
> and remove all other pages in tree without any locking. Although, it could
> be a task for separate patch.

>From the performance point of view, this is a very good idea. Both,
performance of VACUUM and performance of Scans. But doing so we risk
to leave some garbage pages in case of a crash. And I do not see how
to avoid these without unlinking pages one by one. I agree, that
leaving this trick for a separate patch is quite reasonable.

Best regards, Andrey Borodin.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-22 Thread Amit Kapila

On Wed, Mar 22, 2017 at 4:06 PM, Amit Kapila  wrote:
> On Tue, Mar 21, 2017 at 6:47 PM, Pavan Deolasee
>  wrote:
>>
>>>
>>
>> Please find attached rebased patches.
>>
>
> Few comments on 0005_warm_updates_v18.patch:
>

Few more comments on 0005_warm_updates_v18.patch:
1.
@@ -234,6 +241,25 @@ index_beginscan(Relation heapRelation,
  scan->heapRelation = heapRelation;
  scan->xs_snapshot = snapshot;

+ /*
+ * If the index supports recheck,
make sure that index tuple is saved
+ * during index scans. Also build and cache IndexInfo which is used by
+ * amrecheck routine.
+ *
+ * XXX Ideally, we should look at
all indexes on the table and check if
+ * WARM is at all supported on the base table. If WARM is not supported
+ * then we don't need to do any recheck.
RelationGetIndexAttrBitmap() does
+ * do that and sets rd_supportswarm after looking at all indexes. But we
+ * don't know if the function was called earlier in the
session when we're
+ * here. We can't call it now because there exists a risk of causing
+ * deadlock.
+ */
+ if (indexRelation->rd_amroutine->amrecheck)
+ {
+scan->xs_want_itup = true;
+ scan->indexInfo = BuildIndexInfo(indexRelation);
+ }
+

Don't we need to do this rechecking during parallel scans?  Also what
about bitmap heap scans?

2.
+++ b/src/backend/access/nbtree/nbtinsert.c
-
 typedef struct

Above change is not require.

3.
+_bt_clear_items(Page page, OffsetNumber *clearitemnos, uint16 nclearitems)
+void _hash_clear_items(Page page, OffsetNumber *clearitemnos,
+   uint16 nclearitems)

Both the above functions look exactly same, isn't it better to have a
single function like page_clear_items?  If you want separation for
different index types, then we can have one common function which can
be called from different index types.

4.
- if (callback(htup, callback_state))
+ flags = ItemPointerGetFlags(>t_tid);
+ is_warm = ((flags & BTREE_INDEX_WARM_POINTER) != 0);
+
+ if (is_warm)
+ stats->num_warm_pointers++;
+ else
+ stats->num_clear_pointers++;
+
+ result = callback(htup, is_warm, callback_state);
+ if (result == IBDCR_DELETE)
+ {
+ if (is_warm)
+ stats->warm_pointers_removed++;
+ else
+ stats->clear_pointers_removed++;

The patch looks to be inconsistent in collecting stats for btree and
hash.  I don't see above stats are getting updated in hash index code.

5.
+btrecheck(Relation indexRel, IndexInfo *indexInfo, IndexTuple indexTuple,
+ Relation heapRel, HeapTuple heapTuple)
{
..
+ if (!datumIsEqual(values[i - 1], indxvalue, att->attbyval,
+ att->attlen))
..
}

Will this work if the index is using non-default collation?

6.
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -390,83 +390,9 @@ btree_xlog_vacuum(XLogReaderState *record)
-#ifdef UNUSED
  xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);

  /*
- * This section of code is thought to be no longer needed, after analysis
- * of the calling paths. It is retained to allow the code to be reinstated
- * if a flaw is revealed in that thinking.
- *
..

Why does this patch need to remove the above code under #ifdef UNUSED

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] scram and \password

2017-03-22 Thread Michael Paquier

On Wed, Mar 22, 2017 at 8:54 PM, Heikki Linnakangas  wrote:
> On 03/17/2017 05:38 AM, Michael Paquier wrote:
>>
>> Regression tests are proving to be useful here (it would be nice to
>> get those committed first!). I am noticing that this patch breaks
>> connection for users with cleartext or md5-hashed verifier when
>> "password" is used in pg_hba.conf.
>
> Are you sure? It works for me.

Hm... All the tests of password are still broken for me on macos, but
work on Linux:
not ok 4 - authentication success for method password, role scram_role

#   Failed test 'authentication success for method password, role scram_role'
#   at t/001_password.pl line 39.
#  got: '2'
# expected: '0'
not ok 5 - authentication success for method password, role md5_role

#   Failed test 'authentication success for method password, role md5_role'
#   at t/001_password.pl line 39.
#  got: '2'
# expected: '0'
not ok 6 - authentication success for method password, role plain_role

[... dig ... dig ...]

And after a lookup the failure is here:
-   result = get_role_password(port->user_name, _pass, logdetail);
+   shadow_pass = get_role_password(port->user_name, logdetail);
if (result == STATUS_OK)
result is never setup in this code path, so that may crash.

And you need to do something like that, which makes the tests pass here:
@@ -721,6 +721,7 @@ CheckPasswordAuth(Port *port, char **logdetail)
return STATUS_EOF;  /* client wouldn't send password */

shadow_pass = get_role_password(port->user_name, logdetail);
+   result = shadow_pass != NULL ? STATUS_OK : STATUS_ERROR;
if (result == STATUS_OK)
result = plain_crypt_verify(port->user_name, shadow_pass, passwd,
logdetail);

> Here's a slightly updated patch that includes required changes to the test
> case (now that those have been committed), and some re-wording in the docs,
> per Joe's suggestion. All the tests pass here.

+   verifier = scram_build_verifier(username, shadow_pass, 0);
+
+   (void) parse_scram_verifier(verifier, >salt,
>iterations,
+   state->StoredKey, state->ServerKey);
+   pfree(verifier);
Not directly a problem of this patch, but scram_build_verifier can return NULL.

>> -# Most users use SCRAM authentication, but some users use older clients
>> -# that don't support SCRAM authentication, and need to be able to log
>> -# in using MD5 authentication. Such users are put in the @md5users
>> -# group, everyone else must use SCRAM.
>> +# Require SCRAM authentication for most users, but make an exception
>> +# for user 'mike', who uses an older client that doesn't support SCRAM
>> +# authentication.
>>  #
>>  # TYPE  DATABASEUSERADDRESS METHOD
>> -hostall @md5users   .example.commd5
>> +hostall mike.example.commd5
>> Why not still using @md5users?
>
> The old example didn't make much sense, now that md5 means "md5 or scram".
> Could've still used @md5users, but I think this is more clear. The old
> explanation was wrong or at least misleading anyway, because @md5users
> doesn't refer to a group, but a flat file that lists roles.

This patch introduces for the first time a non-generic user name in
pg_hba.conf, that's why, keeping in mind that users could just
copy-paste what is in the docs to make their own file, the approach of
using an @ marker looks more generic to me. But I won't insist on this
point more.

I like the move of removing the status error codes from
get_role_password(). With this support grid, only users with
MD5-hashed verifiers cannot login when the matching hba entry uses
scram.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding on standby

2017-03-22 Thread Andres Freund

On 2017-03-23 12:14:02 +0800, Craig Ringer wrote:
> On 23 March 2017 at 09:39, Andres Freund  wrote:
> > I still think decoding-on-standby is simply not the right approach as
> > the basic/first HA approach for logical rep.  It's a nice later-on
> > feature.  But that's an irrelevant aside.
> 
> I don't really agree that it's irrelevant.

I'm not sure we have enough time for either getting some parts of your
patch in, or for figuring out long term goals. But we definitely don't
have time for both.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling parallelism for queries coming from SQL or other PL functions

2017-03-22 Thread Rafia Sabih

On Thu, Mar 23, 2017 at 5:23 AM, Dilip Kumar  wrote:
> On Wed, Mar 22, 2017 at 10:33 PM, Robert Haas  wrote:
>> So couldn't we actually make this test !fcache->returnsSet || !es->lazyEval?
>> That would let us allow parallel execution for all non-set-returning
>> functions, and also for set-returning functions that end up with
>> es->lazyEval set to false.
>
> Yes, this is the right thing to do although we may not enable
> parallelism for any more queries by adding "|| !es->lazyEval". Because
> SELECT are always marked as es->lazyEval=true(And so far we have
> parallelism only for select).  But here we calling the parameter to
> ExecutorRun as execute_once so  !fcache->returnsSet || !es->lazyEval
> is the correct one and future proof.
>
Agree, done.

-- 
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/


execute-once-v3.patch
Description: Binary data


pl_parallel_opt_support_v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect and hash indexes

2017-03-22 Thread Ashutosh Sharma

>> I think it is not just happening for freed overflow but also for newly
>> allocated bucket page. It would be good if we could mark  freed
>> overflow page as UNUSED page rather than just initialising it's header
>> portion and leaving the page type in special area as it is. Attached
>> is the patch '0001-mark_freed_ovflpage_as_UNUSED_pagetype.patch' that
>> marks a freed overflow page as an unused page.
>>
>
>   _hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
> +
> + ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
> +
> + ovflopaque->hasho_prevblkno = InvalidBlockNumber;
> + ovflopaque->hasho_nextblkno = InvalidBlockNumber;
> + ovflopaque->hasho_bucket = -1;
> + ovflopaque->hasho_flag = LH_UNUSED_PAGE;
> + ovflopaque->hasho_page_id = HASHO_PAGE_ID;
> +
>
> You need similar initialization in replay routine as well.

I will do that and share the patch shortly. Thanks.

>
>> Also, I have now changed pageinspect module to handle unused and empty
>> hash index page. Attached is the patch
>> (0002-allow_pageinspect_handle_UNUSED_OR_EMPTY_hash_pages.patch) for
>> the same.
>>
>
> 1.
> @@ -70,6 +70,17 @@ verify_hash_page(bytea *raw_page, int flags)
>   pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
> +
> + /* Check if it is an unused hash page. */
> + if (pageopaque->hasho_flag == LH_UNUSED_PAGE)
> + return page;
>
>
> I don't see we need to have a separate check for unused page, why
> can't it be checked with other page types in below check.
>
> if (pagetype != LH_OVERFLOW_PAGE && pagetype != LH_BUCKET_PAGE &&
> pagetype != LH_BITMAP_PAGE && pagetype != LH_META_PAGE)

That is because UNUSED page is as good as an empty page except that
empty page doesn't have any pagetype. If we add condition for checking
UNUSED page in above if check it will never show unused page as an
unsed page rather it will always show it as an empty page. To avoid
this, at the start of verify_hash_page function itself if we recognise
page as UNUSED page we return immediately.

>
> 2.
> + /* Check if it is an empty hash page. */
> + if (PageIsEmpty(page))
> + ereport(ERROR,
> + (errcode(ERRCODE_INDEX_CORRUPTED),
> + errmsg("index table contains empty page")));
>
>
> Do we want to give a separate message for EMPTY and NEW pages?  Isn't
> it better that the same error message can be given for both of them as
> from user perspective there is not much difference between both the
> messages?

I think we should show separate message because they are two different
type of pages. In the sense like, one is initialised whereas other is
completely zero.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding on standby

2017-03-22 Thread Craig Ringer

On 23 March 2017 at 00:13, Simon Riggs  wrote:
> On 22 March 2017 at 08:53, Craig Ringer  wrote:
>
>> I'm splitting up the rest of the decoding on standby patch set with
>> the goal of getting minimal functionality for creating and managing
>> slots on standbys in, so we can maintain slots on standbys and use
>> them when the standby is promoted to master.
>>
>> The first, to send catalog_xmin separately to the global xmin on
>> hot_standby_feedback and store it in the upstream physical slot's
>> catalog_xmin, is attached.
>>
>> These are extracted directly from the logical decoding on standby
>> patch, with comments by Petr and Andres made re the relevant code
>> addressed.
>
> I've reduced your two patches back to one with a smaller blast radius.
>
> I'll commit this tomorrow morning, barring objections.

Thanks. I was tempted to refactor GetOldestXmin to use flags myself,
but thought it might be at higher risk of objections. Since Eiji Seki
has shown that there are other uses for excluding particular things
from GetOldestXmin it and that's committed now, it's nice to have the
impact of this patch reduced.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-03-22 Thread Ashutosh Bapat

>
>> I don't think we will get away by supporting just scan paths, since
>> the inner side of lateral join can be any paths not just scan path. Or
>> you are suggesting that we disable partition-wise lateral join and
>> support reparameterization of only scan paths?
>
> I think if you can do a straight-up partitionwise nested loop between
> two tables A and B, that's pretty bad.

Ok.

> But if there are more complex
> cases that involve parameterizing entire join trees which aren't
> covered, that's less bad.  Parallel query almost entirely punts on
> LATERAL right now, and nobody's complained yet.  I'm sure that'll need
> to get fixed someday, but not today.
>
Ok.

I am suggesting this possibility in case we run of time to review and
commit reparameterize_path_by_child() entirely. If we can do that, I
will be happy.

In case, we have to include a stripped down version of
reparameterize_path_by_child(), with which I am fine too, we will need
to disable LATERAL joins, so that we don't end up with an error "could
not devise a query plan for the given query".

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2017-03-22 Thread Craig Ringer

On 23 March 2017 at 11:25, Craig Ringer  wrote:

> Amended patch attached, with added TAP test included.

Managed to omit it, sigh.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


011_crash_recovery.pl
Description: Perl program

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding on standby

2017-03-22 Thread Craig Ringer

On 23 March 2017 at 09:39, Andres Freund  wrote:

> We can't just assume that snapbuild is going to work correctly when it's
> prerequisites - pinned xmin horizon - isn't working.

Makes sense.

>> What do _you_ see as the minimum acceptable way to achieve the ability
>> for a logical decoding client to follow failover of an upstream to a
>> physical standby? In the end, you're one of the main people whose view
>> carries weight in this area, and I don't want to develop yet another
>
> I think your approach here wasn't that bad? There's a lot of cleaning
> up/shoring up needed, and we probably need a smarter feedback system.  I
> don't think anybody here has objected to the fundamental approach?

That's useful, thanks.

I'm not arguing that the patch as it stands is ready, and appreciate
the input re the general design.

> I still think decoding-on-standby is simply not the right approach as
> the basic/first HA approach for logical rep.  It's a nice later-on
> feature.  But that's an irrelevant aside.

I don't really agree that it's irrelevant.

Right now Pg has no HA capability for logical decoding clients. We've
now added logical replication, but it has no way to provide for
upstream node failure and ensure a consistent switch-over, whether to
a logical or physical replica. Since real world servers fail or need
maintenance, this is kind of a problem for practical production use.

Because of transaction serialization for commit-time order replay,
logical replication experiences saw-tooth replication lag, where large
or long xacts such as batch jobs effectively stall all later xacts
until they are fully replicated. We cannot currently start replicating
a big xact until it commits on the upstream, so that lag can easily be
~2x the runtime on the upstream.

So while you can do sync rep on a logical standby, it tends to result
in big delays on COMMITs relative to physical rep, even if app are
careful to keep transactions small. When the app DR planning people
come and ask you what the max data loss window / max sync rep lag is,
you have to say " dunno? depends on what else was running on the
server at the time."

AFAICT, changing those things will require the ability to begin
streaming reorder buffers for big xacts before commit, which as the
logical decoding on 2PC thread shows is not exactly trivial. We'll
also need to be able to apply them concurrently with other xacts on
the other end. Those are both big and complex things IMO, and I'll be
surprised if we can do either in Pg11 given that AFAIK nobody has even
started work on either of them or has a detailed plan.

Presuming we get some kind of failover to logical replica upstreams
into Pg11, it'll have significant limitations relative to what we can
deliver to users by using physical replication. Especially when it
comes to bounded-time lag for HA, sync rep, etc. And I haven't seen a
design for it, though Petr and I have discussed some with regards to
pglogical.

That's why I think we need to do HA on the physical side first.
Because it's going to take a long time to get equivalent functionality
for logical rep based upstreams, and when it is we'll still have to
teach management tools and other non-logical-rep logical decoding
clients about the new way of doing things.  Wheras for physical HA
setups to support logical downstreams requires only relatively minor
changes and gets us all the physical HA features _now_.

That's why we pursued failover slots - as a simple, minimal solution
to allowing logical decoding clients to inter-operate with Pg in a
physical HA configuration. TBH, I still think we should just add them.
Sure, they don't help us achieve decoding on standby, but they're a
lot simpler and they help Pg's behaviour with slots match user
expectations for how the rest of Pg behaves, i.e. if it's on the
master it'll be on the replica too.  And as you've said, decoding on
standby is a nice-to-have, wheras I think some kind of HA support is
rather more important.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash support for grouping sets

2017-03-22 Thread Mark Dilger

> On Mar 22, 2017, at 8:09 AM, Mark Dilger  wrote:
> 
> 
>> On Mar 22, 2017, at 7:55 AM, Andrew Gierth  
>> wrote:
>> 
>> [snip]
>> 
>> This thread seems to have gone quiet - is it time for me to just go
>> ahead and commit the thing anyway? Anyone else want to weigh in?
> 
> Sorry for the delay.  I'll try to look at this today.  Others are most welcome
> to weigh in.

You define DiscreteKnapsack to take integer weights and double values,
and perform the usual Dynamic Programming algorithm to solve.  But
the only place you call this, you pass in NULL for the values, indicating
that all the values are equal.  I'm confused why in this degenerate case
you bother with the DP at all.  Can't you just load the knapsack from
lightest to heaviest after sorting, reducing the runtime complexity to O(nlogn)?
It's been a long day.  Sorry if I am missing something obvious.

I'm guessing you intend to extend the code at some later date to have
different values for items.  Is that right?

I reapplied your patch and reran the regression tests on my laptop and they
all pass.

mark

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] segfault in hot standby for hash indexes

2017-03-22 Thread Amit Kapila

On Thu, Mar 23, 2017 at 8:43 AM, Amit Kapila  wrote:
> On Wed, Mar 22, 2017 at 3:39 PM, Ashutosh Sharma  
> wrote:
>> Hi,
>>
>> On Wed, Mar 22, 2017 at 8:41 AM, Amit Kapila  wrote:
>>
>> To fix this, I think we should pass 'REGBUF_KEEP_DATA' while
>> registering the buffer. Something like this,
>>
>> -   XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
>> +   XLogRegisterBuffer(0, buf, REGBUF_STANDARD |
>> REGBUF_KEEP_DATA);
>>
>> Attached is the patch that fixes this issue.
>>
>
> I think this will work, but not sure if there is a merit to deviate
> from what btree does to handle this case.   One thing I find slightly
> awkward in hash_xlog_vacuum_get_latestRemovedXid() is that you are
> using a number of tuples registered as part of fixed data
> (xl_hash_vacuum_one_page) to traverse the data registered as buf data.
> I think it will be better if we register offsets also in fixed part of
> data as we are doing btree case.
>
>

Also another small point in this regard, do we need two separate
variables to track number of deleted items in below code?  I think one
variable is sufficient.

_hash_vacuum_one_page()
{
..
deletable[ndeletable++] = offnum;
tuples_removed += 1;
..
}



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Hash support for grouping sets

2017-03-22 Thread Andrew Gierth

> "Andres" == Andres Freund  writes:

 Andres> Changes to advance_aggregates() are, in my experience, quite
 Andres> likely to have performance effects.  This needs some
 Andres> performance tests.
 [...]
 Andres> Looks like it could all be noise, but it seems worthwhile to
 Andres> look into some.

Trying to sort out signal from noise when dealing with performance
impacts of no more than a few percent is _extremely hard_ these days.

Remember this, from a couple of years back?  http://tinyurl.com/op9qg8a

That's a graph of performance results from tests where the only change
being made was adding padding bytes to a function that was never called
during the test. Performance differences on the order of 3% were being
introduced by doing nothing more than changing the actual memory
locations of functions.

My latest round of tests on this patch shows a similar effect. Here's
one representative test timing (a select count(*) with no grouping sets
involved):

 master: 5727ms
 gsets_hash: 5949ms
 gsets_hash + 500 padding bytes: 5680ms

Since the effect of padding is going to vary over time as other,
unrelated, patches happen, I think I can safely claim that the
performance of the patch at least overlaps the noise range of the
performance of current code.  To get a more definitive result, it would
be necessary to run at least some dozens of tests, with different
padding sizes, and determine whether the average changes detectably
between the two versions.  I will go ahead and do this, out of sheer
curiosity if nothing else, but the preliminary results suggest there's
probably nothing worth holding up the patch for.

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2017-03-22 Thread Craig Ringer

On 23 March 2017 at 01:41, Robert Haas  wrote:
> On Wed, Mar 22, 2017 at 3:13 AM, Simon Riggs  wrote:
>>> Changes made per discussion.
>>
>> This looks good to me also. Does what we need it to do.
>>
>> I was a little worried by possible performance of new lock, but its
>> not intended to be run frequently, so its OK.
>
> Agreed.
>
> Reviewing 0002:
>
> +if (!TransactionIdIsValid(xid))
> +{
> +LWLockRelease(XidGenLock);
> +ereport(ERROR,
> +(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("transaction ID " UINT64_FORMAT " is an invalid xid",
> +xid_with_epoch)));
> +}
>
> It's unnecessary to release LWLockRelease() before throwing an error,
> and it's also wrong because we haven't acquired XidGenLock in this
> code path.  But maybe it would be better to just remove this entirely
> and instead have TransactionIdInRecentPast return false for
> InvalidTransactionId.  Then we'd avoid adding a translatable message
> for a case that is basically harmless to allow.

Agreed, that's better.

> +if (TransactionIdIsCurrentTransactionId(xid))
> +status = gettext_noop("in progress");
> +else if (TransactionIdDidCommit(xid))
> +status = gettext_noop("committed");
> +else if (TransactionIdDidAbort(xid))
> +status = gettext_noop("aborted");
> +else
> +
> +/*
> + * can't test TransactionIdIsInProgress here or we race with
> + * concurrent commit/abort. There's no point anyway, since it
> + * might then commit/abort just after we check.
> + */
> +status = gettext_noop("in progress");
>
> I am not sure this is going to do the right thing for transactions
> that are aborted by a crash without managing to write an abort record.

You're right. It'll report in-progress, since the clog entry will
still be 0. We don't appear to explicitly update the clog during
recovery.

Funny, I got so focused on the clog access safety stuff in the end
that I missed the bigger picture.

Given that part of the point of this is xact status after crash,
that's kind of important. I've added a TAP test for this, as part of a
new test set, "011_crash_recovery.pl". The recovery tests don't really
have much on crash behaviour at the moment so might as well start it
and add more as we go. It doesn't make sense to add a whole new top
level TAP test just for txid_status.

(I *love* the TAP framework now, it took 10 mins to write a test
that's trivial to repeat in 5 seconds per run for this).

> It seems that the first check will say the transaction isn't in
> progress, and the second and third checks will say it isn't either
> committed or aborted since, if I am reading this correctly, they just
> read clog directly.

Right. They're not concerned with subxacts or multixacts.

> Compare HeapTupleSatisfiesMVCC, which assumes
> that a not-in-progress, not-committed transaction must necessarily
> have aborted.

Right.

We don't have a HeapTuple here, of course, so we can't test flags.
Most importantly this means we can't handle multixacts. If we ever did
decide to expose multixacts as bigint epoch-qualified xids for general
users, we'd probably reserve the high bit of the epoch the bigint xid
representation for the equivalent of HEAP_XMAX_IS_MULTI, and maybe
it's worth reserving that bit now and ERROR'ing if we find it set. But
right now we never produce a bigint xid with a multixact.

I wonder if a note in the docs warning not to cast xid to bigint is
worth it. Probably not. You have to cast xid::text::bigint which is
kind of a hint, plus it doesn't make sense to test txid_status() for
tuples' xmin/xmax . I guess you might use it with xids from
pg_stat_activity, but there's not much point when you can just look at
pg_stat_activity again to see if the proc of interest is still there
and has the same xid.

Anyway...

> I think your comment is pointing to a real issue,
> though.  It seems like what might be needed is to add one more check.
> Before where you have the "else" clause, check if the XID is old, e.g.
> precedes our snapshot's xmin.  If so, it must be committed or aborted
> and, since it didn't commit, it aborted.  If not, it must've changed
> from in progress to not-in-progress just as we were in the midst
> checking, so labeling it as in progress is fine.

That seems clear and sensible.

XidInMVCCSnapshot simply tests TransactionIdPrecedes(xid,
snapshot->xmin), as does HeapTupleSatisfiesHistoricMVCC . It seems
reasonable enough to just examine the snapshot xmin directly in
txid_status too; it's already done in replication/logical/snapbuild.c
and lmgr/predicate.c.

We're running in an SQL-called function so we'll have a good snapshot
and GetSnapshotData will have run, so GetActiveSnapshot()->xmin should
be sufficient.

Amended patch attached, with added TAP

Re: [HACKERS] Radix tree for character conversion

2017-03-22 Thread Kyotaro HORIGUCHI

At Tue, 21 Mar 2017 13:10:48 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
 wrote in 
<20170321.131048.150321071.horiguchi.kyot...@lab.ntt.co.jp>
> At Fri, 17 Mar 2017 13:03:35 +0200, Heikki Linnakangas  
> wrote in <01efd334-b839-0450-1b63-f2dea9326...@iki.fi>
> > On 03/17/2017 07:19 AM, Kyotaro HORIGUCHI wrote:
> > > I would like to use convert() function. It can be a large
> > > PL/PgSQL function or a series of "SELECT convert(...)"s. The
> > > latter is doable on-the-fly (by not generating/storing the whole
> > > script).
> > >
> > > | -- Test for SJIS->UTF-8 conversion
> > > | ...
> > > | SELECT convert('\', 'SJIS', 'UTF-8'); -- results in error
> > > | ...
> > > | SELECT convert('\897e', 'SJIS', 'UTF-8');
> > 
> > Makes sense.
> > 
> > >> You could then run those SQL statements against old and new server
> > >> version, and verify that you get the same results.
> > >
> > > Including the result files in the repository will make this easy
> > > but unacceptably bloats. Put mb/Unicode/README.sanity_check?
> > 
> > Yeah, a README with instructions on how to do sounds good. No need to
> > include the results in the repository, you can run the script against
> > an older version when you need something to compare with.
> 
> Ok, I'll write a small script to generate a set of "conversion
> dump" and try to write README.sanity_check describing how to use
> it.

I found that there's no way to identify the character domain of a
conversion on SQL interface. Unconditionally giving from 0 to
0x as a bytea string yields too-bloat result by containg
many bogus lines.  (If \x40 is a character, convert() also
accepts \x4040, \x404040 and \x40404040)

One more annoyance is the fact that mappings and conversion
procedures are not in one-to-one correspondence. The
corresnponcence is hidden in conversion_procs/*.c files so we
should extract it from them or provide as knowledge. Both don't
seem good.

Finally, it seems that I have no choice than resurrecting
map_checker. The exactly the same one no longer works but
map_dumper.c with almost the same structure will work.

If no one objects to adding map_dumper.c and
gen_mapdumper_header.pl (tentavie name, of course), I'll make a
patch to do that.

Any suggestions?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] segfault in hot standby for hash indexes

2017-03-22 Thread Amit Kapila

On Wed, Mar 22, 2017 at 3:39 PM, Ashutosh Sharma  wrote:
> Hi,
>
> On Wed, Mar 22, 2017 at 8:41 AM, Amit Kapila  wrote:
>
> To fix this, I think we should pass 'REGBUF_KEEP_DATA' while
> registering the buffer. Something like this,
>
> -   XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
> +   XLogRegisterBuffer(0, buf, REGBUF_STANDARD |
> REGBUF_KEEP_DATA);
>
> Attached is the patch that fixes this issue.
>

I think this will work, but not sure if there is a merit to deviate
from what btree does to handle this case.   One thing I find slightly
awkward in hash_xlog_vacuum_get_latestRemovedXid() is that you are
using a number of tuples registered as part of fixed data
(xl_hash_vacuum_one_page) to traverse the data registered as buf data.
I think it will be better if we register offsets also in fixed part of
data as we are doing btree case.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect and hash indexes

2017-03-22 Thread Amit Kapila

On Mon, Mar 20, 2017 at 6:53 PM, Ashutosh Sharma  wrote:
>
> I think it is not just happening for freed overflow but also for newly
> allocated bucket page. It would be good if we could mark  freed
> overflow page as UNUSED page rather than just initialising it's header
> portion and leaving the page type in special area as it is. Attached
> is the patch '0001-mark_freed_ovflpage_as_UNUSED_pagetype.patch' that
> marks a freed overflow page as an unused page.
>

  _hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+
+ ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+
+ ovflopaque->hasho_prevblkno = InvalidBlockNumber;
+ ovflopaque->hasho_nextblkno = InvalidBlockNumber;
+ ovflopaque->hasho_bucket = -1;
+ ovflopaque->hasho_flag = LH_UNUSED_PAGE;
+ ovflopaque->hasho_page_id = HASHO_PAGE_ID;
+

You need similar initialization in replay routine as well.

> Also, I have now changed pageinspect module to handle unused and empty
> hash index page. Attached is the patch
> (0002-allow_pageinspect_handle_UNUSED_OR_EMPTY_hash_pages.patch) for
> the same.
>

1.
@@ -70,6 +70,17 @@ verify_hash_page(bytea *raw_page, int flags)
  pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
+
+ /* Check if it is an unused hash page. */
+ if (pageopaque->hasho_flag == LH_UNUSED_PAGE)
+ return page;


I don't see we need to have a separate check for unused page, why
can't it be checked with other page types in below check.

if (pagetype != LH_OVERFLOW_PAGE && pagetype != LH_BUCKET_PAGE &&
pagetype != LH_BITMAP_PAGE && pagetype != LH_META_PAGE)

2.
+ /* Check if it is an empty hash page. */
+ if (PageIsEmpty(page))
+ ereport(ERROR,
+ (errcode(ERRCODE_INDEX_CORRUPTED),
+ errmsg("index table contains empty page")));


Do we want to give a separate message for EMPTY and NEW pages?  Isn't
it better that the same error message can be given for both of them as
from user perspective there is not much difference between both the
messages?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] bug/oversight in TestLib.pm and PostgresNode.pm

2017-03-22 Thread Michael Paquier

On Thu, Mar 23, 2017 at 12:51 AM, Erik Rijkers  wrote:
> While trying to test pgbench's stderr (looking for 'creating tables' in
> output of the initialisation step)  I ran into these two bugs (or perhaps
> better 'oversights').

+   if (defined $expected_stderr) {
+   like($stderr, $expected_stderr, "$test_name: stderr matches");
+   }
+   else {
is($stderr, '', "$test_name: no stderr");
-   like($stdout, $expected_stdout, "$test_name: matches");
+   }
To simplify that you could as well set expected_output to be an empty
string, and just use like() instead of is(), saving this if/else.

> But especially the omission of command_fails_like() in PostgresNode.pm feels
> like an bug.

+=item $node->command_fails_like(...) - TestLib::command_fails_like
with our PGPORT
+
+See command_ok(...)
+
+=cut
+
+sub command_fails_like
+{
+   my $self = shift;
+
+   local $ENV{PGPORT} = $self->port;
+
+   TestLib::command_fails_like(@_);
+}
Most likely a case where this is needed has not showed up, so +1 to
remove this inconsistency across the modules.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Measuring replay lag

2017-03-22 Thread Thomas Munro

On Wed, Mar 15, 2017 at 8:15 PM, Ian Barwick
 wrote:
>> 2.  Recognise when the last reported write/flush/apply LSN from the
>> standby == end of WAL on the sending server, and show lag times of
>> 00:00:00 in all three columns.  I consider this entirely bogus: it's
>> not an actual measurement that was ever made, and on an active system
>> it would flip-flop between real measurements and the artificial
>> 00:00:00.  I do not like this.
>
> I agree with this; while initially I was expecting to see 00:00:00,
> SQL NULL is definitely correct here. Anyone writing tools etc. which need to
> report an actual interval can convert this to 00:00:00 easily enough .

Right.

Another point here is that if someone really wants to see "estimated
time until caught up to current end of WAL" where 00:00:00 makes sense
when fully replayed, then it is already possible to compute that using
information that is published in pg_stat_replication in 9.6.

An external tool or a plpgsql function could do something like:
observe replay_location twice with a sleep in between to estimate the
current rate of replay, then divide the current distance between
replay location and end-of-WAL by the replay rate to estimate the time
of arrival.  I think the result would behave a bit like the infamous
Windows file transfer dialogue ("3 seconds, not 7 months, no 4
seconds, no INFINITY, oh wait 0 seconds, you have arrived at your
destination!") due to the lumpiness of replay, though perhaps that
could be corrected by applying the right kind of smoothing to the
rate.  I thought about something like that but figured it would be
better to stick to measuring facts about the past rather than making
predictions about the future.  That's on top of the more serious
problem for the syncrep delay measurement use case, where I started
this journey, that many systems would show zero whenever you query
them because they often catch up in between writes, even though sync
rep is not free.

>> 5.  The new proposal:  Show only true measured write/flush/apply data,
>> as in 1, but with a time limit.  To avoid the scenario where we show
>> the same times during prolonged periods of idleness, clear the numbers
>> like in option 3 after a period of idleness.  This way we avoid the
>> dreaded flickering/flip-flopping.  A natural time to do that is when
>> wal_receiver_status_interval expires on idle systems and defaults to
>> 10 seconds.
>>
>> Done using approach 5 in the attached version.  Do you think this is a
>> good compromise?  No bogus numbers, only true measured
>> write/flush/apply times, but a time limit on 'stale' lag information.
>
> This makes sense to me. I'd also add that while on production servers
> it's likely there'll be enough activity to keep the columns updated,
> on a quiescent test/development systems seeing a stale value looks plain
> wrong (and will cause no end of questions from people asking why lag
> is still showing when their system isn't doing anything).

Cool, and now Simon has agreed to this too.

> I suggest the documentation of these columns needs to be extended to mention
> that they will be NULL if no lag was measured recently, and to explain
> the circumstances in which the numbers are cleared.

Done in the v7.  Thanks for the feedback!

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2017-03-22 Thread Craig Ringer

On 23 March 2017 at 02:08, Simon Riggs  wrote:

> And of course, we might return "subcommitted" also, which could
> technically also be an abort in some cases, so we'd need to do a wait
> loop on that.

Users generally don't see subxact IDs, so it wasn't something I was
overly concerned by. Most notably,  txid_current() doesn't report
them.

Users who want to know the commit status of an xact that had a commit
in-flight when they lost access to it due to server crash, network
loss, etc aren't going to care about subxacts. If you lose your
connection after a RELEASE SAVEPOINT you know the outer xact will get
aborted or the state of the individual subxacts.

They're visible in heap tuples, but you can only see uncommitted heap
tuples from your own top-level xid. For anything else, if you can see
the xid in xmin you know it committed. There isn't really any reason
you'd be looking up tuple xids with txid_status anyway, and if you did
you'd have to pay attention to mess like multixacts in xmax ... but
you can't tell from the xid alone if it's a multixact or not, so this
doesn't make sense with xids that could be multis anyway.

If we're going to handle subxacts specially, we should probably report
them as "sub-committed" if we find that the current xid is a committed
subxact member of an outer xact that is still in-progress. But IMO
it's pretty pointless since you won't be dealing with subxact IDs in
the application anyway.

> Which makes me think it would be confusing to say "in progress" for
> when it is our current xid, since the user might wait until it is
> complete and then wait forever. Prefer it if it said "in progress -
> current transaction"

I'm fine with "current transaction". Though I think it's kind of a
moot point as I don't see any reason for an application to ever be
doing the equivalent of

txid_status(txid_current())

in the first place. But it's harmless enough.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding on standby

2017-03-22 Thread Andres Freund

On 2017-03-23 09:14:07 +0800, Craig Ringer wrote:
> On 23 March 2017 at 07:31, Andres Freund  wrote:
> > On 2017-03-23 06:55:53 +0800, Craig Ringer wrote:
> 
> >> I was thinking that by disallowing snapshot use and output plugin
> >> invocation we'd avoid the need to support cancellation on recovery
> >> conflicts, etc, simplifying things considerably.
> >
> > That seems like it'd end up being pretty hacky - the likelihood that
> > we'd run into snapbuild error cross-checks seems very high.
> 
> TBH I'm not following this. But I haven't touched snapbuild much yet,
> Petr's done much more with snapbuild than I have.

We can't just assume that snapbuild is going to work correctly when it's
prerequisites - pinned xmin horizon - isn't working.

> We're not going to have robust logical replication that's suitable for
> HA and failover use on high load systems until 2020 or so, with Pg 12.
> We'll need concurrent decoding and apply, which nobody's even started
> on AFAIK, we'll need sequence replication, and more.

These seem largely unrelated to the topic at hand(nor do I agree on all
of them).

> So I'd really, really like to get some kind of HA picture other than
> "none" in for logical decoding based systems. If it's imperfect, it's
> still something.

I still think decoding-on-standby is simply not the right approach as
the basic/first HA approach for logical rep.  It's a nice later-on
feature.  But that's an irrelevant aside.

I don't understand why you're making a "fundamental" argument here - I'm
not arguing against the goals of the patch at all. I want as much stuff
committed as we can in a good shape.

> What do _you_ see as the minimum acceptable way to achieve the ability
> for a logical decoding client to follow failover of an upstream to a
> physical standby? In the end, you're one of the main people whose view
> carries weight in this area, and I don't want to develop yet another

I think your approach here wasn't that bad? There's a lot of cleaning
up/shoring up needed, and we probably need a smarter feedback system.  I
don't think anybody here has objected to the fundamental approach?

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-03-22 Thread Amit Langote

Hi Amit,

Thanks for the updated patch.

On 2017/03/23 3:09, Amit Khandekar wrote:
> Attached is v2 patch which implements the above optimization.

Would it be better to have at least some new tests?  Also, there are a few
places in the documentation mentioning that such updates cause error,
which will need to be updated.  Perhaps also add some explanatory notes
about the mechanism (delete+insert), trigger behavior, caveats, etc.
There were some points discussed upthread that could be mentioned in the
documentation.

@@ -633,6 +634,9 @@ ExecDelete(ItemPointer tupleid,
 HeapUpdateFailureData hufd;
 TupleTableSlot *slot = NULL;

+if (already_deleted)
+*already_deleted = false;
+

concurrently_deleted?

@@ -962,7 +969,7 @@ ExecUpdate(ItemPointer tupleid,
 }
 else
 {
-LockTupleMode lockmode;
+LockTupleMode   lockmode;

Useless hunk.

+if (!mtstate->mt_partition_dispatch_info)
+{

The if (pointer == NULL) style is better perhaps.

+/* root table RT index is at the head of partitioned_rels */
+if (node->partitioned_rels)
+{
+Index   root_rti;
+Oid root_oid;
+
+root_rti = linitial_int(node->partitioned_rels);
+root_oid = getrelid(root_rti, estate->es_range_table);
+root_rel = heap_open(root_oid, NoLock); /* locked by
InitPlan */
+}
+else
+root_rel = mtstate->resultRelInfo->ri_RelationDesc;

Some explanatory comments here might be good, for example, explain in what
situations node->partitioned_rels would not have been set and/or vice versa.

> Now, for
> UPDATE, ExecSetupPartitionTupleRouting() will be called only if row
> movement is needed.
> 
> We have to open an extra relation for the root partition, and keep it
> opened and its handle stored in
> mt_partition_dispatch_info[0]->reldesc. So ExecEndModifyTable() closes
> this if it is different from node->resultRelInfo->ri_RelationDesc. If
> it is same as node->resultRelInfo, it should not be closed because it
> gets closed as part of ExecEndPlan().

I guess you're referring to the following hunk.  Some comments:

@@ -2154,10 +2221,19 @@ ExecEndModifyTable(ModifyTableState *node)
  * Close all the partitioned tables, leaf partitions, and their indices
  *
  * Remember node->mt_partition_dispatch_info[0] corresponds to the root
- * partitioned table, which we must not try to close, because it is the
- * main target table of the query that will be closed by ExecEndPlan().
- * Also, tupslot is NULL for the root partitioned table.
+ * partitioned table, which should not be closed if it is the main target
+ * table of the query, which will be closed by ExecEndPlan().

The last part could be written as: because it will be closed by ExecEndPlan().

 Also, tupslot
+ * is NULL for the root partitioned table.
  */
+if (node->mt_num_dispatch > 0)
+{
+Relationroot_partition;

root_relation?

+
+root_partition = node->mt_partition_dispatch_info[0]->reldesc;
+if (root_partition != node->resultRelInfo->ri_RelationDesc)
+heap_close(root_partition, NoLock);
+}

It might be a good idea to Assert inside the if block above that
node->operation != CMD_INSERT.  Perhaps, also reflect that in the comment
above so that it's clearer.

I will set the patch to Waiting on Author.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Use $ parameters as replacement characters for pg_stat_statements

2017-03-22 Thread Lukas Fittl

On Sat, Mar 18, 2017 at 12:46 PM, Tom Lane  wrote:
>
> So it turns out this discussion just reinvented the alternative that
> Lukas had in his 0002 proposal.  Are there any remaining objections
> to proceeding with that approach?
>

Thanks for reviewing - updated patch attached, comments below.


> In a quick read of the patch, I note that it falsifies and fails to
> update the header comment for generate_normalized_query:
>
>  * *query_len_p contains the input string length, and is updated with
>  * the result string length (which cannot be longer) on exit.
>
> It doesn't look like the calling code depends (any more?) on the
> assumption that the string doesn't get longer, but it would be good
> to double-check that.
>

Fixed.


> I'd just add, say, "10 * clocations_count" to the allocation length.
> We can afford to waste a few bytes on this transient copy of the query
> text.
>

I've changed this, although I've kept this somewhat dynamic, to avoid
having an accidental overrun in edge cases:

1 + floor(log10(jstate->clocations_count +
jstate->highest_extern_param_id));


> The documentation is overly vague about what the parameter symbols are,
> in particular failing to explain that their numbers start from after
> the last $n occurring in the original query text.
>

Updated the docs to clarify this.


> It seems like a good idea to have a regression test case demonstrating
> that, too.
>

I already had a separate patch that adds more regression tests (),
including one for prepared statements in the initial email.

Merged together now into one patch to keep it simple, and added another
constant value to the prepared statement test case. I've also included
a commit message in the patch file, if helpful.

Thanks,
Lukas

-- 
Lukas Fittl


0002_pgss_mask_with_incrementing_params.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding on standby

2017-03-22 Thread Craig Ringer

On 23 March 2017 at 07:31, Andres Freund  wrote:
> On 2017-03-23 06:55:53 +0800, Craig Ringer wrote:

>> I was thinking that by disallowing snapshot use and output plugin
>> invocation we'd avoid the need to support cancellation on recovery
>> conflicts, etc, simplifying things considerably.
>
> That seems like it'd end up being pretty hacky - the likelihood that
> we'd run into snapbuild error cross-checks seems very high.

TBH I'm not following this. But I haven't touched snapbuild much yet,
Petr's done much more with snapbuild than I have.

We're not going to have robust logical replication that's suitable for
HA and failover use on high load systems until 2020 or so, with Pg 12.
We'll need concurrent decoding and apply, which nobody's even started
on AFAIK, we'll need sequence replication, and more.

So I'd really, really like to get some kind of HA picture other than
"none" in for logical decoding based systems. If it's imperfect, it's
still something.

I wish we'd just proceeded with failover slots. They were blocked in
favour of decoding on standby, and HA is possible if we have decoding
on standby with some more work by the application. But now we'll have
neither. If we'd just done failover slots we could've had logical
replication able to follow failover in Pg 10.

What do _you_ see as the minimum acceptable way to achieve the ability
for a logical decoding client to follow failover of an upstream to a
physical standby? In the end, you're one of the main people whose view
carries weight in this area, and I don't want to develop yet another
approach only to have it knocked back once the work is done.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding on standby

2017-03-22 Thread Craig Ringer

On 23 March 2017 at 00:17, Andres Freund  wrote:
> On 2017-03-22 15:59:42 +, Simon Riggs wrote:
>> On 22 March 2017 at 13:06, Andres Freund  wrote:
>>
>> >> The parts I think are important for Pg10 are:
>> >
>> >> * Ability to create logical slots on replicas
>> >
>> > Doesn't this also imply recovery conflicts on DROP DATABASE?
>>
>> Not needed until the slot is in use, which is a later patch.
>
> Hm? We need to drop slots, if they can exist / be created, on a standby,
> and they're on a dropped database.  Otherwise we'll reserve resources,
> while anyone connecting to the slot will likely just receive errors
> because the database doesn't exist anymore.  It's also one of the
> patches that can quite easily be developed / reviewed, because there
> really isn't anything complicated about it.  Most of the code is already
> in Craig's patch, it just needs some adjustments.

Right, I'm not too concerned about doing that, and it's next on my
TODO as I clean up the split patch series.

>> >> * Ability to advance (via feedback or via SQL function) - no need to
>> >> actually decode and call output plugins at al
>> >
>> > That pretty much requires decoding, otherwise you really don't know how
>> > much WAL you have to retain.
>>
>> Knowing how much WAL to retain is key.
>>
>> Why would decoding tell you how much WAL to retain?
>
> Because decoding already has the necessary logic?  (You need to retain
> enough WAL to restart decoding for all in-progress transactions etc).

Indeed; after all, standby status updates from the decoding client
only contain the *flushed* LSN. The downstream doesn't know the
restartpoint LSN, it must be tracked by the upstream.

It's also necessary to maintain our catalog_xmin correctly on the
standby so we can send it via hot_standby_feedback to a physical
replication slot used on the master, ensuring the master doesn't
remove catalog tuples we may still need.

> I don't know what you're on about with that statement.  I've spent a
> good chunk of time looking at the 0003 patch, even though it's large
> and contains a lot of different things.  I suggested splitting things up.
> I even suggested what to move earlier after Craig agreed with that
> sentiment, in the mail you're replying to, because it seems
> independently doable.

I really appreciate the review, as I'm all too aware of how time
consuming it can be.

>From my PoV, the difficulty I'm in is that this patch series has
languished for most of the Pg 10 release cycle with no real input from
stakeholders in the logical decoding area, so while the review is
important, the fact that it's now means that it pretty comprehensively
blocks the patch for Pg 10. I asked on list for input on structure
(if/how to split it up) literally months ago, for example.

I've been struggling to get some kind of support for logical decoding
on standbys for most of two release cycles, and there are people
climbing the walls wanting it. I'm trying to make sure it's done
right, but I can't do that alone, and it's hard to progress when I
don't know what will be expected until it's too late to do anything
about it.

I guess all we can do at this point is get the foundations in place
and carry on for Pg 11, where the presence of in-core logical
replication will offer a lever to actually push this in. In the mean
time I'll have to continue carrying the out-of-tree failover slots
patch for people who use logical decoding and want it to be reliable.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partitioned tables and relfilenode

2017-03-22 Thread Amit Langote

> On Tue, Mar 21, 2017 at 10:37 AM, Robert Haas  wrote:
>> On Tue, Mar 21, 2017 at 9:49 AM, Robert Haas  wrote:
>>> On Tue, Mar 21, 2017 at 5:05 AM, Amit Langote
>>>  wrote:
 Attached updated patches.
>>>
>>> Committed 0001 after removing a comma.
>>
>> Regarding 0002,

Thanks for the review.

>> I notice this surprising behavior:
>>
>> rhaas=# create table foo (a int, b text) partition by list (a);
>> CREATE TABLE
>> rhaas=# select relfilenode from pg_class where oid = 'foo'::regclass;
>>  relfilenode
>> -
>>16385
>> (1 row)
>>
>> Why do we end up with a relfilenode if there's no storage?  I would
>> have expected to see 0 there.

Hmm, I see that happening even with views (and other relkinds that do not
have storage):

create table foo (a int);
create view foov as select * from foo;
select relfilenode from pg_class where relname = 'foov';
-[ RECORD 1 ]--
relfilenode | 24606

create type typ as (a int);
select relfilenode from pg_class where relname = 'typ';
-[ RECORD 1 ]--
relfilenode | 24610

After staring at the relevant code, fixing things so that relfilenode is 0
for relations without storage seems to be slightly involved. In the header
comment of relmapper.c, there is a line:

Mapped catalogs have zero in their pg_class.relfilenode entries.

which is kind of significant for the relcache initialization code.
RelationInitPhysicalAddr() called when building a relcache entry
initializes the rd_rel.relfilenode from pg_class.relfilenode if it's valid
and from relation mapper if invalid (i.e. 0).  So when relcache entries
for the mapped catalogs are being built, pg_class.relfilenode being
InvalidOid is a signal to get the same from the relmapper.  Since the same
code path is exercised in other cases, all relations supposedly without
storage would wrongly be looked up in the relmapper.  It may be possible
to fix that, but will require non-trivial rearrangement of relevant code.

>> Other than that, there's not much to see here.  I'm a little worried
>> that you might have missed some case that can result in an access to
>> the file, but I can't find one.  Stuff I tried:
>>
>> VACUUM
>> VACUUM FULL
>> CLUSTER
>> ANALYZE
>> CREATE INDEX
>> ALTER TABLE .. ALTER COLUMN .. TYPE
>> TRUNCATE
>>
>> It would be good to go through and make sure all of those - and any
>> others you can think of - are represented in the regression tests.

We now have tests that cover commands that previously would try to access
file for partitioned tables (most of those your listed above).  The commit
yesterday to eliminate scan nodes is also covered by tests.

Speaking of - do you think the following error message is reasonable when
a a partitioned table is passed to pg_prewarm():

select pg_prewarm('p');
ERROR:  fork "main" does not exist for this relation

It's the same if a view name is passed.

>> Also, a documentation update is probably in order to explain that
>> we're not going to accept heap_reloptions on the parent because the
>> parent has no storage; such options must be set on the child in order
>> to have effect.

Hmm, actually I am no longer sure if we should completely get rid of the
reloptions.  Some options like fillfactor don't make sense, but we might
want to use the values for autovacuum_* options when we will improve
autovacuum's handling of partitioned tables.  Maybe even parallel_workers
could be useful to specify for parent tables.  So, I took out reloptions.c
changes from the patch for now and documented that specifying the storage
options for partitioned parents is ineffective currently.  Does that make
sense?

Thanks,
Amit
>From aed0685b6bfb99a301e674221fb393cc4e660461 Mon Sep 17 00:00:00 2001
From: amit 
Date: Tue, 24 Jan 2017 14:22:34 +0900
Subject: [PATCH] Do not allocate storage for partitioned tables.

Currently, it is not possible to insert any data into a partitioned
table.  So they're empty at all times, which means it is wasteful to
allocate relfilenode and related storage objects.

Also documented the fact that specifying storage options for partitioned
tables is ineffective at the moment.
---
 doc/src/sgml/ref/create_table.sgml |  6 ++
 src/backend/catalog/heap.c | 17 ++---
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/doc/src/sgml/ref/create_table.sgml b/doc/src/sgml/ref/create_table.sgml
index 9ed25c05da..7ac55b17fa 100644
--- a/doc/src/sgml/ref/create_table.sgml
+++ b/doc/src/sgml/ref/create_table.sgml
@@ -1040,6 +1040,12 @@ CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [ IF NOT EXI
 will use the table's parameter value.

+   
+Note that specifying the following parameters for partitioned tables has
+no effect at the moment.  You must specify them for individual leaf
+partitions if necessary.
+   
+

diff --git a/src/backend/catalog/heap.c

Re: [HACKERS] Measuring replay lag

2017-03-22 Thread Thomas Munro

On Thu, Mar 23, 2017 at 12:12 AM, Simon Riggs  wrote:
> On 22 March 2017 at 11:03, Thomas Munro  wrote:
>
>> Hah.  Apologies for the delay -- I will post a patch with
>> documentation as requested within 24 hours.
>
> Thanks very much. I'll reserve time to commit it tomorrow, all else being 
> good.

Thanks!  Please find attached v7, which includes a note we can point
at when someone asks why it doesn't show 00:00:00, as requested.

-- 
Thomas Munro
http://www.enterprisedb.com


replication-lag-v7.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Append implementation

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 4:49 AM, Amit Khandekar  wrote:
> Attached is the updated patch that handles the changes for all the
> comments except the cost changes part. Details about the specific
> changes are after the cost-related points discussed below.
>
> For non-partial paths, I was checking following 3 options :
>
> Option 1. Just take the sum of total non-partial child costs and
> divide it by number of workers. It seems to be getting close to the
> actual cost.

If the costs for all children are about equal, then that works fine.
But when they are very unequal, then it's highly misleading.

> Option 2. Calculate exact cost by an algorithm which I mentioned
> before, which is pasted below for reference :
> Persubpath cost : 20 16 10 8 3 1, with 3 workers.
> After 10 time units (this is minimum of first 3 i.e. 20, 16, 10), the
> times remaining are :
> 10  6  0 8 3 1
> After 6 units (minimum of 10, 06, 08), the times remaining are :
> 4  0  0 2 3 1
> After 2 units (minimum of 4, 2, 3), the times remaining are :
>  2  0  0 0 1 1
> After 1 units (minimum of 2, 1, 1), the times remaining are :
>  1  0  0 0 0 0
> After 1 units (minimum of 1, 0 , 0), the times remaining are :
>  0  0  0 0 0 0
> Now add up above time chunks : 10 + 6 + 2 + 1 + 1 = 20

This gives the same answer as what I was proposing, but I believe it's
more complicated to compute.  The way my proposal would work in this
case is that we would start with an array C[3] (since there are three
workers], with all entries 0.  Logically C[i] represents the amount of
work to be performed by worker i.  We add each path in turn to the
worker whose array entry is currently smallest; in the case of a tie,
just pick the first such entry.

So in your example we do this:

C[0] += 20;
C[1] += 16;
C[2] += 10;
/* C[2] is smaller than C[0] or C[1] at this point, so we add the next
path to C[2] */
C[2] += 8;
/* after the previous line, C[1] is now the smallest, so add to that
entry next */
C[1] += 3;
/* now we've got C[0] = 20, C[1] = 19, C[2] = 18, so add to C[2] */
C[2] += 1;
/* final result: C[0] = 20, C[1] = 19, C[2] = 19 */

Now we just take the highest entry that appears in any array, which in
this case is C[0], as the total cost.

Comments on this latest version:

In my previous review, I said that you should "define and document a
new builtin tranche ID"; you did the first but not the second.  See
the table in monitoring.sgml.

Definition of exec_append_goto_next_plan should have a line break
after the return type, per usual PostgreSQL style rules.

- * initialize to scan first subplan
+ * In case it's a sequential Append, initialize to scan first subplan.

This comment is confusing because the code is executed whether it's
parallel or not.  I think it might be better to write something like
"initialize to scan first subplan (but note that we'll override this
later in the case of a parallel append)"

 /*
+ * Check if we are already finished plans from parallel append. This
+ * can happen if all the subplans are finished when this worker
+ * has not even started returning tuples.
+ */
+if (node->as_padesc && node->as_whichplan == PA_INVALID_PLAN)
+return ExecClearTuple(node->ps.ps_ResultTupleSlot);

There seems to be no reason why this couldn't be hoisted out of the
loop.  Actually, I think Ashutosh pointed this out before, but I
didn't understand at that time what his point was.  Looking back, I
see that he also pointed out that the as_padesc test isn't necessary,
which is also true.

+if (node->as_padesc)
+node->as_padesc->pa_finished[node->as_whichplan] = true;

I think you should move this logic inside exec_append_parallel_next.
That would avoid testing node->pa_desc an extra time for non-parallel
append.  I note that the comment doesn't explain why it's safe to do
this without taking the lock.  I think we could consider doing it with
the lock held, but it probably is safe, because we're only setting it
from false to true.  If someone else does the same thing, that won't
hurt anything, and if someone else fails to see our update, then the
worst-case scenario is that they'll try to execute a plan that has no
chance of returning any more rows.  That's not so bad.  Actually,
looking further, you do have a comment explaining that, but it's in
exec_append_parallel_next() where the value is used, rather than here.

+memset(padesc->pa_finished, 0, sizeof(bool) * node->as_nplans);
+
+shm_toc_insert(pcxt->toc, node->ps.plan->plan_node_id, padesc);
+node->as_padesc = padesc;

Putting the shm_toc_insert call after we fully initialize the
structure seems better than putting it after we've done some of the
initialization and before we've done the rest.

+/* Choose the optimal subplan to be executed. */

I think the word "first" would be more accurate than "optimal".  We
can only hope to pick the optimal one, but whichever one we pick

Re: [HACKERS] Hash support for grouping sets

2017-03-22 Thread Andres Freund

Hi,

> +/*
> + * Switch to phase "newphase", which must either be 0 or 1 (to reset) or
>   * current_phase + 1. Juggle the tuplesorts accordingly.
>   */
>  static void
>  initialize_phase(AggState *aggstate, int newphase)
>  {
> - Assert(newphase == 0 || newphase == aggstate->current_phase + 1);
> + Assert(newphase <= 1 || newphase == aggstate->current_phase + 1);

I think this somewhere in the file header needs an expanded explanation
about what these "special" phases mean.


> @@ -838,17 +897,21 @@ advance_transition_function(AggState *aggstate,
>  /*
>   * Advance each aggregate transition state for one input tuple.  The input
>   * tuple has been stored in tmpcontext->ecxt_outertuple, so that it is
> - * accessible to ExecEvalExpr.  pergroup is the array of per-group structs to
> - * use (this might be in a hashtable entry).
> + * accessible to ExecEvalExpr.
> + *
> + * We have two sets of transition states to handle: one for sorted 
> aggregation
> + * and one for hashed; we do them both here, to avoid multiple evaluation of
> + * the inputs.
>   *
>   * When called, CurrentMemoryContext should be the per-query context.
>   */

Changes to advance_aggregates() are, in my experience, quite likely to
have performance effects.  This needs some performance tests.

I scheduled a small TPCH comparison run:

master q01 min: 28924.766 dev-hash min: 28893.923 [diff -0.11]
master q02 min: 2448.135 dev-hash min: 2430.589 [diff -0.72]
master q03 min: 14596.292 dev-hash min: 14421.054 [diff -1.22]
master q04 min: 1999.684 dev-hash min: 1958.727 [diff -2.09]
master q05 min: 10638.148 dev-hash min: 10753.075 [diff +1.07]
master q06 min: 3679.955 dev-hash min: 3649.114 [diff -0.85]
master q07 min: 10097.272 dev-hash min: 10358.089 [diff +2.52]
master q08 min: 3103.698 dev-hash min: 3078.108 [diff -0.83]
master q09 min: 13034.87 dev-hash min: 13049.402 [diff +0.11]
master q10 min: 11702.966 dev-hash min: 11535.05 [diff -1.46]
master q11 min: 577.114 dev-hash min: 586.767 [diff +1.65]
master q12 min: 9087.413 dev-hash min: 9272.874 [diff +2.00]
master q13 min: 16353.813 dev-hash min: 16473.882 [diff +0.73]
master q14 min: 1564.321 dev-hash min: 1552.31 [diff -0.77]
master q15 min: 3958.565 dev-hash min: 3946.728 [diff -0.30]
master q16 min: 3793.345 dev-hash min: 3784.996 [diff -0.22]
master q17 min: 828.286 dev-hash min: 834.929 [diff +0.80]
master q18 min: 13473.084 dev-hash min: 13777.327 [diff +2.21]
master q19 min: 654.241 dev-hash min: 673.863 [diff +2.91]
master q20 min: 2906.811 dev-hash min: 2882.793 [diff -0.83]
master q22 min: 1226.397 dev-hash min: 1262.045 [diff +2.82]

master total min: 154649.176 dev-hash min: 155175.645 [diff +0.34]

Looks like it could all be noise, but it seems worthwhile to look into
some.



>   * GROUP BY columns.  The per-group data is allocated in lookup_hash_entry(),
>   * for each entry.
>   *
> - * The hash table always lives in the aggcontext memory context.
> + * We have a separate hashtable and associated perhash data structure for 
> each
> + * grouping set for which we're doing hashing.
> + *
> + * The hash tables always live in the hashcontext's per-tuple memory context
> + * (there is only one of these for all tables together, since they are all
> + * reset at the same time).
>   */

I've not yet read up on the thread, or the whole patch, but an
explanation of the memory management for all those tables would be good
somewhere (maybe I've just not read that far).


> @@ -2388,7 +2597,36 @@ agg_retrieve_hash_table(AggState *aggstate)
>   * ExecInitAgg
>   *
>   *   Creates the run-time information for the agg node produced by the
> - *   planner and initializes its outer subtree
> + *   planner and initializes its outer subtree.
> + *
> + *   What we get from the planner is actually one "real" Agg node which is 
> part
> + *   of the plan tree proper, but which optionally has an additional list of 
> Agg
> + *   nodes hung off the side via the "chain" field.  This is because an Agg 
> node
> + *   happens to be a convenient representation of all the data we need for
> + *   grouping sets.
> + *
> + *   For many purposes, we treat the "real" node as if it were just the first
> + *   node in the chain.  The chain must be ordered such that hashed entries 
> come
> + *   before sorted/plain entries; the real node is marked AGG_MIXED if there 
> are
> + *   mixed types (in which case the real node describes one of the hashed

"mixed types" sounds a bit weird.


> + *   We collect all hashed nodes into a single "phase", numbered 0, and 
> create a
> + *   sorted phase (numbered 1..n) for each AGG_SORTED or AGG_PLAIN node.  
> Phase
> + *   0 is allocated even if there are no hashes, but remains unused in that
> + *   case.

Ah, here's the explanation I'd been looking for ;)



> +Bitmapset *
> +DiscreteKnapsack(int max_weight, int num_items,
> +  int *item_weights, double *item_values)
> +{
> + MemoryContext local_ctx =

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Stephen Frost

David,

* David Steele (da...@pgmasters.net) wrote:
> On 3/22/17 3:45 PM, Robert Haas wrote:
> >On Wed, Mar 22, 2017 at 3:24 PM, David Steele  wrote:
> >>>One of the reasons to go with the LSN is that we would actually be
> >>>maintaining what happens when the WAL files are 16MB in size.
> >>>
> >>>David's initial expectation was this for 64MB WAL files:
> >>>
> >>>00010040
> >>>00010080
> >>>000100CO
> >>>00010001
> >>
> >>
> >>This is the 1GB sequence, actually, but idea would be the same for 64MB
> >>files.
> >
> >Wait, really?  I thought you abandoned this approach because there's
> >then no principled way to handle WAL segments of less than the default
> >size.
> 
> I did say that, but I thought I had hit on a compromise.

Strikes me as one, at least.

> But, as I originally pointed out the hex characters in the filename
> are not aligned correctly for > 8 bits (< 16MB segments) and using
> different alignments just made it less consistent.
> 
> It would be OK if we were willing to drop the 1,2,4,8 segment sizes
> because then the alignment would make sense and not change the
> current 16MB sequence.

For my 2c, at least, it seems extremely unlikely that people are using
smaller-than-16MB segments.  Also, we don't have to actually drop
support for those sizes, just handle the numbering differently, if we
feel like they're useful enough to keep- in particular I was thinking we
could make the filename one digit longer, or shift the numbers up one
position, but my general feeling is that it wouldn't ever be an
exercised use-case and therefore we should just drop support for them.

Perhaps I'm being overly paranoid, but I share David's concern about
non-standard/non-default WAL sizes being a serious risk due to lack of
exposure for those code paths, which is another reason that we should
change the default if we feel it's valuable to have a large WAL segment,
not just create this option which users can set at initdb time but which
we very rarely actually test to ensure it's working.

With any of these we need to have some buildfarm systems which are at
*least* running our regression tests against the different options, if
we would consider telling users to use them.

> Even then, there are some interesting side effects.  For 1GB
> segments the "0001000100C0" segment would include LSNs
> 1/C000 through 1/.  This is correct but is not an
> obvious filename to LSN mapping, at least for LSNs that appear later
> in the segment.

That doesn't seem unreasonable to me.  If we're going to use the
starting LSN of the segment then it's going to skip when you start
varying the size of the segment.  Even keeping the current scheme we end
up with skipping happening, it just different skipping and goes
"1, 2, 3, skip to the next!" where how high the count goes depends on
the size.  With this approach, we're consistently skipping the same
amount which is exactly the size divided by 16MB, always.

I do also like Peter's suggestion also of using separator between the
components of the WAL filename, but that would change the naming for
everyone, which is a concern I can understand us wishing to avoid.

From a user-experience point of view, keeping the mapping from the WAL
filename to the starting LSN is quite nice, even if this change might
complicate the backend code a bit.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] WIP: [[Parallel] Shared] Hash

2017-03-22 Thread Peter Geoghegan

On Wed, Mar 22, 2017 at 3:17 AM, Thomas Munro
 wrote:
>> If I follow the new code correctly, then it doesn't matter that you've
>> unlink()'d to take care of the more obvious resource management chore.
>> You can still have a reference leak like this, if I'm not mistaken,
>> because you still have backend local state (local VfdCache) that is
>> left totally decoupled with the new "shadow resource manager" for
>> shared BufFiles.
>
> You're right.  The attached version fixes these problems.  The
> BufFiles created or opened in this new way now participate in both of
> our leak-detection and clean-up schemes:  the one in resowner.c
> (because I'm now explicitly registering with it as I had failed to do
> before) and the one in CleanupTempFiles (because FD_CLOSE_AT_EOXACT is
> set, which I already had in the previous version for the creator, but
> not the opener of such a file).  I tested by commenting out my
> explicit BufFileClose calls to check that resowner.c starts
> complaining, and then by commenting out the resowner registration too
> to check that CleanupTempFiles starts complaining.

I took a quick look at your V9 today. This is by no means a
comprehensive review of 0008-hj-shared-buf-file-v9.patch, but it's
what I can manage right now.

The main change you made is well represented by the following part of
the patch, where you decouple close at eoXact with delete at eoXact,
with the intention of doing one but not the other for BufFiles that
are shared:

>  /* these are the assigned bits in fdstate below: */
> -#define FD_TEMPORARY   (1 << 0)/* T = delete when closed */
> -#define FD_XACT_TEMPORARY  (1 << 1)/* T = delete at eoXact */
> +#define FD_DELETE_AT_CLOSE (1 << 0)/* T = delete when closed */
> +#define FD_CLOSE_AT_EOXACT (1 << 1)/* T = close at eoXact */
> +#define FD_TEMP_FILE_LIMIT (1 << 2)/* T = respect temp_file_limit */

So, shared BufFile fd.c segments within backend local resource manager
do not have FD_DELETE_AT_CLOSE set, because you mean to do that part
yourself by means of generic shared cleanup through dynamic shared
memory segment callback. So far so good.

However, I notice that the place that this happens instead,
PathNameDelete(), does not repeat the fd.c step of doing a final
stat(), and using the stats for a pgstat_report_tempfile(). So, a new
pgstat_report_tempfile() call is simply missing. However, the more
fundamental issue in my mind is: How can you fix that? Where would it
go if you had it?

If you do the obvious thing of just placing that before the new
unlink() within PathNameDelete(), on the theory that that needs parity
with the fd.c stuff, that has non-obvious implications. Does the
pgstat_report_tempfile() call need to happen when going through this
path, for example?:

> +/*
> + * Destroy a shared BufFile early.  Files are normally cleaned up
> + * automatically when all participants detach, but it might be useful to
> + * reclaim disk space sooner than that.  The caller asserts that no backends
> + * will attempt to read from this file again and that only one backend will
> + * destroy it.
> + */
> +void
> +SharedBufFileDestroy(SharedBufFileSet *set, int partition, int participant)
> +{

The theory with the unlink()'ing() function PathNameDelete(), I
gather, is that it doesn't matter if it happens to be called more than
once, say from a worker and then in an error handling path in the
leader or whatever. Do I have that right?

Obviously the concern I have about that is that any stat() call you
might add for the benefit of a new pgstat_report_tempfile() call,
needed to keep parity with fd.c, now has a risk of double counting in
error paths, if I'm not mistaken. We do need to do that accounting in
the event of error, just as we do when there is no error, at least if
current stats collector behavior is to be preserved. How can you
determine which duplicate call here is the duplicate? In other words,
how can you figure out which one is not supposed to
pgstat_report_tempfile()? If the size of temp files in each worker is
unknowable to the implementation in error paths, does it not follow
that it's unknowable to the user that queries pg_stat_database?

Now, I don't imagine that this should stump you. Maybe I'm wrong about
that possibility (that you cannot have exactly once
unlink()/stat()/whatever), or maybe I'm right and you can fix it while
preserving existing behavior, for example by relying on unlink()
reliably failing when called a second time, no matter how tight any
race was. What exact semantics does unlink() have with concurrency, as
far as the link itself goes?

If I'm not wrong about the general possibility, then maybe the
existing behavior doesn't need to be preserved in error paths, which
are after all exceptional -- it's not as if the statistics collector
is currently highly reliable. It's not obvious that you are
deliberately accepting of any of these risks or costs, though, which I
think needs to

Re: [HACKERS] Enabling parallelism for queries coming from SQL or other PL functions

2017-03-22 Thread Dilip Kumar

On Wed, Mar 22, 2017 at 10:33 PM, Robert Haas  wrote:
> So couldn't we actually make this test !fcache->returnsSet || !es->lazyEval?
> That would let us allow parallel execution for all non-set-returning
> functions, and also for set-returning functions that end up with
> es->lazyEval set to false.

Yes, this is the right thing to do although we may not enable
parallelism for any more queries by adding "|| !es->lazyEval". Because
SELECT are always marked as es->lazyEval=true(And so far we have
parallelism only for select).  But here we calling the parameter to
ExecutorRun as execute_once so  !fcache->returnsSet || !es->lazyEval
is the correct one and future proof.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding on standby

2017-03-22 Thread Andres Freund

On 2017-03-23 06:55:53 +0800, Craig Ringer wrote:
> On 22 March 2017 at 21:06, Andres Freund  wrote:
> > Hi,
> >
> > On 2017-03-21 09:05:26 +0800, Craig Ringer wrote:
> >> > 0002 should be doable as a whole this release, I have severe doubts that
> >> > 0003 as a whole has a chance for 10 - the code is in quite a raw shape,
> >> > there's a significant number of open ends.  I'd suggest breaking of bits
> >> > that are independently useful, and work on getting those committed.
> >>
> >> That would be my preference too.
> >
> >
> >> The parts I think are important for Pg10 are:
> >
> >> * Ability to create logical slots on replicas
> >
> > Doesn't this also imply recovery conflicts on DROP DATABASE?  Besides,
> > allowing to drop all slots using a database upon DROP DATABASE, is a
> > useful thing on its own.
> 
> Definitely beneficial, otherwise recovery will stop until you drop
> slots, which isn't ideal.

s/isn't ideal/not acceptable/ ;)


> >> * Ability to advance (via feedback or via SQL function) - no need to
> >> actually decode and call output plugins at al
> >
> > That pretty much requires decoding, otherwise you really don't know how
> > much WAL you have to retain.
> 
> Yes, and to update restart_lsn and catalog_xmin correctly.

> I was thinking that by disallowing snapshot use and output plugin
> invocation we'd avoid the need to support cancellation on recovery
> conflicts, etc, simplifying things considerably.

That seems like it'd end up being pretty hacky - the likelihood that
we'd run into snapbuild error cross-checks seems very high.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding on standby

2017-03-22 Thread Craig Ringer

On 22 March 2017 at 21:06, Andres Freund  wrote:
> Hi,
>
> On 2017-03-21 09:05:26 +0800, Craig Ringer wrote:
>> > 0002 should be doable as a whole this release, I have severe doubts that
>> > 0003 as a whole has a chance for 10 - the code is in quite a raw shape,
>> > there's a significant number of open ends.  I'd suggest breaking of bits
>> > that are independently useful, and work on getting those committed.
>>
>> That would be my preference too.
>
>
>> The parts I think are important for Pg10 are:
>
>> * Ability to create logical slots on replicas
>
> Doesn't this also imply recovery conflicts on DROP DATABASE?  Besides,
> allowing to drop all slots using a database upon DROP DATABASE, is a
> useful thing on its own.

Definitely beneficial, otherwise recovery will stop until you drop
slots, which isn't ideal.

>> * Ability to advance (via feedback or via SQL function) - no need to
>> actually decode and call output plugins at al
>
> That pretty much requires decoding, otherwise you really don't know how
> much WAL you have to retain.

Yes, and to update restart_lsn and catalog_xmin correctly.

I was thinking that by disallowing snapshot use and output plugin
invocation we'd avoid the need to support cancellation on recovery
conflicts, etc, simplifying things considerably.

>> * Ability to drop logical slots on replicas
>
> That shouldn't actually require any changes, no?

It doesn't, it works as-is. I have NFI why I wrote that.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] exposing wait events for non-backends (was: Tracking wait event for latches)

2017-03-22 Thread Michael Paquier

On Thu, Mar 23, 2017 at 1:24 AM, Robert Haas  wrote:
> I mean, your argument boils down to "somebody might want to
> deliberately hide things from pg_stat_activity".  But that's not
> really a mode we support in general, and supporting it only for
> certain cases doesn't seem like something that this patch should be
> about.  We could add an option to BackgroundWorkerInitializeConnection
> and BackgroundWorkerInitializeConnectionByOid to suppress it, if it's
> something that somebody wants, but actually I'd be more inclined to
> think that everybody (who has a shared memory connection) should go
> into the machinery and then security-filtering should be left to some
> higher-level facility that can make policy decisions rather than being
> hard-coded in the individual modules.
>
> But I'm slightly confused as to how this even arises.  Background
> workers already show up in pg_stat_activity output, or at least I sure
> think they do.  So why does this patch need to make any change to that
> case at all?

When working on a couple of bgworkers some time ago, I recalled that
they only showed up in pg_stat_activity only if calling
pgstat_report_activity() in them. Just looking again, visibly I was
mistaken, they do indeed show up when if WaitLatch() or
pgstat_report_activity() are not used. Please let me discard that
remark.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread David Steele


Hi Robert,

On 3/22/17 3:45 PM, Robert Haas wrote:

On Wed, Mar 22, 2017 at 3:24 PM, David Steele  wrote:

One of the reasons to go with the LSN is that we would actually be
maintaining what happens when the WAL files are 16MB in size.

David's initial expectation was this for 64MB WAL files:

00010040
00010080
000100CO
00010001



This is the 1GB sequence, actually, but idea would be the same for 64MB
files.


Wait, really?  I thought you abandoned this approach because there's
then no principled way to handle WAL segments of less than the default
size.


I did say that, but I thought I had hit on a compromise.

But, as I originally pointed out the hex characters in the filename are 
not aligned correctly for > 8 bits (< 16MB segments) and using different 
alignments just made it less consistent.


It would be OK if we were willing to drop the 1,2,4,8 segment sizes 
because then the alignment would make sense and not change the current 
16MB sequence.


Even then, there are some interesting side effects.  For 1GB segments 
the "0001000100C0" segment would include LSNs 1/C000 
through 1/.  This is correct but is not an obvious filename to 
LSN mapping, at least for LSNs that appear later in the segment.


--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: Make pg_stop_backup() archive wait optional

2017-03-22 Thread David Steele


On 3/22/17 4:42 PM, Peter Eisentraut wrote:

On 3/22/17 15:14, Stephen Frost wrote:

-SELECT * FROM pg_stop_backup(false);
+SELECT * FROM pg_stop_backup(false [, true ]);

I think that it's better to get rid of "[" and "]" from the above because
IMO this should be the command example that users actually can run.

Using the '[' and ']' are how all of the optional arguments are
specified in the documentation, see things like current_setting() in our
existing documentation:


In the synopsis, but not in concrete examples.


Wasn't quite sure what the protocol was in this case, and figured it was 
easier to subtract than to add.


--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread David Steele


On 3/22/17 3:39 PM, Peter Eisentraut wrote:

On 3/22/17 15:37, Peter Eisentraut wrote:

If changing WAL sizes catches on, I do think we should keep thinking
about a new format for a future release,


I think that means that I'm skeptical about changing the default size
right now.


I think if we don't change the default size it's very unlikely I would 
use alternate WAL segment sizes or recommend that anyone else does, at 
least in v10.


I simply don't think it would get the level of testing required to be 
production worthy and I doubt that most tool writers would be quick to 
add support for a feature that very few people (if any) use.


--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH v1] Add and report the new "in_hot_standby" GUC pseudo-variable.

2017-03-22 Thread Elvis Pranskevichus

On Wednesday, March 22, 2017 4:28:18 PM EDT Robert Haas wrote:
> On Wed, Mar 22, 2017 at 4:00 PM, Peter Eisentraut 
> > I think we could use "in_recovery", which would be consistent with
> > existing naming.
> 
> True.

Ironically, that was the name I originally used.  I'll update the patch.

> (Jaime's question is also on point, I think.)

The main (and only) point of this patch is to avoid polling.  Since 
"in_recovery" is marked as GUC_REPORT, it will be sent to the client 
asynchronously in a ParamStatus message.  Other GUC_REPORT variables set 
a good precedent.

My argument is that Hot Standby is a major mode of operation, which 
significantly alters how connected clients work with the server, and 
this bit of knowledge is no less important than the other GUC_REPORT 
vars.

  Elvis

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: Make pg_stop_backup() archive wait optional

2017-03-22 Thread Stephen Frost

* Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
> On 3/22/17 15:14, Stephen Frost wrote:
> >> -SELECT * FROM pg_stop_backup(false);
> >> +SELECT * FROM pg_stop_backup(false [, true ]);
> >>
> >> I think that it's better to get rid of "[" and "]" from the above because
> >> IMO this should be the command example that users actually can run.
> > Using the '[' and ']' are how all of the optional arguments are
> > specified in the documentation, see things like current_setting() in our
> > existing documentation:
> 
> In the synopsis, but not in concrete examples.

Oh, right, yes.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] PATCH: Make pg_stop_backup() archive wait optional

2017-03-22 Thread Peter Eisentraut

On 3/22/17 15:14, Stephen Frost wrote:
>> -SELECT * FROM pg_stop_backup(false);
>> +SELECT * FROM pg_stop_backup(false [, true ]);
>>
>> I think that it's better to get rid of "[" and "]" from the above because
>> IMO this should be the command example that users actually can run.
> Using the '[' and ']' are how all of the optional arguments are
> specified in the documentation, see things like current_setting() in our
> existing documentation:

In the synopsis, but not in concrete examples.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH v1] Add and report the new "in_hot_standby" GUC pseudo-variable.

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 4:00 PM, Peter Eisentraut
 wrote:
> On 3/22/17 14:09, Robert Haas wrote:
>>> The opposite means primary.  I can flip the GUC name to "is_primary", if
>>> that's clearer.
>> Hmm, I don't find that clearer.  "hot standby" has a very specific
>> meaning; "primary" isn't vague, but I would say it's less specific.
>
> The problem I have is that there is already a GUC named "hot_standby",
> which determines whether an instance is in hot (as opposed to warm?)
> mode if it is a standby.  This is proposing to add a setting named
> "in_hot_standby" which says nothing about the hotness, but something
> about the standbyness.  Note that these are all in the same namespace.

Good point.

> I think we could use "in_recovery", which would be consistent with
> existing naming.

True.

(Jaime's question is also on point, I think.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Stephen Frost

* Peter Eisentraut (peter.eisentr...@2ndquadrant.com) wrote:
> The question is, which property is more useful to preserve: matching
> LSN, or having a mostly consecutive numbering.
> 
> Actually, I would really really like to have both, but if I had to pick
> one, I'd lean 55% toward consecutive numbering.

> For the issue at hand, I think it's fine to proceed with the naming
> schema that the existing compile-time option gives you.

What I don't particularly like about that is that it's *not* actually
consecutive, you end up with this:

00010001
00010002
00010003
00010001

Which is part of what I don't particularly like about this approach.

> In fact, that would flush out some of the tools that look directly at
> the file names and interpret them, thus preserving the option to move to
> a more radically different format.

This doesn't make a lot of sense to me.  If we get people to change to
using larger WAL segments and the tools are modified to understand the
pseudo-consecutive format, and then you want to change it on them again
in another release or two?  I'm generally a fan of not feeling too bad
breaking backwards compatibility, but it seems pretty rough even to me
to do so immediately.

This is exactly why I think it'd be better to work out a good naming
scheme now that actually makes sense and that we'll be able to stick
with for a while instead of rushing to get this ability in now, when
we'll have people actually starting to use it and then try to change it.

> If changing WAL sizes catches on, I do think we should keep thinking
> about a new format for a future release, because debugging will
> otherwise become a bit wild.  I'm thinking something like
> 
> {integer timeline}_{integer seq number}_{hex lsn}
> 
> might address various interests.

Right, I'd rather not have debugging WAL files become a bit wild.

If we can't work out a sensible approach to naming that we expect to
last us for at least a couple of releases for different sizes of WAL
files, then I don't think we should rush to encourage users to use
different sizes of WAL files.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] [PATCH v1] Add and report the new "in_hot_standby" GUC pseudo-variable.

2017-03-22 Thread Peter Eisentraut

On 3/22/17 14:09, Robert Haas wrote:
>> The opposite means primary.  I can flip the GUC name to "is_primary", if
>> that's clearer.
> Hmm, I don't find that clearer.  "hot standby" has a very specific
> meaning; "primary" isn't vague, but I would say it's less specific.

The problem I have is that there is already a GUC named "hot_standby",
which determines whether an instance is in hot (as opposed to warm?)
mode if it is a standby.  This is proposing to add a setting named
"in_hot_standby" which says nothing about the hotness, but something
about the standbyness.  Note that these are all in the same namespace.

I think we could use "in_recovery", which would be consistent with
existing naming.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 8:46 AM, Ashutosh Bapat
 wrote:
> Here's set of updated patches rebased on
> 1148e22a82edc96172fc78855da392b6f0015c88.
>
> I have fixed all the issues reported till now.

I don't understand why patch 0001 ends up changing every existing test
for RELOPT_JOINREL anywhere in the source tree to use IS_JOIN_REL(),
yet none of the existing tests for RELOPT_OTHER_MEMBER_REL end up
getting changed to use IS_OTHER_REL().  That's very surprising.  Some
of those tests are essentially checking for something that is going to
have a scan plan rather than a join or upper plan, and those tests
probably don't need to be modified; for example, the test in
set_rel_consider_parallel() is obviously of this type. But others are
testing whether we've got some kind of child rel, and those seem like
they might need work.  Going through a few specific examples:

- generate_join_implied_equalities_for_ecs() assumes that any child
rel is an other member rel.
- generate_join_implied_equalities_broken() assumes that any child rel
is an other member rel.
- generate_implied_equalities_for_column() set is_child_rel on the
assumption that only an other member rel can be a child rel.
- eclass_useful_for_merging() assumes that the only kind of child rel
is an other member rel.
- find_childrel_appendrelinfo() assumes that any child rel is an other
member rel.
- find_childrel_top_parent() and find_childrel_parents() assume that
children must be other member rels and their parents must be baserels.
- adjust_appendrel_attrs_multilevel() assumes that children must be
other member rels and their parents must be baserels.

It's possible that, for various reasons, none of these code paths
would ever be reachable by a child join, but it doesn't look likely to
me.  And even if that's true, some comment updates are probably
needed, and maybe some renaming of functions too.

In postgres_fdw, get_useful_ecs_for_relation() assumes that any child
rel is an other member rel.  I'm not sure if we're hoping that
partitionwise join will work with postgres_fdw's join pushdown out of
the chute, but clearly this would need to be adjusted to have any
chance of being right.

Some that seem OK:

- set_rel_consider_parallel() is fine.
- set_append_rel_size() is only going to be called for baserels or
their children, so it's fine.
- relation_excluded_by_constraints() is only intended to be called on
baserels or their children, so it's fine.
- check_index_predicates() is only intended to be called on baserels
or their children, so it's fine.
- query_planner() loops over baserels and their children, so it's fine.

Perhaps we could introduce an IS_BASEREL_OR_CHILD() test that could be
used in some of these places, just for symmetry.   The point is that
there are really three questions here: (1) is it some kind of baserel
(parent or child)? (2) is it some kind of joinrel (parent or child)?
and (3) is it some kind of child (baserel or join)?  Right now, both
#2 and #3 are tested by just comparing against
RELOPT_OTHER_MEMBER_REL, but they become different tests as soon as we
add child joinrels.  The goal of 0001, IMV, ought to be to try to
figure out which of #1, #2, and #3 is being checked in each case and
make that clear via use of an appropriate macro.  (If is-other-baserel
is the real test, then fine, but I bet that's a rare case.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 3:24 PM, David Steele  wrote:
>> One of the reasons to go with the LSN is that we would actually be
>> maintaining what happens when the WAL files are 16MB in size.
>>
>> David's initial expectation was this for 64MB WAL files:
>>
>> 00010040
>> 00010080
>> 000100CO
>> 00010001
>
>
> This is the 1GB sequence, actually, but idea would be the same for 64MB
> files.

Wait, really?  I thought you abandoned this approach because there's
then no principled way to handle WAL segments of less than the default
size.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Peter Eisentraut

On 3/22/17 15:37, Peter Eisentraut wrote:
> If changing WAL sizes catches on, I do think we should keep thinking
> about a new format for a future release,

I think that means that I'm skeptical about changing the default size
right now.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Peter Eisentraut

On 3/22/17 15:09, Stephen Frost wrote:
> David's initial expectation was this for 64MB WAL files:
> 
> 00010040
> 00010080
> 000100CO
> 00010001
> 
> Which both matches the LSN *and* keeps the file names the same when
> they're 16MB.  This is what David's looking at writing a patch for and
> is what I think we should be considering.  This avoids breaking
> compatibility for people who choose to continue using 16MB (assuming
> we switch the default to 64MB, which I am still hopeful we will do).

The question is, which property is more useful to preserve: matching
LSN, or having a mostly consecutive numbering.

Actually, I would really really like to have both, but if I had to pick
one, I'd lean 55% toward consecutive numbering.

For the issue at hand, I think it's fine to proceed with the naming
schema that the existing compile-time option gives you.

In fact, that would flush out some of the tools that look directly at
the file names and interpret them, thus preserving the option to move to
a more radically different format.

If changing WAL sizes catches on, I do think we should keep thinking
about a new format for a future release, because debugging will
otherwise become a bit wild.  I'm thinking something like

{integer timeline}_{integer seq number}_{hex lsn}

might address various interests.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Stephen Frost

* David Steele (da...@pgmasters.net) wrote:
> On 3/22/17 3:09 PM, Stephen Frost wrote:
> >* Robert Haas (robertmh...@gmail.com) wrote:
> >>On Wed, Mar 22, 2017 at 1:49 PM, Stephen Frost  wrote:
> >>>Then perhaps we do need to be thinking of moving this to PG11 instead of
> >>>exposing an option that users will start to use which will result in WAL
> >>>naming that'll be confusing and inconsistent.  I certainly don't think
> >>>it's a good idea to move forward exposing an option with a naming scheme
> >>>that's agreed to be bad.
> >>
> >
> >One of the reasons to go with the LSN is that we would actually be
> >maintaining what happens when the WAL files are 16MB in size.
> >
> >David's initial expectation was this for 64MB WAL files:
> >
> >00010040
> >00010080
> >000100CO
> >00010001
> 
> This is the 1GB sequence, actually, but idea would be the same for
> 64MB files.

Ah, right, sorry.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Report the number of skipped frozen pages by manual VACUUM

2017-03-22 Thread Fujii Masao

On Wed, Mar 15, 2017 at 7:51 PM, Masahiko Sawada  wrote:
> On Wed, Mar 15, 2017 at 1:09 PM, Yugo Nagata  wrote:
>> On Fri, 10 Mar 2017 20:08:42 +0900
>> Masahiko Sawada  wrote:
>>
>>> On Fri, Mar 10, 2017 at 3:58 PM, Jim Nasby  wrote:
>>> > On 3/6/17 8:34 PM, Masahiko Sawada wrote:
>>> >>
>>> >> I don't think it can say "1 frozen pages" because the number of
>>> >> skipped pages according to visibility map is always more than 32
>>> >> (SKIP_PAGES_THRESHOLD).
>>> >
>>> >
>>> > That's just an artifact of how the VM currently works. I'm not a fan of
>>> > cross dependencies like that unless there's a pretty good reason.
>>>
>>> Thank you for the comment.
>>> Agreed. Attached fixed version patch.
>>
>> It seems that it says "frozen pages" if pinskipped_pages is not zero,
>> rather than about frozenskipped_pages.
>>
>> How about writing as below?
>>
>> appendStringInfo(, ngettext("Skipped %u page due to buffer pins, "
>> "Skipped %u pages due to buffer pins, ",
>> vacrelstats->pinskipped_pages),
>>  vacrelstats->pinskipped_pages);
>> appendStringInfo(, ngettext("%u frozen page.\n", "%u frozen 
>> pages.\n",
>> vacrelstats->frozenskipped_pages),
>>  vacrelstats->frozenskipped_pages);
>>
>
> Yeah, you're right. I've attached the updated version patch
> incorporated your comment and fixing documentation.

The patch looks very simple and good to me.
Barring objection, I will commit this.

Regards,

-- 
Fujii Masao


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread David Steele


On 3/22/17 3:09 PM, Stephen Frost wrote:

* Robert Haas (robertmh...@gmail.com) wrote:

On Wed, Mar 22, 2017 at 1:49 PM, Stephen Frost  wrote:

Then perhaps we do need to be thinking of moving this to PG11 instead of
exposing an option that users will start to use which will result in WAL
naming that'll be confusing and inconsistent.  I certainly don't think
it's a good idea to move forward exposing an option with a naming scheme
that's agreed to be bad.




One of the reasons to go with the LSN is that we would actually be
maintaining what happens when the WAL files are 16MB in size.

David's initial expectation was this for 64MB WAL files:

00010040
00010080
000100CO
00010001


This is the 1GB sequence, actually, but idea would be the same for 64MB 
files.


--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-22 Thread Mithun Cy

On Thu, Mar 23, 2017 at 12:19 AM, Pavan Deolasee
 wrote:
>
>
> On Wed, Mar 22, 2017 at 4:53 PM, Mithun Cy 
> wrote:
> Ok, no problem. I did some tests on AWS i2.xlarge instance (4 vCPU, 30GB
> RAM, attached SSD) and results are shown below. But I think it is important
> to get independent validation from your side too, just to ensure I am not
> making any mistake in measurement. I've attached naively put together
> scripts which I used to run the benchmark. If you find them useful, please
> adjust the paths and run on your machine.

Looking at your postgresql.conf  JFYI, I have synchronous_commit = off
but same is on with your run (for logged tables) and rest remains
same. Once I get the machine probably next morning, I will run same
tests on v19.
-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: Make pg_stop_backup() archive wait optional

2017-03-22 Thread Stephen Frost

Fujii,

* Fujii Masao (masao.fu...@gmail.com) wrote:
> On Thu, Mar 23, 2017 at 12:37 AM, Stephen Frost  wrote:
> > * David Steele (da...@pgmasters.net) wrote:
> >> On 3/21/17 2:34 PM, Fujii Masao wrote:
> >> >The patch basically looks good to me, but one comment is;
> >> >backup.sgml (at least the description for "Making a non-exclusive
> >> >low level backup) seems to need to be updated.
> >>
> >> Agreed.  Added in the attached patch and rebased on 8027556.
> 
> Thanks for updating the patch!
> 
> -SELECT * FROM pg_stop_backup(false);
> +SELECT * FROM pg_stop_backup(false [, true ]);
> 
> I think that it's better to get rid of "[" and "]" from the above because
> IMO this should be the command example that users actually can run.

Using the '[' and ']' are how all of the optional arguments are
specified in the documentation, see things like current_setting() in our
existing documentation:

https://www.postgresql.org/docs/9.6/static/functions-admin.html

> + If the backup process monitors the WAL archiving process independently,
> + the second parameter (which defaults to true) can be set to false to
> + prevent pg_stop_backup from blocking until all WAL is
> + archived.  Instead, the function will return as soon as the stop backup
> + record is written to the WAL.  This option must be used with caution:
> + if WAL archiving is not monitored correctly then the result might be a
> + useless backup.
> 
> You added this descriptions into the step #4 in the non-exclusive
> backup procedure.. But since the step #5 already explains how
> pg_stop_backup has to do with WAL archiving, I think that it's better
> to update (or add something like the above descriptions into)
> the step #5. Thought?

That seems pretty reasonable to me.

> + If the backup process monitors the WAL archiving process independently,
> 
> Can we explain "monitor the WAL archiving process" part a bit more
> explicitly? For example, "monitor and ensure that all WAL segment files
> required for the backup are successfully archived".

Sure, makes sense.  I'll add some language along those lines.

> > I've started looking at this.  Seems pretty straight-forward and will
> > try to get it committed later today.
> 
> Thanks!

My apologies if you had intended to look at committing this, I just
noticed that it hadn't been 'claimed' yet in the CF app and did so to
move forward with it.  I didn't mean to step on anyone's 'toes'.

Thanks!

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Stephen Frost

Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Mar 22, 2017 at 1:49 PM, Stephen Frost  wrote:
> > Then perhaps we do need to be thinking of moving this to PG11 instead of
> > exposing an option that users will start to use which will result in WAL
> > naming that'll be confusing and inconsistent.  I certainly don't think
> > it's a good idea to move forward exposing an option with a naming scheme
> > that's agreed to be bad.
> 
> I'm not sure there is any such agreement.  I agree that the naming
> scheme for WAL files probably isn't the greatest and that David's
> proposal is probably better, but we've had that naming scheme for many
> years, and I don't accept that making a previously-configure-time
> option initdb-time means that it's suddenly necessary to break
> everything for people who continue to use a 16MB WAL size.

Apologies, I completely forgot to bring up how the discussion has
evolved regarding the 16MB case even though we had moved past it in my
head.  Let me try to set that right here.

One of the reasons to go with the LSN is that we would actually be
maintaining what happens when the WAL files are 16MB in size.

David's initial expectation was this for 64MB WAL files:

00010040
00010080
000100CO
00010001

Which both matches the LSN *and* keeps the file names the same when
they're 16MB.  This is what David's looking at writing a patch for and
is what I think we should be considering.  This avoids breaking
compatibility for people who choose to continue using 16MB (assuming
we switch the default to 64MB, which I am still hopeful we will do).

David had offered up another idea which would change the WAL naming for
all sizes, but he and I chatted about it and it seemed like it'd make
more sense to maintain the 16MB filenames and then to use the LSN for
other sizes also in the same manner.

Regardless of which approach we end up using, I do think we need a
formal function for converting WAL file names into LSNs and
documentation included which spells out exactly how that's done.  This
is obviously important to backup tools which need to make sure that
there aren't any gaps in the WAL stream and also need to figure out
where the LSN returned by pg_start_backup() is.  We have a function for
the latter already, but no documentation explaining how it works, which
I believe we should as tool authors need to implement this in their own
code since they can't always assume access to a PG server is available.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] [GSoC] Push-based query executor discussion

2017-03-22 Thread Oleg Bartunov

On Wed, Mar 22, 2017 at 8:04 PM, Arseny Sher  wrote:

> > While I admire your fearlessness, I think the chances of you being
> > able to bring a project of this type to a successful conclusion are
> > remote.  Here is what I said about this topic previously:
> >
> > http://postgr.es/m/CA+Tgmoa=kzHJ+TwxyQ+vKu21nk3prkRjSdbhjubN7qvc8UKuG
> g...@mail.gmail.com
>
> Well, as I said, I don't pretend that I will support full functionality:
> >> instead, we should decide which part of this work (if any) is
> >> going to be done in the course of GSoC. Probably, all TPC-H queries
> >> with and without index support is a good initial target, but this
> >> needs to be discussed.
>
> I think that successfull completion of this project should be a clear
> and justified answer to the question "Is this idea is good enough to
> work on merging it into the master?", not the production-ready patches
> themselves. Nevertheless, of course project success criterion must be
> reasonably formalized -- e.g. implement nodes X with features Y, etc.
>

How many GSoC slots and possible students we have ?

Should we reject this interesting project, which based on several years of
research work of academician group in the institute ? May be better help
him to reformulate the scope of project and let him work ? I don't know
exactly if the results of GSoC project should be committed , but as a
research project it's certainly would be useful for the community.


>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-22 Thread Pavan Deolasee

On Wed, Mar 22, 2017 at 4:53 PM, Mithun Cy 
wrote:

> On Wed, Mar 22, 2017 at 3:44 PM, Pavan Deolasee
>  wrote:
> >
> > This looks quite weird to me. Obviously these numbers are completely
> > non-comparable. Even the time for VACUUM FULL goes up with every run.
> >
> > May be we can blame it on AWS instance completely, but the pattern in
> your
> > tests looks very similar where the number slowly and steadily keeps going
> > up. If you do complete retest but run v18/v19 first and then run master,
> may
> > be we'll see a complete opposite picture?
> >
>
> For those tests I have done tests in the order ---  patch18, Master> both the time numbers were same.

Hmm, interesting.

> One different thing
> I did was I was deleting the data directory between tests and creating
> the database from scratch. Unfortunately the machine I tested this is
> not available. I will test same with v19 once I get the machine and
> report you back.

Ok, no problem. I did some tests on AWS i2.xlarge instance (4 vCPU, 30GB
RAM, attached SSD) and results are shown below. But I think it is important
to get independent validation from your side too, just to ensure I am not
making any mistake in measurement. I've attached naively put together
scripts which I used to run the benchmark. If you find them useful, please
adjust the paths and run on your machine.

I reverted back to UNLOGGED table because with WAL the results looked very
weird (as posted earlier) even when I was taking a CHECKPOINT before each
set and had set max_wal_size and checkpoint_timeout high enough to avoid
any checkpoint during the run. Anyways, that's a matter of separate
investigation and not related to this patch.

I did two kinds of tests.
a) update last column of the index
b) update second column of the index

v19 does considerably better than even master for the last column update
case and pretty much inline for the second column update test. The reason
is very clear because v19 determines early in the cycle that the buffer is
already full and there is very little chance of doing a HOT update on the
page. In that case, it does not check any columns for modification. The
master on the other hand will scan through all 9 columns (for last column
update case) and incur the same kind of overhead of doing wasteful work.

The first/second/fourth column shows response time in ms and third and
fifth column shows percentage difference over master. (I hope the table
looks fine, trying some text-table generator tool :-). Apologies if it
looks messed up)

+---+
|  Second column update |
+---+
|   Master  | v18 | v19 |
+---+-+-+
| 96657.681 | 108122.868 | 11.86% | 96873.642  | 0.22%  |
+---+++++
| 98546.35  | 110021.27  | 11.64% | 97569.187  | -0.99% |
+---+++++
| 99297.231 | 110550.054 | 11.33% | 100241.261 | 0.95%  |
+---+++++
| 97196.556 | 110235.808 | 13.42% | 97216.207  | 0.02%  |
+---+++++
| 99072.432 | 110477.327 | 11.51% | 97950.687  | -1.13% |
+---+++++
| 96730.056 | 109217.939 | 12.91% | 96929.617  | 0.21%  |
+---+++++

+---+
|   Last column update  |
+---+
|   Master   | v18| v19 |
+++-+
| 112545.537 | 116563.608 | 3.57% | 103067.276 | -8.42% |
+++---+++
| 110169.224 | 115753.991 | 5.07% | 104411.44  | -5.23% |
+++---+++
| 112280.874 | 116437.11  | 3.70% | 104868.98  | -6.60% |
+++---+++
| 113171.556 | 116813.262 | 3.22% | 103907.012 | -8.19% |
+++---+++
| 110721.881 | 117442.709 | 6.07% | 104124.131 | -5.96% |
+++---+++
| 112138.601 | 115834.549 | 3.30% | 104858.624 | -6.49% |
+++---+++

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

interest_attrs_tests.tar.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Possible regression with gather merge.

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 3:39 AM, Rushabh Lathia
 wrote:
> Looking at the explain analyze output of both the plan, its clear that GM
> taking longer as its using external merge dist for the sort, where as
> another plan perform top-N heapsort. For normal sort path, it can consider
> the limit as bound, but for GM its not possible.

Right, good catch.  Committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: Make pg_stop_backup() archive wait optional

2017-03-22 Thread Fujii Masao

On Thu, Mar 23, 2017 at 12:37 AM, Stephen Frost  wrote:
> David, all,
>
> * David Steele (da...@pgmasters.net) wrote:
>> On 3/21/17 2:34 PM, Fujii Masao wrote:
>> >The patch basically looks good to me, but one comment is;
>> >backup.sgml (at least the description for "Making a non-exclusive
>> >low level backup) seems to need to be updated.
>>
>> Agreed.  Added in the attached patch and rebased on 8027556.

Thanks for updating the patch!

-SELECT * FROM pg_stop_backup(false);
+SELECT * FROM pg_stop_backup(false [, true ]);

I think that it's better to get rid of "[" and "]" from the above because
IMO this should be the command example that users actually can run.

+ If the backup process monitors the WAL archiving process independently,
+ the second parameter (which defaults to true) can be set to false to
+ prevent pg_stop_backup from blocking until all WAL is
+ archived.  Instead, the function will return as soon as the stop backup
+ record is written to the WAL.  This option must be used with caution:
+ if WAL archiving is not monitored correctly then the result might be a
+ useless backup.

You added this descriptions into the step #4 in the non-exclusive
backup procedure.. But since the step #5 already explains how
pg_stop_backup has to do with WAL archiving, I think that it's better
to update (or add something like the above descriptions into)
the step #5. Thought?

+ If the backup process monitors the WAL archiving process independently,

Can we explain "monitor the WAL archiving process" part a bit more
explicitly? For example, "monitor and ensure that all WAL segment files
required for the backup are successfully archived".

> I've started looking at this.  Seems pretty straight-forward and will
> try to get it committed later today.

Thanks!

Regards,

-- 
Fujii Masao

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH v1] Add and report the new "in_hot_standby" GUC pseudo-variable.

2017-03-22 Thread Elvis Pranskevichus

On Wednesday, March 22, 2017 2:17:27 PM EDT Jaime Casanova wrote:
> On 18 March 2017 at 14:01, Elvis Pranskevichus  wrote:
> > On Saturday, March 18, 2017 3:33:16 AM EDT Michael Paquier wrote:
> >> Why adding a good chunk of code instead of using
> >> pg_is_in_recovery(),
> >> which switches to false once a server exits recovery?
> > 
> > That requires polling the database continuously, which may not be
> > possible or desirable.
> > 
> > My main motivation here is to gain the ability to manage a pool of
> > connections in asyncpg efficiently.  A part of the connection
> > release
> > protocol is "UNLISTEN *;", which the server in Hot Standby would
> > fail to process.  Polling the database for pg_is_in_recovery() is
> > not feasible in this case, unfortunately.
> 
> Sorry, i still don't understand the motivation for this.
> At one point you're going to poll for the value of the GUC in
> pg_settings, no? Or how are you going to know the current value of
> the GUC that makes it different to just poll for pg_is_in_recovery()?

It is marked as GUC_REPORT and is reported by the server on 
connection and on every change:

https://www.postgresql.org/docs/current/static/protocol-flow.html#PROTOCOL-ASYNC

Elvis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH v1] Add and report the new "in_hot_standby" GUC pseudo-variable.

2017-03-22 Thread Jaime Casanova

On 18 March 2017 at 14:01, Elvis Pranskevichus  wrote:
> On Saturday, March 18, 2017 3:33:16 AM EDT Michael Paquier wrote:
>>
>> Why adding a good chunk of code instead of using pg_is_in_recovery(),
>> which switches to false once a server exits recovery?
>
> That requires polling the database continuously, which may not be
> possible or desirable.
>
> My main motivation here is to gain the ability to manage a pool of
> connections in asyncpg efficiently.  A part of the connection release
> protocol is "UNLISTEN *;", which the server in Hot Standby would fail to
> process.  Polling the database for pg_is_in_recovery() is not feasible
> in this case, unfortunately.
>

Sorry, i still don't understand the motivation for this.
At one point you're going to poll for the value of the GUC in pg_settings, no?
Or how are you going to know the current value of the GUC that makes it
different to just poll for pg_is_in_recovery()?

-- 
Jaime Casanova  www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 2:08 PM, Simon Riggs  wrote:
> On 22 March 2017 at 17:41, Robert Haas  wrote:
>> +if (TransactionIdIsCurrentTransactionId(xid))
>> +status = gettext_noop("in progress");
>> +else if (TransactionIdDidCommit(xid))
>> +status = gettext_noop("committed");
>> +else if (TransactionIdDidAbort(xid))
>> +status = gettext_noop("aborted");
>> +else
>> +
>> +/*
>> + * can't test TransactionIdIsInProgress here or we race with
>> + * concurrent commit/abort. There's no point anyway, since it
>> + * might then commit/abort just after we check.
>> + */
>> +status = gettext_noop("in progress");
>>
>> I am not sure this is going to do the right thing for transactions
>> that are aborted by a crash without managing to write an abort record.
>
> Yes, perhaps we should report that state as "aborted - incomplete".
>
> And of course, we might return "subcommitted" also, which could
> technically also be an abort in some cases, so we'd need to do a wait
> loop on that.

I actually don't think those are things we should expose to users.
They're internal implementation details.  The user had better not care
whether an abort was the type of abort that wrote an abort record or
the type that didn't.

> Which makes me think it would be confusing to say "in progress" for
> when it is our current xid, since the user might wait until it is
> complete and then wait forever. Prefer it if it said "in progress -
> current transaction"

Hmm, or just "current transaction", maybe?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UPDATE of partition key

2017-03-22 Thread Amit Khandekar

On 17 March 2017 at 16:07, Amit Khandekar  wrote:
> On 6 March 2017 at 15:11, Amit Langote  wrote:
>>
 But that starts to sound less attractive when one realizes that
 that will occur for every row that wants to move.
>>>
>>> If we manage to call ExecSetupPartitionTupleRouting() during execution
>>> phase only once for the very first time we find the update requires
>>> row movement, then we can re-use the info.
>>
>> That might work, too.  But I guess we're going with initialization in
>> ExecInitModifyTable().
>
> I am more worried about this: even the UPDATEs that do not involve row
> movement would do the expensive setup. So do it only once when we find
> that we need to move the row. Something like this :
> ExecUpdate()
> {
> 
> if (resultRelInfo->ri_PartitionCheck &&
>   !ExecPartitionCheck(resultRelInfo, slot, estate))
> {
>   bool  already_deleted;
>
>   ExecDelete(tupleid, oldtuple, planSlot, epqstate, estate,
>  _deleted, canSetTag);
>
>   if (already_deleted)
> return NULL;
>   else
>   {
> /* If we haven't already built the state for INSERT
>  * tuple routing, build it now */
> if (!mtstate->mt_partition_dispatch_info)
> {
>   ExecSetupPartitionTupleRouting(
> mtstate->resultRelInfo->ri_RelationDesc,
> >mt_partition_dispatch_info,
> >mt_partitions,
> >mt_partition_tupconv_maps,
> >mt_partition_tuple_slot,
> >mt_num_dispatch,
> >mt_num_partitions);
> }
>
> return ExecInsert(mtstate, slot, planSlot, NULL,
>   ONCONFLICT_NONE, estate, false);
>   }
> }
> ...
> }

Attached is v2 patch which implements the above optimization. Now, for
UPDATE, ExecSetupPartitionTupleRouting() will be called only if row
movement is needed.

We have to open an extra relation for the root partition, and keep it
opened and its handle stored in
mt_partition_dispatch_info[0]->reldesc. So ExecEndModifyTable() closes
this if it is different from node->resultRelInfo->ri_RelationDesc. If
it is same as node->resultRelInfo, it should not be closed because it
gets closed as part of ExecEndPlan().


update-partition-key_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH v1] Add and report the new "in_hot_standby" GUC pseudo-variable.

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 8:25 AM, Elvis Pranskevichus  wrote:
> On Tuesday, March 21, 2017 11:50:38 PM EDT Peter Eisentraut wrote:
>> On 3/17/17 13:56, Elvis Pranskevichus wrote:
>> > Currently, clients wishing to know when the server exits hot standby
>> > have to resort to polling, which is often suboptimal.
>> >
>> > This adds the new "in_hot_standby" GUC variable that is reported via
>> > a ParameterStatus message.
>>
>> The terminology chosen here is not very clear.  What is the opposite
>> of "in hot standby"?  Warm standby?  Cold standby?  Not standby at
>> all? Promoted to primary (writable)?
>
> The opposite means primary.  I can flip the GUC name to "is_primary", if
> that's clearer.

Hmm, I don't find that clearer.  "hot standby" has a very specific
meaning; "primary" isn't vague, but I would say it's less specific.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2017-03-22 Thread Simon Riggs

On 22 March 2017 at 17:41, Robert Haas  wrote:

> +if (TransactionIdIsCurrentTransactionId(xid))
> +status = gettext_noop("in progress");
> +else if (TransactionIdDidCommit(xid))
> +status = gettext_noop("committed");
> +else if (TransactionIdDidAbort(xid))
> +status = gettext_noop("aborted");
> +else
> +
> +/*
> + * can't test TransactionIdIsInProgress here or we race with
> + * concurrent commit/abort. There's no point anyway, since it
> + * might then commit/abort just after we check.
> + */
> +status = gettext_noop("in progress");
>
> I am not sure this is going to do the right thing for transactions
> that are aborted by a crash without managing to write an abort record.

Yes, perhaps we should report that state as "aborted - incomplete".

And of course, we might return "subcommitted" also, which could
technically also be an abort in some cases, so we'd need to do a wait
loop on that.

Which makes me think it would be confusing to say "in progress" for
when it is our current xid, since the user might wait until it is
complete and then wait forever. Prefer it if it said "in progress -
current transaction"

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 1:49 PM, Stephen Frost  wrote:
> * Robert Haas (robertmh...@gmail.com) wrote:
>> On Wed, Mar 22, 2017 at 1:22 PM, Stephen Frost  wrote:
>> > To put this in another light, had this issue been brought up post
>> > feature-freeze, your definition would mean that we would only have the
>> > option to either revert the patch entirely or to live with the poor
>> > naming scheme.
>>
>> Yeah, and I absolutely agree with that.  In fact, I think it's
>> *already* past the time when we should be considering the changes you
>> want.
>
> Then perhaps we do need to be thinking of moving this to PG11 instead of
> exposing an option that users will start to use which will result in WAL
> naming that'll be confusing and inconsistent.  I certainly don't think
> it's a good idea to move forward exposing an option with a naming scheme
> that's agreed to be bad.

I'm not sure there is any such agreement.  I agree that the naming
scheme for WAL files probably isn't the greatest and that David's
proposal is probably better, but we've had that naming scheme for many
years, and I don't accept that making a previously-configure-time
option initdb-time means that it's suddenly necessary to break
everything for people who continue to use a 16MB WAL size.  I really
think that is very unlikely to be a majority position, no matter how
firmly you and David hold to it.   It is possible that a majority of
people will agree that such a change should be made, but it seems very
remote that a majority of people will agree that it has to (or even
should be) the same commit that improves the configurability.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread David G. Johnston

On Wed, Mar 22, 2017 at 9:51 AM, Stephen Frost  wrote:

> Robert,
>
> * Robert Haas (robertmh...@gmail.com) wrote:
> > On Wed, Mar 22, 2017 at 12:22 PM, Stephen Frost 
> wrote:
> > > While I understand that you'd like to separate the concerns between
> > > changing the renaming scheme and changing the default and enabling this
> > > option, I don't agree that they can or should be independently
> > > considered.
> >
> > Well, I don't understand what you think is going to happen here.  Neither
> > Beena nor any other contributor you don't employ is obliged to write a
> > patch for those changes because you'd like them to get made, and Peter
> and
> > I have already voted against including them.  If you or David want to
> write
> > a patch for those changes, post it for discussion, and try to get
> consensus
> > to commit it, that's of course your right.  But the patch will be more
> than
> > three weeks after the feature freeze deadline and will have two committer
> > votes against it from the outset.
>
> This would clearly be an adjustment to the submitted patch, which
> happens regularly during the review and commit process and is part of
> the commitfest process, so I don't agree that holding it to new-feature
> level is correct when it's a change which is entirely relevant to this
> new feature, and one which a committer is asking to be included as part
> of the change.  Nor do I feel particularly bad about asking for feature
> authors to be prepared to rework parts of their feature based on
> feedback during the commitfest process.
>

Maybe it can be fit in as part of the overall patch set but wouldn't
placing it either:

First. changing the name behavior and use the existing configure-time knob
to test it out

or

Second. commit the existing patch relying on the existing behavior and then
implement the rename changes using the new initdb-time knob to test it out.

in a series make reasoning and discussing the change considerably easier?


> I would have liked to have realized this oddity with the WAL naming
> scheme for not-16MB-WALs earlier too, but it's unfortunately not within
> my abilities to change that.  That does not mean that we shouldn't be
> cognizant of the impact that this new feature will have in exposing this
> naming scheme, one which there seems to be agreement is bad, to users.
>

> That said, David is taking a look at it to try and be helpful.
>
> Vote-counting seems a bit premature given that there hasn't been any
> particularly clear asking for votes.  Additionally, I believe Peter also
> seemed concerned that the existing naming scheme which, if used with,
> say, 64MB segments, wouldn't match LSNs either, in this post:
> 9795723f-b4dd-f9e9-62e4-ddaf6cd88...@2ndquadrant.com


While my DBA skills aren't that great I would think that having a system
that relies upon the DBA learning how to mentally map between LSN IDs and
WAL files is a failure in UX in the first place.  The hacker-DBA might get
a kick out of being able to operate efficiently with that knowledge and
level of skill but the typical DBA would rather have something like "pg_wal
--lsn " that they can rely upon.  I would think tool builders would
likewise rather rely on us to tell them where to go look instead of
matching up two strings.

This kinda reminds me of the discussion regarding our version number
change.  We are going to expose broken tools that rely on the decimal
version number instead of the official integer one.  Here we expose tools
that rely on the equivalence between LSN and WAL filenames when using 16MB
WAL files.  What I haven't seen defined here is how those tools should be
behaving - i.e., what is our supported API.  If we lack an official way of
working with these values then maybe we shouldn't give users an initdb-time
way to change the WAL file size.

For the uninformed like myself an actual concrete example of an actual
program that would be affected would be helpful.

David J.

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Stephen Frost

Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Mar 22, 2017 at 1:22 PM, Stephen Frost  wrote:
> > To put this in another light, had this issue been brought up post
> > feature-freeze, your definition would mean that we would only have the
> > option to either revert the patch entirely or to live with the poor
> > naming scheme.
> 
> Yeah, and I absolutely agree with that.  In fact, I think it's
> *already* past the time when we should be considering the changes you
> want.

Then perhaps we do need to be thinking of moving this to PG11 instead of
exposing an option that users will start to use which will result in WAL
naming that'll be confusing and inconsistent.  I certainly don't think
it's a good idea to move forward exposing an option with a naming scheme
that's agreed to be bad.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Review: GIN non-intrusive vacuum of posting tree

2017-03-22 Thread Teodor Sigaev


No, second conditional code will be called for any subtree, which
contains totally empty subtree. That check !isRoot covers case when
the entire posting tree should be erased: we cannot just quit out of
recursive cleanup, we have to make a scan here, starting from root.

Oh, I see


Probably, variable isChildHasVoid has a bit confusing name. This flag
indicates that some subtree:
1. Had empty pages
2. Did not bother deleting them, because there is a chance that it is
a part of a bigger empty subtree.
May be it'd be better to call the variable "someChildIsVoidSubtree".


hasEmptyChild? and hasNonEmptyChild (BTW, isAnyNonempy has missed 't')

And if the whole posting tree is empty,then we could mark root page as leaf and 
remove all other pages in tree without any locking. Although, it could be a task 
for separate patch.



--
Teodor Sigaev   E-mail: teo...@sigaev.ru
   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Fix and simplify check for whether we're running as Windows serv

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 10:13 AM, Alvaro Herrera
 wrote:
> Heikki Linnakangas wrote:
>> I did some archeology, and found CheckTokenMembership() in MinGW's w32api
>> packages version 3.14
>> (https://sourceforge.net/projects/mingw/files/MinGW/Base/w32api/w32api-3.14/,
>> in include/winbase.h). According to the timestamps on that download page,
>> that was released in 2009. That was the oldest version I could find, so it
>> might go even further back.
>>
>> Dave, do you know exactly what version of MinGW narwhal is running? And how
>> difficult is it to upgrade to something slightly more modern? Ease of
>> upgrade is another good data point on how far we need to support old
>> versions.
>
> Given that this was backpatched and that it broke narwhal in all
> branches, I think the solution needs to make narwhal work again without
> requiring it to upgrade; so we should acquire CheckTokenMembership via
> dynloading just like we do the other functions.  If we want to require a
> newer mingw version in pg10, that's acceptable, but it should be a
> separate patch.

+1 for not moving the minimum system requirements in the back-branches.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 1:22 PM, Stephen Frost  wrote:
> To put this in another light, had this issue been brought up post
> feature-freeze, your definition would mean that we would only have the
> option to either revert the patch entirely or to live with the poor
> naming scheme.

Yeah, and I absolutely agree with that.  In fact, I think it's
*already* past the time when we should be considering the changes you
want.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 3:13 AM, Simon Riggs  wrote:
>> Changes made per discussion.
>
> This looks good to me also. Does what we need it to do.
>
> I was a little worried by possible performance of new lock, but its
> not intended to be run frequently, so its OK.

Agreed.

Reviewing 0002:

+if (!TransactionIdIsValid(xid))
+{
+LWLockRelease(XidGenLock);
+ereport(ERROR,
+(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("transaction ID " UINT64_FORMAT " is an invalid xid",
+xid_with_epoch)));
+}

It's unnecessary to release LWLockRelease() before throwing an error,
and it's also wrong because we haven't acquired XidGenLock in this
code path.  But maybe it would be better to just remove this entirely
and instead have TransactionIdInRecentPast return false for
InvalidTransactionId.  Then we'd avoid adding a translatable message
for a case that is basically harmless to allow.

+if (TransactionIdIsCurrentTransactionId(xid))
+status = gettext_noop("in progress");
+else if (TransactionIdDidCommit(xid))
+status = gettext_noop("committed");
+else if (TransactionIdDidAbort(xid))
+status = gettext_noop("aborted");
+else
+
+/*
+ * can't test TransactionIdIsInProgress here or we race with
+ * concurrent commit/abort. There's no point anyway, since it
+ * might then commit/abort just after we check.
+ */
+status = gettext_noop("in progress");

I am not sure this is going to do the right thing for transactions
that are aborted by a crash without managing to write an abort record.
It seems that the first check will say the transaction isn't in
progress, and the second and third checks will say it isn't either
committed or aborted since, if I am reading this correctly, they just
read clog directly.  Compare HeapTupleSatisfiesMVCC, which assumes
that a not-in-progress, not-committed transaction must necessarily
have aborted.  I think your comment is pointing to a real issue,
though.  It seems like what might be needed is to add one more check.
Before where you have the "else" clause, check if the XID is old, e.g.
precedes our snapshot's xmin.  If so, it must be committed or aborted
and, since it didn't commit, it aborted.  If not, it must've changed
from in progress to not-in-progress just as we were in the midst
checking, so labeling it as in progress is fine.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Faster Expression Processing v4

2017-03-22 Thread Andres Freund

On 2017-03-22 13:15:41 -0400, Tom Lane wrote:
> BTW, I'm fairly concerned by what you did in nodeTidscan.c, ie delaying
> compile of the TID expressions until TidListCreate.  I think that probably
> fails for cases involving, eg, subplans in the expressions; we need
> subplans to get linked to the parent node, and this way won't do it till
> (maybe) too late.  Barring objection I'm going to rearrange that so that
> we still do the compile part at executor start.

No objection here.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Stephen Frost

Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Mar 22, 2017 at 12:51 PM, Stephen Frost  wrote:
> > This would clearly be an adjustment to the submitted patch, which
> > happens regularly during the review and commit process and is part of
> > the commitfest process, so I don't agree that holding it to new-feature
> > level is correct when it's a change which is entirely relevant to this
> > new feature, and one which a committer is asking to be included as part
> > of the change.  Nor do I feel particularly bad about asking for feature
> > authors to be prepared to rework parts of their feature based on
> > feedback during the commitfest process.
> 
> Obviously, reworking the patch is an expected part of the CommitFest
> process.  However, I disagree that what you're asking for can in any
> way be fairly characterized that way.  You're not trying to make it do
> the thing that it does better or differently.  You're trying to make
> it do a second thing.

I don't agree with the particularly narrow definition you're using in
this case to say that adding an option to initdb to change how big WAL
files are, which will also change how they're named (even though this
patch doesn't *specifically* do anything with the naming because there
was a configure-time switch which existed before) means that asking for
the WAL files names, which are already being changed, to be changed in a
different way, is really outside the scope and a new feature.

To put this in another light, had this issue been brought up post
feature-freeze, your definition would mean that we would only have the
option to either revert the patch entirely or to live with the poor
naming scheme.  For my 2c, in such a case, I would have voted to make
the change even post feature-freeze unless we were very close to
release as it's not really a new 'feature'.

Thankfully, that isn't the case here and we do have time to consider
changing it without having to worry about having a post feature-freeze
discussion about it.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] patch proposal

2017-03-22 Thread David Steele


On 3/21/17 8:45 PM, Venkata B Nagothi wrote:

On Tue, Mar 21, 2017 at 8:46 AM, David Steele  minimum recovery position (which is the
end backup, which again means applying WALs generated between start and
the end backup) all the way through to  --> recovery target position.


minRecoveryPoint is only used when recovering a backup that was made 
from a standby.  For backups taken on the master, the backup end WAL 
record is used.



The best start position to check with would the position shown up in the
pg_control file, which is way much better compared to the current
postgresql behaviour.


I don't agree, for the reasons given previously.


The tests pass (or rather fail as expected) because they are written
using values before the start of the backup.  It's actually the end
of the backup (where the database becomes consistent on recovery)
that defines where PITR can start and this distinction definitely
matters for very long backups.  A better test would be to start the
backup, get a time/lsn/xid after writing some data, and then make
sure that time/lsn/xid is flagged as invalid on recovery.

The current behaviour is that, if the XID shown-up by the pg_controldata
is say 1400 and the specified recovery_target_xid is say 200, then,
postgresql just completes the recovery and promotes at the immediate
consistent recovery target, which means, the DBA has to all the way
start the restoration and recovery process again to promote the database
at the intended position.


I'm not saying that the current behavior is good, only that the proposed 
behavior is incorrect as far as I can tell.



It is also problematic to assume that transaction IDs commit in
order. If checkPoint.latestCompletedXid contains 5 then a recovery
to xid 4 may or may not be successful depending on the commit
order.  However, it appears in this case the patch would disallow
recovery to 4.

If the txid 4 is the recovery target xid, then, the appropriate backup
taken previous to txid 4 must be used or as an alternative recovery
target like recovery_target_timestamp must be used to proceed to the
respective recovery target xid.


I'm not sure I follow you here, but I do know that using the last 
committed xid says little about which xids precede or follow it.



I think this logic would need to be baked into recoveryStopsAfter()
and/or recoveryStopsBefore() in order to work.  It's not clear to me
what that would look like, however, or if the xid check is even
practical.


The above two functions would determine if the recovery process has to
stop at a particular position or not and the patch has nothing to do
with it.

To summarise, this patch only determines if the WAL replay has to start
at all. In other words, if the specified recovery target falls prior to
the restored database position, then, the WAL replay should not start,
which prevent the database from getting promoted at an unintended
recovery target.


I understand what you are trying to do.  My point is that the 
information you are working from (whatever checkpoint happens to be in 
pg_control when recovery starts) is not sufficient for the task.  This 
method is inaccurate and would disallow valid recovery scenarios.


I suggest doing a thorough read-through of StartupXLOG(), particularly 
the section where the backup_label is loaded.


Thanks,
--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 9:59 AM, Ashutosh Bapat
 wrote:
>> In my testing last week, reparameterize_path_by_child() was essential
>> for nested loops to work properly, even without LATERAL.  Without it,
>> the parameterized path ends up containing vars that reference the
>> parent varno instead of the child varno.  That confused later planner
>> stages so that those Vars did not get replaced with Param during
>> replace_nestloop_params(), eventually resulting in a crash at
>> execution time.
>
> I half-described the solution. Sorry. Along-with disabling
> partition-wise lateral joins, we have to disable nested loop
> child-joins where inner child is parameterized by the parent of the
> outer one. We will still have nestloop join between parents where
> inner relation is parameterized by the outer and every child of inner
> is parameterized by the outer. But we won't create nest loop joins
> where inner child is parameterized by the outer child, where we
> require reparameterize_path_by_child. We will loose this optimization
> only till we get reparameterize_path_by_child() committed. Basically,
> in try_nestloop_path() (in the patch 0009), if
> (PATH_PARAM_BY_PARENT(inner_path, outer_path->parent)), give up
> creating nest loop path. That shouldn't create any problems.
>
> Did you experiment with this change in try_nestloop_path()? Can you
> please share the testcase? I will take a look at it.

I didn't save the test case.  It was basically just forcing a
partitionwise nestloop join between two equipartitioned tables, with
the calls to adjust_appendrel_attrs() ripped out of
reparameterize_path_by_child(), just to see what would break.

>> Based on that experiment, I think we could consider
>> having reparameterize_path_by_child() handle only scan paths as
>> reparameterize_path() does, and just give up on plans like this:
>>
>> Append
>> -> Left Join
>>-> Scan on a
>>-> Inner Join
>>   -> Index Scan on b
>>   -> Index Scan on c
>> [repeat for each partition]
>>
>
> I am assuming that a, b and c are partitions of A, B and C resp. which
> are being joined and both or one of the scans on b and c are
> parameteried by a or scan of c is parameterized by b.

Yes.

> I don't think we will get away by supporting just scan paths, since
> the inner side of lateral join can be any paths not just scan path. Or
> you are suggesting that we disable partition-wise lateral join and
> support reparameterization of only scan paths?

I think if you can do a straight-up partitionwise nested loop between
two tables A and B, that's pretty bad.  But if there are more complex
cases that involve parameterizing entire join trees which aren't
covered, that's less bad.  Parallel query almost entirely punts on
LATERAL right now, and nobody's complained yet.  I'm sure that'll need
to get fixed someday, but not today.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Faster Expression Processing v4

2017-03-22 Thread Tom Lane

BTW, I'm fairly concerned by what you did in nodeTidscan.c, ie delaying
compile of the TID expressions until TidListCreate.  I think that probably
fails for cases involving, eg, subplans in the expressions; we need
subplans to get linked to the parent node, and this way won't do it till
(maybe) too late.  Barring objection I'm going to rearrange that so that
we still do the compile part at executor start.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] RADIUS fallback servers

2017-03-22 Thread Magnus Hagander

On Mon, Mar 6, 2017 at 8:14 PM, Adam Brightwell <
adam.brightw...@crunchydata.com> wrote:

> On Mon, Mar 6, 2017 at 10:24 AM, Adam Brightwell
>  wrote:
> >>> I wonder if removing the complexity of maintaining two separate lists
> >>> for the server and port would be a better/less complex approach.  For
> >>> instance, why not go with a list of typical 'host:port' strings for
> >>> 'radiusservers'?  If no port is specified, then simply use the default
> >>> for that specific host. Therefore, we would not have to worry about
> >>> keeping the two lists in sync. Thoughts?
> >>
> >>
> >> If we do that we should do it for all the parameters, no? So not just
> >> host:port, but something like host:port:secret:identifier? Mixing the
> two
> >> ways of doing it would be quite confusing I think.
> >>
> >> And I wonder if that format wouldn't get even more confusing if you for
> >> example want to use default ports, but non-default secrets.
> >
> > Yes, I agree. Such a format would be more confusing and I certainly
> > wouldn't be in favor of it.
> >
> >> I can see how it would probably be easier in some of the simple cases,
> but I
> >> wonder if it wouldn't make it worse in a lot of other cases.
> >
> > Ultimately, I think that it would be better off in a separate
> > configuration file. Something to the effect of each line representing
> > a server, something like:
> >
> > '   '
> >
> > With 'radiusservers' simply being the path to that file and
> > 'radiusserver', etc. would remain as is. Where only one or the other
> > could be provided, but not both. Though, that's perhaps would be
> > beyond the scope of this patch.
>

I think it is. If we want to go there I don't think we should do that as a
RADIUS thing -- we should rethink it within the context of *all* the
different authentication methods we have.



> >
> > At any rate, I'm going to continue moving forward with testing this
> patch as is.
>
> I have run through testing this patch against a small set of RADIUS
> servers.  This testing included both single server and multiple server
> configurations. All seems to work as expected.
>

Seeps I forgot about this one.

Thanks for the review -- I've applied the patch.

I'll look into the timeout thing as a separate patch later.


-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 12:51 PM, Stephen Frost  wrote:
> This would clearly be an adjustment to the submitted patch, which
> happens regularly during the review and commit process and is part of
> the commitfest process, so I don't agree that holding it to new-feature
> level is correct when it's a change which is entirely relevant to this
> new feature, and one which a committer is asking to be included as part
> of the change.  Nor do I feel particularly bad about asking for feature
> authors to be prepared to rework parts of their feature based on
> feedback during the commitfest process.

Obviously, reworking the patch is an expected part of the CommitFest
process.  However, I disagree that what you're asking for can in any
way be fairly characterized that way.  You're not trying to make it do
the thing that it does better or differently.  You're trying to make
it do a second thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [GSoC] Push-based query executor discussion

2017-03-22 Thread Arseny Sher

> While I admire your fearlessness, I think the chances of you being
> able to bring a project of this type to a successful conclusion are
> remote.  Here is what I said about this topic previously:
>
> http://postgr.es/m/CA+Tgmoa=kzhj+twxyq+vku21nk3prkrjsdbhjubn7qvc8uk...@mail.gmail.com

Well, as I said, I don't pretend that I will support full functionality:
>> instead, we should decide which part of this work (if any) is
>> going to be done in the course of GSoC. Probably, all TPC-H queries
>> with and without index support is a good initial target, but this
>> needs to be discussed.

I think that successfull completion of this project should be a clear
and justified answer to the question "Is this idea is good enough to
work on merging it into the master?", not the production-ready patches
themselves. Nevertheless, of course project success criterion must be
reasonably formalized -- e.g. implement nodes X with features Y, etc.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling parallelism for queries coming from SQL or other PL functions

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 3:00 AM, Rafia Sabih 
wrote:

> On Wed, Mar 22, 2017 at 12:55 AM, Robert Haas 
> wrote:
> > On Tue, Mar 21, 2017 at 6:36 AM, Dilip Kumar 
> wrote:
> >> How about taking the decision of execute_once based on
> >> fcache->returnsSet instead of based on lazyEval?
> >>
> >> change
> >> + ExecutorRun(es->qd, ForwardScanDirection, count, !es->lazyEval);
> >> to
> >> + ExecutorRun(es->qd, ForwardScanDirection, count, !fcache->returnsSet);
> >>
> >> IMHO, Robert have the same thing in mind?
> >
> > Yeah, something like that.
> >
> Done in execute-once-v2.patch
>

So, let's see here.  We are safe so long as we're sure that, when
postquel_getnext() returns, postquel_end() will be called next without
iterating the loop in fmgr_sql().  That will definitely be true if
fcache->returnsSet is true.  It will also be true if postquel_getnext
returns true, which will be true whenever count == 0, which will be true
whenever es->lazyEval is false.

So couldn't we actually make this test !fcache->returnsSet ||
!es->lazyEval?  That would let us allow parallel execution for all
non-set-returning functions, and also for set-returning functions that end
up with es->lazyEval set to false.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Stephen Frost

Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
> On Wed, Mar 22, 2017 at 12:22 PM, Stephen Frost  wrote:
> > While I understand that you'd like to separate the concerns between
> > changing the renaming scheme and changing the default and enabling this
> > option, I don't agree that they can or should be independently
> > considered.
> 
> Well, I don't understand what you think is going to happen here.  Neither
> Beena nor any other contributor you don't employ is obliged to write a
> patch for those changes because you'd like them to get made, and Peter and
> I have already voted against including them.  If you or David want to write
> a patch for those changes, post it for discussion, and try to get consensus
> to commit it, that's of course your right.  But the patch will be more than
> three weeks after the feature freeze deadline and will have two committer
> votes against it from the outset.

This would clearly be an adjustment to the submitted patch, which
happens regularly during the review and commit process and is part of
the commitfest process, so I don't agree that holding it to new-feature
level is correct when it's a change which is entirely relevant to this
new feature, and one which a committer is asking to be included as part
of the change.  Nor do I feel particularly bad about asking for feature
authors to be prepared to rework parts of their feature based on
feedback during the commitfest process.

I would have liked to have realized this oddity with the WAL naming
scheme for not-16MB-WALs earlier too, but it's unfortunately not within
my abilities to change that.  That does not mean that we shouldn't be
cognizant of the impact that this new feature will have in exposing this
naming scheme, one which there seems to be agreement is bad, to users.

That said, David is taking a look at it to try and be helpful.

Vote-counting seems a bit premature given that there hasn't been any
particularly clear asking for votes.  Additionally, I believe Peter also
seemed concerned that the existing naming scheme which, if used with,
say, 64MB segments, wouldn't match LSNs either, in this post:
9795723f-b4dd-f9e9-62e4-ddaf6cd88...@2ndquadrant.com

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] Proposal: GetOldestXminExtend for ignoring arbitrary vacuum flags

2017-03-22 Thread Simon Riggs

On 22 March 2017 at 03:42, Haribabu Kommi  wrote:
>
>
> On Wed, Mar 22, 2017 at 1:53 PM, Seki, Eiji 
> wrote:
>>
>>
>> Thank you for your review, again.
>>
>> I think your proposals are better, so I reflected them.
>
>
> Thanks for the updated patch. Patch looks good to me.
> I marked it as "ready for committer".

Looks good. I'll double check and commit this.

> While reviewing this patch, I found that PGXACT->vacuumFlags
> variable name needs a rename because with the addition of
> PROC_IN_LOGICAL_DECODING flag "vacuumFlags" doesn't
> only use it for vacuum operation. I feel this variable can be renamed
> as just "flags", but anyway that is a different patch.

Good point. Should be an open item.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 12:22 PM, Stephen Frost  wrote:
> While I understand that you'd like to separate the concerns between
> changing the renaming scheme and changing the default and enabling this
> option, I don't agree that they can or should be independently
> considered.

Well, I don't understand what you think is going to happen here.  Neither
Beena nor any other contributor you don't employ is obliged to write a
patch for those changes because you'd like them to get made, and Peter and
I have already voted against including them.  If you or David want to write
a patch for those changes, post it for discussion, and try to get consensus
to commit it, that's of course your right.  But the patch will be more than
three weeks after the feature freeze deadline and will have two committer
votes against it from the outset.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Logical decoding on standby

2017-03-22 Thread Andres Freund

On 2017-03-22 15:59:42 +, Simon Riggs wrote:
> On 22 March 2017 at 13:06, Andres Freund  wrote:
> 
> >> The parts I think are important for Pg10 are:
> >
> >> * Ability to create logical slots on replicas
> >
> > Doesn't this also imply recovery conflicts on DROP DATABASE?
> 
> Not needed until the slot is in use, which is a later patch.

Hm? We need to drop slots, if they can exist / be created, on a standby,
and they're on a dropped database.  Otherwise we'll reserve resources,
while anyone connecting to the slot will likely just receive errors
because the database doesn't exist anymore.  It's also one of the
patches that can quite easily be developed / reviewed, because there
really isn't anything complicated about it.  Most of the code is already
in Craig's patch, it just needs some adjustments.

> > Besides,
> > allowing to drop all slots using a database upon DROP DATABASE, is a
> > useful thing on its own.
> 
> Sure but that's a separate feature unrelated to this patch and we're
> running out of time.

Hm? The patch implemented it.

> >> * Ability to advance (via feedback or via SQL function) - no need to
> >> actually decode and call output plugins at al
> >
> > That pretty much requires decoding, otherwise you really don't know how
> > much WAL you have to retain.
> 
> Knowing how much WAL to retain is key.
> 
> Why would decoding tell you how much WAL to retain?

Because decoding already has the necessary logic?  (You need to retain
enough WAL to restart decoding for all in-progress transactions etc).

> We tried to implement this automatically from the master, which was
> rejected. So the only other way is manually. We need one or the other.

I think the current approach is roughly the right way - but that doesn't
make the patch ready.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)

2017-03-22 Thread Peter Geoghegan

On Wed, Mar 22, 2017 at 5:44 AM, Robert Haas  wrote:
>> Actually, that's quite possible with the design I came up with.
>
> I don't think it is.  What sequence of calls do the APIs you've
> proposed would accomplish that goal?  I don't see anything in this
> patch set that would permit anything other than a handoff from the
> worker to the leader.  There seems to be no way for the ref count to
> be more than 1 (or 2?).

See my remarks on this below.

>> The
>> restriction that Thomas can't live with as I've left things is that
>> you have to know the number of BufFiles ahead of time. I'm pretty sure
>> that that's all it is. (I do sympathize with the fact that that isn't
>> very helpful to him, though.)
>
> I feel like there's some cognitive dissonance here.  On the one hand,
> you're saying we should use your design.

No, I'm not. I'm saying that my design is complete on its own terms,
and has some important properties that a mechanism like this ought to
have. I think I've been pretty clear on my general uncertainty about
the broader question.

> On the other hand, you are
> admitting that in at least one key respect, it won't meet Thomas's
> requirements.  On the third hand, you just said that you weren't
> arguing for two mechanisms for sharing a BufFile across cooperating
> parallel processes.  I don't see how you can hold all three of those
> positions simultaneously.

I respect your position as the person that completely owns parallelism
here. You are correct when you say that there has to be some overlap
between the requirements for the mechanisms used by each patch --
there just *has* to be. As I said, I only know very approximately how
much overlap that is or should be, even at this late date, and I am
unfortunately not in a position to spend more time on it to find out.
C'est la vie.

I know that I have no chance of convincing you to adopt my design
here, and you are right not to accept the design, because there is a
bigger picture. And, because it's just too late now. My efforts to get
ahead of that, and anticipate and provide for Thomas' requirements
have failed. I admit that. But, you are asserting that my patch has
specific technical defects that it does not have.

I structured things this way for a reason. You are not required to
agree with me in full to see that I might have had a point. I've
described it as a trade-off already. I think that it will be of
practical value to you to see that trade-off. This insight is what
allowed me to immediately zero in on resource leak bugs in Thomas'
revision of the patch from yesterday.

>> It is true that a worker resowner can unlink() the files
>> mid-unification, in the same manner as with conventional temp files,
>> and not decrement its refcount in shared memory, or care at all in any
>> special way. This is okay because the leader (in the case of parallel
>> tuplesort) will realize that it should not "turn out the lights",
>> finding that remaining reference when it calls BufFileClose() in
>> registered callback, as it alone must. It doesn't matter that the
>> unlink() may have already occurred, or may be just about to occur,
>> because we are only operating on already-opened files, and never on
>> the link itself (we don't have to stat() the file link for example,
>> which is naturally only a task for the unlink()'ing backend anyway).
>> You might say that the worker only blows away the link itself, not the
>> file proper, since it may still be open in leader (say).
>
> Well, that sounds like it's counting on fd.c not to close the file
> descriptor at an inconvenient point in time and reopen it later, which
> is not guaranteed.

It's true that in an error path, if the FD of the file we just opened
gets swapped out, that could happen. That seems virtually impossible,
and in any case the consequence is no worse than a confusing LOG
message. But, yes, that's a weakness.

>> This is the only way to avoid the unlink()-ENOENT-ignore hack, AFAICT,
>> since only the worker itself can reliably know how many segments it
>> has opened at every single instant in time. Because it's the owner!
>
> Above, you said that your design would allow for a group of processes
> to share access to a file, with the last one that abandons it "turning
> out the lights".  But here, you are referring to it as having one
> owner - the "only the worker itself" can know the number of segments.
> Those things are exact opposites of each other.

You misunderstood.

Under your analogy, the worker needs to wait for someone else to enter
the room before leaving, because otherwise, as an "environmentally
conscious" worker, it would be compelled to turn the lights out before
anyone else ever got to do anything with its files. But once someone
else is in the room, the worker is free to leave without turning out
the lights. I could provide a mechanism for the leader, or whatever
the other backend is, to do another hand off. You're right that that
is left

Re: [HACKERS] exposing wait events for non-backends (was: Tracking wait event for latches)

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 12:20 PM, Robert Haas  wrote:
> On Wed, Mar 22, 2017 at 1:31 AM, Michael Paquier
>  wrote:
>> Okay, switched as ready for committer. One note for the committer
>> though: keeping the calls of pgstat_bestart() out of
>> BackgroundWorkerInitializeConnection() and
>> BackgroundWorkerInitializeConnectionByOid() keeps users the
>> possibility to not have backends connected to the database show up in
>> pg_stat_activity. This may matter for some users (cloud deployment for
>> example). I am as well in favor in keeping the work of those routines
>> minimal, without touching at pgstat.
>
> I think that's just inviting bugs of omission, in both core and
> extension code.  I think it'd be much better to do this in a
> centralized place.

I mean, your argument boils down to "somebody might want to
deliberately hide things from pg_stat_activity".  But that's not
really a mode we support in general, and supporting it only for
certain cases doesn't seem like something that this patch should be
about.  We could add an option to BackgroundWorkerInitializeConnection
and BackgroundWorkerInitializeConnectionByOid to suppress it, if it's
something that somebody wants, but actually I'd be more inclined to
think that everybody (who has a shared memory connection) should go
into the machinery and then security-filtering should be left to some
higher-level facility that can make policy decisions rather than being
hard-coded in the individual modules.

But I'm slightly confused as to how this even arises.  Background
workers already show up in pg_stat_activity output, or at least I sure
think they do.  So why does this patch need to make any change to that
case at all?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Stephen Frost

Robert,

* Robert Haas (robertmh...@gmail.com) wrote:
> On the topic of whether to also change the default, I'm not sure what
> is best and will defer to others.  On the topic of whether to whack
> around the file naming scheme, -1 from me.  This patch was posted
> three months ago and nobody suggested that course of action until this
> week.  Even though it is on a related topic, it is a conceptually
> separate change that is previously undiscussed and on which we do not
> have agreement.  Making those changes just before feature freeze isn't
> fair to the patch authors or people who may not have time to pay
> attention to this thread right this minute.

While I understand that you'd like to separate the concerns between
changing the renaming scheme and changing the default and enabling this
option, I don't agree that they can or should be independently
considered.

This is, in my view, basically the only opportunity we will have to
change the naming scheme because once we make it an initdb option, while
I don't think very many people will use it, there will be people who
will and the various tool authors will also have to adjust to handle
those cases.  Chances are good that we will even see people start to
recommend using that initdb option, leading to more people using a
different default, at which point we simply are not going to be able to
consider changing the nameing scheme.

Therefore, I would much rather we take this opportunity to change the
naming scheme and the default at the same time to be more sane, because
if we have this patch as-is in PG10, we won't be able to do so in the
future without a great deal more pain.

I'm willing to forgo the ability to change the WAL size with just a
server restart for PG10 because that's something which can clearly be
added later without any concerns about backwards-compatibility, but the
same is not true regarding the naming scheme.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] [COMMITTERS] pgsql: Sync pg_dump and pg_dumpall output

2017-03-22 Thread Andrew Dunstan



On 03/22/2017 12:10 PM, Stephen Frost wrote:
>> Still I agree that we should have tests for both cases.
> Perhaps, though if I recall correctly, we've basically had zero calls
> for fsync() until this.  If we don't feel like we need to test that in
> the backend then it seems a bit silly to feel like we need it for
> pg_dump's regression coverage.
>
> That said, perhaps the right answer is to have a couple tests for both
> the backend and for pg_dump which do exercise the fsync-enabled paths.
>
>


+1

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] exposing wait events for non-backends (was: Tracking wait event for latches)

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 1:31 AM, Michael Paquier
 wrote:
> Okay, switched as ready for committer. One note for the committer
> though: keeping the calls of pgstat_bestart() out of
> BackgroundWorkerInitializeConnection() and
> BackgroundWorkerInitializeConnectionByOid() keeps users the
> possibility to not have backends connected to the database show up in
> pg_stat_activity. This may matter for some users (cloud deployment for
> example). I am as well in favor in keeping the work of those routines
> minimal, without touching at pgstat.

I think that's just inviting bugs of omission, in both core and
extension code.  I think it'd be much better to do this in a
centralized place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Logical decoding on standby

2017-03-22 Thread Simon Riggs

On 22 March 2017 at 08:53, Craig Ringer  wrote:

> I'm splitting up the rest of the decoding on standby patch set with
> the goal of getting minimal functionality for creating and managing
> slots on standbys in, so we can maintain slots on standbys and use
> them when the standby is promoted to master.
>
> The first, to send catalog_xmin separately to the global xmin on
> hot_standby_feedback and store it in the upstream physical slot's
> catalog_xmin, is attached.
>
> These are extracted directly from the logical decoding on standby
> patch, with comments by Petr and Andres made re the relevant code
> addressed.

I've reduced your two patches back to one with a smaller blast radius.

I'll commit this tomorrow morning, barring objections.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Report-catalog_xmin-separately-to-xmin-in-hot-standb.v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Robert Haas

On Wed, Mar 22, 2017 at 8:46 AM, Stephen Frost  wrote:
>> I was definitely initially in favor of
>> raising the value, but I got cold feet, a bit, when Alvaro pointed out
>> that going to 64MB would require a substantial increase in
>> min_wal_size.
>
> The performance concern of having 3 segments is a red herring here if
> we're talking about a default install because the default for
> max_wal_size is 1G, not 192MB.  I do think increasing the default WAL
> size would be valuable to do even if it does mean a default install will
> take up a bit more space.

min_wal_size isn't the same thing as max_wal_size.

> I didn't see much discussion of it, but if this is really a concern then
> couldn't we set the default to be 2 segments worth instead of 3 also?
> That would mean an increase from 80MB to 128MB in the default install if
> you never touch more than 128MB during a checkpoint.

Not sure.  Need testing.

>> I'm a little worried that this whole question of changing the file
>> naming scheme is a diversion which will result in torpedoing any
>> chance of getting some kind of improvement here for v11.  I don't
>> think the patch is all that far from being committable but it's not
>> going to get there if we start redesigning the world around it.
>
> It's not my intent to 'torpedo' this patch but I'm pretty disappointed
> that we're introducing yet another initdb-time option with, as far as I
> can tell, no option to change it after the cluster has started (without
> some serious hackery), and continuing to have a poor default, which is
> what most users will end up with.
>
> I really don't like these kinds of options.  I'd much rather have a
> reasonable default that covers most cases and is less likely to be a
> problem for most systems than have a minimal setting that's impossible
> to change after you've got your data in the system.  As much as I'd like
> everyone to talk to me before doing an initdb, that's pretty rare and
> instead we end up having to break the bad news that they should have
> known better and done the right thing at initdb time and, no, sorry,
> there's no answer today but to dump out all of the data and load it into
> a new cluster which was set up with the right initdb settings.

Well, right now, the alternative is to recompile the server, so I
think being able to make the change at initdb time is pretty [ insert
a word of your choice here ] great by comparison.  Now, I completely
agree that initdb-time configurability is inferior to server-restart
configurability which is obviously inferior to on-the-fly
reconfigurability, but we are not going to get either of those latter
two things in v10, so I think we should take the one we can get, which
is clearly better than what we've got now.   In the future, if
somebody is willing to put in the time and energy to allow this to be
changed via a pg_resetexlog-like procedure, or even on the fly by some
ALTER SYSTEM command, we can consider those changes then, but
everything this patch does will still be necessary.

On the topic of whether to also change the default, I'm not sure what
is best and will defer to others.  On the topic of whether to whack
around the file naming scheme, -1 from me.  This patch was posted
three months ago and nobody suggested that course of action until this
week.  Even though it is on a related topic, it is a conceptually
separate change that is previously undiscussed and on which we do not
have agreement.  Making those changes just before feature freeze isn't
fair to the patch authors or people who may not have time to pay
attention to this thread right this minute.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] increasing the default WAL segment size

2017-03-22 Thread Kuntal Ghosh

On Wed, Mar 22, 2017 at 3:14 PM, Beena Emerson  wrote:
> PFA an updated patch which fixes a minor bug I found. It only increases the
> string size in pretty_wal_size function.
> The 01-add-XLogSegmentOffset-macro.patch has also been rebased.
Thanks for the updated versions. Here is a partial review of the patch:

In pg_standby.c and pg_waldump.c,
+ XLogPageHeader hdr = (XLogPageHeader) buf;
+ XLogLongPageHeader NewLongPage = (XLogLongPageHeader) hdr;
+
+ XLogSegSize = NewLongPage->xlp_seg_size;
It waits until the file is at least XLOG_BLCKSZ, then gets the
expected final size from XLogPageHeader. This looks really clean
compared to the previous approach.

+ * Verify that the min and max wal_size meet the minimum requirements.
Better to write min_wal_size and max_wal_size.

+ errmsg("Insufficient value for \"min_wal_size\"")));
"min_wal_size %d is too low" may be? Use lower case for error
messages. Same for max_wal_size.

+ /* Set the XLogSegSize */
+ XLogSegSize = ControlFile->xlog_seg_size;
+
A call to IsValidXLogSegSize() will be good after this, no?

+ /* Update variables using XLogSegSize */
+ check_wal_size();
The method name looks somewhat misleading compared to the comment for
it, doesn't it?

+ * allocating space and reading ControlFile.
s/and/for

+ {"TB", GUC_UNIT_MB, 1024 * 1024},
+ {"GB", GUC_UNIT_MB, 1024},
+ {"MB", GUC_UNIT_MB, 1},
+ {"kB", GUC_UNIT_MB, -1024},
@@ -2235,10 +2231,10 @@ static struct config_int ConfigureNamesInt[] =
  {"min_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS,
  gettext_noop("Sets the minimum size to shrink the WAL to."),
  NULL,
- GUC_UNIT_XSEGS
+ GUC_UNIT_MB
  },
- _wal_size,
- 5, 2, INT_MAX,
+ _wal_size_mb,
+ DEFAULT_MIN_WAL_SEGS * 16, 2, INT_MAX,
  NULL, NULL, NULL
  },

@@ -2246,10 +2242,10 @@ static struct config_int ConfigureNamesInt[] =
  {"max_wal_size", PGC_SIGHUP, WAL_CHECKPOINTS,
  gettext_noop("Sets the WAL size that triggers a checkpoint."),
  NULL,
- GUC_UNIT_XSEGS
+ GUC_UNIT_MB
  },
- _wal_size,
- 64, 2, INT_MAX,
+ _wal_size_mb,
+ DEFAULT_MAX_WAL_SEGS * 16, 2, INT_MAX,
  NULL, assign_max_wal_size, NULL
  },
This patch introduces a new guc_unit having values in MB for
max_wal_size and min_wal_size. I'm not sure about the upper limit
which is set to INT_MAX for 32-bit systems as well. Is it needed to
define something like MAX_MEGABYTES similar to MAX_KILOBYTES?
It is worth mentioning that GUC_UNIT_KB can't be used in this case
since MAX_KILOBYTES is INT_MAX / 1024(<2GB) on 32-bit systems. That's
not a sufficient value for min_wal_size/max_wal_size.

While testing with pg_waldump, I got the following error.
bin/pg_waldump -p master/pg_wal/ -s 0/0100
Floating point exception (core dumped)
Stack:
#0  0x004039d6 in ReadPageInternal ()
#1  0x00404c84 in XLogFindNextRecord ()
#2  0x00401e08 in main ()
I think that the problem is in following code:
/* parse files as start/end boundaries, extract path if not specified */
if (optind < argc)
{

+ if (!RetrieveXLogSegSize(full_path))
...
}
In this case, RetrieveXLogSegSize is conditionally called. So, if the
condition is false, XLogSegSize will not be initialized.

I'm yet to review pg_basebackup module and I'll try to finish that ASAP.

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Sync pg_dump and pg_dumpall output

2017-03-22 Thread Stephen Frost

Andrew,

* Andrew Dunstan (andrew.duns...@2ndquadrant.com) wrote:
> On 03/22/2017 11:39 AM, Stephen Frost wrote:
> > * Andrew Dunstan (and...@dunslane.net) wrote:
> >> Sync pg_dump and pg_dumpall output
> > This probably should have adjusted all callers of pg_dump in the
> > regression tests to use the --no-sync option, otherwise we'll end up
> > spending possibly a good bit of time calling fsync() during the
> > regression tests unnecessairly.
> 
> All of them? The imnpact is not likely to be huge in most cases
> (possibly different on Windows). On crake, the bin-check stage actually
> took less time after the change than before, so I suspect that the
> impact will be pretty small.

Well, perhaps not all, but I'd think --no-sync would be better to have
in most cases.  We turn off fsync for all of the TAP tests and all
initdb calls, so it seems like we should here too.  Perhaps it won't be
much overhead on an unloaded system, but not all of the buildfarm
animals seem to be on unloaded systems, nor are they particularly fast
in general or have fast i/o..

> Still I agree that we should have tests for both cases.

Perhaps, though if I recall correctly, we've basically had zero calls
for fsync() until this.  If we don't feel like we need to test that in
the backend then it seems a bit silly to feel like we need it for
pg_dump's regression coverage.

That said, perhaps the right answer is to have a couple tests for both
the backend and for pg_dump which do exercise the fsync-enabled paths.

Thanks!

Stephen

signature.asc
Description: Digital signature

Re: [HACKERS] PATCH: recursive json_populate_record()

2017-03-22 Thread Nikita Glukhov


On 22.03.2017 00:26, David Steele wrote:


On 3/21/17 2:31 PM, Andrew Dunstan wrote:

On 03/21/2017 01:37 PM, David Steele wrote:

>>

This thread has been idle for months since Tom's review.

The submission has been marked "Returned with Feedback". Please feel
free to resubmit to a future commitfest.


Please revive. I am planning to look at this later this week.


Revived in "Waiting on Author" state.



Here is updated v05 version of this patch:
  * rebased to the latest master
  * one patch is completely removed because it is unnecessary now
  * added some macros for JsValue fields access
  * added new JsObject structure for passing json/jsonb objects
  * refactoring patch is not yet simplified (not broken into a 
step-by-step sequence)


Also I must notice that json branch of this code is not as optimal as it 
might be:

there could be repetitive parsing passes for nested json objects/arrays
instead of a single parsing pass.

--
Nikita Glukhov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
index bf2c91f..da29dab 100644
--- a/src/backend/utils/adt/jsonfuncs.c
+++ b/src/backend/utils/adt/jsonfuncs.c
@@ -109,6 +109,7 @@ typedef struct JhashState
 	HTAB	   *hash;
 	char	   *saved_scalar;
 	char	   *save_json_start;
+	JsonTokenType saved_token_type;
 } JHashState;
 
 /* hashtable element */
@@ -116,26 +117,49 @@ typedef struct JsonHashEntry
 {
 	char		fname[NAMEDATALEN];		/* hash key (MUST BE FIRST) */
 	char	   *val;
-	char	   *json;
-	bool		isnull;
+	JsonTokenType type;
 } JsonHashEntry;
 
-/* these two are stolen from hstore / record_out, used in populate_record* */
-typedef struct ColumnIOData
+/* structure to cache type I/O metadata needed for populate_scalar() */
+typedef struct ScalarIOData
 {
-	Oid			column_type;
-	Oid			typiofunc;
 	Oid			typioparam;
-	FmgrInfo	proc;
-} ColumnIOData;
+	FmgrInfo	typiofunc;
+} ScalarIOData;
+
+typedef struct ColumnIOData ColumnIOData;
+typedef struct RecordIOData RecordIOData;
 
-typedef struct RecordIOData
+/* structure to cache metadata needed for populate_composite() */
+typedef struct CompositeIOData
+{
+	/*
+	 * We use pointer to a RecordIOData here because variable-length
+	 * struct RecordIOData can't be used directly in ColumnIOData.io union
+	 */
+	RecordIOData   *record_io;	/* metadata cache for populate_record() */
+	TupleDesc		tupdesc;	/* cached tuple descriptor */
+} CompositeIOData;
+
+/* these two are stolen from hstore / record_out, used in populate_record* */
+
+/* structure to cache record metadata needed for populate_record_field() */
+struct ColumnIOData
+{
+	Oid			typid;		/* column type id */
+	int32		typmod;		/* column type modifier */
+	ScalarIOData scalar_io;	/* metadata cache for directi conversion
+			 * through input function */
+};
+
+/* structure to cache record metadata needed for populate_record() */
+struct RecordIOData
 {
 	Oid			record_type;
 	int32		record_typmod;
 	int			ncolumns;
 	ColumnIOData columns[FLEXIBLE_ARRAY_MEMBER];
-} RecordIOData;
+};
 
 /* state for populate_recordset */
 typedef struct PopulateRecordsetState
@@ -145,10 +169,11 @@ typedef struct PopulateRecordsetState
 	HTAB	   *json_hash;
 	char	   *saved_scalar;
 	char	   *save_json_start;
+	JsonTokenType saved_token_type;
 	Tuplestorestate *tuple_store;
 	TupleDesc	ret_tdesc;
 	HeapTupleHeader rec;
-	RecordIOData *my_extra;
+	RecordIOData **my_extra;
 	MemoryContext fn_mcxt;		/* used to stash IO funcs */
 } PopulateRecordsetState;
 
@@ -160,6 +185,55 @@ typedef struct StripnullState
 	bool		skip_next_null;
 } StripnullState;
 
+/* structure for generalized json/jsonb value passing */
+typedef struct JsValue
+{
+	bool is_json;/* json/jsonb */
+	union
+	{
+		struct
+		{
+			char   *str;		/* json string */
+			int		len;		/* json string length or -1 if null-terminated */
+			JsonTokenType type;	/* json type */
+		} json;	/* json value */
+
+		JsonbValue *jsonb;		/* jsonb value */
+	} val;
+} JsValue;
+
+typedef struct JsObject
+{
+	bool		is_json;		/* json/jsonb */
+	union
+	{
+		HTAB		   *json_hash;
+		JsonbContainer *jsonb_cont;
+	} val;
+} JsObject;
+
+#define JsValueIsNull(jsv) \
+	((jsv)->is_json ?  \
+		(!(jsv)->val.json.str || (jsv)->val.json.type == JSON_TOKEN_NULL) : \
+		(!(jsv)->val.jsonb || (jsv)->val.jsonb->type == jbvNull))
+
+#define JsValueIsString(jsv) \
+	((jsv)->is_json ? (jsv)->val.json.type == JSON_TOKEN_STRING \
+		: ((jsv)->val.jsonb && (jsv)->val.jsonb->type == jbvString))
+
+#define JsObjectSize(jso) \
+	((jso)->is_json \
+		? hash_get_num_entries((jso)->val.json_hash) \
+		: !(jso)->val.jsonb_cont || JsonContainerSize((jso)->val.jsonb_cont))
+
+#define JsObjectIsEmpty(jso) (JsObjectSize(jso) == 0)
+
+#define JsObjectFree(jso) do { \
+		if ((jso)->is_json) \
+			hash_destroy((jso)->val.json_hash); \
+	} while (0)
+
+
 /* semantic action functions for json_object_keys */
 static void okeys_object_field_start(void

Re: [HACKERS] [COMMITTERS] pgsql: Sync pg_dump and pg_dumpall output

2017-03-22 Thread Andrew Dunstan

On 03/22/2017 11:39 AM, Stephen Frost wrote:
> Andrew,
>
> * Andrew Dunstan (and...@dunslane.net) wrote:
>> Sync pg_dump and pg_dumpall output
> This probably should have adjusted all callers of pg_dump in the
> regression tests to use the --no-sync option, otherwise we'll end up
> spending possibly a good bit of time calling fsync() during the
> regression tests unnecessairly.
>

All of them? The imnpact is not likely to be huge in most cases
(possibly different on Windows). On crake, the bin-check stage actually
took less time after the change than before, so I suspect that the
impact will be pretty small.

Still I agree that we should have tests for both cases.

Michael, do you want to look at that? If not, I'll take a look but it
will probably be next week before I get to it.

cheers

andrew

-- 
Andrew Dunstanhttps://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

1 2 >

1 - 100 of 190 matches

Mail list logo