Re: [HACKERS] Proposal for changes to recovery.conf API

2016-08-31 Thread Michael Paquier
On Thu, Sep 1, 2016 at 4:01 AM, Abhijit Menon-Sen  wrote:
> At 2016-08-31 17:15:59 +0100, si...@2ndquadrant.com wrote:
>>
>> * Recovery parameters would now be part of the main postgresql.conf
>> infrastructure
>> Any parameters set in $DATADIR/recovery.conf will be read after the
>> main parameter file, similar to the way that postgresql.conf.auto is
>> read.
>> (Abhijit)
>>
>> * Parameters
>> All of the parameters formerly set in recovery.conf can be set in
>> postgresql.conf using RELOAD
>> These parameters will have no defaults in postgresql.conf.sample
>> Setting them has no effect during normal running, or once recovery ends.
>>  https://www.postgresql.org/docs/devel/static/archive-recovery-settings.html
>>  https://www.postgresql.org/docs/devel/static/recovery-target-settings.html
>>  https://www.postgresql.org/docs/devel/static/standby-settings.html
>> (Abhijit)
>
> I've attached a WIP patch for the above (recovery_guc_v20160831.patch).
> This was based on the unite_recoveryconf_postgresqlconf_v3.patch posted
> by Fujii Masao.
>
> Unfortunately, some parts conflict with the patch that Simon just posted
> (e.g., his patch removes trigger_file altogether, whereas mine converts
> it into a GUC along the lines of the original patch). Rather than trying
> to untangle that right now, I'm posting what I have as-is, and I'll post
> an updated version tomorrow.

-   else if (recoveryTarget == RECOVERY_TARGET_XID)
-   ereport(LOG,
-   (errmsg("starting point-in-time recovery to XID %u",
-   recoveryTargetXid)));
User loses information if those logs are removed.

+   {"recovery_min_apply_delay", PGC_SIGHUP, WAL_RECOVERY_TARGET,
+   gettext_noop("Sets the minimum delay to apply changes
during recovery."),
+   NULL,
+   GUC_UNIT_MS
+   },
What's the point of having them all as SIGHUP? The startup process
does not reload GUC parameters with ProcessConfigFile(), it works on
recoveryWakeupLatch though. I'd rather let them as PGC_POSTMASTER to
begin with as that's the safest approach, and then get a second patch
that processes ProcessConfigFile in the startup process and switch
some of them to PGC_SIGHUP. Recovery targets should not be SIGHUP, but
recovery_min_apply_delay applies definitely applies to that, as well
as archive_cleanup_command or restore_command.

+static void
+assign_recovery_target_name(const char *newval, void *extra)
+{
+   if (recovery_target_xid != InvalidTransactionId)
+   recovery_target = RECOVERY_TARGET_XID;
+   else if (newval[0])
+   recovery_target = RECOVERY_TARGET_NAME;
+   else if (recovery_target_time != 0)
+   recovery_target = RECOVERY_TARGET_TIME;
+   else
+   recovery_target = RECOVERY_TARGET_UNSET;
+}
That's brittle to put that in the GUC machinery... The recovery target
type should be determined only once, if archive recovery is wanted at
the beginning of the startup process once and for all, and just be set
up within the startup process. If multiple recovery_target_*
parameters are set, we should just define the target type in order of
priority, instead of the-last-one-wins that is currently present.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Add support for restrictive RLS policies

2016-08-31 Thread Stephen Frost
Greetings,

As outlined in the commit message, this adds support for restrictive RLS
policies.  We've had this in the backend since 9.5, but they were only
available via hooks and therefore extensions.  This adds support for
them to be configured through regular DDL commands.  These policies are,
essentially "AND"d instead of "OR"d.

Includes updates to the catalog, grammer, psql, pg_dump, and regression
tests.  Documentation will be added soon, but until then, would be great
to get feedback on the grammer, catalog and code changes.

Thanks!

Stephen
From f4195e9c109d8323266419e487eed2b4cbaafdef Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Thu, 1 Sep 2016 02:11:30 -0400
Subject: [PATCH] Add support for restrictive RLS policies

We have had support for restrictive RLS policies since 9.5, but they
were only available through extensions which use the appropriate hooks.
This adds support into the grammer, catalog, psql and pg_dump for
restrictive RLS policies, thus reducing the cases where an extension is
necessary.
---
 src/backend/commands/policy.c |   9 ++
 src/backend/nodes/copyfuncs.c |   1 +
 src/backend/nodes/equalfuncs.c|   1 +
 src/backend/parser/gram.y |  15 +++
 src/backend/rewrite/rowsecurity.c |   7 +-
 src/bin/pg_dump/pg_dump.c |  39 --
 src/bin/pg_dump/pg_dump.h |   1 +
 src/bin/psql/describe.c   | 109 
 src/bin/psql/tab-complete.c   |   1 +
 src/include/catalog/pg_policy.h   |  16 ++-
 src/include/nodes/parsenodes.h|   1 +
 src/include/rewrite/rowsecurity.h |   1 +
 src/test/regress/expected/rowsecurity.out | 207 ++
 src/test/regress/sql/rowsecurity.sql  |  22 +++-
 14 files changed, 332 insertions(+), 98 deletions(-)

diff --git a/src/backend/commands/policy.c b/src/backend/commands/policy.c
index d694cf8..70e22c1 100644
--- a/src/backend/commands/policy.c
+++ b/src/backend/commands/policy.c
@@ -235,6 +235,7 @@ RelationBuildRowSecurity(Relation relation)
 		{
 			Datum		value_datum;
 			char		cmd_value;
+			bool		permissive_value;
 			Datum		roles_datum;
 			char	   *qual_value;
 			Expr	   *qual_expr;
@@ -257,6 +258,12 @@ RelationBuildRowSecurity(Relation relation)
 			Assert(!isnull);
 			cmd_value = DatumGetChar(value_datum);
 
+			/* Get policy permissive or restrictive */
+			value_datum = heap_getattr(tuple, Anum_pg_policy_polpermissive,
+	   RelationGetDescr(catalog), &isnull);
+			Assert(!isnull);
+			permissive_value = DatumGetBool(value_datum);
+
 			/* Get policy name */
 			value_datum = heap_getattr(tuple, Anum_pg_policy_polname,
 	   RelationGetDescr(catalog), &isnull);
@@ -298,6 +305,7 @@ RelationBuildRowSecurity(Relation relation)
 			policy = palloc0(sizeof(RowSecurityPolicy));
 			policy->policy_name = pstrdup(policy_name_value);
 			policy->polcmd = cmd_value;
+			policy->permissive = permissive_value;
 			policy->roles = DatumGetArrayTypePCopy(roles_datum);
 			policy->qual = copyObject(qual_expr);
 			policy->with_check_qual = copyObject(with_check_qual);
@@ -796,6 +804,7 @@ CreatePolicy(CreatePolicyStmt *stmt)
 	values[Anum_pg_policy_polname - 1] = DirectFunctionCall1(namein,
 		 CStringGetDatum(stmt->policy_name));
 	values[Anum_pg_policy_polcmd - 1] = CharGetDatum(polcmd);
+	values[Anum_pg_policy_polpermissive - 1] = BoolGetDatum(stmt->permissive);
 	values[Anum_pg_policy_polroles - 1] = PointerGetDatum(role_ids);
 
 	/* Add qual if present. */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 1877fb4..4fc9525 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -4150,6 +4150,7 @@ _copyCreatePolicyStmt(const CreatePolicyStmt *from)
 	COPY_STRING_FIELD(policy_name);
 	COPY_NODE_FIELD(table);
 	COPY_STRING_FIELD(cmd_name);
+	COPY_SCALAR_FIELD(permissive);
 	COPY_NODE_FIELD(roles);
 	COPY_NODE_FIELD(qual);
 	COPY_NODE_FIELD(with_check);
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 448e1a9..3e4e15b 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -2122,6 +2122,7 @@ _equalCreatePolicyStmt(const CreatePolicyStmt *a, const CreatePolicyStmt *b)
 	COMPARE_STRING_FIELD(policy_name);
 	COMPARE_NODE_FIELD(table);
 	COMPARE_STRING_FIELD(cmd_name);
+	COMPARE_SCALAR_FIELD(permissive);
 	COMPARE_NODE_FIELD(roles);
 	COMPARE_NODE_FIELD(qual);
 	COMPARE_NODE_FIELD(with_check);
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index cb5cfc4..a79a1e6 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -4633,11 +4633,26 @@ CreatePolicyStmt:
 	n->policy_name = $3;
 	n->table = $5;
 	n->cmd_name = $6;
+	n->permissive = true;
 	n->roles = $7;
 	n->qual = $8;
 	n->with_check = $9;
 	$$ = (Node *) n;
 }
+			| CREATE RESTRICT POLICY name ON qualified_name RowSecurit

Re: [HACKERS] Logical Replication WIP

2016-08-31 Thread Erik Rijkers

On 2016-09-01 01:04, Erik Rijkers wrote:

On 2016-08-31 22:51, Petr Jelinek wrote:

Here are some small changes to  logical-replication.sgml


...  and other .sgml files.

Erik Rijkers--- doc/src/sgml/ref/alter_publication.sgml.orig	2016-09-01 07:20:03.280295807 +0200
+++ doc/src/sgml/ref/alter_publication.sgml	2016-09-01 07:34:43.448625900 +0200
@@ -42,15 +42,15 @@
 
   
The first variant of this command listed in the synopsis can change
-   all of the publication attributes that can be specified in
+   all of the publication attributes specified in
.
Attributes not mentioned in the command retain their previous settings.
Database superusers can change any of these settings for any role.
   
 
   
-   The other variants this command deal with table membership in the
-   publication. The FOR TABLE subscommand will replace
+   The other variants of this command deal with table membership in the
+   publication. The FOR TABLE subcommand will replace
the current list of tables in the publication with the specified one.
The FOR ALL TABLES variant will mark the publication
as one that replicates all tables in the database (including the future
--- doc/src/sgml/ref/drop_publication.sgml.orig	2016-09-01 07:43:26.599160896 +0200
+++ doc/src/sgml/ref/drop_publication.sgml	2016-09-01 07:44:16.051797277 +0200
@@ -21,7 +21,7 @@
 
  
 
-DROP PUBLCATION name [, ...]
+DROP PUBLICATION name [, ...]
 
  
 
@@ -29,7 +29,7 @@
   Description
 
   
-   DROP PUBLCATION removes publications from the database.
+   DROP PUBLICATION removes publications from the database.
   
 
   

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for changes to recovery.conf API

2016-08-31 Thread Michael Banck
On Wed, Aug 31, 2016 at 05:15:59PM +0100, Simon Riggs wrote:
> Recover mode starts the server as a standby server which
> expects to receive changes from a primary/master server using physical
> streaming replication or is used for performing a recovery from
> backup. 

I understand where this is coming from, but wouldn't "standby",
"replication" or something be more generally understood than "recover"?
Streaming replication got bolted on to recovery, but that seems like an
implementation detail by now.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for changes to recovery.conf API

2016-08-31 Thread Michael Paquier
On Thu, Sep 1, 2016 at 1:15 AM, Simon Riggs  wrote:
> This is a summary of proposed changes to the recovery.conf API for
> v10. These are based in part on earlier discussions, and represent a
> minimal modification to current usage.
>
> Proposed changes (with reference to patches from Abhijit Menon-Sen and myself)
>
> * pg_ctl start -M (normal|recover|continue)
> pg_ctl can now start the server directly as a standby, similar to the
> way pg_ctl promote works. The internal implementation is also similar,
> with pg_ctl writing a "recovery.trigger" file that initiates recovery
> in the same way recovery.conf used to do. It is still possible to
> manually add a file called "recovery.trigger" and have that work if
> users desire that mechanism.
> Different startup methods can be selected with the -M
> option. Normal mode starts the server for read/write,
> overriding any previous use in Recover mode.
> Recover mode starts the server as a standby server which
> expects to receive changes from a primary/master server using physical
> streaming replication or is used for
> performing a recovery from backup. Continue mode is
> the default and will startup the server in whatever mode it was in at
> last proper shutdown, or as modified by any trigger files present.
> (Patch: recovery_startup_r10_api.v1b.patch)

So you basically just set ArchiveRecoveryRequested based on the
presence of recovery.trigger instead of recovery.conf. And the
interface of pg_ctl:
- recover mode => creates recovery.trigger
- continue mode => does nothing
- normal mode => removes recovery.trigger
That looks like a sound design.

> * Recovery parameters would now be part of the main postgresql.conf
> infrastructure
> Any parameters set in $DATADIR/recovery.conf will be read after the
> main parameter file, similar to the way that postgresql.conf.auto is
> read.
> (Abhijit)
> * pg_basebackup -R will continue to generate a parameter file called
> recovery.conf as it does now, but will also create a file named
> recovery.trigger.
> (This part is WIP; patch doesn't include implementation for tar format yet)

Or we could just throw away this dependency with recovery.conf,
simply. I see no need to keep it if recovery is triggered depending on
recovery.trigger, nor recommend its use. We could instead add
include_if_exists 'recovery.conf' at the bottom of postgresql.conf and
let the infrastructure do the rest to simplify the patch.

> * Parameters
> All of the parameters formerly set in recovery.conf can be set in
> postgresql.conf using RELOAD
> These parameters will have no defaults in postgresql.conf.sample
> Setting them has no effect during normal running, or once recovery ends.
>  https://www.postgresql.org/docs/devel/static/archive-recovery-settings.html
>  https://www.postgresql.org/docs/devel/static/recovery-target-settings.html
>  https://www.postgresql.org/docs/devel/static/standby-settings.html
> (Abhijit)

Hm. I think that what would make sense here is a new GUC category,
meaning that once recovery is launched the new parameters are not
taken into account once again. Even updating primary_conninfo would
need a restart of the WAL receiver, so we could just make them
GUC_POSTMASTER and be done with it.

> Related cleanup
> * Promotion signal file is now called "promote.trigger" rather than
> just "promote"

If that's not strictly necessary this renaming is not mandatory.

> * Remove user configurable "trigger_file" mechanism - use
> "promote.trigger" for all cases

Ugh. I am -1 on that. There are likely backup tools and users that
rely on this option, particularly to be able to trigger promotion
using a file that is on a different partition than PGDATA.

> * Remove Fallback promote mechanism, so all promotion is now "fast" in xlog.c

No problem with that. Now others have surely other opinions. That
could be addressed as a separate patch.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2016-08-31 Thread Craig Ringer
On 1 September 2016 at 13:08, Craig Ringer  wrote:
> On 29 August 2016 at 15:53, Craig Ringer  wrote:
>
>> Said better approach attached in revised series. Thanks.
>
> Here's another minor update to the txid_status() and
> txid_convert_if_recent() patches. The only change is moving
> get_xid_in_recent_past from src/backend/utils/adt/txid.c to
> src/backend/access/transam/xlog.c to permit its use by other code.
> Specifically, I think it'll be needed for logical decoding on standby.

Ahem, wrong patches. Attached correctly now.



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 1e81a5a7c8a450925919903ab68b70926af9d69d Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Fri, 19 Aug 2016 14:44:15 +0800
Subject: [PATCH 1/3] Introduce txid_status(bigint) to get status of an xact

If an appliation is disconnected while a COMMIT request is in flight,
the backend crashes mid-commit, etc, then an application may not be
sure whether or not a commit completed successfully or was rolled
back. While two-phase commit solves this it does so at a considerable
overhead, so introduce a lighter alternative.

txid_status(bigint) lets an application determine the status of a
a commit based on an xid-with-epoch as returned by txid_current()
or similar. Status may be committed, aborted, in-progress (including
prepared xacts) or null if the xact is too old for its commit status
to still be retained because it has passed the wrap-around epoch
boundary.

Applications must call txid_current() in their transactions to make
much use of this since PostgreSQL does not automatically report an xid
to the client when one is assigned.
---
 doc/src/sgml/func.sgml | 31 +
 src/backend/access/transam/clog.c  | 23 -
 src/backend/access/transam/xlog.c  | 62 ++
 src/backend/utils/adt/txid.c   | 58 
 src/include/access/clog.h  | 23 +
 src/include/access/xlog.h  |  2 ++
 src/include/catalog/pg_proc.h  |  2 ++
 src/include/utils/builtins.h   |  1 +
 src/test/regress/expected/txid.out | 68 ++
 src/test/regress/sql/txid.sql  | 38 +
 10 files changed, 285 insertions(+), 23 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5148095..d8b086f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17143,6 +17143,10 @@ SELECT collation for ('foo' COLLATE "de_DE");
 txid_visible_in_snapshot

 
+   
+txid_status
+   
+

 The functions shown in 
 provide server transaction information in an exportable form.  The main
@@ -17193,6 +17197,11 @@ SELECT collation for ('foo' COLLATE "de_DE");
boolean
is transaction ID visible in snapshot? (do not use with subtransaction ids)
   
+  
+   txid_status(bigint)
+   txid_status
+   report the status of the given xact - committed, aborted, in progress, or NULL if the xid is too old
+  
  
 

@@ -17263,6 +17272,28 @@ SELECT collation for ('foo' COLLATE "de_DE");

 

+txid_status(bigint) reports the commit status of a recent
+transaction.  Applications may use it to determine whether a transaction
+committed or aborted when the application and/or database server crashed or
+lost connection while a COMMIT command was in progress.
+The status of a transaction will be reported as one of:
+
+ 'in progress'
+ 'committed'
+ 'aborted'
+ NULL if xid too old
+
+PostgreSQL discards the commit status transactions after no references to
+the transaction survive in other active transactions, tables, replication
+slots, etc. This means that the status of older transactions cannot be
+determined.  txid_status(bigint) returns NULL if a
+transaction is too old to look up.  Prepared transactions are reported as
+in progress; applications must check pg_prepared_xacts if they
+need to determine if the xid is a prepared transaction.
+   
+
+   
 The functions shown in 
 provide information about transactions that have been already committed.
 These functions mainly provide information about when the transactions
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 2634476..1a6e26d 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -41,29 +41,6 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 
-/*
- * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
- * everywhere else in Postgres.
- *
- * Note: because TransactionIds are 32 bits and wrap around at 0x,
- * CLOG page numbering also wraps around at 0x/CLOG_XACTS_PER_PAGE,
- * and CLOG segment numbering at
- * 0x/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
- * explicit notice of th

Re: [HACKERS] WAL consistency check facility

2016-08-31 Thread Amit Kapila
On Thu, Sep 1, 2016 at 9:43 AM, Peter Geoghegan  wrote:
> On Wed, Aug 31, 2016 at 8:26 PM, Amit Kapila  wrote:
>>> As far as I am understanding things, we are aiming at something that
>>> could be used on production systems.
>>>
>>
>> I don't think you can enable it by default in production systems.
>> Enabling it will lead to significant performance drop as it writes the
>> whole page after each record for most type of RMGR ID's.
>>
>>> And, honestly, any people
>>> enabling it would just do it for all RMGRs because that's a
>>> no-brainer.
>>
>> Agreed, but remember enabling it for all is not free.
>
> I have sympathy for the idea that this should be as low overhead as
> possible, even if that means adding complexity to the interface --
> within reason. I would like to hear a practical example of where this
> RMGR id interface could be put to good use, when starting with little
> initial information about a problem.
>

One example that comes to mind is for the cases where the problem
reproduces only under high concurrency or some stress test. Now assume
the problem is with index, enabling it for all rmgr's could reduce the
probability of problem due to it's performance impact.  The second
advantage which I have already listed is it helps in future
development like the one I am doing now for hash indexes (making them
logged).

> And, ideally, we'd also have some
> indication of how big a difference that would make, it terms of
> measurable performance impact.
>

Yes, that's a valid point. I think we can do some tests to see the difference.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning - another take

2016-08-31 Thread Amit Langote
On 2016/08/25 16:15, Ashutosh Bapat wrote:
> On Thu, Aug 25, 2016 at 12:22 PM, Amit Langote wrote:
>> b)
>> when accumulating append subpaths, do not flatten a subpath that is itself
>> an append when ((AppendPath *) subpath)->path.parent is a RelOptInfo with
>> non-NULL partitioning info.Is the latter somehow necessary for
>> pairwise-join considerations?
> 
> I don't think you need to do anything in the path creation code for this.
> As is it flattens all AppendPath hierarchies whether for partitioning or
> inheritance or subqueries. We should leave it as it is.

I thought it would be convenient for pairwise join code to work with the
hierarchy intact even within the AppendPath tree.  If it turns out to be
so, maybe that patch can take care of it.

>> I think I can manage to squeeze in (a) in the next version patch and will
>> also start working on (b), mainly the part about RelOptInfo getting some
>> partitioning info.
> 
> I am fine with b, where you would include some partitioning information in
> RelOptInfo. But you don't need to do what you said in (b) above.
> 
> In a private conversation Robert Haas suggested a way slightly different
> than what my patch for partition-wise join does. He suggested that the
> partitioning schemes i.e strategy, number of partitions and bounds of the
> partitioned elations involved in the query should be stored in PlannerInfo
> in the form of a list. Each partitioning scheme is annotated with the
> relids of the partitioned relations. RelOptInfo of the partitioned relation
> will point to the partitioning scheme in PlannerInfo. Along-with that each
> RelOptInfo will need to store partition keys for corresponding relation.
> This simplifies matching the partitioning schemes of the joining relations.
> Also it reduces the number of copies of partition bounds floating around as
> we expect that a query will involve multiple partitioned tables following
> similar partitioning schemes. May be you want to consider this idea while
> working on (b).

So IIUC, a partitioned relation's (baserel or joinrel) RelOptInfo has only
the information about partition keys.  They will be matched with query
restriction quals pruning away any unneeded partitions which happens
individually for each such parent baserel (within set_append_rel_size() I
suppose).  Further, two joining relations are eligible to be considered
for pairwise joining if they have identical partition keys and query
equi-join quals match the same.  The resulting joinrel will have the same
partition key (as either joining relation) and will have as many
partitions as there are in the intersection of sets of partitions of
joining rels (intersection proceeds by matching partition bounds).

"Partition scheme" structs go into a PlannerInfo list member, one
corresponding to each partitioned relation - baserel or joinrel, right?
As you say, each such struct has the following pieces of information:
strategy, num_partitions, bounds (and other auxiliary info).  Before
make_one_rel() starts, the list has one for each partitioned baserel.
After make_one_rel() has formed baserel pathlists and before
make_rel_from_joinlist() is called, are the partition scheme structs of
processed baserels marked with some information about the pruning activity
that occurred so far?  Then as we build successively higher levels of
joinrels, new entries will be made for those joinrels for which we added
pairwise join paths, with relids matching the corresponding joinrels.
Does that make sense?

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Transaction traceability - txid_status(bigint)

2016-08-31 Thread Craig Ringer
On 29 August 2016 at 15:53, Craig Ringer  wrote:

> Said better approach attached in revised series. Thanks.

Here's another minor update to the txid_status() and
txid_convert_if_recent() patches. The only change is moving
get_xid_in_recent_past from src/backend/utils/adt/txid.c to
src/backend/access/transam/xlog.c to permit its use by other code.
Specifically, I think it'll be needed for logical decoding on standby.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 44f1ddfcf43bb83ce1478c0290fbbaa916fdc22e Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Fri, 19 Aug 2016 14:44:15 +0800
Subject: [PATCH 1/3] Introduce txid_status(bigint) to get status of an xact

If an appliation is disconnected while a COMMIT request is in flight,
the backend crashes mid-commit, etc, then an application may not be
sure whether or not a commit completed successfully or was rolled
back. While two-phase commit solves this it does so at a considerable
overhead, so introduce a lighter alternative.

txid_status(bigint) lets an application determine the status of a
a commit based on an xid-with-epoch as returned by txid_current()
or similar. Status may be committed, aborted, in-progress (including
prepared xacts) or null if the xact is too old for its commit status
to still be retained because it has passed the wrap-around epoch
boundary.

Applications must call txid_current() in their transactions to make
much use of this since PostgreSQL does not automatically report an xid
to the client when one is assigned.
---
 doc/src/sgml/func.sgml | 31 +
 src/backend/access/transam/clog.c  | 23 -
 src/backend/access/transam/xlog.c  | 61 ++
 src/backend/utils/adt/txid.c   | 58 
 src/include/access/clog.h  | 23 +
 src/include/access/xlog.h  |  2 ++
 src/include/catalog/pg_proc.h  |  2 ++
 src/include/utils/builtins.h   |  1 +
 src/test/regress/expected/txid.out | 68 ++
 src/test/regress/sql/txid.sql  | 38 +
 10 files changed, 284 insertions(+), 23 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5148095..d8b086f 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -17143,6 +17143,10 @@ SELECT collation for ('foo' COLLATE "de_DE");
 txid_visible_in_snapshot

 
+   
+txid_status
+   
+

 The functions shown in 
 provide server transaction information in an exportable form.  The main
@@ -17193,6 +17197,11 @@ SELECT collation for ('foo' COLLATE "de_DE");
boolean
is transaction ID visible in snapshot? (do not use with subtransaction ids)
   
+  
+   txid_status(bigint)
+   txid_status
+   report the status of the given xact - committed, aborted, in progress, or NULL if the xid is too old
+  
  
 

@@ -17263,6 +17272,28 @@ SELECT collation for ('foo' COLLATE "de_DE");

 

+txid_status(bigint) reports the commit status of a recent
+transaction.  Applications may use it to determine whether a transaction
+committed or aborted when the application and/or database server crashed or
+lost connection while a COMMIT command was in progress.
+The status of a transaction will be reported as one of:
+
+ 'in progress'
+ 'committed'
+ 'aborted'
+ NULL if xid too old
+
+PostgreSQL discards the commit status transactions after no references to
+the transaction survive in other active transactions, tables, replication
+slots, etc. This means that the status of older transactions cannot be
+determined.  txid_status(bigint) returns NULL if a
+transaction is too old to look up.  Prepared transactions are reported as
+in progress; applications must check pg_prepared_xacts if they
+need to determine if the xid is a prepared transaction.
+   
+
+   
 The functions shown in 
 provide information about transactions that have been already committed.
 These functions mainly provide information about when the transactions
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 2634476..1a6e26d 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -41,29 +41,6 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 
-/*
- * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
- * everywhere else in Postgres.
- *
- * Note: because TransactionIds are 32 bits and wrap around at 0x,
- * CLOG page numbering also wraps around at 0x/CLOG_XACTS_PER_PAGE,
- * and CLOG segment numbering at
- * 0x/CLOG_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
- * explicit notice of that fact in this module, except when comparing segment
- * and page numbers in TruncateCLOG (see CLOGPagePrecedes)

[HACKERS] Re: [COMMITTERS] pgsql: Add putenv support for msvcrt from Visual Studio 2013

2016-08-31 Thread Michael Paquier
On Wed, Apr 27, 2016 at 2:39 AM, Christian Ullrich  wrote:
> * Christian Ullrich wrote:
>
>> wrong even without considering the debug/release split. If we load a
>> compiled extension built with a CRT we have not seen yet, _after_ the
>> first call to pgwin32_putenv(), that module's CRT's view of its
>> environment will be frozen because we will never attempt to update it.
>
>
> Four patches attached:
>
> master --- 0001 --- 0002 --- 0003
>  \
>   `- 0004
>
> 0001 is just some various fixes to set the stage.
>
> 0002 fixes this "load race" by not remembering a negative result anymore.
> However, ...

>From 0001, which does not apply anymore on HEAD because of the
integration with MS2015:
if (rtmodules[i].putenvFunc == NULL)
{
-   CloseHandle(rtmodules[i].hmodule);
rtmodules[i].hmodule = INVALID_HANDLE_VALUE;
continue;
}
Nice catch. This portion is a bug and should be backpatched. As far as
I can read from MS docs, GetModuleHandle() retrieves an existing
handle so there is no need to free it. Or that would fail.

And actually, by looking at those patches, isn't it a dangerous
practice to be able to load multiple versions of the same DLL routines
in the same workspace? I have personally very bad souvenirs with that,
and particularly loading multiple versions of msvcr into the same
workspace can cause unwanted crashes and corruptions of processes. In
short I mean this thing: https://en.wikipedia.org/wiki/DLL_Hell.

So, shouldn't we first make the DLL list a little bit more severe
depending on the value of _MSC_VER? I would mean something like that:
#ifdef _MSC_VER >= 1900
{"ucrtbase",NULL,   NULL},
#elif _MSC_VER >= 1800
{"msvcr120",NULL,   NULL},
#elif etc, etc.
[...]
#endif

This would require modules to be built using the same msvc version as
the core server, but that's better than just plain crash if loaded
DLLs corrupt the stack. Am I missing something?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_recvlogical --endpos

2016-08-31 Thread Craig Ringer
Hi all

Here's a rebased version of my pg_recvlogical --endpos patch from the
9.5 series, updated to incoroprate Álvaro's changes.

This will be mainly useful for recovery tests as we start adding more
logical decoding testing.

See original post for more detail:

https://www.postgresql.org/message-id/CAMsr+YHBm3mUtXb2_RD=qsnupdt0dr8k-+gtbbgprdyuzfm...@mail.gmail.com

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 464e487e6235e1f86d06aa82a4a57716ff188579 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 1 Sep 2016 12:37:40 +0800
Subject: [PATCH] Add an optional --endpos LSN argument to pg_reclogical
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

pg_recvlogical usually just runs until cancelled or until the upstream
server disconnects. For some purposes, especially testing, it's useful
to have the ability to stop receive at a specified LSN without having
to parse the output and deal with buffering issues, etc.

Add a --endpos parameter that takes the LSN at which no further
messages should be written and receive should stop.

Craig Ringer, Álvaro Herrera
---
 doc/src/sgml/ref/pg_recvlogical.sgml   |  35 +
 src/bin/pg_basebackup/pg_recvlogical.c | 137 +
 2 files changed, 157 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/ref/pg_recvlogical.sgml b/doc/src/sgml/ref/pg_recvlogical.sgml
index b35881f..00829a7 100644
--- a/doc/src/sgml/ref/pg_recvlogical.sgml
+++ b/doc/src/sgml/ref/pg_recvlogical.sgml
@@ -155,6 +155,41 @@ PostgreSQL documentation
  
 
  
+  -E lsn
+  --endpos=lsn
+  
+   
+In --start mode, automatically stop replication
+and exit with normal exit status 0 when receiving reaches the
+specified LSN.  If specified when not in --start
+mode, an error is raised.
+   
+
+   
+Note the following points:
+
+  
+  
+   If there's a record with LSN exactly equal to lsn,
+   the record will not be output.  If you want to receive up to and
+   including a given LSN, specify LSN + 1 as the desired stop point.
+  
+ 
+ 
+  
+   The --endpos option is not aware of transaction
+   boundaries and may truncate output partway through a transaction.
+   Any partially output transaction will not be consumed and will be
+   replayed again when the slot is next read from. Individual messages
+   are never truncated.
+  
+ 
+
+   
+  
+ 
+
+ 
   --if-not-exists
   

diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 4c6cf70..5108222 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -37,6 +37,7 @@ static int	noloop = 0;
 static int	standby_message_timeout = 10 * 1000;		/* 10 sec = default */
 static int	fsync_interval = 10 * 1000; /* 10 sec = default */
 static XLogRecPtr startpos = InvalidXLogRecPtr;
+static XLogRecPtr endpos = InvalidXLogRecPtr;
 static bool do_create_slot = false;
 static bool slot_exists_ok = false;
 static bool do_start_slot = false;
@@ -60,6 +61,9 @@ static XLogRecPtr output_fsync_lsn = InvalidXLogRecPtr;
 static void usage(void);
 static void StreamLogicalLog(void);
 static void disconnect_and_exit(int code);
+static bool flushAndSendFeedback(PGconn *conn, TimestampTz *now);
+static void prepareToTerminate(PGconn *conn, XLogRecPtr endpos,
+   bool keepalive, XLogRecPtr lsn);
 
 static void
 usage(void)
@@ -78,6 +82,7 @@ usage(void)
 			 " time between fsyncs to the output file (default: %d)\n"), (fsync_interval / 1000));
 	printf(_("  --if-not-existsdo not error if slot already exists when creating a slot\n"));
 	printf(_("  -I, --startpos=LSN where in an existing slot should the streaming start\n"));
+	printf(_("  -E, --endpos=LSN   exit upon receiving the specified LSN\n"));
 	printf(_("  -n, --no-loop  do not loop on connection lost\n"));
 	printf(_("  -o, --option=NAME[=VALUE]\n"
 			 " pass option NAME with optional value VALUE to the\n"
@@ -278,6 +283,7 @@ StreamLogicalLog(void)
 		int			bytes_written;
 		int64		now;
 		int			hdr_len;
+		XLogRecPtr	cur_record_lsn = InvalidXLogRecPtr;
 
 		if (copybuf != NULL)
 		{
@@ -451,6 +457,7 @@ StreamLogicalLog(void)
 			int			pos;
 			bool		replyRequested;
 			XLogRecPtr	walEnd;
+			bool		endposReached = false;
 
 			/*
 			 * Parse the keepalive message, enclosed in the CopyData message.
@@ -473,18 +480,32 @@ StreamLogicalLog(void)
 			}
 			replyRequested = copybuf[pos];
 
-			/* If the server requested an immediate reply, send one. */
-			if (replyRequested)
+			if (endpos != InvalidXLogRecPtr && walEnd >= endpos)
 			{
-/* fsync data, so we send a rec

Re: [HACKERS] WAL consistency check facility

2016-08-31 Thread Peter Geoghegan
On Wed, Aug 31, 2016 at 8:26 PM, Amit Kapila  wrote:
>> As far as I am understanding things, we are aiming at something that
>> could be used on production systems.
>>
>
> I don't think you can enable it by default in production systems.
> Enabling it will lead to significant performance drop as it writes the
> whole page after each record for most type of RMGR ID's.
>
>> And, honestly, any people
>> enabling it would just do it for all RMGRs because that's a
>> no-brainer.
>
> Agreed, but remember enabling it for all is not free.

I have sympathy for the idea that this should be as low overhead as
possible, even if that means adding complexity to the interface --
within reason. I would like to hear a practical example of where this
RMGR id interface could be put to good use, when starting with little
initial information about a problem. And, ideally, we'd also have some
indication of how big a difference that would make, it terms of
measurable performance impact.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] Logical decoding timeline following take II

2016-08-31 Thread Craig Ringer
Hi all

Attached is a rebased and updated logical decoding timeline following
patch for 10.0.

This is a pre-requisite for the pending work on logical decoding on
standby servers and simplified failover of logical decoding.

Restating the commit message:
__

Follow timeline switches in logical decoding

When decoding from a logical slot, it's necessary for xlog reading to
be able to read xlog from historical (i.e. not current) timelines.
Otherwise decoding fails after failover to a physical replica because
the oldest still-needed archives are in the historical timeline.

Supporting logical decoding timeline following is a pre-requisite for
logical decoding on physical standby servers. It also makes it
possible to promote a replica with logical slots to a master and
replay from those slots, allowing logical decoding applications to
follow physical failover.

Logical slots cannot actually be created on a replica without use of
the low-level C slot management APIs so this is mostly foundation work
for subsequent changes to enable logical decoding on standbys.

This commit includes a module in src/test/modules with functions to
manipulate the slots (which is not otherwise possible in SQL code) in
order to enable testing, and a new test in src/test/recovery to ensure
that the behavior is as expected.

Note that an earlier version of logical decoding timeline following
was committed to 9.5 as 24c5f1a103ce, 3a3b309041b0, 82c83b337202, and
f07d18b6e94d. It was then reverted by c1543a81a7a8 just after 9.5
feature freeze when issues were discovered too late to safely fix them
in the 9.5 release cycle.

The prior approach failed to consider that a record could be split
across pages that are on different segments, where the new segment
contains the start of a new timeline. In that case the old segment
might be missing or renamed with a .partial suffix.

This patch reworks the logic to be page-based and in the process
simplify how the last timeline for a segment is looked up.

Slot timeline following only works in a backend. Frontend support can
be aded separately, where it could be useful for pg_xlogdump etc once
support for timeline.c, List, etc is added for frontend code.
__


I'm hoping to find time to refactor timeline following so that we
avoid passing timeline information around the xlogreader using
globals, but that'd be a separate change that can be made after this.

I've omitted the --endpos changes for pg_recvlogical, which again can
be added separately.

The test harness code will become unnecessary when proper support for
logical failover or logical decoding on standby is added, so I'm not
really sure it should be committed.

Prior threads:

* 
https://www.postgresql.org/message-id/camsr+yg_1fu_-l8qwsk6okft4jt8dpory2rhxdymy0b5zfk...@mail.gmail.com

* 
https://www.postgresql.org/message-id/CAMsr+YH-C1-X_+s=2nzapnr0wwqja-rumvhsyyzansn93mu...@mail.gmail.com

* http://www.postgresql.org/message-id/20160503165812.GA29604@alvherre.pgsql


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
From 10d5f4652689fbcaf5b14fe6bd991c98dbf60e00 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Thu, 1 Sep 2016 10:16:55 +0800
Subject: [PATCH] Follow timeline switches in logical decoding

When decoding from a logical slot, it's necessary for xlog reading to
be able to read xlog from historical (i.e. not current) timelines.
Otherwise decoding fails after failover to a physical replica because
the oldest still-needed archives are in the historical timeline.

Supporting logical decoding timeline following is a pre-requisite for
logical decoding on physical standby servers. It also makes it
possible to promote a replica with logical slots to a master and
replay from those slots, allowing logical decoding applications to
follow physical failover.

Logical slots cannot actually be created on a replica without use of
the low-level C slot management APIs so this is mostly foundation work
for subsequent changes to enable logical decoding on standbys.

This commit includes a module in src/test/modules with functions to
manipulate the slots (which is not otherwise possible in SQL code) in
order to enable testing, and a new test in src/test/recovery to ensure
that the behavior is as expected.

Note that an earlier version of logical decoding timeline following
was committed to 9.5 as 24c5f1a103ce, 3a3b309041b0, 82c83b337202, and
f07d18b6e94d. It was then reverted by c1543a81a7a8 just after 9.5
feature freeze when issues were discovered too late to safely fix them
in the 9.5 release cycle.

The prior approach failed to consider that a record could be split
across pages that are on different segments, where the new segment
contains the start of a new timeline. In that case the old segment
might be missing or renamed with a .partial suffix.

This patch reworks the logic to be page-based and in the process
simplify how the last timeline for a segment is 

Re: [HACKERS] WAL consistency check facility

2016-08-31 Thread Amit Kapila
On Thu, Sep 1, 2016 at 8:02 AM, Amit Kapila  wrote:
> On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs  wrote:
>> On 27 August 2016 at 12:09, Kuntal Ghosh  wrote:
>>
> * wal_consistency_mask = 511  /* Enable consistency check mask bit*/

 What does this mean? (No docs)
>>>
>>> I was using this parameter as a masking integer to indicate the
>>> operations(rmgr list) for which we need this feature to be enabled.
>>> Since, this could be confusing, I've changed it accordingly so that it
>>> accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)
>>
>> Why would we want that?
>>
>
> It would be easier to test and develop the various modules separately.
> As an example, if we develop a new AM which needs WAL facility or
> adding WAL capability to an existing system (say Hash Index), we can
> just test that module, rather than whole system.  I think it can help
> us in narrowing down the problem, if we have facility to enable it at
> RMGR ID level.  Having said that, I think this must have the facility
> to enable it for all the RMGR ID's (say ALL) and probably that should
> be default.
>

oops, I think having an option of specifying 'ALL' is good, but that
shouldn't be default, because it could have serious performance
implications.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-08-31 Thread Amit Kapila
On Thu, Sep 1, 2016 at 8:30 AM, Michael Paquier
 wrote:
> On Thu, Sep 1, 2016 at 11:32 AM, Amit Kapila  wrote:
>> On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs  wrote:
>>> On 27 August 2016 at 12:09, Kuntal Ghosh  wrote:
>>>
>> * wal_consistency_mask = 511  /* Enable consistency check mask bit*/
>
> What does this mean? (No docs)

 I was using this parameter as a masking integer to indicate the
 operations(rmgr list) for which we need this feature to be enabled.
 Since, this could be confusing, I've changed it accordingly so that it
 accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)
>>>
>>> Why would we want that?
>>
>> It would be easier to test and develop the various modules separately.
>> As an example, if we develop a new AM which needs WAL facility or
>> adding WAL capability to an existing system (say Hash Index), we can
>> just test that module, rather than whole system.  I think it can help
>> us in narrowing down the problem, if we have facility to enable it at
>> RMGR ID level.  Having said that, I think this must have the facility
>> to enable it for all the RMGR ID's (say ALL) and probably that should
>> be default.
>
> As far as I am understanding things, we are aiming at something that
> could be used on production systems.
>

I don't think you can enable it by default in production systems.
Enabling it will lead to significant performance drop as it writes the
whole page after each record for most type of RMGR ID's.

> And, honestly, any people
> enabling it would just do it for all RMGRs because that's a
> no-brainer.

Agreed, but remember enabling it for all is not free.

> If we are designing something for testing purposes
> instead, something is wrong with this patch then.
>

What is wrong?

> Doing filtering at RMGR level for testing and development purposes
> will be done by somebody who has the skills to filter out which
> records he should look at.
>

Right, but in that way, if you see many of our guc parameters needs a
good level of understanding to set the correct  values for them.  For
example, do you think it is easy for user to set value for
"replacement_sort_tuples" without reading the description or
understanding the meaning of same.  This example might not be the best
example, but I think there are other parameters which do require some
deeper understanding of system.  The main thing is default values for
such parameters should be chosen carefully such that it represents
most common usage.

> Or he'll bump into an existing bump. So I'd
> rather keep this thing simple.
>

It seems to me that having an option of 'ALL' would make it easier for
users to set it.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-08-31 Thread Michael Paquier
On Thu, Sep 1, 2016 at 11:32 AM, Amit Kapila  wrote:
> On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs  wrote:
>> On 27 August 2016 at 12:09, Kuntal Ghosh  wrote:
>>
> * wal_consistency_mask = 511  /* Enable consistency check mask bit*/

 What does this mean? (No docs)
>>>
>>> I was using this parameter as a masking integer to indicate the
>>> operations(rmgr list) for which we need this feature to be enabled.
>>> Since, this could be confusing, I've changed it accordingly so that it
>>> accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)
>>
>> Why would we want that?
>
> It would be easier to test and develop the various modules separately.
> As an example, if we develop a new AM which needs WAL facility or
> adding WAL capability to an existing system (say Hash Index), we can
> just test that module, rather than whole system.  I think it can help
> us in narrowing down the problem, if we have facility to enable it at
> RMGR ID level.  Having said that, I think this must have the facility
> to enable it for all the RMGR ID's (say ALL) and probably that should
> be default.

As far as I am understanding things, we are aiming at something that
could be used on production systems. And, honestly, any people
enabling it would just do it for all RMGRs because that's a
no-brainer. If we are designing something for testing purposes
instead, something is wrong with this patch then.

Doing filtering at RMGR level for testing and development purposes
will be done by somebody who has the skills to filter out which
records he should look at. Or he'll bump into an existing bump. So I'd
rather keep this thing simple.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing checks when malloc returns NULL...

2016-08-31 Thread Michael Paquier
On Thu, Sep 1, 2016 at 12:14 AM, Tom Lane  wrote:
> I still think it'd be better to fix that as attached, because it
> represents a net reduction not net addition of code, and it provides
> a defense against future repetitions of the same omission.  If only
> 4 out of 11 existing calls were properly checked --- some of them
> adjacent to calls with checks --- that should tell us that we *will*
> have more instances of the same bug if we don't fix it centrally.
>
> I also note that your patch missed checks for two ShmemAlloc calls in
> InitShmemAllocation and ShmemInitStruct.  Admittedly, since those are
> the very first such calls, it's highly unlikely they'd fail; but I think
> this exercise is not about dismissing failures as improbable.  Almost
> all of these failures are improbable, given that we precalculate the
> shmem space requirement.

OK, that looks fine to me after review.

Also, we could take one extra step forward then, and just introduce
ShmemAllocExtended that holds two flags as per the attached:
- SHMEM_ALLOC_ZERO that zeros all the fields
- SHMEM_ALLOC_NO_OOM that does not fail
Or we could just put a call to MemSet directly in ShmemAlloc(), but
I'd rather keep the base routines extensible.

What do you think about the attached? One other possibility would be
to take your patch, but use MemSet unconditionally on it as this
should not cause any overhead.
-- 
Michael


malloc-nulls-v8.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-08-31 Thread Amit Kapila
On Wed, Aug 31, 2016 at 7:02 PM, Simon Riggs  wrote:
> On 27 August 2016 at 12:09, Kuntal Ghosh  wrote:
>
 * wal_consistency_mask = 511  /* Enable consistency check mask bit*/
>>>
>>> What does this mean? (No docs)
>>
>> I was using this parameter as a masking integer to indicate the
>> operations(rmgr list) for which we need this feature to be enabled.
>> Since, this could be confusing, I've changed it accordingly so that it
>> accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)
>
> Why would we want that?
>

It would be easier to test and develop the various modules separately.
As an example, if we develop a new AM which needs WAL facility or
adding WAL capability to an existing system (say Hash Index), we can
just test that module, rather than whole system.  I think it can help
us in narrowing down the problem, if we have facility to enable it at
RMGR ID level.  Having said that, I think this must have the facility
to enable it for all the RMGR ID's (say ALL) and probably that should
be default.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replace use malloc() & friend by memory contexts for plperl and pltcl

2016-08-31 Thread Michael Paquier
On Thu, Sep 1, 2016 at 8:57 AM, Tom Lane  wrote:
> I wrote:
>> Michael Paquier  writes:
>>> Attached are a set of patches to replace those memory system calls by
>>> proper memory contexts:
>>> - 0001 does the cleanup work for pltcl
>>> - 0002 does this cleanup for plperl
>
>> Off to look at 0002 next.
>
> Pushed 0002 as well.  The main thing missing there was to use a PG_TRY
> block to replace the bulky-and-yet-incomplete assorted invocations of
> free_plperl_function.

Thanks. That's neat!
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] restoration after crash slowness, any way to improve?

2016-08-31 Thread Joshua D. Drake

-hackers,

So this is more of a spit balling thread than anything. As I understand 
it, if we have a long running transaction or a large number of wal logs 
and we crash, we then have to restore those logs on restart to the last 
known good transaction. No problem.


I recently ran a very long transaction. I was building up a large number 
of rows into a two column table to test index performance. I ended up 
having to kill the connection and thus the transaction after I realized 
I had an extra zero in my generate_series(). (Side note: Amazing the 
difference a single zero can make ;)). When coming back up, I watched 
the machine and I was averaging anywhere from 60MBs to 97MBs on writes. 
That in itself isn't that bad over a single thread and a single SSD, 
doing what it is doing.


However, since I know this machine can get well over 400MBs when using 
multiple connections I can't help but wonder if there is anything we can 
do to make restoration more efficient without sacrificing the purpose of 
what it is doing?


Can we have multiple readers pull transaction logs into shared_buffers 
(on recovery only), sort the good transactions and then push them back 
to the walwriter or bgwriter to the pages?


Anyway, like I said, spitballing and I thought I would start the thread.

Sincerely,

JD

--
Command Prompt, Inc.  http://the.postgres.company/
+1-503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Everyone appreciates your honesty, until you are honest with them.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replace use malloc() & friend by memory contexts for plperl and pltcl

2016-08-31 Thread Tom Lane
I wrote:
> Michael Paquier  writes:
>> Attached are a set of patches to replace those memory system calls by
>> proper memory contexts:
>> - 0001 does the cleanup work for pltcl
>> - 0002 does this cleanup for plperl

> Off to look at 0002 next.

Pushed 0002 as well.  The main thing missing there was to use a PG_TRY
block to replace the bulky-and-yet-incomplete assorted invocations of
free_plperl_function.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical Replication WIP

2016-08-31 Thread Erik Rijkers

On 2016-08-31 22:51, Petr Jelinek wrote:

Hi,

and one more version with bug fixes, improved code docs and couple
more tests, some general cleanup and also rebased on current master
for the start of CF.


Clear, well-written docs, thanks.

Here are some small changes to  logical-replication.sgml

Erik Rijkers--- doc/src/sgml/logical-replication.sgml.orig	2016-09-01 00:04:11.484729675 +0200
+++ doc/src/sgml/logical-replication.sgml	2016-09-01 00:54:07.817799915 +0200
@@ -22,11 +22,11 @@
 cascading replication or more complex configurations.
   
   
-Logical replication typically starts with snapshot of the data on
-the publisher database. Once that is done, the changes on publisher
-are sent to subscriber as they occur in real-time. The subscriber
-applies the data in same order as the publisher so that the
-transactional consistency is guaranteed for the publications within
+Logical replication typically starts with a snapshot of the data on
+the publisher database. Once that is done, the changes on the publisher
+are sent to the subscriber as they occur in real-time. The subscriber
+applies the data in the same order as the publisher so that
+transactional consistency is guaranteed for publications within a
 single subscription. This method of data replication is sometimes
 referred to as transactional replication.
   
@@ -54,23 +54,23 @@
 
 
   
-Replicating between different major versions of the PostgreSQL
+Replicating between different major versions of PostgreSQL
   
 
 
   
-Giving access to the replicated data to different groups of
+Giving access to replicated data to different groups of
 users.
   
 
 
   
-Sharing subset of the database between multiple databases.
+Sharing a subset of the database between multiple databases.
   
 
   
   
-The subscriber database behaves in a same way as any other
+The subscriber database behaves in the same way as any other
 PostgreSQL instance and can be used as a publisher for other
 databases by defining its own publications. When the subscriber is
 treated as read-only by application, there will be no conflicts from
@@ -83,9 +83,9 @@
   Publication
   
 A publication object can be defined on any physical
-replication master, the node where publication is deined is referred to
-as publisher. Only superusers or members of
-REPLICATION role can define publication. A publication is
+replication master. The node where a publication is defined is referred
+to as publisher. Only superusers or members of the
+REPLICATION role can define a publication. A publication is
 a set of changes generated from a group of tables, and might also be
 described as a change set or replication set.
 Each publication exists in only one database.
@@ -93,7 +93,7 @@
   
 Publications are different from table schema and do not affect
 how the table is accessed. Each table can be added to multiple
-publications if needed.  Publications may currenly only contain
+publications if needed.  Publications may currently only contain
 tables. Objects must be added explicitly, except when a publication
 is created for ALL TABLES. There is no default name for
 a publication which specifies all tables.
@@ -103,8 +103,8 @@
 any combination of INSERT, UPDATE,
 DELETE and TRUNCATE in a similar
 way to the way triggers are fired by particular event types. Only
-tables with REPLICA IDENTITY index can be added to
-publication which replicate UPDATE and DELETE
+tables with a REPLICA IDENTITY index can be added to a
+publication which replicates UPDATE and DELETE
 operation.
   
   
@@ -129,20 +129,20 @@
 
   Subscription
   
-A subscription is the downstream side of the logical
+A subscription is the downstream side of logical
 replication. The node where subscription is defined is referred to as
-a subscriber. Subscription defines the connection to
+the subscriber. Subscription defines the connection to
 another database and set of publications (one or more) to which it
 wants to be subscribed.
   
   
-The subscriber database behaves in a same way as any other
+The subscriber database behaves in the same way as any other
 PostgreSQL instance and can be used as a publisher for other
 databases by defining its own publications.
   
   
 A subscriber may have multiple subscriptions if desired. It is
-possible to define multiple subscriptions between single
+possible to define multiple subscriptions between a single
 publisher-subscriber pair, in which case extra care must be taken
 to ensure that the subscribed publication objects don't overlap.
   
@@ -153,17 +153,17 @@
 of pre-existing table data.
   
   
-Subscriptions are not dumped by pg_dump by default,

Re: [HACKERS] Notice lock waits

2016-08-31 Thread Jeff Janes
On Tue, Aug 9, 2016 at 5:17 PM, Jim Nasby  wrote:

> On 8/5/16 12:00 PM, Jeff Janes wrote:
>
>> So I created a new guc, notice_lock_waits, which acts like
>> log_lock_waits but sends the message as NOTICE so it will show up on
>> interactive connections like psql.
>>
>
> I would strongly prefer that this accept a log level instead of being
> hard-coded to NOTICE. The reason is that I find the NOTICE chatter from
> many DDL commands to be completely worthless (looking at you %TYPE),


Perhaps we should do something about those notices?  In 9.3 we removed ones
about adding implicit unique indexes to implement primary keys, and I think
that that was a pretty good call.



> so I normally set client_min_messages to WARNING in DDL scripts. I can
> work on that patch; would it essentially be a matter of changing
> notice_lock_waits to int lock_wait_level?


How would it be turned off?  Is there a err level which would work for
that?  And what levels would non-superusers be allowed to set it to?

And, I'd be happy if you were to work on a patch to implement it.

Cheers,

Jeff


Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-31 Thread Corey Huinker
On Wed, Aug 31, 2016 at 6:07 PM, Andres Freund  wrote:

>
> In my experience pg attribute is usually the worst affected. Many tech
> takes won't even have stays entries...
>
>
Mine too. One database currently has a 400GB pg_attribute table, because we
chew through temp tables like popcorn.


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-31 Thread Peter Geoghegan
On Wed, Aug 31, 2016 at 3:08 PM, Andres Freund  wrote:
> On August 31, 2016 3:06:23 PM PDT, Peter Geoghegan  wrote:
>
>>In other painfully pedantic news, I should point out that
>>sizeof(size_t) isn't necessarily word size (the most generic
>>definition of word size for the architecture), contrary to my reading
>>of the 0002-* patch comments. I'm mostly talking thinking about x86_64
>>here, of course.
>
> Uh?

Sorry, I really should have not said anything. It is true that x86_64
word size is sometimes reported as 16 and/or 32 bits [1], because of
legacy issues.

[1] 
https://en.wikipedia.org/wiki/Word_(computer_architecture)#Table_of_word_sizes
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autonomous transactions

2016-08-31 Thread Vik Fearing
On 08/31/2016 03:09 PM, Joel Jacobson wrote:
> On Wed, Aug 31, 2016 at 6:41 AM, Jaime Casanova
>  wrote:
>>
>> On 30 August 2016 at 23:10, Joel Jacobson  wrote:
>>>
>>> There should be a way to within the session and/or txn permanently
>>> block autonomous transactions.
>>>
>>
>> This will defeat one of the use cases of autonomous transactions: auditing
> 
> My idea on how to deal with this would be to mark the function to be
> "AUTONOMOUS" similar to how a function is marked to be "PARALLEL
> SAFE",
> and to throw an error if a caller that has blocked autonomous
> transactions tries to call a function that is marked to be autonomous.
> 
> That way none of the code that needs to be audited would ever get executed.

Part of what people want this for is to audit what people *try* to do.
We can already audit what they've actually done.

With your solution, we still wouldn't know when an unauthorized attempt
to do something happened.
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-31 Thread Andres Freund


On August 31, 2016 3:06:23 PM PDT, Peter Geoghegan  wrote:

>In other painfully pedantic news, I should point out that
>sizeof(size_t) isn't necessarily word size (the most generic
>definition of word size for the architecture), contrary to my reading
>of the 0002-* patch comments. I'm mostly talking thinking about x86_64
>here, of course.

Uh?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-31 Thread Andres Freund


On August 31, 2016 3:00:15 PM PDT, Tomas Vondra  
wrote:
>
>
>On 08/31/2016 11:43 PM, Andres Freund wrote:
>> On 2016-08-31 23:40:46 +0200, Tomas Vondra wrote:
>>> It's an improvement (and it's pretty much exactly what I proposed
>>> upthread). But it does not solve the problems with pg_statistic for
>>> example (each backend needs it's own statistics. So we'd either
>bloat
>>> the pg_statistic (if we manage to solve the problem that the table
>has
>>> the same oid in all backends), or we would need in-memory tuples
>(just
>>> like discussed in the thread so far).
>> 
>> Creating a session private version of pg_statistic would be fairly
>> simple.
>
>Sure. I'm just saying it's not as simple as overriding relpath.
>
>ISTM we only need the pg_statistics (as other catalogs are connected to
>the pg_class entry), which does not have the dependency issues. Or do
>we
>need other catalogs?

In my experience pg attribute is usually the worst affected. Many tech takes 
won't even have stays entries...

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-31 Thread Peter Geoghegan
On Wed, Aug 31, 2016 at 2:37 PM, Andres Freund  wrote:
>> This looks good.
>
> Thanks for looking!

No problem.

In other painfully pedantic news, I should point out that
sizeof(size_t) isn't necessarily word size (the most generic
definition of word size for the architecture), contrary to my reading
of the 0002-* patch comments. I'm mostly talking thinking about x86_64
here, of course.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-08-31 Thread Michael Paquier
On Wed, Aug 31, 2016 at 10:32 PM, Simon Riggs  wrote:
> On 27 August 2016 at 12:09, Kuntal Ghosh  wrote:
>
 * wal_consistency_mask = 511  /* Enable consistency check mask bit*/
>>>
>>> What does this mean? (No docs)
>>
>> I was using this parameter as a masking integer to indicate the
>> operations(rmgr list) for which we need this feature to be enabled.
>> Since, this could be confusing, I've changed it accordingly so that it
>> accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)
>
> Why would we want that?

I am still in for just an on/off switch instead of this complication.
An all-or-nothing feature is what we are looking at here. Still a list
is an improvement compared to a bitmap.

 1. Add support for other Resource Managers.
>>>
>>> We probably need to have a discussion as to why you think this should
>>> be Rmgr dependent?
>>> Code comments would help there.
>>>
>>> If it does, then you should probably do this by extending RmgrTable
>>> with an rm_check, so you can call it like this...
>>>
>>> RmgrTable[record->xl_rmid].rm_check
>>
>> +1.
>> I'm modifying it accordingly. I'm calling this function after
>> RmgrTable[record->xl_rmid].rm_redo.
>>
 5. Generalize the page type identification technique.
>>>
>>> Why not do this first?
>>>
>>
>> At present, I'm using special page size and page ID to identify page
>> type. But, I've noticed some cases where the entire page is
>> initialized to zero (Ex: hash_xlog_squeeze_page). RmgrID and info bit
>> can help us to identify those pages.
>
> I'd prefer a solution that was not dependent upon RmgrID at all.

So you'd rather identify the page types by looking at pd_special? That
seems worse to me but..
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-31 Thread Tomas Vondra


On 08/31/2016 11:43 PM, Andres Freund wrote:
> On 2016-08-31 23:40:46 +0200, Tomas Vondra wrote:
>> It's an improvement (and it's pretty much exactly what I proposed
>> upthread). But it does not solve the problems with pg_statistic for
>> example (each backend needs it's own statistics. So we'd either bloat
>> the pg_statistic (if we manage to solve the problem that the table has
>> the same oid in all backends), or we would need in-memory tuples (just
>> like discussed in the thread so far).
> 
> Creating a session private version of pg_statistic would be fairly
> simple.

Sure. I'm just saying it's not as simple as overriding relpath.

ISTM we only need the pg_statistics (as other catalogs are connected to
the pg_class entry), which does not have the dependency issues. Or do we
need other catalogs?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_basebackup stream xlog to tar

2016-08-31 Thread Magnus Hagander
Attached patch adds support for -X stream to work with .tar and .tar.gz
file formats.

If tar mode is specified, a separate pg_xlog.tar (or .tar.gz) file is
created and the data is streamed into it. Regular mode is (should not) see
any changes in how it works.

The implementation creates a "walmethod" for directory and one for tar,
which is basically a set of function pointers that we pass around as part
of the StreamCtl structure. All calls to modify the files are sent through
the current method, using the normal open/read/write calls as it is now for
directories, and the more complicated method for tar and targz.

The tar method doesn't support all things that are required for
pg_receivexlog, but I don't think it makes any real sense to have support
for pg_receivexlog in tar mode. But it does support all the things that
pg_basebackup needs.

Some smaller pieces of functionality like unlinking files on failure and
padding files have been moved into the walmethod because they have to be
differently implemented (we cannot pre-pad a compressed file -- the size
will depend on the compression ration anyway -- for example).

AFAICT we never actually documented that -X stream doesn't work with tar in
the manpage of current versions. Only in the error message. We might want
to fix that in backbranches.

In passing this also fixes an XXX comment about not re-lseeking on the WAL
file all the time -- the walmethod now tracks the current position in the
file in a variable.

Finally, to make this work, the pring_tar_number() function is now exported
from port/tar.c along with the other ones already exported from there.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/
diff --git a/doc/src/sgml/ref/pg_basebackup.sgml b/doc/src/sgml/ref/pg_basebackup.sgml
index 03615da..981d201 100644
--- a/doc/src/sgml/ref/pg_basebackup.sgml
+++ b/doc/src/sgml/ref/pg_basebackup.sgml
@@ -180,7 +180,8 @@ PostgreSQL documentation
 target directory, the tar contents will be written to
 standard output, suitable for piping to for example
 gzip. This is only possible if
-the cluster has no additional tablespaces.
+the cluster has no additional tablespaces and transaction
+log streaming is not used.


  
@@ -323,6 +324,10 @@ PostgreSQL documentation
  If the log has been rotated when it's time to transfer it, the
  backup will fail and be unusable.

+   
+The transaction log files will be written to
+ the base.tar file.
+   
   
  
 
@@ -339,6 +344,9 @@ PostgreSQL documentation
  client can keep up with transaction log received, using this mode
  requires no extra transaction logs to be saved on the master.

+   The transactionn log files are written to a separate file
+called pg_xlog.tar.
+   
   
  
 
diff --git a/src/bin/pg_basebackup/Makefile b/src/bin/pg_basebackup/Makefile
index fa1ce8b..52ac9e9 100644
--- a/src/bin/pg_basebackup/Makefile
+++ b/src/bin/pg_basebackup/Makefile
@@ -19,7 +19,7 @@ include $(top_builddir)/src/Makefile.global
 override CPPFLAGS := -I$(libpq_srcdir) $(CPPFLAGS)
 LDFLAGS += -L$(top_builddir)/src/fe_utils -lpgfeutils -lpq
 
-OBJS=receivelog.o streamutil.o $(WIN32RES)
+OBJS=receivelog.o streamutil.o walmethods.o $(WIN32RES)
 
 all: pg_basebackup pg_receivexlog pg_recvlogical
 
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 351a420..58c0821 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -365,7 +365,7 @@ typedef struct
 {
 	PGconn	   *bgconn;
 	XLogRecPtr	startptr;
-	char		xlogdir[MAXPGPATH];
+	char		xlog[MAXPGPATH];	/* directory or tarfile depending on mode */
 	char	   *sysidentifier;
 	int			timeline;
 } logstreamer_param;
@@ -383,9 +383,13 @@ LogStreamerMain(logstreamer_param *param)
 	stream.standby_message_timeout = standby_message_timeout;
 	stream.synchronous = false;
 	stream.mark_done = true;
-	stream.basedir = param->xlogdir;
 	stream.partial_suffix = NULL;
 
+	if (format == 'p')
+		stream.walmethod = CreateWalDirectoryMethod(param->xlog);
+	else
+		stream.walmethod = CreateWalTarMethod(param->xlog, compresslevel);
+
 	if (!ReceiveXlogStream(param->bgconn, &stream))
 
 		/*
@@ -395,6 +399,14 @@ LogStreamerMain(logstreamer_param *param)
 		 */
 		return 1;
 
+	if (!stream.walmethod->finish())
+	{
+		fprintf(stderr,
+_("%s: could not finish writing WAL files: %s\n"),
+progname, strerror(errno));
+		return 1;
+	}
+
 	PQfinish(param->bgconn);
 	return 0;
 }
@@ -445,22 +457,25 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 		/* Error message already written in GetConnection() */
 		exit(1);
 
-	snprintf(param->xlogdir, sizeof(param->xlogdir), "%s/pg_xlog", b

Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-31 Thread Andres Freund
On 2016-08-31 23:40:46 +0200, Tomas Vondra wrote:
> It's an improvement (and it's pretty much exactly what I proposed
> upthread). But it does not solve the problems with pg_statistic for
> example (each backend needs it's own statistics. So we'd either bloat
> the pg_statistic (if we manage to solve the problem that the table has
> the same oid in all backends), or we would need in-memory tuples (just
> like discussed in the thread so far).

Creating a session private version of pg_statistic would be fairly
simple.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-31 Thread Tomas Vondra


On 08/31/2016 09:20 PM, Vik Fearing wrote:
> On 08/24/2016 06:16 PM, Robert Haas wrote:
>> On Tue, Aug 23, 2016 at 6:11 PM, Tomas Vondra
>>  wrote:
>>> Could someone please explain how the unlogged tables are supposed to fix the
>>> catalog bloat problem, as stated in the initial patch submission? We'd still
>>> need to insert/delete the catalog rows when creating/dropping the temporary
>>> tables, causing the bloat. Or is there something I'm missing?
>>
>> No, not really.  Jim just asked if the idea of partitioning the
>> columns was completely dead in the water, and I said, no, you could
>> theoretically salvage it.  Whether that does you much good is another
>> question.
>>
>> IMV, the point here is that you MUST have globally visible dependency
>> entries for this to work sanely.  If they're not in a catalog, they
>> have to be someplace else, and backend-private memory isn't good
>> enough, because that's not globally visible.  Until we've got a
>> strategy for that problem, this whole effort is going nowhere - even
>> though in other respects it may be a terrific idea.
> 
> Why not just have a regular-looking table, with a "global temporary"
> relpersistence (I don't care which letter it gets) and when a backend
> tries to access it, it uses its own private relfilenode instead of
> whatever is in pg_class, creating one if necessary.  That way the
> structure of the table is fixed, with all the dependencies and whatnot,
> but the content is private to each backend.  What's wrong with this idea?
> 

It's an improvement (and it's pretty much exactly what I proposed
upthread). But it does not solve the problems with pg_statistic for
example (each backend needs it's own statistics. So we'd either bloat
the pg_statistic (if we manage to solve the problem that the table has
the same oid in all backends), or we would need in-memory tuples (just
like discussed in the thread so far).



-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-31 Thread Andres Freund
On 2016-08-31 14:09:47 -0700, Peter Geoghegan wrote:
> On Thu, Aug 18, 2016 at 5:26 PM, Andres Freund  wrote:
> > Rebased version attached. A review would be welcome. Plan to push this
> > forward otherwise in the not too far away future.
> 
> This looks good.

Thanks for looking!


> The only thing that stuck out to any degree is that we don't grow the
> "reln->md_seg_fds[forknum]" array within the new _fdvec_resize()
> function geometrically. In other words, we don't separately keep track
> of the array size and allocated size, and grow the allocated size
> using the doubling strategy that you see in many places.

I can't really see that being worth the code here. The cost of
open()/lseek()'ing etc. is going to dwarf the cost of this.


Regards,

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Reload SSL certificates on SIGHUP

2016-08-31 Thread Peter Geoghegan
On Sun, Nov 22, 2015 at 7:29 PM, Andreas Karlsson  wrote:
> Sorry for dropping this patch, but now I have started looking at it again.

Any chance of picking this up again soon, Andreas? I think it's an
important project. I would like to review it.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Replace use malloc() & friend by memory contexts for plperl and pltcl

2016-08-31 Thread Tom Lane
Michael Paquier  writes:
> Cleanup $subject has been raised a couple of times, like one year ago here:
> https://www.postgresql.org/message-id/cab7npqrxvq+q66ufzd9wa5uaftyn4wauadbjxkfrync96kf...@mail.gmail.com
> And more recently here while working on the NULL checks for malloc():
> https://www.postgresql.org/message-id/CAB7nPqR7ozfCqc6C=2+ctCcnuehTZ-vTDQ8MMhY=bqyet0h...@mail.gmail.com

> Attached are a set of patches to replace those memory system calls by
> proper memory contexts:
> - 0001 does the cleanup work for pltcl
> - 0002 does this cleanup for plperl

I looked at 0001.  It seemed to me that it wasn't that useful to add a
context unless we did something about actually freeing it; otherwise
we're just increasing the amount of memory leaked after a function
redefinition.  However, it proved pretty easy to shoehorn in a refcounting
mechanism like plpgsql has, so I did that and pushed it.

Off to look at 0002 next.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-31 Thread Peter Geoghegan
On Wed, Aug 31, 2016 at 2:09 PM, Peter Geoghegan  wrote:
> The only thing that stuck out to any degree is that we don't grow the
> "reln->md_seg_fds[forknum]" array within the new _fdvec_resize()
> function geometrically.

That new function looks like this:

> +static void
> +_fdvec_resize(SMgrRelation reln,
> + ForkNumber forknum,
> + int nseg)
>  {
> -   return (MdfdVec *) MemoryContextAlloc(MdCxt, sizeof(MdfdVec));
> +   if (nseg == 0)
> +   {
> +   if (reln->md_num_open_segs[forknum] > 0)
> +   {
> +   pfree(reln->md_seg_fds[forknum]);
> +   reln->md_seg_fds[forknum] = NULL;
> +   }
> +   }
> +   else if (reln->md_num_open_segs[forknum] == 0)
> +   {
> +   reln->md_seg_fds[forknum] =
> +   MemoryContextAlloc(MdCxt, sizeof(MdfdVec) * nseg);
> +   }
> +   else
> +   {
> +   reln->md_seg_fds[forknum] =
> +   repalloc(reln->md_seg_fds[forknum],
> +sizeof(MdfdVec) * nseg);
> +   }
> +
> +   reln->md_num_open_segs[forknum] = nseg;
>  }

Another tiny tweak that you might consider occurs to me here: It
couldn't hurt to "Assert(reln->md_seg_fds[forknum] == NULL)" within
the "else if (reln->md_num_open_segs[forknum] == 0)" path here, prior
to the MemoryContextAlloc(). If it's worth setting it to NULL when
there are 0 segs (i.e., "reln->md_seg_fds[forknum] = NULL"), then
perhaps it's worth checking that things are found that way later.

I guess that this is at odds with remarks in my last mail about
considering geometric allocations for the "reln->md_seg_fds[forknum]"
array. Both feedback items are more or less just things that bothered
me ever so slightly, which I don't necessarily expect you to act on,
but assume you'd like to hear.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] _mdfd_getseg can be expensive

2016-08-31 Thread Peter Geoghegan
On Thu, Aug 18, 2016 at 5:26 PM, Andres Freund  wrote:
> Rebased version attached. A review would be welcome. Plan to push this
> forward otherwise in the not too far away future.

This looks good.

The only thing that stuck out to any degree is that we don't grow the
"reln->md_seg_fds[forknum]" array within the new _fdvec_resize()
function geometrically. In other words, we don't separately keep track
of the array size and allocated size, and grow the allocated size
using the doubling strategy that you see in many places.

Now, I freely admit that that probably doesn't matter, given what that
array tracks. I'm just pointing out that that aspect did give me
pause. The struct MdfdVec is small enough that that *might* be
worthwhile.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] delta relations in AFTER triggers

2016-08-31 Thread Kevin Grittner
I have merged in the changes since v4 (a year and a half ago) and
cured all bit-rot I found, to get the attached v5 which runs `make
check world` without problem -- including the tests added for this
feature.

I did remove the contrib code that David Fetter wrote to
demonstrate the correctness and performance of the tuplestores as
created during the transaction, and how to use them directly from C
code, before any API code was written.  If we want that to be
committed, it should be considered separately after the main
feature is in.

Thanks to Thomas Munro who took a look at v4 and pointed out a bug
(which is fixed in v5) and suggested a way forward for using the
parameters.  Initial attempts to get that working were not
successful,, but I think it is fundamentally the right course,
should we reach a consensus to go that way,

On Thu, Jul 7, 2016 at 5:07 PM, Robert Haas  wrote:
> On Fri, May 13, 2016 at 2:37 PM, Kevin Grittner  wrote:
>> On Fri, May 13, 2016 at 1:02 PM, David Fetter  wrote:
>>> On Thu, Jan 22, 2015 at 02:41:42PM +, Kevin Grittner wrote:
>>
 [ideas on how to pass around references to ephemeral relations]
>>>
>>> [almost 17 months later]
>>>
>>> It seems like now is getting close to the time to get this into
>>> master.  The patch might have suffered from some bit rot, but the
>>> design so far seems sound.
>>>
>>> What say?
>>
>> I had a talk with Tom in Brussels about this.  As I understood it,
>> he wasn't too keen on the suggestion by Heikki (vaguely
>> sorta-endorsed by Robert) of passing ephemeral named relations such
>> as these tuplestores around in the structures currently used for
>> parameter values.  He intuitively foresaw the types of problems I
>> had run into trying to use the same structure to pass a relation
>> (with structure and rows and columns of data) as is used to pass,
>> say, an integer.
>
> See, the thing that disappoints me about this is that I think we were
> pretty closed to having the ParamListInfo-based approach working.

Maybe, but the thing I would like to do before proceeding down that
road is to confirm that we have a consensus that such a course is
better than what Is on the branch which is currently working.  If
that's the consensus here, I'll work on that for the next CF.  If
not, there may not be a lot left to do before commit.  (Notably, we
may want to provide a way to free a tuplestore pointer when done
with it, but that's not too much work.)  Let me describe the API I
have working.

There are 11 function prototypes modified under src/include, in all
cases to add a Tsrcache parameter:
1 createas.h
3 explain.h
1 prepare.h
1 analyze.h
2 tcopprot.h
3 utility.h

There are three new function prototypes in SPI.  NOTE: This does
*not* mean that SPI is required to use named tuplestores in
queries, just that there are convenience functions for any queries
being run through SPI, so that any code using SPI (including any
PLs that do) will find assigning a name to a tuplestore and
referencing that within a query about as easy as falling off a log.
A tuplestore is registered to the current SPI context and not
visible outside that context.  This results in a Tsrcache being
passed to the functions mentioned above when that context is
active, just as any non-SPI code could do.

> The thing I liked about that approach is that we already know that any
> place where you can provide parameters for a query, there will also be
> an opportunity to provide a ParamListInfo.  And I think that
> parameterizing a query by an ephemeral table is conceptually similar
> to parameterizing it by a scalar value.  If we invent a new mechanism,
> it's got to reach all of those same places in the code.

Do you see someplace that the patch missed?

> One other comment that I would make here is that I think that it's
> important, however we pass the data, that the scope of the tuplestores
> is firmly lexical and not dynamic.  We need to make sure that we don't
> set some sort of context via global variables that might get used for
> some query other than the one to which it's intended to apply.

Is this based on anything actually in the patch?


For this CF, the main patch attached is a working version of the
feature that people can test, review documentation, etc.  Any API
changes are not expected to change these visible behaviors, so any
feedback on usability or documentation can be directly useful
regardless of the API discussion.

I have also attached a smaller patch which applies on top of the
main one which rips out the Tsrcache API to get to a "no API"
version that compiles cleanly and runs fine as long as you don't
try to use the feature, in which case it will not recognize the
tuplestore names and will give this message: "executor could not
find named tuplestore  \"%s\"".  There may be a little bit left to
rip out when adding a parameter-based API, but this is basically
where we're moving from if we go that way.  I include it both to
help isolate the API we're discussing an

Re: [HACKERS] ICU integration

2016-08-31 Thread Doug Doole
Hi all. I’m new to the PostgreSQL code and the mailing list, but I’ve had a lot 
of experience with using ICU in a different database product. So while I’m not 
up to speed on the code yet, I can offer some insights on using ICU.

> On Aug 30, 2016, at 9:12 PM, Peter Eisentraut 
>  wrote:
>> How stable are the UCU locales? Most importantly, does ICU offer any
>> way to "pin" a locale version, so we can say "we want de_DE as it was
>> in ICU 4.6" and get consistent behaviour when the user sets up a
>> replica on some other system with ICU 4.8? Even if the German
>> government has changed its mind (again) about some details of the
>> language and 4.8 knows about the changes but 4.4 doesn’t?

ICU explicitly does not provide stability in their locales and collations. We 
pushed them hard to provide this, but between changes to the CLDR data and 
changes to the ICU code it just wasn’t feasible for them to provide version to 
version stability.

What they do offer is a compile option when building ICU to version all their 
APIs. So instead of calling icu_foo() you’d call icu_foo46(). (Or something 
like this - it’s been a few years since I actually worked with the ICU code.) 
This ultimately allows you to load multiple versions of the ICU library into a 
single program and provide stability by calling the appropriate version of the 
library. (Unfortunately, the OS - at least my Linux box - only provides the 
generic version of ICU and not the version annotated APIs, which means a 
separate compile of ICU is needed.)

The catch with this is that it means you likely want to expose the version 
information. In another note it was suggested to use something like fr_FR%icu. 
If you want to pin it to a specific version of ICU, you’ll likely need 
something like fr_FR%icu46. (There’s nothing wrong with supporting fr_FR%icu to 
give users an easy way of saying “give me the latest and greatest”, but you’d 
probably want to harden it to a specific ICU version internally.)

> I forgot to mention this, but the patch adds a collversion column that
> stores the collation version (provided by ICU).  And then when you
> upgrade ICU to something incompatible you get
> 
> +   if (numversion != collform->collversion)
> +   ereport(WARNING,
> +   (errmsg("ICU collator version mismatch"),
> +errdetail("The database was created using
> version 0x%08X, the library provides version 0x%08X.",
> +  (uint32) collform->collversion,
> (uint32) numversion),
> +errhint("Rebuild affected indexes, or build
> PostgreSQL with the right version of ICU.")));
> 
> So you still need to manage this carefully, but at least you have a
> chance to learn about it.

Indexes are the obvious place where collation comes into play, and are 
relatively easy to address. But consider all the places where string 
comparisons can be done. For example, check constraints and referential 
constraints can depend on string comparisons. If the collation rules change 
because of a new version of ICU, the database can become inconsistent and will 
need a lot more work than an index rebuild.

> Suggestions for refining this are welcome.
> 
> -- 
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Doug Doole
Salesforce



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2016-08-31 Thread Bruce Momjian
On Wed, Aug 31, 2016 at 04:03:29PM -0400, Bruce Momjian wrote:
> Why not just remember the tid of chains converted from WARM to HOT, then
> use "amrecheck" on an index entry matching that tid to see if the index
> matches one of the entries in the chain.  (It will match all of them or
> none of them, because they are all red.)  I don't see a point in
> coloring the index entries as reds as later you would have to convert to
> blue in the WARM-to-HOT conversion, and a vacuum crash could lead to
> inconsistencies.  Consider that you can just call "amrecheck" on the few
> chains that have converted from WARM to HOT.  I believe this is more
> crash-safe too.  However, if you have converted WARM to HOT in the heap,
> but crash during the index entry removal, you could potentially have
> duplicates in the index later, which is bad.

I think Pavan had the "crash durin the index entry removal" fixed via:

> During the second heap scan, we fix WARM chain by clearing HEAP_WARM_TUPLE 
> flag
> and also reset Red flag to Blue.

so the marking from WARM to HOT only happens after the index has been cleaned.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2016-08-31 Thread Bruce Momjian
On Wed, Aug 31, 2016 at 10:15:33PM +0530, Pavan Deolasee wrote:
> Instead, what I would like to propose and the patch currently implements is to
> restrict WARM update to once per chain. So the first non-HOT update to a tuple
> or a HOT chain can be a WARM update. The chain can further be HOT updated any
> number of times. But it can no further be WARM updated. This might look too
> restrictive, but it can still bring down the number of regular updates by
> almost 50%. Further, if we devise a strategy to convert a WARM chain back to
> HOT chain, it can again be WARM updated. (This part is currently not
> implemented). A good side effect of this simple strategy is that we know there
> can maximum two index entries pointing to any given WARM chain.

I like the simplified approach, as long as it doesn't block further
improvements.

> Headline TPS numbers:
> 
> Master:
> 
> transaction type: update.sql
> scaling factor: 700
> query mode: simple
> number of clients: 16
> number of threads: 8
> duration: 57600 s
> number of transactions actually processed: 65552986
> latency average: 14.059 ms
> tps = 1138.072117 (including connections establishing)
> tps = 1138.072156 (excluding connections establishing)
> 
> 
> WARM:
> 
> transaction type: update.sql
> scaling factor: 700
> query mode: simple
> number of clients: 16
> number of threads: 8
> duration: 57600 s
> number of transactions actually processed: 116168454
> latency average: 7.933 ms
> tps = 2016.812924 (including connections establishing)
> tps = 2016.812997 (excluding connections establishing)

These are very impressive results.

> Converting WARM chains back to HOT chains (VACUUM ?)
> -
> 
> The current implementation of WARM allows only one WARM update per chain. This
> simplifies the design and addresses certain issues around duplicate scans. But
> this also implies that the benefit of WARM will be no more than 50%, which is
> still significant, but if we could return WARM chains back to normal status, 
> we
> could do far more WARM updates.
> 
> A distinct property of a WARM chain is that at least one index has more than
> one live index entries pointing to the root of the chain. In other words, if 
> we
> can remove duplicate entry from every index or conclusively prove that there
> are no duplicate index entries for the root line pointer, the chain can again
> be marked as HOT.

I had not thought of how to convert from WARM to HOT yet.

> Here is one idea, but more thoughts/suggestions are most welcome. 
> 
> A WARM chain has two parts, separated by the tuple that caused WARM update. 
> All
> tuples in each part has matching index keys, but certain index keys may not
> match between these two parts. Lets say we mark heap tuples in each part with 
> a
> special Red-Blue flag. The same flag is replicated in the index tuples. For
> example, when new rows are inserted in a table, they are marked with Blue flag
> and the index entries associated with those rows are also marked with Blue
> flag. When a row is WARM updated, the new version is marked with Red flag and
> the new index entry created by the update is also marked with Red flag.
> 
> 
> Heap chain: lp  [1] [2] [3] [4]
>   [, ]B -> [, ]B -> [, ]R -> [, ]R
> 
> Index1: ()B points to 1 (satisfies only tuples marked with B)
> ()R points to 1 (satisfies only tuples marked with R)
> 
> Index2: ()B points to 1 (satisfies both B and R tuples)
> 
> 
> It's clear that for indexes with Red and Blue pointers, a heap tuple with Blue
> flag will be reachable from Blue pointer and that with Red flag will be
> reachable from Red pointer. But for indexes which did not create a new entry,
> both Blue and Red tuples will be reachable from Blue pointer (there is no Red
> pointer in such indexes). So, as a side note, matching Red and Blue flags is
> not enough from index scan perspective.
> 
> During first heap scan of VACUUM, we look for tuples with HEAP_WARM_TUPLE set.
> If all live tuples in the chain are either marked with Blue flag or Red flag
> (but no mix of Red and Blue), then the chain is a candidate for HOT 
> conversion.

Uh, if the chain is all blue, then there is are WARM entries so it
already a HOT chain, so there is nothing to do, right?

> We remember the root line pointer and Red-Blue flag of the WARM chain in a
> separate array.
> 
> If we have a Red WARM chain, then our goal is to remove Blue pointers and vice
> versa. But there is a catch. For Index2 above, there is only Blue pointer
> and that must not be removed. IOW we should remove Blue pointer iff a Red
> pointer exists. Since index vacuum may visit Red and Blue pointers in any
> order, I think we will need another index pass to remove dead
> index pointers. So in the first index pass we check which WARM candidates have
> 2 index pointers. In the second pass, we remove the dead pointer and reset Red
> flag is the sur

Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-31 Thread Vik Fearing
On 08/24/2016 06:16 PM, Robert Haas wrote:
> On Tue, Aug 23, 2016 at 6:11 PM, Tomas Vondra
>  wrote:
>> Could someone please explain how the unlogged tables are supposed to fix the
>> catalog bloat problem, as stated in the initial patch submission? We'd still
>> need to insert/delete the catalog rows when creating/dropping the temporary
>> tables, causing the bloat. Or is there something I'm missing?
> 
> No, not really.  Jim just asked if the idea of partitioning the
> columns was completely dead in the water, and I said, no, you could
> theoretically salvage it.  Whether that does you much good is another
> question.
> 
> IMV, the point here is that you MUST have globally visible dependency
> entries for this to work sanely.  If they're not in a catalog, they
> have to be someplace else, and backend-private memory isn't good
> enough, because that's not globally visible.  Until we've got a
> strategy for that problem, this whole effort is going nowhere - even
> though in other respects it may be a terrific idea.

Why not just have a regular-looking table, with a "global temporary"
relpersistence (I don't care which letter it gets) and when a backend
tries to access it, it uses its own private relfilenode instead of
whatever is in pg_class, creating one if necessary.  That way the
structure of the table is fixed, with all the dependencies and whatnot,
but the content is private to each backend.  What's wrong with this idea?
-- 
Vik Fearing  +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for changes to recovery.conf API

2016-08-31 Thread Abhijit Menon-Sen
At 2016-08-31 17:15:59 +0100, si...@2ndquadrant.com wrote:
>
> * Recovery parameters would now be part of the main postgresql.conf
> infrastructure
> Any parameters set in $DATADIR/recovery.conf will be read after the
> main parameter file, similar to the way that postgresql.conf.auto is
> read.
> (Abhijit)
> 
> * Parameters
> All of the parameters formerly set in recovery.conf can be set in
> postgresql.conf using RELOAD
> These parameters will have no defaults in postgresql.conf.sample
> Setting them has no effect during normal running, or once recovery ends.
>  https://www.postgresql.org/docs/devel/static/archive-recovery-settings.html
>  https://www.postgresql.org/docs/devel/static/recovery-target-settings.html
>  https://www.postgresql.org/docs/devel/static/standby-settings.html
> (Abhijit)

I've attached a WIP patch for the above (recovery_guc_v20160831.patch).
This was based on the unite_recoveryconf_postgresqlconf_v3.patch posted
by Fujii Masao.

Unfortunately, some parts conflict with the patch that Simon just posted
(e.g., his patch removes trigger_file altogether, whereas mine converts
it into a GUC along the lines of the original patch). Rather than trying
to untangle that right now, I'm posting what I have as-is, and I'll post
an updated version tomorrow.

-- Abhijit
commit 999c0c2632f0d4c20d20b9ac30cd258305f74bac
Author: Abhijit Menon-Sen 
Date:   Wed Aug 31 22:18:07 2016 +0530

Convert recovery.conf settings into GUCs

Based on unite_recoveryconf_postgresqlconf_v3.patch by Fujii Masao.

diff --git a/contrib/pg_standby/pg_standby.c b/contrib/pg_standby/pg_standby.c
index e4136f9..8fcb85c 100644
--- a/contrib/pg_standby/pg_standby.c
+++ b/contrib/pg_standby/pg_standby.c
@@ -520,7 +520,7 @@ usage(void)
 	printf("  -w MAXWAITTIME max seconds to wait for a file (0=no limit) (default=0)\n");
 	printf("  -?, --help show this help, then exit\n");
 	printf("\n"
-		   "Main intended use as restore_command in recovery.conf:\n"
+		   "Main intended use as restore_command in postgresql.conf:\n"
 		   "  restore_command = 'pg_standby [OPTION]... ARCHIVELOCATION %%f %%p %%r'\n"
 		   "e.g.\n"
 	"  restore_command = 'pg_standby /mnt/server/archiverdir %%f %%p %%r'\n");
diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 0f09d82..f43e41e 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1174,8 +1174,15 @@ SELECT pg_stop_backup();


 
- Create a recovery command file recovery.conf in the cluster
- data directory (see ). You might
+ Set up recovery parameters in postgresql.conf (see
+  and
+ ).
+
+   
+   
+
+ Create a recovery trigger file recovery.trigger
+ in the cluster data directory. You might
  also want to temporarily modify pg_hba.conf to prevent
  ordinary users from connecting until you are sure the recovery was successful.
 
@@ -1187,7 +1194,7 @@ SELECT pg_stop_backup();
  recovery be terminated because of an external error, the server can
  simply be restarted and it will continue recovery.  Upon completion
  of the recovery process, the server will rename
- recovery.conf to recovery.done (to prevent
+ recovery.trigger to recovery.done (to prevent
  accidentally re-entering recovery mode later) and then
  commence normal database operations.
 
@@ -1203,12 +1210,11 @@ SELECT pg_stop_backup();

 

-The key part of all this is to set up a recovery configuration file that
-describes how you want to recover and how far the recovery should
-run.  You can use recovery.conf.sample (normally
-located in the installation's share/ directory) as a
-prototype.  The one thing that you absolutely must specify in
-recovery.conf is the restore_command,
+The key part of all this is to set up recovery parameters that
+specify how you want to recover and how far the recovery should
+run. The one thing that you absolutely must specify in
+postgresql.conf to recover from the backup is
+the restore_command,
 which tells PostgreSQL how to retrieve archived
 WAL file segments.  Like the archive_command, this is
 a shell command string.  It can contain %f, which is
@@ -1270,7 +1276,7 @@ restore_command = 'cp /mnt/server/archivedir/%f %p'

 If you want to recover to some previous point in time (say, right before
 the junior DBA dropped your main transaction table), just specify the
-required stopping point in recovery.conf.  You can specify
+required stopping point in postgresql.conf.  You can specify
 the stop point, known as the recovery target, either by
 date/time, named restore point or by completion of a specific transaction
 ID.  As of this writing only the date/time and named restore point options
@@ -1367,8 +1373,9 @@ restore_command = 'cp /mnt/server/archivedir/%f %p'
 The default behavior of recovery is to recover along the same timeline
 that was current 

Re: [HACKERS] autonomous transactions

2016-08-31 Thread Serge Rielau

> On Aug 31, 2016, at 6:46 AM, Greg Stark  wrote:
> 
> Using a background worker mean that the autonomous transaction can't
> access any state from the process memory. Parameters in plpgsql are a
> symptom of this but I suspect there will be others. What happens if a
> statement timeout occurs during an autonomous transaction? What
> happens if you use a pl language in the autonomous transaction and if
> it tries to use non-transactional information such as prepared
> statements?
> 
+1 on this.
The proposed solution loosely matches what was done in DB2 9.7 and it runs into 
the same 
complexity. Passing local variable or session level variables back and forth 
became a source of grief.

At SFDC PG we have taken a different tack:
1. Gather up all the transaction state that is scattered across global 
variables into a struct
2. backup/restore transaction state when an autonomous transaction is invoked.

This allows full access to all non-transactional state.

The downside is that full access also includes uncommitted DDL (shared 
recache). 
So we had to restrict DDL in the parent transaction prior to the spawning of 
the child.

If there is interest in exploring this kind of solution as an alternative I can 
elaborate.

Cheers
Serge
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Andres Freund
On 2016-08-31 14:23:41 -0400, Tom Lane wrote:
> Andres Freund  writes:
> > On 2016-08-31 13:59:48 -0400, Tom Lane wrote:
> >> You are ignoring the performance costs associated with eating 100x more
> >> shared buffer space than necessary.
> 
> > I doubt that's measurable in any real-world scenario. You seldomly have
> > hundreds of thousands of sequences that you all select from at a high
> > rate.
> 
> If there are only a few sequences in the database, cross-sequence
> contention is not going to be a big issue anyway.

Isn't that *precisely* when it's going to matter? If you have 5 active
tables & sequences where the latter previously used independent locks,
and they'd now be contending on a single lock.  If you have hundreds of
thousands of active sequences, I doubt individual page locks would
become a point of contention.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Tom Lane
Andres Freund  writes:
> On 2016-08-31 13:59:48 -0400, Tom Lane wrote:
>> You are ignoring the performance costs associated with eating 100x more
>> shared buffer space than necessary.

> I doubt that's measurable in any real-world scenario. You seldomly have
> hundreds of thousands of sequences that you all select from at a high
> rate.

If there are only a few sequences in the database, cross-sequence
contention is not going to be a big issue anyway.  I think most of
the point of making this change at all is to make things work better
when you do have a boatload of sequences.

Also, we could probably afford to add enough dummy padding to the sequence
tuples so that they're each exactly 64 bytes, thereby having only one
or two active counters per cache line.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [Patch] Temporary tables that do not bloat pg_catalog (a.k.a fast temp tables)

2016-08-31 Thread Corey Huinker
On Wed, Aug 24, 2016 at 12:39 PM, Andres Freund  wrote:

>
>
> On August 24, 2016 9:32:48 AM PDT, Tomas Vondra <
> tomas.von...@2ndquadrant.com> wrote:
> >
> >
> >On 08/24/2016 12:20 AM, Andres Freund wrote:
> >> On 2016-08-23 19:18:04 -0300, Claudio Freire wrote:
> >>> On Tue, Aug 23, 2016 at 7:11 PM, Tomas Vondra
> >>>  wrote:
>  Could someone please explain how the unlogged tables are supposed
> >to fix the
>  catalog bloat problem, as stated in the initial patch submission?
> >We'd still
>  need to insert/delete the catalog rows when creating/dropping the
> >temporary
>  tables, causing the bloat. Or is there something I'm missing?
> >>
> >> Beats me.
> >>
> >
> >Are you puzzled just like me, or are you puzzled why I'm puzzled?
>
> Like you. I don't think this addresses the problem to a significant enough
> degree to care.
>
> Andres
> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Ok, here's a wild idea, and it probably depends on having native
partitioning implemented.

Propagate relpersistence, or a boolean flag on (relpersistence = 't') from
pg_class into the child pg_attribute records.

Partition the tables pg_class and pg_attribute first by relpersistence, and
then by oid.

The partitions holding data on persistent objects would basically stay
as-is, but the partition wouldn't have much activity and no temp-table
churn.

The temporary ones, however, would fall into essentially a rotating set of
partitions. Pick enough partitions such that the active transactions only
cover some of the partitions. The rest can be safely truncated by vacuum.

It would mitigate the bloat, existing dictionary queries would still work,
but the additional lookup cost might not be worth it.


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2016-08-31 Thread Pavan Deolasee
On Wed, Aug 31, 2016 at 10:38 PM, Claudio Freire 
wrote:

>
> On Wed, Aug 31, 2016 at 1:45 PM, Pavan Deolasee 
> wrote:
>
>> We discussed a few ideas to address the "Duplicate Scan" problem. For
>> example, we can teach Index AMs to discard any duplicate (key, CTID) insert
>> requests. Or we could guarantee uniqueness by either only allowing updates
>> in one lexical order. While the former is a more complete solution to avoid
>> duplicate entries, searching through large number of keys for non-unique
>> indexes could be a drag on performance. The latter approach may not be
>> sufficient for many workloads. Also tracking increment/decrement for many
>> indexes will be non-trivial.
>>
>> There is another problem with allowing many index entries pointing to the
>> same WARM chain. It will be non-trivial to know how many index entries are
>> currently pointing to the WARM chain and index/heap vacuum will throw up
>> more challenges.
>>
>> Instead, what I would like to propose and the patch currently implements
>> is to restrict WARM update to once per chain. So the first non-HOT update
>> to a tuple or a HOT chain can be a WARM update. The chain can further be
>> HOT updated any number of times. But it can no further be WARM updated.
>> This might look too restrictive, but it can still bring down the number of
>> regular updates by almost 50%. Further, if we devise a strategy to convert
>> a WARM chain back to HOT chain, it can again be WARM updated. (This part is
>> currently not implemented). A good side effect of this simple strategy is
>> that we know there can maximum two index entries pointing to any given WARM
>> chain.
>>
>
> We should probably think about coordinating with my btree patch.
>
> From the description above, the strategy is quite readily "upgradable" to
> one in which the indexam discards duplicate (key,ctid) pairs and that would
> remove the limitation of only one WARM update... right?
>

Yes, we should be able to add further optimisations on lines you're working
on, but what I like about the current approach is that a) it reduces
complexity of the patch and b) having thought about cleaning up WARM
chains, limiting number of index entries per root chain to a small number
will simplify that aspect too.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Andres Freund
On 2016-08-31 13:59:48 -0400, Tom Lane wrote:
> and...@anarazel.de (Andres Freund) writes:
> > On 2016-08-31 14:25:43 -0300, Alvaro Herrera wrote:
> >> Yes, sure, we're still improving even if we stick to one-seq-per-bufpage,
> >> but while we're at it, we could as well find a way to make it as best as
> >> we can.  And allowing multiple seqs per page seems a much better
> >> situation, so let's try to get there.
> 
> > It's not really that simple. Having independent sequence rows closer
> > together will have its own performance cost.
> 
> You are ignoring the performance costs associated with eating 100x more
> shared buffer space than necessary.

I doubt that's measurable in any real-world scenario. You seldomly have
hundreds of thousands of sequences that you all select from at a high
rate.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Tom Lane
and...@anarazel.de (Andres Freund) writes:
> On 2016-08-31 14:25:43 -0300, Alvaro Herrera wrote:
>> Yes, sure, we're still improving even if we stick to one-seq-per-bufpage,
>> but while we're at it, we could as well find a way to make it as best as
>> we can.  And allowing multiple seqs per page seems a much better
>> situation, so let's try to get there.

> It's not really that simple. Having independent sequence rows closer
> together will have its own performance cost.

You are ignoring the performance costs associated with eating 100x more
shared buffer space than necessary.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Use static inline functions for float <-> Datum conversions.

2016-08-31 Thread Tom Lane
Heikki Linnakangas  writes:
> On 08/31/2016 08:27 PM, Tom Lane wrote:
>> We had a bunch of similar problems in the recent work on exact degree trig
>> functions, and eventually found that storing values into volatile float8
>> variables was the most reliable way to force rounding to the expected
>> storage width.  I propose to replace the above hack with declaring
>> newelemorder as volatile, and see if that makes things better.

> Sounds reasonable. None of this is supposed to be necessary, we're just 
> working around broken compilers, so whatever works.

> I'll prepare a patch for that tomorrow, unless you're on it already.

On it already.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2016-08-31 Thread Peter Geoghegan
On Tue, Aug 30, 2016 at 7:46 PM, Peter Eisentraut
 wrote:
> In initdb, I initialize the default collation set as before from the
> `locale -a` output, but also add all available ICU locales with a "%icu"
> appended (so "fr_FR%icu").  I suppose one could create a configuration
> option perhaps in initdb to change the default so that, say, "fr_FR"
> uses ICU and "fr_FR%posix" uses the old stuff.

I suspect that we'd be better off adding a mechanism for adding a new
collation after initdb runs, on a live production instance. Maybe that
part can come later.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Use static inline functions for float <-> Datum conversions.

2016-08-31 Thread Heikki Linnakangas

On 08/31/2016 08:27 PM, Tom Lane wrote:

Heikki Linnakangas  writes:

Use static inline functions for float <-> Datum conversions.


Hmm, it looks like narwhal for one is not happy with this:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=narwhal&dt=2016-08-31%2016%3A00%3A03

I suspect the failure traces to this bit in pg_enum.c:

/*
 * On some machines, newelemorder may be in a register that's
 * wider than float4.  We need to force it to be rounded to float4
 * precision before making the following comparisons, or we'll get
 * wrong results.  (Such behavior violates the C standard, but
 * fixing the compilers is out of our reach.)
 */
newelemorder = DatumGetFloat4(Float4GetDatum(newelemorder));

If you suppose that inlining those macros allows gcc to decide that the
assignment is a no-op, the observed failure would be explained.


Ugh.


We had a bunch of similar problems in the recent work on exact degree trig
functions, and eventually found that storing values into volatile float8
variables was the most reliable way to force rounding to the expected
storage width.  I propose to replace the above hack with declaring
newelemorder as volatile, and see if that makes things better.


Sounds reasonable. None of this is supposed to be necessary, we're just 
working around broken compilers, so whatever works.


I'll prepare a patch for that tomorrow, unless you're on it already.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU integration

2016-08-31 Thread Peter Geoghegan
On Tue, Aug 30, 2016 at 7:46 PM, Peter Eisentraut
 wrote:
> Here is a patch I've been working on to allow the use of ICU for sorting
> and other locale things.

I'm really happy that you're working on this. This is more important
than is widely appreciated, and very important overall.

In a world where ICU becomes the defacto standard (i.e. is used on
major platforms by default), what remaining barriers are there to
documenting and enforcing the binary compatibility of replicas? I may
be mistaken, but offhand I can't think of any. Being able to describe
exactly what works and what doesn't is very important. After all,
failure to adhere to "the rules" today, such as they are, can leave
you with a subtly broken replica. I'd like to make that scenario
mechanically impossible, by locking everything down.

> I'm not sure how well it will work to replace all the bits of LIKE and
> regular expressions with ICU API calls.  One problem is that ICU likes
> to do case folding as a whole string, not by character.  I need to do
> more research about that.

My guess is that there are cultural reasons why it wants to operate on
a whole string, at least in some cases.

> Also note that ICU locales are encoding-independent and don't support a
> separate collcollate and collctype, so the existing catalog structure is
> not optimal.

That makes more sense to me, personally. ICU very explicitly decouples
technical issues (like the representation of strxfrm() keys, and, I
gather, encoding) from cultural issues (the actual user-visible
behaviors). This allows us to use strxfrm()-style binary keys in
indexes directly, since they're versioned independently from their
underlying collation; they can add a new optimization to strxfrm()-key
generation to the next ICU version, and we can detect that and require
a REINDEX, even when the collation version itself (the user-visible
behaviors) are unchanged. I'm getting ahead of myself here, but that
does seem very useful.

The Unicode collation algorithm [1] that ICU is directly based on
knows plenty about the requirements of indexing. It contains guidance
about equivalence vs. equality that we learned the hard way in commit
656beff5, for example.

> Where it gets really interesting is what to do with the database
> locales.  They just set the global process locale.  So in order to port
> that to ICU we'd need to check every implicit use of the process locale
> and tweak it.  We could add a datcollprovider column or something.  But
> we also rely on the datctype setting to validate the encoding of the
> database.  Maybe we wouldn't need that anymore, but it sounds risky.

Not sure about that.

Whatever we come up with here needs to mesh well with the existing
conventions around collation versioning that ICU has, in the context
of various operating system packages in particular. We can arrange it
so that in practice, an ICU upgrade doesn't often break your indexes
due to a collation rule change; ICU is happy to have multiple versions
of a collation at a time, and you'll probably retain the old collation
version in ICU.

Even if your old collation version isn't available in a new ICU
release (which I think is unlikely in practice), or you downgrade ICU,
it might be possible to give guidance on how to download a "Collation
Resource Bundle" [2][3] that *does* have the right collation version,
which presumably satisfies the requirement immediately.

Firebird already uses ICU. Maybe we have something to learn from them
here. In particular, where do they (by which I mean the ICU version
that Firebird links to) get its collations from in practice? I think
that the CLDR Data collations were at one time not even distributed
with ICU source. It might be a matter of individual OS packagers of
ICU deciding what exact CLDR data to use, which may or may not be of
any significant consequence in practice.

[1] http://unicode.org/reports/tr10
[2] http://site.icu-project.org/design/size/collation
[3] http://userguide.icu-project.org/icudata
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Use static inline functions for float <-> Datum conversions.

2016-08-31 Thread Tom Lane
Heikki Linnakangas  writes:
> Use static inline functions for float <-> Datum conversions.

Hmm, it looks like narwhal for one is not happy with this:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=narwhal&dt=2016-08-31%2016%3A00%3A03

I suspect the failure traces to this bit in pg_enum.c:

/*
 * On some machines, newelemorder may be in a register that's
 * wider than float4.  We need to force it to be rounded to float4
 * precision before making the following comparisons, or we'll get
 * wrong results.  (Such behavior violates the C standard, but
 * fixing the compilers is out of our reach.)
 */
newelemorder = DatumGetFloat4(Float4GetDatum(newelemorder));

If you suppose that inlining those macros allows gcc to decide that the
assignment is a no-op, the observed failure would be explained.

We had a bunch of similar problems in the recent work on exact degree trig
functions, and eventually found that storing values into volatile float8
variables was the most reliable way to force rounding to the expected
storage width.  I propose to replace the above hack with declaring
newelemorder as volatile, and see if that makes things better.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Andres Freund
On 2016-08-31 14:25:43 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > Hi,
> > 
> > On 2016-08-31 12:53:30 -0400, Tom Lane wrote:
> > > Improving on the space wastage is exactly the point IMO.  If it's still
> > > going to be 8k per sequence on disk (*and* in shared buffers, remember),
> > > I'm not sure it's worth all the work to change things at all.
> > 
> > A separate file is a heck lot more heavyweight than another 8 kb in an
> > existing file.
> 
> Yes, sure, we're still improving even if we stick to one-seq-per-bufpage,
> but while we're at it, we could as well find a way to make it as best as
> we can.  And allowing multiple seqs per page seems a much better
> situation, so let's try to get there.

It's not really that simple. Having independent sequence rows closer
together will have its own performance cost. Suddenly independent
sequences will compete for the same page level lock, NUMA systems will
have to transfer the page/cacheline around even if it's independent
sequences being accessed in different backends, we'll have to take care
about cacheline padding...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Alvaro Herrera
Andres Freund wrote:
> Hi,
> 
> On 2016-08-31 12:53:30 -0400, Tom Lane wrote:
> > Improving on the space wastage is exactly the point IMO.  If it's still
> > going to be 8k per sequence on disk (*and* in shared buffers, remember),
> > I'm not sure it's worth all the work to change things at all.
> 
> A separate file is a heck lot more heavyweight than another 8 kb in an
> existing file.

Yes, sure, we're still improving even if we stick to one-seq-per-bufpage,
but while we're at it, we could as well find a way to make it as best as
we can.  And allowing multiple seqs per page seems a much better
situation, so let's try to get there.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Andres Freund
Hi,

On 2016-08-31 12:53:30 -0400, Tom Lane wrote:
> Improving on the space wastage is exactly the point IMO.  If it's still
> going to be 8k per sequence on disk (*and* in shared buffers, remember),
> I'm not sure it's worth all the work to change things at all.

A separate file is a heck lot more heavyweight than another 8 kb in an
existing file.


> Another idea would be to have nominally per-sequence LWLocks (or
> spinlocks?) controlling nextval's nontransactional accesses to the catalog
> rows, but to map those down to some fixed number of locks in a way similar
> to the current fallback implementation for spinlocks, which maps them onto
> a fixed number of semaphores.  You'd trade off shared memory against
> contention while choosing the underlying number of locks.

If we could rely on spinlocks to actually be spinlocks, we could just
put the spinlock besides the actual state data... Not entirely pretty,
but probably pretty simple.


- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2016-08-31 Thread Claudio Freire
On Wed, Aug 31, 2016 at 1:45 PM, Pavan Deolasee 
wrote:

> We discussed a few ideas to address the "Duplicate Scan" problem. For
> example, we can teach Index AMs to discard any duplicate (key, CTID) insert
> requests. Or we could guarantee uniqueness by either only allowing updates
> in one lexical order. While the former is a more complete solution to avoid
> duplicate entries, searching through large number of keys for non-unique
> indexes could be a drag on performance. The latter approach may not be
> sufficient for many workloads. Also tracking increment/decrement for many
> indexes will be non-trivial.
>
> There is another problem with allowing many index entries pointing to the
> same WARM chain. It will be non-trivial to know how many index entries are
> currently pointing to the WARM chain and index/heap vacuum will throw up
> more challenges.
>
> Instead, what I would like to propose and the patch currently implements
> is to restrict WARM update to once per chain. So the first non-HOT update
> to a tuple or a HOT chain can be a WARM update. The chain can further be
> HOT updated any number of times. But it can no further be WARM updated.
> This might look too restrictive, but it can still bring down the number of
> regular updates by almost 50%. Further, if we devise a strategy to convert
> a WARM chain back to HOT chain, it can again be WARM updated. (This part is
> currently not implemented). A good side effect of this simple strategy is
> that we know there can maximum two index entries pointing to any given WARM
> chain.
>

We should probably think about coordinating with my btree patch.

>From the description above, the strategy is quite readily "upgradable" to
one in which the indexam discards duplicate (key,ctid) pairs and that would
remove the limitation of only one WARM update... right?


Re: [HACKERS] autonomous transactions

2016-08-31 Thread Greg Stark
On Wed, Aug 31, 2016 at 3:11 PM, Craig Ringer  wrote:
>
> I suspect that there'll be way too much code that relies on stashing
> xact-scoped stuff in globals for that to be viable. Caches alone.
> Peter will be able to explain more, I'm sure.
>
> We'd probably need a new transaction data object that everything
> xact-scoped hangs off, so we can pass it everywhere or swap it out of
> some global. The mechanical refactoring alone would be pretty scary,
> not to mention the complexity of actually identifying all the less
> obvious places that need changing.

Well this is the converse of the same problem. Today process state and
transaction are tied together. One way or another you're trying to
split that -- either by having two processes share state or by having
one process manage two transactions.

I suppose we already have the infrastructure for parallel query so
there's at least some shared problem space there.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Tom Lane
Andres Freund  writes:
> On 2016-08-31 12:56:45 -0300, Alvaro Herrera wrote:
>> I was thinking that nextval could grab a shared buffer lock and release
>> immediately, just to ensure no one holds exclusive buffer lock
>> concurrently (which would be used for things like dropping one seq tuple
>> from the page, when a sequence is dropped); then control access to each
>> sequence tuple using LockDatabaseObject.  This is a HW lock, heavier
>> than a buffer's LWLock, but it seems better than wasting a full 8kb for
>> each sequence.

> That's going to go be a *lot* slower, I don't think that's ok.  I've a
> hard time worrying about the space waste here; especially considering
> where we're coming from.

Improving on the space wastage is exactly the point IMO.  If it's still
going to be 8k per sequence on disk (*and* in shared buffers, remember),
I'm not sure it's worth all the work to change things at all.

We're already dealing with taking a heavyweight lock for each sequence
(the relation AccessShareLock).  I wonder whether it'd be possible to
repurpose that effort somehow.

Another idea would be to have nominally per-sequence LWLocks (or
spinlocks?) controlling nextval's nontransactional accesses to the catalog
rows, but to map those down to some fixed number of locks in a way similar
to the current fallback implementation for spinlocks, which maps them onto
a fixed number of semaphores.  You'd trade off shared memory against
contention while choosing the underlying number of locks.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: psql \setfileref

2016-08-31 Thread Pavel Stehule
2016-08-31 18:24 GMT+02:00 Corey Huinker :

> On Wed, Aug 31, 2016 at 11:32 AM, Pavel Stehule 
> wrote:
>
>> Hi
>>
>> I propose a new type of psql variables - file references. The content of
>> file reference is specified by referenced file. It allows simple inserting
>> large data without necessity of manual escaping or using LO api.
>>
>> When I wrote the patch, I used parametrized queries for these data
>> instead escaped strings - the code is not much bigger, and the error
>> messages are much more friendly if query is not bloated by bigger content.
>> The text mode is used only - when escaping is not required, then content is
>> implicitly transformed to bytea. By default the content of file is bytea.
>> When use requires escaping, then he enforces text escaping - because it has
>> sense only for text type.
>>
>> postgres=# \setfileref a ~/test2.xml
>> postgres=# \setfileref b ~/avatar.gif
>> postgres=# insert into test values(convert_from(:a, 'latin2')::xml, :b);
>> -- xml is passed as bytea
>> postgres=# insert into test values(:'a', :b); -- xml is passed via
>> unknown text value
>>
>> The content of file reference variables is not persistent in memory.
>>
>> Comments, notes?
>>
>> Regards
>>
>> Pavel
>>
>>
>> --
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
>>
>>
> Clearly jumping ahead on this one, but if the fileref is essentially a
> pipe to "cat /path/to/file.name", is there anything stopping us from
> setting pipes?
>
>
My interest is primarily in ways that COPY could use this.
>

I don't see a reason why it should not be possible - the current code can't
do it, but with 20 lines more, it should be possible

There is one disadvantage against copy - the content should be fully loaded
to memory, but any other functionality should be same.

Regards

Pavel


Re: [HACKERS] autonomous transactions

2016-08-31 Thread Simon Riggs
On 31 August 2016 at 14:09, Joel Jacobson  wrote:
> On Wed, Aug 31, 2016 at 6:41 AM, Jaime Casanova
>  wrote:
>>
>> On 30 August 2016 at 23:10, Joel Jacobson  wrote:
>> >
>> > There should be a way to within the session and/or txn permanently
>> > block autonomous transactions.
>> >
>>
>> This will defeat one of the use cases of autonomous transactions: auditing
>
> My idea on how to deal with this would be to mark the function to be
> "AUTONOMOUS" similar to how a function is marked to be "PARALLEL
> SAFE",
> and to throw an error if a caller that has blocked autonomous
> transactions tries to call a function that is marked to be autonomous.
>
> That way none of the code that needs to be audited would ever get executed.

Not sure I see why you would want to turn off execution for only some functions.

What happens if your function calls some other function with
side-effects? How would you roll that back? How would you mark
functions for the general case?

Functions with side effects can't be tested with simple unit tests;
that has nothing to do with autonomous transactions.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: psql \setfileref

2016-08-31 Thread Corey Huinker
On Wed, Aug 31, 2016 at 11:32 AM, Pavel Stehule 
wrote:

> Hi
>
> I propose a new type of psql variables - file references. The content of
> file reference is specified by referenced file. It allows simple inserting
> large data without necessity of manual escaping or using LO api.
>
> When I wrote the patch, I used parametrized queries for these data instead
> escaped strings - the code is not much bigger, and the error messages are
> much more friendly if query is not bloated by bigger content. The text mode
> is used only - when escaping is not required, then content is implicitly
> transformed to bytea. By default the content of file is bytea. When use
> requires escaping, then he enforces text escaping - because it has sense
> only for text type.
>
> postgres=# \setfileref a ~/test2.xml
> postgres=# \setfileref b ~/avatar.gif
> postgres=# insert into test values(convert_from(:a, 'latin2')::xml, :b);
> -- xml is passed as bytea
> postgres=# insert into test values(:'a', :b); -- xml is passed via unknown
> text value
>
> The content of file reference variables is not persistent in memory.
>
> Comments, notes?
>
> Regards
>
> Pavel
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
Clearly jumping ahead on this one, but if the fileref is essentially a pipe
to "cat /path/to/file.name", is there anything stopping us from setting
pipes?
My interest is primarily in ways that COPY could use this.


Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Andres Freund
On 2016-08-31 12:56:45 -0300, Alvaro Herrera wrote:
> I was thinking that nextval could grab a shared buffer lock and release
> immediately, just to ensure no one holds exclusive buffer lock
> concurrently (which would be used for things like dropping one seq tuple
> from the page, when a sequence is dropped); then control access to each
> sequence tuple using LockDatabaseObject.  This is a HW lock, heavier
> than a buffer's LWLock, but it seems better than wasting a full 8kb for
> each sequence.

That's going to go be a *lot* slower, I don't think that's ok.  I've a
hard time worrying about the space waste here; especially considering
where we're coming from.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Proposal for changes to recovery.conf API

2016-08-31 Thread Simon Riggs
This is a summary of proposed changes to the recovery.conf API for
v10. These are based in part on earlier discussions, and represent a
minimal modification to current usage.

Proposed changes (with reference to patches from Abhijit Menon-Sen and myself)

* pg_ctl start -M (normal|recover|continue)
pg_ctl can now start the server directly as a standby, similar to the
way pg_ctl promote works. The internal implementation is also similar,
with pg_ctl writing a "recovery.trigger" file that initiates recovery
in the same way recovery.conf used to do. It is still possible to
manually add a file called "recovery.trigger" and have that work if
users desire that mechanism.
Different startup methods can be selected with the -M
option. Normal mode starts the server for read/write,
overriding any previous use in Recover mode.
Recover mode starts the server as a standby server which
expects to receive changes from a primary/master server using physical
streaming replication or is used for
performing a recovery from backup. Continue mode is
the default and will startup the server in whatever mode it was in at
last proper shutdown, or as modified by any trigger files present.
(Patch: recovery_startup_r10_api.v1b.patch)

* $DATADIR/recovery.conf no longer triggers recovery
(Patch: recovery_startup_r10_api.v1b.patch)

* Recovery parameters would now be part of the main postgresql.conf
infrastructure
Any parameters set in $DATADIR/recovery.conf will be read after the
main parameter file, similar to the way that postgresql.conf.auto is
read.
(Abhijit)

* pg_basebackup -R will continue to generate a parameter file called
recovery.conf as it does now, but will also create a file named
recovery.trigger.
(This part is WIP; patch doesn't include implementation for tar format yet)

* Parameters
All of the parameters formerly set in recovery.conf can be set in
postgresql.conf using RELOAD
These parameters will have no defaults in postgresql.conf.sample
Setting them has no effect during normal running, or once recovery ends.
 https://www.postgresql.org/docs/devel/static/archive-recovery-settings.html
 https://www.postgresql.org/docs/devel/static/recovery-target-settings.html
 https://www.postgresql.org/docs/devel/static/standby-settings.html
(Abhijit)

Related cleanup
* Promotion signal file is now called "promote.trigger" rather than
just "promote"
* Remove user configurable "trigger_file" mechanism - use
"promote.trigger" for all cases
* Remove Fallback promote mechanism, so all promotion is now "fast" in xlog.c
* Rename CheckForStandbyTrigger() to CheckForPromoteTrigger() for clarity
(Patch: recovery_startup_r10_api.v1b.patch)

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


recovery_startup_r10_api.v1b.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing aggregates

2016-08-31 Thread Andres Freund
On 2016-08-31 19:07:00 +0300, Heikki Linnakangas wrote:
> On 08/31/2016 06:51 PM, Andres Freund wrote:
> > I've first combined the projection for all the aggregates, ordered set,
> > or not, into one projetion. That got rid of a fair amount of overhead
> > when you have multiple aggregates.  I attached an, probably out of date,
> > WIP version of that patch.
> 
> A-ha, I also considered doing just that! I also considered a variant where
> we call ExecProject once for all non-ordered aggregates, and a separate
> ExecProject() for each ordered one. But just switching back to straight
> ExecEvalExprs seemed easier.

The issue is that might, I think, end up iteratively deforming the
underlying tuple. The projection machinery takes care of that, if we do
it in one go.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing aggregates

2016-08-31 Thread Heikki Linnakangas

On 08/31/2016 06:51 PM, Andres Freund wrote:

On 2016-08-31 17:47:18 +0300, Heikki Linnakangas wrote:

We actually used to call ExecEvalExpr() directly for each argument, but that
was changed by the patch that added support for ordered set aggregates. It
looks like that was a bad idea, from a performance point of view.


I complained about that as well
http://archives.postgresql.org/message-id/20160519175727.ymv2y5tye4qgcmqx%40alap3.anarazel.de


Ah, missed that!


I propose that we go back to calling ExecEvalExpr() directly, for
non-ordered aggregates, per the attached patch. That makes that example
query about 10% faster on my laptop, which is in line with the fact that
ExecProject() accounted for about 13% of the CPU time.


My approach is a bit different.

I've first combined the projection for all the aggregates, ordered set,
or not, into one projetion. That got rid of a fair amount of overhead
when you have multiple aggregates.  I attached an, probably out of date,
WIP version of that patch.


A-ha, I also considered doing just that! I also considered a variant 
where we call ExecProject once for all non-ordered aggregates, and a 
separate ExecProject() for each ordered one. But just switching back to 
straight ExecEvalExprs seemed easier.



Secondly, I'm working on overhauling expression evaluation to be
faster. Even without the ExecProject overhead, the computations very
quickly become the bottleneck. During that I pretty much merged
ExecProject and ExecEvalExpr into one - they're really not that
different, and the distinction serves no purpose, except to increase the
number of function calls. The reason I'm working on getting rid of
targetlist SRFs is precisely that. A proof of concept of that is
attached to
http://archives.postgresql.org/message-id/20160714011850.bd5zhu35szle3n3c%40alap3.anarazel.de


Cool, yes, all that should help.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Alvaro Herrera
Andres Freund wrote:
> On 2016-08-31 11:23:27 -0400, Tom Lane wrote:
> > Another issue is what is the low-level interlock between nextvals
> > in different processes.  Right now it's the buffer lock on the
> > sequence's page.  With a scheme like this, if we just kept doing
> > that, we'd have a single lock covering probably O(100) different
> > sequences which might lead to contention problems.  We could probably
> > improve on that with some thought.
> 
> I was thinking of forcing the rows to be spread to exactly one page per
> sequence...

I was thinking that nextval could grab a shared buffer lock and release
immediately, just to ensure no one holds exclusive buffer lock
concurrently (which would be used for things like dropping one seq tuple
from the page, when a sequence is dropped); then control access to each
sequence tuple using LockDatabaseObject.  This is a HW lock, heavier
than a buffer's LWLock, but it seems better than wasting a full 8kb for
each sequence.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizing aggregates

2016-08-31 Thread Andres Freund
Hi,

On 2016-08-31 17:47:18 +0300, Heikki Linnakangas wrote:
> #     ..  .
> 
> #
> 25.70% 0.00%  postmaster  [unknown]  [k] 
> 14.23%13.75%  postmaster  postgres   [.] ExecProject

> ExecProject stands out. I find that pretty surprising.
> 
> We're using ExecProject to extract the arguments from the input tuples, to
> pass to the aggregate transition functions. It looks like that's a pretty
> expensive way of doing it, for a typical aggregate that takes only one
> argument.
> 
> We actually used to call ExecEvalExpr() directly for each argument, but that
> was changed by the patch that added support for ordered set aggregates. It
> looks like that was a bad idea, from a performance point of view.

I complained about that as well
http://archives.postgresql.org/message-id/20160519175727.ymv2y5tye4qgcmqx%40alap3.anarazel.de


> I propose that we go back to calling ExecEvalExpr() directly, for
> non-ordered aggregates, per the attached patch. That makes that example
> query about 10% faster on my laptop, which is in line with the fact that
> ExecProject() accounted for about 13% of the CPU time.

My approach is a bit different.

I've first combined the projection for all the aggregates, ordered set,
or not, into one projetion. That got rid of a fair amount of overhead
when you have multiple aggregates.  I attached an, probably out of date,
WIP version of that patch.

Secondly, I'm working on overhauling expression evaluation to be
faster. Even without the ExecProject overhead, the computations very
quickly become the bottleneck. During that I pretty much merged
ExecProject and ExecEvalExpr into one - they're really not that
different, and the distinction serves no purpose, except to increase the
number of function calls. The reason I'm working on getting rid of
targetlist SRFs is precisely that. A proof of concept of that is
attached to
http://archives.postgresql.org/message-id/20160714011850.bd5zhu35szle3n3c%40alap3.anarazel.de

Greetings,

Andres Freund
>From 384845bea72d28952d88e58e55f81aaa5addd930 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Tue, 12 Jul 2016 01:01:28 -0700
Subject: [PATCH] WIP: Only perform one projection in aggregation.

---
 src/backend/executor/nodeAgg.c | 112 -
 1 file changed, 88 insertions(+), 24 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index f655aec..4499d5f 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -210,6 +210,9 @@ typedef struct AggStatePerTransData
 	 */
 	int			numInputs;
 
+	/* offset of input columns in Aggstate->evalslot */
+	int			inputoff;
+
 	/*
 	 * Number of aggregated input columns to pass to the transfn.  This
 	 * includes the ORDER BY columns for ordered-set aggs, but not for plain
@@ -836,14 +839,20 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
 	int			setno = 0;
 	int			numGroupingSets = Max(aggstate->phase->numsets, 1);
 	int			numTrans = aggstate->numtrans;
+	TupleTableSlot *slot = aggstate->evalslot;
+	AggStatePerTrans pertrans;
 
-	for (transno = 0; transno < numTrans; transno++)
+	/* compute input for all aggregates */
+	if (aggstate->evalproj)
+		ExecProjectIntoSlot(aggstate->evalproj, aggstate->evalslot);
+
+	for (transno = 0, pertrans = aggstate->pertrans; transno < numTrans;
+		 transno++, pertrans++)
 	{
-		AggStatePerTrans pertrans = &aggstate->pertrans[transno];
 		ExprState  *filter = pertrans->aggfilter;
 		int			numTransInputs = pertrans->numTransInputs;
 		int			i;
-		TupleTableSlot *slot;
+		int			inputoff = pertrans->inputoff;
 
 		/* Skip anything FILTERed out */
 		if (filter)
@@ -857,13 +866,10 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
 continue;
 		}
 
-		/* Evaluate the current input expressions for this aggregate */
-		slot = ExecProject(pertrans->evalproj, NULL);
-
 		if (pertrans->numSortCols > 0)
 		{
 			/* DISTINCT and/or ORDER BY case */
-			Assert(slot->tts_nvalid == pertrans->numInputs);
+			Assert(slot->tts_nvalid >= pertrans->numInputs);
 
 			/*
 			 * If the transfn is strict, we want to check for nullity before
@@ -876,7 +882,7 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
 			{
 for (i = 0; i < numTransInputs; i++)
 {
-	if (slot->tts_isnull[i])
+	if (slot->tts_isnull[i + inputoff])
 		break;
 }
 if (i < numTransInputs)
@@ -888,10 +894,22 @@ advance_aggregates(AggState *aggstate, AggStatePerGroup pergroup)
 /* OK, put the tuple into the tuplesort object */
 if (pertrans->numInputs == 1)
 	tuplesort_putdatum(pertrans->sortstates[setno],
-	   slot->tts_values[0],
-	   slot->tts_isnull[0]);
+	   slot->tts_values[inputoff],
+	   slot->tts_isnull[inputoff]);
 else
-	tuplesort_puttupleslot(pertrans->sortstates[setno],

Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Andres Freund
On 2016-08-31 11:23:27 -0400, Tom Lane wrote:
> Another issue is what is the low-level interlock between nextvals
> in different processes.  Right now it's the buffer lock on the
> sequence's page.  With a scheme like this, if we just kept doing
> that, we'd have a single lock covering probably O(100) different
> sequences which might lead to contention problems.  We could probably
> improve on that with some thought.

I was thinking of forcing the rows to be spread to exactly one page per
sequence...

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] proposal: psql \setfileref

2016-08-31 Thread Pavel Stehule
Hi

I propose a new type of psql variables - file references. The content of
file reference is specified by referenced file. It allows simple inserting
large data without necessity of manual escaping or using LO api.

When I wrote the patch, I used parametrized queries for these data instead
escaped strings - the code is not much bigger, and the error messages are
much more friendly if query is not bloated by bigger content. The text mode
is used only - when escaping is not required, then content is implicitly
transformed to bytea. By default the content of file is bytea. When use
requires escaping, then he enforces text escaping - because it has sense
only for text type.

postgres=# \setfileref a ~/test2.xml
postgres=# \setfileref b ~/avatar.gif
postgres=# insert into test values(convert_from(:a, 'latin2')::xml, :b); --
xml is passed as bytea
postgres=# insert into test values(:'a', :b); -- xml is passed via unknown
text value

The content of file reference variables is not persistent in memory.

Comments, notes?

Regards

Pavel
commit 077c71b1f8ae24ccf2f3723e1e4ca5bf05bca0d3
Author: Pavel Stehule 
Date:   Wed Aug 31 17:15:33 2016 +0200

initial

diff --git a/src/bin/psql/command.c b/src/bin/psql/command.c
index 4aaf657..3150510 100644
--- a/src/bin/psql/command.c
+++ b/src/bin/psql/command.c
@@ -1355,6 +1355,32 @@ exec_command(const char *cmd,
 		free(envval);
 	}
 
+	/* \setfileref - set variable by reference on file */
+	else if (strcmp(cmd, "setfileref") == 0)
+	{
+		char	   *name = psql_scan_slash_option(scan_state,
+  OT_NORMAL, NULL, false);
+
+		char	   *ref = psql_scan_slash_option(scan_state,
+		 OT_NORMAL, NULL, false);
+
+		success = false;
+
+		if (!name || !ref)
+		{
+			psql_error("\\%s: missing required argument\n", cmd);
+			success = false;
+		}
+		else
+		{
+			if (!SetFileRef(pset.vars, name, ref))
+			{
+psql_error("\\%s: error while setting variable\n", cmd);
+success = false;
+			}
+		}
+	}
+
 	/* \sf -- show a function's source code */
 	else if (strcmp(cmd, "sf") == 0 || strcmp(cmd, "sf+") == 0)
 	{
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 7399950..3c1db17 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -32,7 +32,6 @@ static bool ExecQueryUsingCursor(const char *query, double *elapsed_msec);
 static bool command_no_begin(const char *query);
 static bool is_select_command(const char *query);
 
-
 /*
  * openQueryOutputFile --- attempt to open a query output file
  *
@@ -108,6 +107,123 @@ setQFout(const char *fname)
 	return true;
 }
 
+void
+psql_reset_query_params(void)
+{
+	int			i;
+
+	for (i = 0; i < pset.nparams; i++)
+		if (pset.params[i] != NULL)
+		{
+			PQfreemem(pset.params[i]);
+			pset.params[i] = NULL;
+		}
+
+	pset.nparams = 0;
+}
+
+/*
+ * Load a content of the file_ref related file to query params buffer.
+ * When escaping is requested, then the text content is expected.
+ * Without escaping the bytea content is expected and related bytea
+ * escaping is processed.
+ */
+#define BYTEAOID		17
+#define UNKNOWNOID		0
+
+static char *
+get_file_ref_content(const char *value, bool escape, bool as_ident)
+{
+	PQExpBufferData		buffer;
+	FILE			   *fd = NULL;
+	char			   *fname;
+	char			   *escaped_value;
+	charline[1024];
+	size_tsize;
+
+	fname = pstrdup(value);
+
+	expand_tilde(&fname);
+	if (!fname)
+	{
+		psql_error("missing valid path to file\n");
+		return NULL;
+	}
+
+	canonicalize_path(fname);
+
+	fd = fopen(fname, PG_BINARY_R);
+	if (!fd)
+	{
+		psql_error("%s: %s\n", fname, strerror(errno));
+		PQfreemem(fname);
+		return NULL;
+	}
+
+	/* can append another parameter */
+	if (pset.nparams >= MAX_BINARY_PARAMS)
+	{
+		psql_error("too much binary parameters");
+		PQfreemem(fname);
+		return NULL;
+	}
+
+	if (!pset.db)
+	{
+		psql_error("cannot escape without active connection\n");
+		PQfreemem(fname);
+		return NULL;
+	}
+
+	initPQExpBuffer(&buffer);
+
+	while ((size = fread(line, 1, sizeof(line), fd)) > 0)
+		appendBinaryPQExpBuffer(&buffer, line, size);
+
+	if (ferror(fd))
+	{
+		psql_error("%s: %s\n", fname, strerror(errno));
+		PQfreemem(fname);
+		termPQExpBuffer(&buffer);
+		return NULL;
+	}
+
+	if (escape)
+	{
+		if (as_ident)
+			escaped_value =
+PQescapeIdentifier(pset.db, buffer.data, buffer.len);
+		else
+			escaped_value =
+PQescapeLiteral(pset.db, buffer.data, buffer.len);
+		pset.paramTypes[pset.nparams] = UNKNOWNOID;
+	}
+	else
+	{
+		escaped_value = (char *)
+PQescapeByteaConn(pset.db,
+	(const unsigned char *) buffer.data, buffer.len, &size);
+		pset.paramTypes[pset.nparams] = BYTEAOID;
+	}
+
+	/* fname, buffer is not necessary longer */
+	PQfreemem(fname);
+	termPQExpBuffer(&buffer);
+
+	if (escaped_value == NULL)
+	{
+		const char *error = PQerrorMessage(pset.db);
+
+		psql_error("%s", error);
+		return NULL;
+	}
+
+	pset.params[pset.nparams] = escaped_value;
+
+	snprintf(line, sizeof(line) - 1, "$%d", ++pset.nparams);
+
+	return pstrdup(lin

Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Tom Lane
Craig Ringer  writes:
> On 31 August 2016 at 22:01, Tom Lane  wrote:
>> Personally, my big beef with the current approach to sequences is that
>> we eat a whole relation (including a whole relfilenode) per sequence.
>> I wish that we could reduce a sequence to just a single row in a
>> catalog, including the nontransactional state.

> It sounds like you're thinking of something like a normal(ish) heap
> tuple where we just overwrite some fields in-place without fiddling
> xmin/xmax and making a new row version. Right? Like we currently
> overwrite the lone Form_pg_sequence  on the 1-page sequence relations.

That would be what to do with the nontransactional state.  If I recall
previous discussions correctly, there's a stumbling block if you want
to treat ALTER SEQUENCE changes as transactional --- but maybe that
doesn't make sense anyway.  If we did want to try that, maybe we need
two auxiliary catalogs, one for the transactionally-updatable sequence
fields and one for the nontransactional fields.

> It feels intuitively pretty gross to effectively dirty-read and write
> a few fields of a tuple. But that's what we do all the time with
> xmin/xmax etc, it's not really that different.

True.  I think two rows would work around that, but maybe we don't
have to.

Another issue is what is the low-level interlock between nextvals
in different processes.  Right now it's the buffer lock on the
sequence's page.  With a scheme like this, if we just kept doing
that, we'd have a single lock covering probably O(100) different
sequences which might lead to contention problems.  We could probably
improve on that with some thought.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Missing checks when malloc returns NULL...

2016-08-31 Thread Tom Lane
Michael Paquier  writes:
> Which means that processes have an escape window when initializing
> shared memory by cleaning up the index if an entry cannot be found and
> then cannot be created properly. I don't think that it is a good idea
> to change that by forcing ShmemAlloc to fail. So I would tend to just
> have the patch attached and add those missing NULL-checks on all the
> existing ShmemAlloc() calls.

> Opinions?

I still think it'd be better to fix that as attached, because it
represents a net reduction not net addition of code, and it provides
a defense against future repetitions of the same omission.  If only
4 out of 11 existing calls were properly checked --- some of them
adjacent to calls with checks --- that should tell us that we *will*
have more instances of the same bug if we don't fix it centrally.

I also note that your patch missed checks for two ShmemAlloc calls in
InitShmemAllocation and ShmemInitStruct.  Admittedly, since those are
the very first such calls, it's highly unlikely they'd fail; but I think
this exercise is not about dismissing failures as improbable.  Almost
all of these failures are improbable, given that we precalculate the
shmem space requirement.

regards, tom lane

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 1efe020..cc3af2d 100644
*** a/src/backend/storage/ipc/shmem.c
--- b/src/backend/storage/ipc/shmem.c
*** InitShmemAllocation(void)
*** 163,177 
  /*
   * ShmemAlloc -- allocate max-aligned chunk from shared memory
   *
!  * Assumes ShmemLock and ShmemSegHdr are initialized.
   *
!  * Returns: real pointer to memory or NULL if we are out
!  *		of space.  Has to return a real pointer in order
!  *		to be compatible with malloc().
   */
  void *
  ShmemAlloc(Size size)
  {
  	Size		newStart;
  	Size		newFree;
  	void	   *newSpace;
--- 163,194 
  /*
   * ShmemAlloc -- allocate max-aligned chunk from shared memory
   *
!  * Throws error if request cannot be satisfied.
   *
!  * Assumes ShmemLock and ShmemSegHdr are initialized.
   */
  void *
  ShmemAlloc(Size size)
  {
+ 	void	   *newSpace;
+ 
+ 	newSpace = ShmemAllocNoError(size);
+ 	if (!newSpace)
+ 		ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+  errmsg("out of shared memory (%zu bytes requested)",
+ 		size)));
+ 	return newSpace;
+ }
+ 
+ /*
+  * ShmemAllocNoError -- allocate max-aligned chunk from shared memory
+  *
+  * As ShmemAlloc, but returns NULL if out of space, rather than erroring.
+  */
+ void *
+ ShmemAllocNoError(Size size)
+ {
  	Size		newStart;
  	Size		newFree;
  	void	   *newSpace;
*** ShmemAlloc(Size size)
*** 206,216 
  
  	SpinLockRelease(ShmemLock);
  
! 	if (!newSpace)
! 		ereport(WARNING,
! (errcode(ERRCODE_OUT_OF_MEMORY),
!  errmsg("out of shared memory")));
! 
  	Assert(newSpace == (void *) CACHELINEALIGN(newSpace));
  
  	return newSpace;
--- 223,229 
  
  	SpinLockRelease(ShmemLock);
  
! 	/* note this assert is okay with newSpace == NULL */
  	Assert(newSpace == (void *) CACHELINEALIGN(newSpace));
  
  	return newSpace;
*** ShmemInitHash(const char *name, /* table
*** 293,299 
  	 * The shared memory allocator must be specified too.
  	 */
  	infoP->dsize = infoP->max_dsize = hash_select_dirsize(max_size);
! 	infoP->alloc = ShmemAlloc;
  	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
  
  	/* look it up in the shmem index */
--- 306,312 
  	 * The shared memory allocator must be specified too.
  	 */
  	infoP->dsize = infoP->max_dsize = hash_select_dirsize(max_size);
! 	infoP->alloc = ShmemAllocNoError;
  	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
  
  	/* look it up in the shmem index */
*** ShmemInitStruct(const char *name, Size s
*** 364,375 
  			 */
  			Assert(shmemseghdr->index == NULL);
  			structPtr = ShmemAlloc(size);
- 			if (structPtr == NULL)
- ereport(ERROR,
- 		(errcode(ERRCODE_OUT_OF_MEMORY),
- 		 errmsg("not enough shared memory for data structure"
- " \"%s\" (%zu bytes requested)",
- name, size)));
  			shmemseghdr->index = structPtr;
  			*foundPtr = FALSE;
  		}
--- 377,382 
*** ShmemInitStruct(const char *name, Size s
*** 410,416 
  	else
  	{
  		/* It isn't in the table yet. allocate and initialize it */
! 		structPtr = ShmemAlloc(size);
  		if (structPtr == NULL)
  		{
  			/* out of memory; remove the failed ShmemIndex entry */
--- 417,423 
  	else
  	{
  		/* It isn't in the table yet. allocate and initialize it */
! 		structPtr = ShmemAllocNoError(size);
  		if (structPtr == NULL)
  		{
  			/* out of memory; remove the failed ShmemIndex entry */
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 7cdb355..4064b20 100644
*** a/src/backend/storage/lmgr/predicate.c
--- b/src/backend/storage/lmgr/predicate.c
*** InitPredicateLocks(void)
*** 1184,1195 
  		

Re: [HACKERS] [GENERAL] C++ port of Postgres

2016-08-31 Thread Peter Eisentraut
[trimmed cc list because of big attachments]

On 8/16/16 4:22 PM, Jim Nasby wrote:
> Joy, do you have an idea what a *minimally invasive* patch for C++ 
> support would look like? That's certainly the first step here.

I developed a minimally invasive patch for C++ support a few years ago
shortly after I wrote that blog post.  Since there appears to have been
some interest here now, I have updated that and split it up into logical
chunks.

So here you go.

To build this, you need to configure with g++ <= version 5.  (4.x works,
too.)  g++ version 6 does not work yet because of the issues described
in patch 0013.

Then you also need to edit src/Makefile.custom and set

COPT = -fpermissive -Wno-sign-compare -Wno-write-strings

The -W options are optional just to reduce some noise.  Cleaning up
those warnings can be a separate project that might also have some
benefit under C.

The -fpermissive option is a g++ specific option that reduces some
errors to warnings.  (So this won't work with clang or other compilers
at all at this point.)  In particular, C++ does not allow casting from
or to void pointers without a cast, but -fpermissive allows that.  The
step from this to "real" C++ would be adding a bunch of casts around
things like malloc and palloc and other places.  That would be mostly
busy work, so I have excluded that here.

The patches are numbered approximately in increasing order of dubiosity.
 So 0001 is probably a straight bug fix, 0002 and 0003 are arguably
minor bug fixes as well.  The patches through 0012 can probably be
considered for committing in some form.  After that it gets a bit hackish.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From 86dbd4a5a0ab3212cd340e1fa56f03e864aa8e1a Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 30 Aug 2016 12:00:00 -0400
Subject: [PATCH 01/27] Fix use of offsetof()

Using offsetof() with a run-time computed argument is not allowed in
either C or C++.  Apparently, gcc allows it, but g++ doesn't.
---
 contrib/bloom/blutils.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index debf4f4..b68a0d1 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -75,7 +75,7 @@ _PG_init(void)
 		bl_relopt_tab[i + 1].optname = MemoryContextStrdup(TopMemoryContext,
 		   buf);
 		bl_relopt_tab[i + 1].opttype = RELOPT_TYPE_INT;
-		bl_relopt_tab[i + 1].offset = offsetof(BloomOptions, bitSize[i]);
+		bl_relopt_tab[i + 1].offset = offsetof(BloomOptions, bitSize[0]) + sizeof(int) * i;
 	}
 }
 
-- 
2.9.3

From d3c7c1fbb346fd5c040a7a379d971db7b5129581 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 30 Aug 2016 12:00:00 -0400
Subject: [PATCH 02/27] Use return instead of exit() in configure

Using exit() requires stdlib.h, which is not included.  Use return
instead.  Also add return type for main().
---
 config/c-compiler.m4 |  4 +++-
 config/c-library.m4  |  4 +++-
 configure| 12 +---
 3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index a7f6773..7d901e1 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -71,8 +71,10 @@ int does_int64_work()
 return 0;
   return 1;
 }
+
+int
 main() {
-  exit(! does_int64_work());
+  return (! does_int64_work());
 }])],
 [Ac_cachevar=yes],
 [Ac_cachevar=no],
diff --git a/config/c-library.m4 b/config/c-library.m4
index 50d068d..56658b5 100644
--- a/config/c-library.m4
+++ b/config/c-library.m4
@@ -204,8 +204,10 @@ int does_int64_snprintf_work()
 return 0;			/* either multiply or snprintf is busted */
   return 1;
 }
+
+int
 main() {
-  exit(! does_int64_snprintf_work());
+  return (! does_int64_snprintf_work());
 }]])],
 [pgac_cv_snprintf_long_long_int_modifier=$pgac_modifier; break],
 [],
diff --git a/configure b/configure
index 45c8eef..36d9a54 100755
--- a/configure
+++ b/configure
@@ -13563,8 +13563,10 @@ int does_int64_work()
 return 0;
   return 1;
 }
+
+int
 main() {
-  exit(! does_int64_work());
+  return (! does_int64_work());
 }
 _ACEOF
 if ac_fn_c_try_run "$LINENO"; then :
@@ -13645,8 +13647,10 @@ int does_int64_work()
 return 0;
   return 1;
 }
+
+int
 main() {
-  exit(! does_int64_work());
+  return (! does_int64_work());
 }
 _ACEOF
 if ac_fn_c_try_run "$LINENO"; then :
@@ -13739,8 +13743,10 @@ int does_int64_snprintf_work()
 return 0;			/* either multiply or snprintf is busted */
   return 1;
 }
+
+int
 main() {
-  exit(! does_int64_snprintf_work());
+  return (! does_int64_snprintf_work());
 }
 _ACEOF
 if ac_fn_c_try_run "$LINENO"; then :
-- 
2.9.3

From 54ed07be5a29d4955a9485316e68da5c6896797b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Tue, 30 Aug 2016 12:00:00 -0400
Subject: [PATCH 03/27] Add missing include files to configure tests

atoi() needs stdlib.h
strcmp() needs string.h
---
 config/c-library.m4 |

[HACKERS] less expensive pg_buffercache on big shmem

2016-08-31 Thread Ivan Kartyshov

Hi hackers,

Recently I have finished my work on a patch for pg_buffercache contrib, 
I think it's time to share my results.



Introduction


I want to offer you the implementation that allows to decrease system 
workload by
partially sacrificing (fully snapshot consistency) data consistency. 
Sometimes we do not need full data consistency, for example on 
quantitative rather than qualitative analysis of memory contents, or 
when we want to catch insufficient memory resources or how often 
relation is used.



Problem description
===

Currently, the pg_buffercache v1.1 and prior takes an exclusive lock on 
all shared buffers, which greatly affects system performance.
Usually we use pg_buffercache to find out why DB is working slower than 
expected or examine what occupies the entire memory. So when we run 
pg_buffercache on such system, we make it even slower.



Implementation
==

Vanilla implementation contains loop which collecting statistic from 
whole shared memory acquire, read and release Spinlocks one by one, page 
by page  while holding LWLock.


V1.2 implementation contains flexible loop which can collect shared 
memory statistic using three different methods:
1) with holding LWLock only on one partition of shared memory 
(semiconsistent method)

2) without LWLocks (nonconsistent method),
3) or in vanilia way (consistent method)

The aforementioned allow us to launch pg_buffercache in the three 
different ways.

Each of them have some advantages and some disadvantages:

Consistent:
+ 100% consistency of shared memory snapshot
- Slowdown the system with whole shared memory exclusive lock

Semiconsistent:
+ Faster than consistent method
+ Mostly doesn`t affect on the system load
- Speed of taking that snapshot is low
Nonconsistent:
The fastest
+ Doesn`t noticeably affects on the systems
- <3% lost of snapshot consistency

What works
==

Actually, it work well even on big load, but of course there might be 
things I've

overlooked.
VIEW pg_buffercache_cons
VIEW pg_buffercache_noncons
VIEW pg_buffercache_semicons

Examples from docs in new realization:
SELECT c.relname, count(*) AS buffers FROM pg_buffercache_noncons b 
INNER JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) AND 
b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = 
current_database())) GROUP BY c.relname ORDER BY 2 DESC LIMIT 10;


SELECT c.relname, count(*) AS buffers FROM pg_buffercache_semicons b 
INNER JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) AND 
b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname = 
current_database())) GROUP BY c.relname ORDER BY 2 DESC LIMIT 10;



Testing the implementation
==

How implementation tested:
1) Start server
2) Make pgbench tps
pgbench -c 250 -s 1000  -T 200 -P1
3) Compare how tps sags under load if:
SELECT count(*) FROM pg_buffercache_cons;
SELECT count(*) FROM pg_buffercache_semicons;
SELECT count(*) FROM pg_buffercache_noncons;

This test was made on server (server parameters)
Model name:Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz
CPU(s):144
Socket(s): 4
Shared_buffers:200GB


Results of testing
==

Our DBA team obtained the following results:
Nonconsistent:
* 10% faster then consistent method
* doesn`t noticeably affects on the systems
	* the maximum loss of accuracy was less then 3%* ( in most situation it 
is permissible accuracy loss )


Semiconsistent:
* 5 time slower then nonconsistent
* made less affects on system compared to consistent

Overall results:
Our clients was pleased with this implementation.
Implementation is made with backward compatibility, as a conclusion old 
pg_buffercache v1.1 queries will work well.
Semiconsistent show results approaching to nonconsistent on SELECTONLY 
queries.


* this values were obtained from our DBA tests.

What can be better
===

It is unclear how to optimize the semiconsistent method to make it 
faster, and reduce temporary effect that appears from time to time.



I will be glad to see your feedback!

---
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index 065d3d6..8813c50 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -4,7 +4,7 @@ MODULE_big = pg_buffercache
 OBJS = pg_buffercache_pages.o $(WIN32RES)
 
 EXTENSION = pg_buffercache
-DATA = pg_buffercache--1.1.sql pg_buffercache--1.0--1.1.sql pg_buffercache--unpackaged--1.0.sql
+DATA = pg_buffercache--1.2.sql pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql pg_buffercache--unpackaged--1.0.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 ifdef USE_PGXS
diff --git a/contrib/pg_buffercache/README b/contrib/pg_buffercache/

Re: [HACKERS] make async slave to wait for lsn to be replayed

2016-08-31 Thread Craig Ringer
On 31 August 2016 at 22:16, Ivan Kartyshov  wrote:

> Our clients who deal with 9.5 and use asynchronous master-slave replication,
> asked to make the wait-mechanism on the slave side to prevent the situation
> when slave handles query which needs data (LSN) that was received, flushed,
> but still not replayed.

I like the broad idea - I've wanted something like it for a while. BDR
has pg_xlog_wait_remote_receive() and pg_xlog_wait_remote_apply() for
use in tests for this reason, but they act on the *upstream* side,
waiting until the downstream has acked the data. Not as useful for
ensuring that apps connected to both master and one or more replicas
get a consistent view of data.

How do you get the commit LSN to watch for? Grab
pg_current_xlog_insert_location() just after the commit and figure
that replaying to that point guarantees you get the commit?

Some time ago[1] I raised the idea of reporting commit LSN on the wire
to clients. That didn't go anywhere due to compatibility and security
concerns. I think those were resolvable, but it wasn't enough of a
priority to push hard on at the time. A truly "right" solution has to
wait for a protocol bump, but I think good-enough solutions are
possible now. So you might want to read that thread.

It also mentions hesitations about exposing LSN to clients even more.
I think we're *way* past that now - we have replication origins and
replication slots relying on it, it's exposed in a pg_lsn datatype, a
bunch of views expose it, etc. But it might be reasonable to ask
"should the client instead be expected to wait for the confirmed
commit of a 64-bit epoch-extended xid, like that returned by
txid_current()?" . One advantage of using xid is that you can get it
while you're still in the xact, so there's no race between commit and
checking the lsn after commit.

Are you specifically trying to ensure "this commit has replayed on the
replica before we run queries on it" ? Or something else?

(Also, on a side note, Kevin mentioned that it may be possible to use
SSI data to achieve SERIALIZABLE read-only queries on replicas, where
they get the same protection from commit-order related anomalies as
queries on the master. You might want to look more deeply into that
too at some stage, if you're trying to ensure the app can do read only
queries on the master and expect fully consistent results).

[1] 
https://www.postgresql.org/message-id/flat/53E41EC1.5050603%402ndquadrant.com#53e41ec1.5050...@2ndquadrant.com

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autonomous transactions

2016-08-31 Thread Petr Jelinek

On 31/08/16 16:11, Craig Ringer wrote:

On 31 August 2016 at 21:46, Greg Stark  wrote:

On Wed, Aug 31, 2016 at 2:50 AM, Peter Eisentraut
 wrote:

- A API interface to open a "connection" to a background worker, run
queries, get results: AutonomousSessionStart(), AutonomousSessionEnd(),
AutonomousSessionExecute(), etc.  The communication happens using the
client/server protocol.


Peter, you mention "Oracle-style autonomous transaction blocks".

What are the semantics to be expected of those with regards to:

- Accessing objects exclusively locked by the outer xact or where the
requested lockmode conflicts with a lock held by the outer xact

- Visibility of data written by the outer xact



That would be my question as well.



Also, is it intended (outside the plpgsql interface) that the
autonomous xact can proceed concurrently/interleaved with a local
backend xact? i.e. the local backend xact isn't suspended and you're
allowed to do things on the local backend as well? If so, what
handling do you have in mind for deadlocks between the local backend
xact and the bgworker with the autonomous xact? I'd expect the local
backend to always win, killing the autonomous xact every time.



I would expect that in PLs it's handled by them, if you misuse this on C 
level that's your problem?



I'm surprised by the background worker. I had envisioned autonomous
transactions being implemented by allowing a process to reserve a
second entry in PGPROC with the same pid. Or perhaps save its existing
information in a separate pgproc slot (or stack of slots) and
restoring it after the autonomous transaction commits.


I suspect that there'll be way too much code that relies on stashing
xact-scoped stuff in globals for that to be viable. Caches alone.
Peter will be able to explain more, I'm sure.



I can also see some advantages in bgworker approach. For example it 
could be used for "fire and forget" type of interface in the future, 
where you return as soon as you send exec and don't care about waiting 
for result.


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Optimizing aggregates

2016-08-31 Thread Heikki Linnakangas
I've been profiling simple aggregate queries, looking for any 
low-hanging fruit. For this query:



-- setup
create table floats as select g::float8 as a, g::float8 as b, g::float8 
as c from generate_series(1, 1000) g;

vacuum freeze floats;

-- query
select sum(a), sum(b+c) from floats;


perf report says:

# Children  Self  Command Shared Object  Symbol 

#     ..  . 


#
25.70% 0.00%  postmaster  [unknown]  [k] 
14.23%13.75%  postmaster  postgres   [.] ExecProject
11.18%10.57%  postmaster  postgres   [.] slot_deform_tuple
 9.58% 9.04%  postmaster  postgres   [.] advance_aggregates
 8.96% 0.00%  postmaster  [unknown]  [.] 0x000298d4
 8.77% 8.42%  postmaster  postgres   [.] 
ExecMakeFunctionResultNoSets

 7.78% 0.00%  postmaster  [unknown]  [.] 0x01d38260
 6.63% 6.15%  postmaster  postgres   [.] 
advance_transition_function

 6.61% 0.00%  postmaster  [unknown]  [.] 0x01e99e40
 6.47% 0.00%  postmaster  libc-2.23.so   [.] __GI___libc_read
 6.24% 5.88%  postmaster  postgres   [.] heap_getnext
 4.62% 4.62%  postmaster  [kernel.kallsyms]  [k] 
copy_user_enhanced_fast_string

 3.91% 3.82%  postmaster  postgres   [.] slot_getsomeattrs
 3.29% 3.18%  postmaster  postgres   [.] slot_getattr
 3.06% 3.00%  postmaster  postgres   [.] ExecClearTuple
 2.59% 0.00%  postmaster  [unknown]  [.] 0x01e9a370
 2.57% 2.45%  postmaster  postgres   [.] ExecScan
 2.56% 2.37%  postmaster  postgres   [.] float8pl
 2.54% 2.43%  postmaster  postgres   [.] heapgetpage
 2.25% 2.17%  postmaster  postgres   [.] ExecAgg
 2.10% 1.96%  postmaster  postgres   [.] ExecStoreTuple
 2.00% 1.91%  postmaster  postgres   [.] ExecProcNode

ExecProject stands out. I find that pretty surprising.

We're using ExecProject to extract the arguments from the input tuples, 
to pass to the aggregate transition functions. It looks like that's a 
pretty expensive way of doing it, for a typical aggregate that takes 
only one argument.


We actually used to call ExecEvalExpr() directly for each argument, but 
that was changed by the patch that added support for ordered set 
aggregates. It looks like that was a bad idea, from a performance point 
of view.


I propose that we go back to calling ExecEvalExpr() directly, for 
non-ordered aggregates, per the attached patch. That makes that example 
query about 10% faster on my laptop, which is in line with the fact that 
ExecProject() accounted for about 13% of the CPU time.


Another idea is that maybe we should add a fast-path to ExecProject(), 
for these trivial cases.


- Heikki
From 106c5742fde2ec83576323db74a7249d7d85101f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas 
Date: Wed, 31 Aug 2016 17:27:52 +0300
Subject: [PATCH] Skip ExecProject for non-ordered aggregates.

When support for ordered aggregates was added, we started using
ExecProject to compute the arguments for the transient function. However,
it turns out that the old way of just calling ExecEvalExpr() directly for
each argument is somewhat faster. At least for typical aggregates that only
take one or two arguments. So go back to using ExecEvalExpr() for non-ordered
aggregates.

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index ce2fc28..e32b140 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -229,9 +229,8 @@ typedef struct AggStatePerTransData
 	/* Oid of state value's datatype */
 	Oid			aggtranstype;
 
-	/* ExprStates of the FILTER and argument expressions. */
+	/* ExprStates of the FILTER and direct-argument expressions. */
 	ExprState  *aggfilter;		/* state of FILTER expression, if any */
-	List	   *args;			/* states of aggregated-argument expressions */
 	List	   *aggdirectargs;	/* states of direct-argument expressions */
 
 	/*
@@ -288,17 +287,21 @@ typedef struct AggStatePerTransData
 transtypeByVal;
 
 	/*
-	 * Stuff for evaluation of inputs.  We used to just use ExecEvalExpr, but
-	 * with the addition of ORDER BY we now need at least a slot for passing
-	 * data to the sort object, which requires a tupledesc, so we might as
-	 * well go whole hog and use ExecProject too.
+	 * Stuff for evaluation of inputs.
+	 *
+	 * For non-ordered aggregates, we call ExecEvalExpr for each argument,
+	 * represented by the expression trees in transInputs. For ordered
+	 * aggregates, we need at least a slot for passing data to the sort
+	 * object, which requires a tupledesc, so we might as well go whole hog
+	 * and use ExecProject to evaluate the arguments, too.
 	 */
+	ExprState **transInputs;	/* express

Re: [HACKERS] some requests on auditing

2016-08-31 Thread Pavel Stehule
2016-08-31 16:00 GMT+02:00 David Steele :

> On 8/31/16 9:39 AM, David Steele wrote:
>
>> On 8/30/16 10:12 AM, Pavel Stehule wrote:
>>
>
> #3 is not likely without changes to logging in Postgres.  However, there
>> are plenty of tools for log analysis (e.g. ELK) that might help and a
>> Postgres extension that allows log messages to be directed elsewhere
>> (can't remember the name but Gabrielle or Simon would know).
>>
>
> Here's the extension I was thinking of:
>
> https://github.com/2ndquadrant-it/redislog
>
> This one is more general purpose:
>
> https://github.com/mpihlak/pg_logforward
>

many thanks you for these informations - I'll check it.

Regards

Pavel


> --
> -David
> da...@pgmasters.net
>


Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Petr Jelinek

On 31/08/16 16:10, Tom Lane wrote:

I wrote:

Personally, my big beef with the current approach to sequences is that
we eat a whole relation (including a whole relfilenode) per sequence.
I wish that we could reduce a sequence to just a single row in a
catalog, including the nontransactional state.  Not sure how feasible
that is either, but accomplishing it would move the benefits of making
a change out of the "debatable whether it's worth it" category, IMO.


BTW, another thing to keep in mind here is the ideas that have been
kicked around in the past about alternative sequence implementations
managed through a "sequence AM API".  I dunno whether now is the time
to start creating that API abstraction, but let's at least consider
it if we're whacking the catalog representation around.



FWIW if I was going to continue with the sequence AM API, the next patch 
would have included split of sequence metadata and sequence state into 
separate catalogs, so from that point this actually seems like an 
improvement (I didn't look at the code though).


As a side note, I don't plan to resurrect the seqam patch at least until 
we have reasonable built-in logical replication functionality.


--
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Craig Ringer
On 31 August 2016 at 22:01, Tom Lane  wrote:

> Personally, my big beef with the current approach to sequences is that
> we eat a whole relation (including a whole relfilenode) per sequence.
> I wish that we could reduce a sequence to just a single row in a
> catalog, including the nontransactional state.

I'd be happy to see incremental improvement in this space as Peter has
suggested, though I certainly see the value of something like seqam
too.

It sounds like you're thinking of something like a normal(ish) heap
tuple where we just overwrite some fields in-place without fiddling
xmin/xmax and making a new row version. Right? Like we currently
overwrite the lone Form_pg_sequence  on the 1-page sequence relations.

I initially thought that TRUNCATE ... RESTART IDENTITY would be
somewhat of a problem with this. We effectively have a temporary
"timeline" fork in the sequence value where it's provisionally
restarted and we start using values from the restarted sequence within
the xact that restarted it. But actually, it'd fit pretty well.
TRUNCATE ... RESTART IDENTITY would write a new row version with a new
xmin, and set xmax on the old sequence row. nextval(...) within the
truncating xact would update the new row's non-transactional fields
when it allocated new sequence chunks. On commit, everyone starts
using the new row due to normal transactional visibility rules. On
rollback everyone ignores it like they would any other dead tuple from
an aborted act and uses the old tuple's nontransactional fields. It
Just Works(TM).

nextval(...) takes AccessShareLock on a sequence relation. TRUNCATE
... RESTART IDENTITY takes AccessExclusiveLock. So we can never have
nextval(...) advancing the "old" timeline in other xacts at the same
time as we consume values on the restarted sequence inside the xact
that did the restarting. We still need the new "timeline" though,
because we have to retain the old value for rollback.

It feels intuitively pretty gross to effectively dirty-read and write
a few fields of a tuple. But that's what we do all the time with
xmin/xmax etc, it's not really that different.

It'd certainly make sequence decoding easier too. A LOT easier.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] make async slave to wait for lsn to be replayed

2016-08-31 Thread Ivan Kartyshov

Hi hackers,

Few days earlier I've finished my work on WAITLSN statement utility, so 
I’d like to share it.



Introduction


Our clients who deal with 9.5 and use asynchronous master-slave 
replication, asked to make the wait-mechanism on the slave side to 
prevent the situation when slave handles query which needs data (LSN) 
that was received, flushed, but still not replayed.



Problem description
===

The implementation:
Must handle the wait-mechanism using pg_sleep() in order not to load system
Must avoid race conditions if different backend want to wait for 
different LSN

Must not take snapshot of DB, to avoid troubles with sudden minXID change
Must have optional timeout parameter if LSN traffic has stalled.
Must release on postmaster’s death or interrupts.


Implementation
==

To avoid troubles with snapshots, WAITLSN was implemented as a utility 
statement, this allows us to circumvent the snapshot-taking mechanism.

We tried different variants and the most effective way was to use Latches.
To handle interprocess interaction all Latches are stored in shared 
memory and to cope with race conditions, each Latch is protected by a 
Spinlock.

Timeout was made optional parameter, it is set in milliseconds.


What works
==

Actually, it works well even with significant timeout or wait period 
values, but of course there might be things I've overlooked.


How to use it
==

WAITLSN ‘LSN’ [, timeout in ms];

#Wait until LSN 0/303EC60 will be replayed, or 10 second passed.
WAITLSN ‘0/303EC60’, 1;

#Or same without timeout.
WAITLSN ‘0/303EC60’;

Notice: WAITLSN will release on PostmasterDeath or Interruption events 
if they come earlier then LSN or timeout.


Testing the implementation
==

The implementation was tested with testgres and unittest python modules.

How to test this implementation:
Start master server
Make table test, insert tuple 1
Make asynchronous slave replication (9.5 wal_level = standby, 9.6 or 
higher wal_level =  replica)

Slave: START TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
SELECT * FROM test;
Master: delete tuple + make vacuum + get new LSN
Slave: WAITLSN ‘newLSN’, 6;
Waitlsn finished with FALSE “LSN doesn`t reached”
Slave: COMMIT;
WAITLSN ‘newLSN’, 6;
Waitlsn finished with success (without NOTICE message)

The WAITLSN as expected wait LSN, and interrupts on PostmasterDeath, 
interrupts or timeout.


Your feedback is welcome!


---
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 77667bd..72c5390 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -172,6 +172,7 @@ Complete list of usable sgml source files in this directory.
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/ref/waitlsn.sgml b/doc/src/sgml/ref/waitlsn.sgml
new file mode 100644
index 000..6a8bdca
--- /dev/null
+++ b/doc/src/sgml/ref/waitlsn.sgml
@@ -0,0 +1,108 @@
+
+
+
+ 
+  WAITLSN
+ 
+
+ 
+  WAITLSN
+  7
+  SQL - Language Statements
+ 
+
+ 
+  WAITLSN
+  wait when target LSN been replayed
+ 
+
+ 
+
+WAITLSN 'LSN' [ , delay ]
+
+ 
+
+ 
+  Description
+
+  
+   The WAITLSN wait till target LSN will
+   be replayed with an optional delay (milliseconds by default
+   infinity) to be wait for LSN to replayed.
+  
+
+  
+   WAITLSN provides a simple
+   interprocess LSN wait mechanism for a backends on slave
+   in master-slave replication scheme on PostgreSQL database.
+  
+ 
+
+ 
+  Parameters
+
+  
+   
+LSN
+
+ 
+  Target log sequence number to be wait for.
+ 
+
+   
+   
+delay
+
+ 
+  Time in miliseconds to waiting for LSN to be replayed.
+ 
+
+   
+  
+ 
+
+ 
+  Notes
+
+  
+   Delay time to waiting for LSN to be replayed must be integer. For
+   default it is infinity. Waiting can be interupped using Ctl+C, or
+   by Postmaster death.
+  
+ 
+
+ 
+  Examples
+
+  
+   Configure and execute a waitlsn from
+   psql:
+
+
+WAITLSN '0/3F07A6B1', 1;
+NOTICE:  LSN is not reached. Try to make bigger delay.
+WAITLSN
+
+WAITLSN '0/3F07A611';
+WAITLSN
+
+WAITLSN '0/3F0FF791', 50;
+^CCancel request sent
+NOTICE:  LSN is not reached. Try to make bigger delay.
+ERROR:  canceling statement due to user request
+
+
+ 
+
+ 
+  Compatibility
+
+  
+   There is no WAITLSN statement in the SQL
+   standard.
+  
+ 
+
diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
index 8acdff1..3733ad9 100644
--- a/doc/src/sgml/reference.sgml
+++ b/doc/src/sgml/reference.sgml
@@ -200,6 +200,7 @@
&update;
&vacuum;
&values;
+   &waitlsn;
 
  
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f13f9c1..609c83e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -39,6 +39,7 @@
 #inclu

Re: [HACKERS] INSERT .. SET syntax

2016-08-31 Thread Marko Tiikkaja

Hello hello,

Here's a rebased and updated patch for $SUBJECT for the September commit 
fest.



.m
*** a/doc/src/sgml/ref/insert.sgml
--- b/doc/src/sgml/ref/insert.sgml
***
*** 22,33  PostgreSQL documentation
   
  
  [ WITH [ RECURSIVE ] with_query [, ...] ]
! INSERT INTO table_name [ AS alias ] [ ( column_name [, ...] ) ]
! { DEFAULT VALUES | VALUES ( { expression | DEFAULT } [, ...] ) [, ...] | query }
  [ ON CONFLICT [ conflict_target ] conflict_action ]
  [ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] ]
  
! where conflict_target can be one of:
  
  ( { index_column_name | ( index_expression ) } [ COLLATE collation ] [ opclass ] [, ...] ) [ WHERE index_predicate ]
  ON CONSTRAINT constraint_name
--- 22,42 
   
  
  [ WITH [ RECURSIVE ] with_query [, ...] ]
! INSERT INTO table_name [ AS alias ]
! {
! [ column_list ] VALUES ( { expression | DEFAULT } [, ...] ) [, ...] |
! [ column_list ] query |
! DEFAULT VALUES |
! SET column_name = { expression | DEFAULT } [, ...]
! }
  [ ON CONFLICT [ conflict_target ] conflict_action ]
  [ RETURNING * | output_expression [ [ AS ] output_name ] [, ...] ]
  
! where column_list is:
! 
! ( column_name [, ...] )
! 
! and conflict_target can be one of:
  
  ( { index_column_name | ( index_expression ) } [ COLLATE collation ] [ opclass ] [, ...] ) [ WHERE index_predicate ]
  ON CONSTRAINT constraint_name
***
*** 53,65  INSERT INTO table_name [ AS 
  

!The target column names can be listed in any order.  If no list of
!column names is given at all, the default is all the columns of the
!table in their declared order; or the first N column
!names, if there are only N columns supplied by the
!VALUES clause or query.  The values
!supplied by the VALUES clause or query are
!associated with the explicit or implicit column list left-to-right.

  

--- 62,87 

  

!The target column names in a column_list can be
!listed in any order.  If no column_list is given at
!all (and the SET syntax is not used), the default is all
!the columns of the table in their declared order; or the first
!N column names, if there are only N
!columns supplied by the VALUES clause or
!query.  The values supplied by the VALUES
!clause or query are associated with the explicit or
!implicit column list left-to-right.
!   
! 
!   
! Instead of a column_list and a VALUES
! clause, a SET clause similar to that of an
! UPDATE can be used instead.  The advantage of the
! SET clause is that instead of matching the elements in
! the two lists by ordinal position, the column name and the
! expression to assign to that column are visually next to each other.
! This can make long column assignment lists significantly more
! readable.

  

***
*** 690,702  INSERT INTO distributors (did, dname) VALUES (10, 'Conrad International')

 INSERT conforms to the SQL standard, except that
 the RETURNING clause is a
!PostgreSQL extension, as is the ability
!to use WITH with INSERT, and the ability to
!specify an alternative action with ON CONFLICT.
!Also, the case in
!which a column name list is omitted, but not all the columns are
!filled from the VALUES clause or query,
!is disallowed by the standard.

  

--- 712,724 

 INSERT conforms to the SQL standard, except that
 the RETURNING clause is a
!PostgreSQL extension, as is the
!SET clause when used instead of a VALUES clause, the
!ability to use WITH with INSERT, and the
!ability to specify an alternative action with ON
!CONFLICT.  Also, the case in which a column name list is omitted,
!but not all the columns are filled from the VALUES clause
!or query, is disallowed by the standard.

  

*** a/src/backend/parser/analyze.c
--- b/src/backend/parser/analyze.c
***
*** 467,474  transformInsertStmt(ParseState *pstate, InsertStmt *stmt)
  		stmt->onConflictClause->action == ONCONFLICT_UPDATE);
  
  	/*
! 	 * We have three cases to deal with: DEFAULT VALUES (selectStmt == NULL),
! 	 * VALUES list, or general SELECT input.  We special-case VALUES, both for
  	 * efficiency and so we can handle DEFAULT specifications.
  	 *
  	 * The grammar allows attaching ORDER BY, LIMIT, FOR UPDATE, or WITH to a
--- 467,475 
  		stmt->onConflictClause->action == ONCONFLICT_UPDATE);
  
  	/*
! 	 * We have four cases to deal with: DEFAULT VALUES (selectStmt == NULL and
! 	 * cols == NIL), SET syntax (selectStmt == NULL but cols != NIL), VALUES
! 	 * list, or general SELECT input.  We special-case VALUES, both for
  	 * efficiency and so we can handle DEFAULT specifications.
  	 *
  	 * The grammar allows attaching ORDER BY, LIMIT, FOR UPDATE, or WITH to a
***
*** 523,529  transfo

Re: [HACKERS] autonomous transactions

2016-08-31 Thread Craig Ringer
On 31 August 2016 at 21:46, Greg Stark  wrote:
> On Wed, Aug 31, 2016 at 2:50 AM, Peter Eisentraut
>  wrote:
>> - A API interface to open a "connection" to a background worker, run
>> queries, get results: AutonomousSessionStart(), AutonomousSessionEnd(),
>> AutonomousSessionExecute(), etc.  The communication happens using the
>> client/server protocol.

Peter, you mention "Oracle-style autonomous transaction blocks".

What are the semantics to be expected of those with regards to:

- Accessing objects exclusively locked by the outer xact or where the
requested lockmode conflicts with a lock held by the outer xact

- Visibility of data written by the outer xact

?

Also, is it intended (outside the plpgsql interface) that the
autonomous xact can proceed concurrently/interleaved with a local
backend xact? i.e. the local backend xact isn't suspended and you're
allowed to do things on the local backend as well? If so, what
handling do you have in mind for deadlocks between the local backend
xact and the bgworker with the autonomous xact? I'd expect the local
backend to always win, killing the autonomous xact every time.

> I'm surprised by the background worker. I had envisioned autonomous
> transactions being implemented by allowing a process to reserve a
> second entry in PGPROC with the same pid. Or perhaps save its existing
> information in a separate pgproc slot (or stack of slots) and
> restoring it after the autonomous transaction commits.

I suspect that there'll be way too much code that relies on stashing
xact-scoped stuff in globals for that to be viable. Caches alone.
Peter will be able to explain more, I'm sure.

We'd probably need a new transaction data object that everything
xact-scoped hangs off, so we can pass it everywhere or swap it out of
some global. The mechanical refactoring alone would be pretty scary,
not to mention the complexity of actually identifying all the less
obvious places that need changing.

Consider invalidation callbacks. They're always "fun", and so simple
to get right

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Tom Lane
I wrote:
> Personally, my big beef with the current approach to sequences is that
> we eat a whole relation (including a whole relfilenode) per sequence.
> I wish that we could reduce a sequence to just a single row in a
> catalog, including the nontransactional state.  Not sure how feasible
> that is either, but accomplishing it would move the benefits of making
> a change out of the "debatable whether it's worth it" category, IMO.

BTW, another thing to keep in mind here is the ideas that have been
kicked around in the past about alternative sequence implementations
managed through a "sequence AM API".  I dunno whether now is the time
to start creating that API abstraction, but let's at least consider
it if we're whacking the catalog representation around.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Leftover member in openssl part of Port struct

2016-08-31 Thread Daniel Gustafsson

> On 31 Aug 2016, at 15:12, Tom Lane  wrote:
> 
> Daniel Gustafsson  writes:
>> When SSL renegotiation was removed in 426746b9 the only consumer of the 
>> openssl
>> specific count member in the Port struct was removed, but the member was left
>> together with a few updates to it which are unused.  Attached patch removes 
>> the
>> leftovers which now serves no purpose unless I’m missing something.
> 
> Hm, well, we can't remove those case labels as control would then end
> up in the default case which throws an error.  

Doh, I’ll go stand in the corner.

> But otherwise this seems
> sound.  Without renegotiation the count is not very useful anyway
> since it's likely to overflow (at least if long is 32 bits).

Yeah. Thanks for applying.

cheers ./daniel

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Tom Lane
Craig Ringer  writes:
> On 31 August 2016 at 21:17, Peter Eisentraut
>  wrote:
>> I don't know if this is a net improvement.  Maybe this introduces as
>> many new issues as it removes.  But I figured I'll post this, so that at
>> least we can discuss it.

> This will change behaviour subtly.

Uh, not as subtly as all that, because "select * from sequence" will
now return a different set of columns, which will flat out break a
lot of clients that use that method to get sequence properties.

In previous discussions of this idea, we had speculated about turning
sequences into, effectively, views, so that we could still return all
the same columns --- only now some of them would be coming from a
catalog rather than the non-transactional per-sequence storage.
I'm not sure how feasible that is, but it's worth thinking about.

Personally, my big beef with the current approach to sequences is that
we eat a whole relation (including a whole relfilenode) per sequence.
I wish that we could reduce a sequence to just a single row in a
catalog, including the nontransactional state.  Not sure how feasible
that is either, but accomplishing it would move the benefits of making
a change out of the "debatable whether it's worth it" category, IMO.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some requests on auditing

2016-08-31 Thread David Steele

On 8/31/16 9:39 AM, David Steele wrote:

On 8/30/16 10:12 AM, Pavel Stehule wrote:



#3 is not likely without changes to logging in Postgres.  However, there
are plenty of tools for log analysis (e.g. ELK) that might help and a
Postgres extension that allows log messages to be directed elsewhere
(can't remember the name but Gabrielle or Simon would know).


Here's the extension I was thinking of:

https://github.com/2ndquadrant-it/redislog

This one is more general purpose:

https://github.com/mpihlak/pg_logforward

--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autonomous transactions

2016-08-31 Thread Greg Stark
On Wed, Aug 31, 2016 at 2:50 AM, Peter Eisentraut
 wrote:
> - A API interface to open a "connection" to a background worker, run
> queries, get results: AutonomousSessionStart(), AutonomousSessionEnd(),
> AutonomousSessionExecute(), etc.  The communication happens using the
> client/server protocol.
>


I'm surprised by the background worker. I had envisioned autonomous
transactions being implemented by allowing a process to reserve a
second entry in PGPROC with the same pid. Or perhaps save its existing
information in a separate pgproc slot (or stack of slots) and
restoring it after the autonomous transaction commits.

Using a background worker mean that the autonomous transaction can't
access any state from the process memory. Parameters in plpgsql are a
symptom of this but I suspect there will be others. What happens if a
statement timeout occurs during an autonomous transaction? What
happens if you use a pl language in the autonomous transaction and if
it tries to use non-transactional information such as prepared
statements?


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Use static inline functions for Float <-> Datum conversions

2016-08-31 Thread Tom Lane
Heikki Linnakangas  writes:
> On 08/31/2016 02:38 PM, Tom Lane wrote:
>> I wonder whether there is a compiler-dependent way of avoiding the union
>> trick ... or maybe gcc is already smart enough that it doesn't matter?

> It seems to compile into a single instruction, so it can't get any 
> better from a performance point of view.

Yeah, confirmed here.  On my not-real-new gcc (version 4.4.7, which
ships with RHEL6), these test functions:

Datum
compare_int8(PG_FUNCTION_ARGS)
{
int64   x = PG_GETARG_INT64(0);
int64   y = PG_GETARG_INT64(1);

PG_RETURN_BOOL(x < y);
}

Datum
compare_float8(PG_FUNCTION_ARGS)
{
double  x = PG_GETARG_FLOAT8(0);
double  y = PG_GETARG_FLOAT8(1);

PG_RETURN_BOOL(x < y);
}

compile into this (at -O2):

compare_int8:
.cfi_startproc
movq40(%rdi), %rax
cmpq%rax, 32(%rdi)
setl%al
movzbl  %al, %eax
ret
.cfi_endproc

compare_float8:
.cfi_startproc
movsd   40(%rdi), %xmm0
xorl%eax, %eax
ucomisd 32(%rdi), %xmm0
seta%al
ret
.cfi_endproc

(Not sure why the compiler does the widening of the comparison result
differently, but it doesn't look like it matters.)  Before this patch,
that looked like:

compare_float8:
.cfi_startproc
pushq   %rbx
.cfi_def_cfa_offset 16
.cfi_offset 3, -16
movq%rdi, %rbx
subq$16, %rsp
.cfi_def_cfa_offset 32
movq32(%rdi), %rdi
callDatumGetFloat8
movq40(%rbx), %rdi
movsd   %xmm0, 8(%rsp)
callDatumGetFloat8
xorl%eax, %eax
ucomisd 8(%rsp), %xmm0
seta%al
addq$16, %rsp
.cfi_def_cfa_offset 16
popq%rbx
.cfi_def_cfa_offset 8
ret
.cfi_endproc

Nice.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_sequence catalog

2016-08-31 Thread Craig Ringer
On 31 August 2016 at 21:17, Peter Eisentraut
 wrote:
> While I was hacking around sequence stuff, I felt the urge to look into
> an old peeve: That sequence metadata is not stored in a proper catalog.
> Right now in order to find out some metadata about sequences (min, max,
> increment, etc.), you need to look into the sequence.  That is like
> having to query a table in order to find out about its schema.
>
> There are also known issues with the current storage such as that we
> can't safely update the sequence name stored in the sequence when we
> rename, so we just don't.

... and don't have a comment warning the poor confused reader about
that, either. As I discovered when doing my first pass at sequence
decoding.

> I don't know if this is a net improvement.  Maybe this introduces as
> many new issues as it removes.  But I figured I'll post this, so that at
> least we can discuss it.

This will change behaviour subtly. Probably not in ways we care much
about, but I'd rather that be an informed decision than an implicit
"oops we didn't think about that" one.

Things stored in the Form_pg_sequence are affected immediately,
non-transactionally, by ALTER SEQUENCE. If you:

BEGIN;
ALTER SEQUENCE myseq INTERVAL 10;
ROLLBACK;

then the *next call to nextval* after the ALTER will step by 10. Well,
roughly, there's some slush there due to caching etc. Rolling back has
no effect. Yet other effects of ALTER SEQUENCE, notably RENAME, are
transactional and are subject to normal locking, visibility and
rollback rules.

Even more fun, ALTER SEQUENCE ... RESTART is immediate and
non-transactional  but TRUNCATE ... RESTART IDENTITY *is*
transactional and takes effect only at commit. ALTER SEQUENCE writes a
new Form_pg_sequence with the new value to the existing relfilenode.
TRUNCATE instead updates pg_class for the sequence with a new
relfilenode and writes its changes to the new relfilenode. So even two
operations that seem the same are very different.

If  understand the proposed change correctly, this change will make
previously non-transactional ALTER SEQUENCE operations transactional
and subject to normal rules, since the relevant information is now in
a proper catalog.

Personally I think that's a big improvement. The current situation is
warts upon warts.

Prior proposals to move sequences away from a
one-relation-and-one-filenode-per-sequence model have fallen down in
early planning stages. This seems like a simpler, more sensible step
to take, and it won't block later cleanups of how we store sequences
if we decide to go there.

It'll also make it slightly easier to handle logical decoding of
sequence advances, since it'll be possible to send *just* the new
sequence value. Right now we can't tell if interval, etc, also got
changed, and have to send most of the Form_pg_sequence on  the wire
for every sequence advance, which is a little sucky. This change won't
solve the problem I outlined in the other thread though, that
sequences are transactional sometimes and not other times.

So +1 for the idea from me. It'll just need relnotes warning of the
subtle behaviour change.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] some requests on auditing

2016-08-31 Thread David Steele

On 8/30/16 10:12 AM, Pavel Stehule wrote:


I am working on pgaudit customization for one my customer.

There are few requests:

1. flat format without complex types, without nesting - CSV is ideal.
2. all important attributes should be separated - is not possible to
search in original queries: table name, database name, role name, rights.
3. if it is possible - own log file
4. one statement can have more rows (flat format is required), but it
should be logged only once success/failed
5. any activity should be logged


You may want to take a look at pgaudit_analyze which I think addresses 
#1, #2, and #4:


https://github.com/pgaudit/pgaudit/tree/master/analyze

#3 is not likely without changes to logging in Postgres.  However, there 
are plenty of tools for log analysis (e.g. ELK) that might help and a 
Postgres extension that allows log messages to be directed elsewhere 
(can't remember the name but Gabrielle or Simon would know).


As for #5, which activities aren't being logged?

--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WAL consistency check facility

2016-08-31 Thread Simon Riggs
On 27 August 2016 at 12:09, Kuntal Ghosh  wrote:

>>> * wal_consistency_mask = 511  /* Enable consistency check mask bit*/
>>
>> What does this mean? (No docs)
>
> I was using this parameter as a masking integer to indicate the
> operations(rmgr list) for which we need this feature to be enabled.
> Since, this could be confusing, I've changed it accordingly so that it
> accepts a list of rmgrIDs. (suggested by Michael, Amit and Robert)

Why would we want that?


>>> 1. Add support for other Resource Managers.
>>
>> We probably need to have a discussion as to why you think this should
>> be Rmgr dependent?
>> Code comments would help there.
>>
>> If it does, then you should probably do this by extending RmgrTable
>> with an rm_check, so you can call it like this...
>>
>> RmgrTable[record->xl_rmid].rm_check
>
> +1.
> I'm modifying it accordingly. I'm calling this function after
> RmgrTable[record->xl_rmid].rm_redo.
>
>>> 5. Generalize the page type identification technique.
>>
>> Why not do this first?
>>
>
> At present, I'm using special page size and page ID to identify page
> type. But, I've noticed some cases where the entire page is
> initialized to zero (Ex: hash_xlog_squeeze_page). RmgrID and info bit
> can help us to identify those pages.

I'd prefer a solution that was not dependent upon RmgrID at all.

If there are various special cases that we need to cater for, ISTM
they would be flaws in the existing WAL implementation rather than
anything we would want to perpetuate. I hope we'll spend time fixing
them rather than add loads of weird code to work around the
imperfections.

Underdocumented special case code is going to be unbelievably
difficult to get right in the long term.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] autonomous transactions

2016-08-31 Thread Pavel Stehule
2016-08-31 15:09 GMT+02:00 Joel Jacobson :

> On Wed, Aug 31, 2016 at 6:41 AM, Jaime Casanova
>  wrote:
> >
> > On 30 August 2016 at 23:10, Joel Jacobson  wrote:
> > >
> > > There should be a way to within the session and/or txn permanently
> > > block autonomous transactions.
> > >
> >
> > This will defeat one of the use cases of autonomous transactions:
> auditing
>
> My idea on how to deal with this would be to mark the function to be
> "AUTONOMOUS" similar to how a function is marked to be "PARALLEL
> SAFE",
> and to throw an error if a caller that has blocked autonomous
> transactions tries to call a function that is marked to be autonomous.
>
> That way none of the code that needs to be audited would ever get executed.
>

I like this idea - it allows better (cleaner) snapshot isolation.

Regards

Pavel


>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Re: [HACKERS] [COMMITTERS] pgsql: Fix pg_receivexlog --synchronous

2016-08-31 Thread Simon Riggs
On 29 August 2016 at 12:34, Tom Lane  wrote:
> Simon Riggs  writes:
>> Fix pg_receivexlog --synchronous
>
> The buildfarm says you broke the 9.5 branch.
>
> In general, pushing inessential patches just a few hours before a wrap
> deadline is a dangerous business.  Pushing them without any testing
> is close to irresponsible.

Sorry about that everybody. Thanks to Alvaro for doing that in my absence.

I pushed to 9.5 because of a misunderstanding that the author was
saying to me they had also tested it for 9.5. It was not knowingly
untested, but responsibility and mistake was mine in not confirming
that with my own eyes before pushing.

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_sequence catalog

2016-08-31 Thread Peter Eisentraut
While I was hacking around sequence stuff, I felt the urge to look into
an old peeve: That sequence metadata is not stored in a proper catalog.
Right now in order to find out some metadata about sequences (min, max,
increment, etc.), you need to look into the sequence.  That is like
having to query a table in order to find out about its schema.

There are also known issues with the current storage such as that we
can't safely update the sequence name stored in the sequence when we
rename, so we just don't.

This patch introduces a new catalog pg_sequence that stores the sequence
metadata.  The sequences themselves now only store the counter and
supporting information.

I don't know if this is a net improvement.  Maybe this introduces as
many new issues as it removes.  But I figured I'll post this, so that at
least we can discuss it.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/catalog/Makefile b/src/backend/catalog/Makefile
index 1ce7610..cbf0d79 100644
--- a/src/backend/catalog/Makefile
+++ b/src/backend/catalog/Makefile
@@ -41,7 +41,7 @@ POSTGRES_BKI_SRCS = $(addprefix $(top_srcdir)/src/include/catalog/,\
 	pg_foreign_data_wrapper.h pg_foreign_server.h pg_user_mapping.h \
 	pg_foreign_table.h pg_policy.h pg_replication_origin.h \
 	pg_default_acl.h pg_init_privs.h pg_seclabel.h pg_shseclabel.h \
-	pg_collation.h pg_range.h pg_transform.h \
+	pg_collation.h pg_range.h pg_transform.h pg_sequence.h \
 	toasting.h indexing.h \
 )
 
diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index 04d7840..8e1e1ac 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -66,6 +66,7 @@
 #include "commands/proclang.h"
 #include "commands/schemacmds.h"
 #include "commands/seclabel.h"
+#include "commands/sequence.h"
 #include "commands/trigger.h"
 #include "commands/typecmds.h"
 #include "nodes/nodeFuncs.h"
@@ -1155,6 +1156,8 @@ doDeletion(const ObjectAddress *object, int flags)
 	else
 		heap_drop_with_catalog(object->objectId);
 }
+if (relKind == RELKIND_SEQUENCE)
+	DeleteSequenceTuple(object->objectId);
 break;
 			}
 
diff --git a/src/backend/catalog/information_schema.sql b/src/backend/catalog/information_schema.sql
index 00550eb..182d2d0 100644
--- a/src/backend/catalog/information_schema.sql
+++ b/src/backend/catalog/information_schema.sql
@@ -1535,15 +1535,16 @@ CREATE VIEW sequences AS
CAST(64 AS cardinal_number) AS numeric_precision,
CAST(2 AS cardinal_number) AS numeric_precision_radix,
CAST(0 AS cardinal_number) AS numeric_scale,
-   CAST(p.start_value AS character_data) AS start_value,
-   CAST(p.minimum_value AS character_data) AS minimum_value,
-   CAST(p.maximum_value AS character_data) AS maximum_value,
-   CAST(p.increment AS character_data) AS increment,
-   CAST(CASE WHEN p.cycle_option THEN 'YES' ELSE 'NO' END AS yes_or_no) AS cycle_option
-FROM pg_namespace nc, pg_class c, LATERAL pg_sequence_parameters(c.oid) p
+   CAST(s.seqstart AS character_data) AS start_value,
+   CAST(s.seqmin AS character_data) AS minimum_value,
+   CAST(s.seqmax AS character_data) AS maximum_value,
+   CAST(s.seqincrement AS character_data) AS increment,
+   CAST(CASE WHEN s.seqcycle THEN 'YES' ELSE 'NO' END AS yes_or_no) AS cycle_option
+FROM pg_namespace nc, pg_class c, pg_sequence s
 WHERE c.relnamespace = nc.oid
   AND c.relkind = 'S'
   AND (NOT pg_is_other_temp_schema(nc.oid))
+  AND c.oid = s.seqrelid
   AND (pg_has_role(c.relowner, 'USAGE')
OR has_sequence_privilege(c.oid, 'SELECT, UPDATE, USAGE') );
 
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index c98f981..95bd172 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -22,8 +22,10 @@
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/dependency.h"
+#include "catalog/indexing.h"
 #include "catalog/namespace.h"
 #include "catalog/objectaccess.h"
+#include "catalog/pg_sequence.h"
 #include "catalog/pg_type.h"
 #include "commands/defrem.h"
 #include "commands/sequence.h"
@@ -74,7 +76,7 @@ typedef struct SeqTableData
 	int64		cached;			/* last value already cached for nextval */
 	/* if last != cached, we have not used up all the cached values */
 	int64		increment;		/* copy of sequence's increment field */
-	/* note that increment is zero until we first do read_seq_tuple() */
+	/* note that increment is zero until we first do nextval_internal() */
 } SeqTableData;
 
 typedef SeqTableData *SeqTable;
@@ -92,10 +94,11 @@ static int64 nextval_internal(Oid relid);
 static Relation open_share_lock(SeqTable seq);
 static void create_seq_hashtable(void);
 static void init_sequence(Oid relid, SeqTable *p_elm, 

Re: [HACKERS] Exclude schema during pg_restore

2016-08-31 Thread Michael Banck
Hi,

Am Mittwoch, den 31.08.2016, 07:59 -0300 schrieb Fabrízio de Royes
Mello:

> Please add it to the next open commitfest.

I had done so already: https://commitfest.postgresql.org/10/762/


Regards,

Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   >