Re: [HACKERS] pg_ctl promote wait

2016-02-28 Thread Michael Paquier
On Mon, Feb 29, 2016 at 4:28 PM, Michael Paquier
 wrote:
> I would suggest using
> $node_standby->poll_query_until('SELECT pg_is_in_recovery()') to
> validate the end of the test.

Meh. SELECT NOT pg_is_in_recovery(). This will wait until the query
returns true.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl promote wait

2016-02-28 Thread Michael Paquier
On Mon, Feb 29, 2016 at 10:30 AM, Peter Eisentraut  wrote:
> On 2/19/16 3:09 PM, Tom Lane wrote:
>> I see no need for an additional mechanism.  Just watch pg_control until
>> you see DB_IN_PRODUCTION state there, then switch over to the same
>> connection probing that "pg_ctl start -w" uses.
>
> Here is a patch set around that idea.
>
> The subsequent discussion mentioned that there might still be a window
> between end of waiting and when read-write queries would be accepted.  I
> don't know how big that window would be in practice and would be
> interested in some testing and feedback.

Here is some input for 0001 (useful taken independently):
+$node_primary->append_conf(
+   "postgresql.conf", qq(
+wal_level = hot_standby
+max_wal_senders = 2
+wal_keep_segments = 20
+hot_standby = on
+)
+   );
That's more or less allows_streaming => 1 in $node_primary->init.

+$node_standby->append_conf(
+   "recovery.conf", qq(
+primary_conninfo='$connstr_primary'
+standby_mode=on
+recovery_target_timeline='latest'
+)
+   );
Here you could just use enable_streaming => 1 when calling init_from_backup.

+$node_standby->command_like(['psql', '-X', '-A', '-t', '-c', 'SELECT
pg_is_in_recovery()'],
$node_standby->psql instead of a direct command? The result is
returned directly with the call to the routine.

+$node_standby->command_like(['psql', '-X', '-A', '-t', '-c', 'SELECT
pg_is_in_recovery()'],
+   qr/^f$/,
+   'promoted standby is not in recovery');
Once again $node_standby->psql?

+sleep 3;  # needs a moment to react
+
+$node_standby->command_like(['psql', '-X', '-A', '-t', '-c', 'SELECT
pg_is_in_recovery()'],
+   qr/^f$/,
+   'promoted standby is not in recovery');
sleep() is something we should try to avoid as much as possible in our
tests. On slow platforms, take hamster, promote is surely going to
take longer than that to be processed by the standby node and put it
out of recovery. I would suggest using
$node_standby->poll_query_until('SELECT pg_is_in_recovery()') to
validate the end of the test.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: BUG #13685: Archiving while idle every archive_timeout with wal_level hot_standby

2016-02-28 Thread Michael Paquier
On Wed, Feb 24, 2016 at 2:40 PM, Michael Paquier
 wrote:
> This has the merit to be clear, thanks for the input. Whatever the
> approach taken at the end we have two candidates:
> - Extend XLogInsert() with an extra argument for flags (Andres)
> - Introduce XLogInsertExtended with this extra argument and let
> XLogInsert() in peace (Robert and I).
> Actually, I lied, there was still something I could do for this
> thread: attached are two patches implementing both approaches as
> respectively a-1 and a-2. Patch b is the set of logs I used for the
> tests to show with a low checkpoint_timeout that checkpoints are
> getting correctly skipped on an idle system.
>
> Let's see what happens next.

FWIW, I just made a couple of 5-min pgbench runs on my laptop (scale
100, 24 clients using 4 threads) with checkpoint_timeout = 30s and a
small value of shared_buffers to minimize the work of each checkpoint,
and I am seeing similar spikes with variations that can be assimilated
to noise. See for example the graph attached. Trying to put more
contention on the extra locks taken before entering the checkpoint
section to scan the progress LSN looks kind of difficult :) But that's
not really a surprise.
-- 
Michael

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-02-28 Thread Dilip Kumar
On Mon, Feb 29, 2016 at 8:26 AM, Amit Kapila 
wrote:

> Have you tried by reverting the commits 6150a1b0 and ac1d794, which I
> think effects read-only performance and sometimes create variation in TPS
> across different runs, here second might have less impact, but first one
> could impact performance?  Is it possible for you to get perf data with and
> without patch and share with others?
>

I only reverted ac1d794 commit in my test, In my next run I will revert
6150a1b0 also and test.


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Declarative partitioning

2016-02-28 Thread Amit Langote

Hi Ildar,

On 2016/02/29 7:14, Ildar Musin wrote:
> 16/02/16 07:46, Amit Langote wrote:
>> On 2016/02/16 11:41, Josh berkus wrote:
>>> We're not going to use CE for the new partitioning long-term, are we? This
>>> is just the first version, right?
>> Yes. My approach in previous versions of stuffing major planner changes in
>> with the syntax patch was not quite proper in retrospect. So, I thought
>> I'd propose any major planner (and executor) changes later.
>>
>> Thanks,
>> Amit
>>
> Hello Amit,
> 
> Thank you for your work. I'm currently working on extension aimed at
> planner optimization for partitioned tables
> (https://github.com/postgrespro/pg_pathman). At this moment I have an
> implementation of binary search for range partitioned tables with basic
> partitioning keys (date, timestamp, integers etc). And I'd like to try to
> combine your syntax and infrastructure with my binary search implementation.
> There likely will be changes in range syntax and partitions cache
> structure based on discussion. So looking forward for your next patch.

Sure, thanks! I will look at your extension as well.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-28 Thread Michael Paquier
On Mon, Feb 29, 2016 at 7:40 AM, Michael Paquier
 wrote:
> The buildfarm does not have infrastructure to test that yet.. I need
> to craft a patch for the client-side code and send it to Andrew. Will
> try to do so today.

For those interested, here is where things are going to happen:
https://github.com/PGBuildFarm/client-code/pull/7
(This patch could be more refactored, I am not sure how much Andrew
would like things to be less duplicated, so I went to the most simple
solution).
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2016-02-28 Thread Michael Paquier
On Mon, Feb 29, 2016 at 3:50 AM, Joe Conway  wrote:
> If there are no other complaints or comments, I will commit the attached
> sometime this coming week (the the requisite catversion bump).

Thanks for the updated patch, I have nothing else to say on my side.
The new version has fixed the MSVC build, I just double-checked it.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-02-28 Thread Amit Kapila
On Sun, Feb 28, 2016 at 9:05 PM, Dilip Kumar  wrote:
>
>
> On Sat, Feb 27, 2016 at 5:14 AM, Andres Freund  wrote:
>>
>> > > > >> ./pgbench  -j$ -c$ -T300 -M prepared -S postgres
>> > > > >>
>> > > > >> ClientBasePatch
>> > > > >> 11716916454
>> > > > >> 8108547105559
>> > > > >> 32241619262818
>> > > > >> 64206868233606
>> > > > >> 128137084217013
>> > >
>> > > So, there's a small regression on low client counts. That's worth
>> > > addressing.
>> > >
>> >
>> > Interesting. I'll try to reproduce it.
>>
>> Any progress here?
>
>
> In Multi socket machine with 8 sockets and 64 cores, I have seen more
regression compared to my previous run in power8 with 2 socket, currently I
tested Read only workload for 5 mins Run, When I get time, I will run for
longer time and confirm again.
>

Have you tried by reverting the commits 6150a1b0 and ac1d794, which I think
effects read-only performance and sometimes create variation in TPS across
different runs, here second might have less impact, but first one could
impact performance?  Is it possible for you to get perf data with and
without patch and share with others?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [PATH] Correct negative/zero year in to_date/to_timestamp

2016-02-28 Thread Vitaly Burovoy
On 2/27/16, Robert Haas  wrote:
> On Tue, Feb 23, 2016 at 6:23 AM, Thomas Munro
>  wrote:
>> This seems to be a messy topic.  The usage of "AD" and "BC" imply that
>> TO_DATE is using the anno domini system which doesn't have a year 0,
>> but in the DATE type perhaps we are using the ISO 8601 model[2] where
>> 1 BC is represented as , leading to the difference of one in all
>> years before 1 AD?
>
> Well, the way to figure that out, I think, is to look at the
> documentation.  I had a look at...
>
> http://www.postgresql.org/docs/9.5/static/functions-formatting.html
>
> ...which says...
>
>  year (4 or more digits)
> IYYY ISO 8601 week-numbering year (4 or more digits)
>
> I don't really understand ISO 8601, but if IYYY is giving us an ISO
> 8601 thing, then presumably  is not supposed to be giving us that.
>   The same page elsewhere refers to Gregorian dates, and other parts
> of the documentation seem to agree that's what we use.
>
> But having said that, this is kind of a weird situation.  We're
> talking about this:
>
> rhaas=# SELECT y || '-06-01', to_date (y || '-06-01', '-MM-DD')
> FROM (VALUES (2), (1), (0), (-1), (-2)) t(y);
>  ?column? |to_date
> --+---
>  2-06-01  | 0002-06-01
>  1-06-01  | 0001-06-01
>  0-06-01  | 0001-06-01 BC
>  -1-06-01 | 0002-06-01 BC
>  -2-06-01 | 0003-06-01 BC
> (5 rows)
>
> Now, I would be tempted to argue that passing to_date('-1-06-01',
> '-MM-DD') ought to do the same thing as to_date('pickle',
> '-MM-DD') i.e. throw an error.  There's all kinds of what seems to
> me to be shoddy error checking in this area:
>
> rhaas=# select to_date('-3', ':MM&DD');
> to_date
> ---
>  0004-01-01 BC
> (1 row)
>
> It's pretty hard for me to swallow the argument that the input matches
> the provided format.
>
> However, I'm not sure we ought to tinker with the behavior in this
> area.  If -MM-DD is going to accept things that are not of the
> format -MM-DD, and I'd argue that -1-06-01 is not in that format,

It is not about format, it is about values.

> then I think it should probably keep doing the same things it's always
> done.  If you want to supply a BC date, why not do

Because it is inconvenient a little. If one value ("-2345") is passed,
another one ("2346 BC") is got. In the other case a programmer must
check for negative value, and if so change a sign and add "BC" to the
format. Moreover the programmer must keep in mind that it is not
enough to have usual date format "DD/MM/", because sometimes there
can be "BC" part.

> this:
>
> rhaas=# select to_date('0001-06-01 BC', '-MM-DD BC');
> to_date
> ---
>  0001-06-01 BC
> (1 row)

Also because of:

postgres=# SELECT EXTRACT(year FROM to_date('-3', ''));
 date_part
---
-4
(1 row)

Note that the documentation[1] says "Keep in mind there is no 0 AD".
More examples by the link[2].

> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

[1]http://www.postgresql.org/docs/devel/static/functions-datetime.html
[2]http://www.postgresql.org/message-id/cakoswnn6kpfabmrshzyn8+2nhpyfvbdmphya5pqguny8lp2...@mail.gmail.com
-- 
Best regards,
Vitaly Burovoy


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Logical decoding support for sequence advances

2016-02-28 Thread Craig Ringer
On 17 December 2015 at 10:08, Craig Ringer  wrote:

> On 15 December 2015 at 20:17, Andres Freund  wrote:
>>
>>
>> I think this is quite the wrong approach. You're calling the logical
>> decoding callback directly from decode.c, circumventing
>> reorderbuffer.c. While you don't need the actual reordering, you *do*
>> need the snapshot integration.
>
>
> Yeah. I thought I could get away without it because I could use
> Form_pg_sequence.sequence_name to bypass any catalog lookups at all, but
> per upthread that field should be ignored and can't be used.
>
>
>> As you since noticed you can't just
>> look into the sequence tuple and be happy with that, you need to do
>> pg_class lookups et al. Furhtermore callbacks can validly expect to have
>> a snapshot set.
>>
>
> Good point. Just because I don't need one doesn't mean others won't, and
> all the other callbacks do.
>
> I'll have a read over reorderbuffer.c and see if I can integrate this
> properly.
>
>
>> At the very least what you need to do is to SetupHistoricSnapshot(),
>> then lookup the actual identity of the sequence using
>> RelidByRelfilenode() and only then you can call the callback.
>>
>
> Yeah. That means it's safe for the callback to do a syscache lookup for
> the sequence name by oid.
>

I've revisited this patch in an attempt to get a corrected version ready
for the next CF.

Turns out it's not that simple.


Sequences advances aren't part of a transaction (though they always happen
during one) and proceed whether the xact that happened to trigger them
commits or rolls back. So it doesn't make sense to add them to a reorder
buffer for an xact. We want to send it immediately, not associate it with
some arbitrary xact that just happened to use the last value in the cache
that might not commit for ages.


But then when do we send it? If sent at the moment of decoding the advance
from WAL then it'll always be sent to the client between the commit of one
xact and the begin of another, because it'll be sent while we're reading
xlog and populating reorder buffers. For advance of an already-created
sequence advance that's what we want, but CREATE SEQUENCE, ALTER SEQUENCE
etc also log RM_SEQ_ID XLOG_SEQ_LOG operations. The xact that created the
sequence isn't committed yet so sending the advance to the downstream will
lead to attempting to advance a sequence that doesn't yet exist. Similarly
ALTER SEQUENCE emits a new XLOG_SEQ_LOG with the updated Form_pg_sequence.
All fine for physical replication but rather more challenging for logical
replication since sometimes we want the sequence change to be associated
with a particular xact and sometimes not.


So the reorder buffer has to keep track of sequences. It must check to see
if a catalog change is a sequence creation and if so mark the sequence as
pending, keeping a copy of the Form_pg_sequence that's updated to the
latest version as the xact proceeds and writes updates to the sequence. On
commit the sequence advance is replayed at the end of the xact using the
snapshot of the newly committed xact; on rollback it's discarded since the
sequence never became visible to anyone else. We can safely assert that
that sequence will not be updated by any other xact until this one commits.


In reorderbuffer.c, there's a test for relation->rd_rel->relkind ==
RELKIND_SEQUENCE that skips changes to sequence relations. We'll want to
replicate sequence catalog updates as DDL via event triggers and deparse so
that's OK, but I think we'll need to make a note of sequence creation here
to mark new sequences as uncommitted.


If we see a sequence change that isn't for an uncommitted newly-created
sequence we make an immediate call through the reorder buffer to send the
sequence update info to the client. That's OK even if it's something like
ALTER SEQUENCE ... INCREMENT 10 RENAME TO othername; . The ALTER ... RENAME
part gets sent with the xact that did it when/if it commits since it's part
of the pg_class tuple for the sequence; the INCREMENT gets sent immediately
since it's part of the Form_pg_sequence. That matches what happens locally,
and it means there's no need to keep track of every sequence, only new ones.


On commit or rollback of an xact that creates a sequence we pop the
sequence oid from the ordered list of uncommitted sequence oids that must
be kept by the decoding session.


If we land up decoding the same WAL again we might send sequence updates
that temporarily move a sequence backwards then advance it again when we
replay the newer updates. That's the same as hot standby and seems fine.
You obviously won't be able to safely get new values from sequences that're
replicated from an upstream on the downstream anyway - and I anticipate
that logical decoding receivers will probably want to use seqam etc later
to make those sequences read-only until a failover event.


Sound reasonable?



* keep track of the last-committed xact's snapshot in the decoding session


* decode relation create/drop for RELKIND

Re: [HACKERS] Proposal: SET ROLE hook

2016-02-28 Thread Joe Conway
On 01/07/2016 09:08 AM, Joe Conway wrote:
> On 01/06/2016 10:36 AM, Tom Lane wrote:
>> I think a design that was actually somewhat robust would require two
>> hooks, one at check_role and one at assign_role, wherein the first one
>> would do any potentially-failing work and package all required info into
>> a blob that could be passed through to the assign hook.

Attached.

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 903b3a6..9f92de3 100644
*** a/src/backend/commands/variable.c
--- b/src/backend/commands/variable.c
***
*** 32,37 
--- 32,41 
  #include "utils/timestamp.h"
  #include "mb/pg_wchar.h"
  
+ /* Hooks for plugins to get control in check_role() and assign_role() */
+ SetRoleCheck_hook_type SetRoleCheck_hook = NULL;
+ SetRoleAssign_hook_type SetRoleAssign_hook = NULL;
+ 
  /*
   * DATESTYLE
   */
*** check_role(char **newval, void **extra,
*** 900,905 
--- 904,912 
  	myextra->is_superuser = is_superuser;
  	*extra = (void *) myextra;
  
+ 	if (SetRoleCheck_hook)
+ 		(*SetRoleCheck_hook) (GetSessionUserId(), roleid, is_superuser);
+ 
  	return true;
  }
  
*** assign_role(const char *newval, void *ex
*** 908,913 
--- 915,927 
  {
  	role_auth_extra *myextra = (role_auth_extra *) extra;
  
+ 	/*
+ 	 * Any defined hooks must be able to execute in a failed
+ 	 * transaction to restore a prior value of the ROLE GUC variable.
+ 	 */
+ 	if (SetRoleAssign_hook)
+ 		(*SetRoleAssign_hook) (myextra->roleid, myextra->is_superuser);
+ 
  	SetCurrentRoleId(myextra->roleid, myextra->is_superuser);
  }
  
diff --git a/src/include/commands/variable.h b/src/include/commands/variable.h
index 8105951..f3870e9 100644
*** a/src/include/commands/variable.h
--- b/src/include/commands/variable.h
***
*** 12,17 
--- 12,22 
  
  #include "utils/guc.h"
  
+ /* Hooks for plugins to get control in check_role() and assign_role() */
+ typedef void (*SetRoleCheck_hook_type) (Oid, Oid, bool);
+ extern PGDLLIMPORT SetRoleCheck_hook_type SetRoleCheck_hook;
+ typedef void (*SetRoleAssign_hook_type) (Oid, bool);
+ extern PGDLLIMPORT SetRoleAssign_hook_type SetRoleAssign_hook;
  
  extern bool check_datestyle(char **newval, void **extra, GucSource source);
  extern void assign_datestyle(const char *newval, void *extra);


signature.asc
Description: OpenPGP digital signature


Re: [HACKERS] pg_ctl promote wait

2016-02-28 Thread Peter Eisentraut
On 2/19/16 3:09 PM, Tom Lane wrote:
> I see no need for an additional mechanism.  Just watch pg_control until
> you see DB_IN_PRODUCTION state there, then switch over to the same
> connection probing that "pg_ctl start -w" uses.

Here is a patch set around that idea.

The subsequent discussion mentioned that there might still be a window
between end of waiting and when read-write queries would be accepted.  I
don't know how big that window would be in practice and would be
interested in some testing and feedback.
From cb5d4a63620636d4043d1a85acf7fcfdace73b1d Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 28 Feb 2016 20:21:54 -0500
Subject: [PATCH 1/3] pg_ctl: Add tests for promote action

---
 src/bin/pg_ctl/t/003_promote.pl | 62 +
 src/test/perl/TestLib.pm| 11 
 2 files changed, 73 insertions(+)
 create mode 100644 src/bin/pg_ctl/t/003_promote.pl

diff --git a/src/bin/pg_ctl/t/003_promote.pl b/src/bin/pg_ctl/t/003_promote.pl
new file mode 100644
index 000..64d72e0
--- /dev/null
+++ b/src/bin/pg_ctl/t/003_promote.pl
@@ -0,0 +1,62 @@
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 13;
+
+my $tempdir   = TestLib::tempdir;
+#my $tempdir_short = TestLib::tempdir_short;
+
+command_fails_like([ 'pg_ctl', 'promote', '-D', "$tempdir/nonexistent" ],
+   qr/directory .* does not exist/,
+   'pg_ctl promote with nonexistent directory');
+
+my $node_primary = get_new_node('primary');
+$node_primary->init;
+$node_primary->append_conf(
+	"postgresql.conf", qq(
+wal_level = hot_standby
+max_wal_senders = 2
+wal_keep_segments = 20
+hot_standby = on
+)
+	);
+
+command_fails_like([ 'pg_ctl', 'promote', '-D', $node_primary->data_dir ],
+   qr/PID file .* does not exist/,
+   'pg_ctl promote of not running instance fails');
+
+$node_primary->start;
+
+command_fails_like([ 'pg_ctl', 'promote', '-D', $node_primary->data_dir ],
+   qr/not in standby mode/,
+   'pg_ctl promote of primary instance fails');
+
+my $node_standby = get_new_node('standby');
+$node_primary->backup('my_backup');
+$node_standby->init_from_backup($node_primary, 'my_backup');
+my $connstr_primary = $node_primary->connstr('postgres');
+
+$node_standby->append_conf(
+	"recovery.conf", qq(
+primary_conninfo='$connstr_primary'
+standby_mode=on
+recovery_target_timeline='latest'
+)
+	);
+
+$node_standby->start;
+
+$node_standby->command_like(['psql', '-X', '-A', '-t', '-c', 'SELECT pg_is_in_recovery()'],
+			qr/^t$/,
+			'standby is in recovery');
+
+command_ok([ 'pg_ctl', 'promote', '-D', $node_standby->data_dir ],
+		   'pg_ctl promote of standby runs');
+
+sleep 3;  # needs a moment to react
+
+$node_standby->command_like(['psql', '-X', '-A', '-t', '-c', 'SELECT pg_is_in_recovery()'],
+			qr/^f$/,
+			'promoted standby is not in recovery');
diff --git a/src/test/perl/TestLib.pm b/src/test/perl/TestLib.pm
index 3d11cbb..dd275cf 100644
--- a/src/test/perl/TestLib.pm
+++ b/src/test/perl/TestLib.pm
@@ -34,6 +34,7 @@ our @EXPORT = qw(
   program_version_ok
   program_options_handling_ok
   command_like
+  command_fails_like
 
   $windows_os
 );
@@ -262,4 +263,14 @@ sub command_like
 	like($stdout, $expected_stdout, "$test_name: matches");
 }
 
+sub command_fails_like
+{
+	my ($cmd, $expected_stderr, $test_name) = @_;
+	my ($stdout, $stderr);
+	print("# Running: " . join(" ", @{$cmd}) . "\n");
+	my $result = IPC::Run::run $cmd, '>', \$stdout, '2>', \$stderr;
+	ok(!$result, "@$cmd exit code not 0");
+	like($stderr, $expected_stderr, "$test_name: matches");
+}
+
 1;
-- 
2.7.2

From 7c1b6e94f5e3beb9e558d2af7098940d4475fe11 Mon Sep 17 00:00:00 2001
From: Peter Eisentraut 
Date: Sun, 28 Feb 2016 20:21:54 -0500
Subject: [PATCH 2/3] pg_ctl: Detect current standby state from pg_control

pg_ctl used to determine whether a server was in standby mode by looking
for a recovery.conf file.  With this change, it instead looks into
pg_control, which is potentially more accurate.  There are also
occasional discussions about removing recovery.conf, so this removes one
dependency.
---
 src/bin/pg_ctl/pg_ctl.c | 60 ++---
 1 file changed, 57 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index bae6c22..c38c479 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -19,6 +19,7 @@
 
 #include "postgres_fe.h"
 
+#include "catalog/pg_control.h"
 #include "libpq-fe.h"
 #include "pqexpbuffer.h"
 
@@ -158,6 +159,8 @@ static bool postmaster_is_alive(pid_t pid);
 static void unlimit_core_size(void);
 #endif
 
+static DBState get_control_dbstate(void);
+
 
 #ifdef WIN32
 static void
@@ -1168,7 +1171,6 @@ do_promote(void)
 {
 	FILE	   *prmfile;
 	pgpid_t		pid;
-	struct stat statbuf;
 
 	pid = get_pgpid(false);
 
@@ -1187,8 +1189,7 @@ do_promote(void)
 		exit(1);
 	}
 
-	/* If recovery.conf doesn't exist, the server is not in standby mode 

[HACKERS] Improve error handling in pltcl

2016-02-28 Thread Jim Nasby

Per discussion in [1], this patch improves error reporting in pltcl.

pltcl_error_objects.patch applies on top of the pltcl_objects_2.patch 
referenced in [2].


pltcl_error_master.patch applies against current master.

[1] 
http://www.postgresql.org/message-id/20160223150401.2173d...@wagner.wagner.home

[2] http://www.postgresql.org/message-id/56cce7d2.9090...@bluetreble.com
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
diff --git a/doc/src/sgml/pltcl.sgml b/doc/src/sgml/pltcl.sgml
index d2175d5..d5c576d 100644
--- a/doc/src/sgml/pltcl.sgml
+++ b/doc/src/sgml/pltcl.sgml
@@ -775,6 +775,127 @@ CREATE EVENT TRIGGER tcl_a_snitch ON ddl_command_start 
EXECUTE PROCEDURE tclsnit
 

 
+   
+Error Handling in PL/Tcl
+
+
+ error handling
+ in PL/Tcl
+
+
+
+ All Tcl errors that are allowed to propagate back to the top level of the
+ interpreter, that is, errors not caught within the stored procedure
+ using the Tcl catch command will raise a database
+ error.
+
+
+ Tcl code within or called from the stored procedure can choose to
+ raise a database error by invoking the elog
+ command provided by PL/Tcl or by generating an error using the Tcl
+ error command and not catching it with Tcl's
+ catch command.
+
+
+ Database errors that occur from the PL/Tcl stored procedure's
+ use of spi_exec, spi_prepare,
+ and spi_execp are also catchable by Tcl's
+ catch command.
+
+
+ Tcl provides an errorCode variable that can
+ represent additional information about the error in a form that
+ is easy for programs to interpret.  The contents are in Tcl list
+ format and the first word identifies the subsystem or
+ library responsible for the error and beyond that the contents are left
+ to the individual code or library.  For example if Tcl's
+ open command is asked to open a file that doesn't
+ exist, errorCode
+ might contain POSIX ENOENT {no such file or directory}
+ where the third element may vary by locale but the first and second
+ will not.
+
+
+ When spi_exec, spi_prepare
+ or spi_execp cause a database error to be raised,
+ that database eror propagates back to Tcl as a Tcl error.
+ In this case errorCode is set to a list
+ where the first element is POSTGRES followed by a
+ copious decoding of the Postgres error structure.  Since fields in the
+ structure may or may not be present depending on the nature of the
+ error, how the function was invoked, etc, PL/Tcl has adopted the 
+ convention that subsequent elements of the errorCode
+ list are key-value pairs where the first value is the name of the
+ field and the second is its value.
+
+
+ Fields that may be present include message,
+ detail, detail_log,
+ hint, domain,
+ context_domain, context,
+ schema, table,
+ column, datatype,
+ constraint, cursor_position,
+ internalquery, internal_position,
+ filename, lineno and
+ funcname.
+
+
+ You might find it useful to load the results into an array. Code
+ for doing that might look like
+
+if {[lindex $errorCode 0] == "POSTGRES"} {
+array set errorRow [lrange $errorCode 1 end]
+}
+
+
+
+ In the example below we cause an error by attempting to
+ SELECT from a table that doesn't exist.
+
+select tcl_eval('spi_exec "select * from foo;"');
+
+
+
+ERROR:  relation "foo" does not exist
+
+
+
+
+ Now we examine the error code.  (The double-colons explicitly
+ reference errorCode as a global variable.)
+
+select tcl_eval('join $::errorCode "\n"');
+
+
+
+   tcl_eval
+---
+ POSTGRES +
+ message  +
+ relation "foo" does not exist+
+ domain   +
+ postgres-9.6 +
+ context_domain   +
+ postgres-9.6 +
+ cursorpos+
+ 0+
+ internalquery+
+ select * from foo;   +
+ internalpos  +
+ 15   +
+ filename +
+ parse_relation.c +
+ lineno   +
+ 1159 +
+ funcname +
+ parserOpenTable
+(1 row)
+
+
+
+   
+

Modules and the unknown Command

diff --git a/src/pl/tcl/expected/pltcl_setup.out 
b/src/pl/tcl/expected/pltcl_setup.out
index 4183c14..0a9f9f4 100644
--- a/src/pl/tcl/expected/pltcl_setup.out
+++ b/src/pl/tcl/expected/pltcl_setup.out
@@ -542,3 +542,44 @@ NOTICE:  tclsnitch: ddl_command_start DROP TABLE
 NOTICE:  tclsnitch: ddl_command_end DROP TABLE
 drop event trigger tcl_a_snitch;
 drop event trigger tcl_b_snitch;
+-- test erro

Re: [HACKERS] [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-28 Thread Michael Paquier
On Sun, Feb 28, 2016 at 11:22 PM, Craig Ringer  wrote:
> On 27 February 2016 at 06:37, Michael Paquier 
> wrote:
>>
>> On Sat, Feb 27, 2016 at 4:30 AM, Alvaro Herrera
>>  wrote:
>> > Craig Ringer wrote:
>> >> Should be committed ASAP IMO.
>> >
>> > Finally pushed it.  Let's see how it does in the buildfarm.  Now let's
>> > get going and add more tests, I know there's no shortage of people with
>> > test scripts waiting for this.
>> >
>> > Thanks, Michael, for the persistency, and thanks to all reviewers.  (I'm
>> > sorry we seem to have lost Amir Rohan in the process.  He was doing
>> > a great job.)
>>
>> Date of first message of this thread: Mon, 2 Dec 2013 15:40:41 +0900
>> Date of this message: Fri, 26 Feb 2016 16:30:08 -0300
>> This has been a long trip. Thanks a lot to all involved. Many people
>> have reviewed and helped out with this patch.
>
>
> Congratulations and thanks.
>
> I don't see any new buildfarm failures. The BinInstallCheck failure on
> Windows predates this and the isolationtest failure on OpenBSD is unrelated.

The buildfarm does not have infrastructure to test that yet.. I need
to craft a patch for the client-side code and send it to Andrew. Will
try to do so today.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning

2016-02-28 Thread Ildar Musin



16/02/16 07:46, Amit Langote wrote:

Hi Josh,

On 2016/02/16 11:41, Josh berkus wrote:

On 02/15/2016 04:28 PM, Amit Langote wrote:

Also, you won't see any optimizer and executor changes. Queries will still
use the same plans as existing inheritance-based partitioned tables,
although as I mentioned, constraint exclusion won't yet kick in. That will
be fixed very shortly.

We're not going to use CE for the new partitioning long-term, are we? This
is just the first version, right?

Yes. My approach in previous versions of stuffing major planner changes in
with the syntax patch was not quite proper in retrospect. So, I thought
I'd propose any major planner (and executor) changes later.

Thanks,
Amit


Hello Amit,

Thank you for your work. I'm currently working on extension aimed at 
planner optimization for partitioned tables 
(https://github.com/postgrespro/pg_pathman). At this moment I have an 
implementation of binary search for range partitioned tables with basic 
partitioning keys (date, timestamp, integers etc). And I'd like to try 
to combine your syntax and infrastructure with my binary search 
implementation.
There likely will be changes in range syntax and partitions cache 
structure based on discussion. So looking forward for your next patch.


Ildar


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Upper planner pathification

2016-02-28 Thread Andres Freund
Hi,

On 2016-02-28 15:03:28 -0500, Tom Lane wrote:
> Those with long memories will recall that I've been waving my arms about
> $SUBJECT for more than five years.  I started to work seriously on a patch
> last summer, and here is a version that I feel comfortable exposing to
> public scrutiny (which is not to call it "done"; more below).

Yay!


> So, where to go from here?  I'm acutely aware that we're hard up against
> the final 9.6 commitfest, and that we discourage major patches arriving
> so late in a devel cycle.  But I simply couldn't get this done any faster.
> I don't really want to hold it over for the 9.7 devel cycle.  It's been
> enough trouble maintaining this patch in the face of conflicting commits
> over the last year or so (it's probably still got bugs related to parallel
> query...), and there definitely are conflicting patches in the upcoming
> 'fest.  And the lack of this infrastructure is blocking progress on FDWs
> and some other things.
> 
> So I'd really like to get this into 9.6.  I'm happy to put it into the
> March commitfest if someone will volunteer to review it.

Hard. This is likely to cause/trigger a number of bugs, and we don't
have much time to let this mature. It's a change that we're unlikely to
be able to back-out if we discover that it wasn't the right thing to
integrate shortly before the release.  On the other hand, this is a
major architectural step forward; one that unblocks a number of nice
features. There's also an argument to be made that integrating this now
is beneficial, because it'll cause less churn for patches being
developed while 9.6 is stabilizing.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Generic WAL logical messages

2016-02-28 Thread Andres Freund
Hi,

On 2016-02-28 22:44:12 +0100, Petr Jelinek wrote:
> On 27/02/16 01:05, Andres Freund wrote:
> >I'm not really convinced by RegisterStandbyMsgPrefix() et al. There's
> >not much documentation about what it actually is supposed to
> >acomplish. Afaics you're basically forced to use
> >shared_preload_libraries with it right now?  Also, iterating through a
> >linked list everytime something is logged doesn't seem very satisfying?
> >
> 
> Well, my reasoning there was to stop multiple plugins from using same prefix
> and for that you need some kind of registry. Making this a shared catalog
> seemed like huge overkill given the potentially transient nature of output
> plugins and this was the best I could come up with. And yes it requires you
> to load your plugin before you can log a message for it.

I think right now that's a solution that's worse than the problem.  I'm
inclined to define the problem away with something like "The prefix
should be unique across different users of the messaging facility. Using
the extension name often is a good choice.".


> >>+void
> >>+logicalmsg_desc(StringInfo buf, XLogReaderState *record)
> >>+{
> >>+   char   *rec = XLogRecGetData(record);
> >>+   xl_logical_message *xlrec = (xl_logical_message *) rec;
> >>+
> >>+   appendStringInfo(buf, "%s message size %zu bytes",
> >>+xlrec->transactional ? "transactional" 
> >>: "nontransactional",
> >>+xlrec->message_size);
> >>+}
> >
> >Shouldn't we check
> >   uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
> >   if XLogRecGetInfo(record) == XLOG_LOGICAL_MESSAGE
> >here?
> >
> 
> I thought it's kinda pointless, but we seem to be doing it in other places
> so will add.

It leads to a segfault or something similar when adding further message
types, without a compiler warning. So it seems to be a good idea to be
slightly careful.


> >
> >>+void
> >>+ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid, XLogRecPtr 
> >>lsn,
> >>+ bool transactional, const 
> >>char *prefix, Size msg_sz,
> >>+ const char *msg)
> >>+{
> >>+   ReorderBufferTXN *txn = NULL;
> >>+
> >>+   if (transactional)
> >>+   {
> >>+   ReorderBufferChange *change;
> >>+
> >>+   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
> >>+
> >>+   Assert(xid != InvalidTransactionId);
> >>+   Assert(txn != NULL);
> >>+
> >>+   change = ReorderBufferGetChange(rb);
> >>+   change->action = REORDER_BUFFER_CHANGE_MESSAGE;
> >>+   change->data.msg.transactional = true;
> >>+   change->data.msg.prefix = pstrdup(prefix);
> >>+   change->data.msg.message_size = msg_sz;
> >>+   change->data.msg.message = palloc(msg_sz);
> >>+   memcpy(change->data.msg.message, msg, msg_sz);
> >>+
> >>+   ReorderBufferQueueChange(rb, xid, lsn, change);
> >>+   }
> >>+   else
> >>+   {
> >>+   rb->message(rb, txn, lsn, transactional, prefix, msg_sz, msg);
> >>+   }
> >>+}
> >
> >
> >This approach prohibts catalog access when processing a nontransaction
> >message as there's no snapshot set up.
> >
> 
> Hmm I do see usefulness in having snapshot, although I wonder if that does
> not kill the point of non-transactional messages.

I don't see how it would? It'd obviously have to be the catalog/historic
snapshot a transaction would have had if it started in that moment in
the original WAL stream?


> Question is then though which snapshot should the message see,
> base_snapshot of transaction?

Well, there'll not be a transaction, but something like snapbuild.c's
->snapshot ought to do the trick.


> That would mean we'd have to call SnapBuildProcessChange for
> non-transactional messages which we currently avoid. Alternatively we
> could probably invent lighter version of that interface that would
> just make sure builder->snapshot is valid and if not then build it

I think the latter is probably the direction we should go in.


> I am honestly sure if that's a win or not.

I think it'll be confusing (bug inducing) if there's no snapshot for
non-transactional messages but for transactional ones, and it'll
severely limit the usefulness of the interface.


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal: Generic WAL logical messages

2016-02-28 Thread Petr Jelinek

Hi,

thanks for looking Andres,


On 27/02/16 01:05, Andres Freund wrote:

Hi,

I'm not really convinced by RegisterStandbyMsgPrefix() et al. There's
not much documentation about what it actually is supposed to
acomplish. Afaics you're basically forced to use
shared_preload_libraries with it right now?  Also, iterating through a
linked list everytime something is logged doesn't seem very satisfying?



Well, my reasoning there was to stop multiple plugins from using same 
prefix and for that you need some kind of registry. Making this a shared 
catalog seemed like huge overkill given the potentially transient nature 
of output plugins and this was the best I could come up with. And yes it 
requires you to load your plugin before you can log a message for it.


I am not married to this solution so if you have better ideas or if you 
think the clashes are not read issue, we can change it.




On 2016-02-24 18:35:16 +0100, Petr Jelinek wrote:

+SET synchronous_commit = on;
+SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot', 
'test_decoding');
+ ?column?
+--
+ init
+(1 row)



+SELECT 'msg1' FROM pg_logical_send_message(true, 'test', 'msg1');
+ ?column?
+--
+ msg1
+(1 row)


Hm. Somehow 'sending' a message seems wrong here. Maybe 'emit'?


+  
+   
+
+ pg_logical_send_message
+
+pg_logical_send_message(transactional bool, 
prefix text, content 
text)
+   
+   
+void
+   
+   
+Write text logical decoding message. This can be used to pass generic
+messages to logical decoding plugins through WAL. The parameter
+transactional specifies if the message should
+be part of current transaction or if it should be written and decoded
+immediately. The prefix has to be prefix which
+was registered by a plugin. The content is
+content of the message.
+   
+  


It's not decoded immediately, even if emitted non-transactionally.



Okay, immediately is somewhat misleading. How does "should be written 
immediately and decoded as soon as logical decoding reads the WAL 
record" sound ?



+
+ Generic Message Callback
+
+ 
+  The optional message_cb callback is called whenever
+  a logical decoding message has been decoded.
+
+typedef void (*LogicalDecodeMessageCB) (
+struct LogicalDecodingContext *,
+ReorderBufferTXN *txn,
+XLogRecPtr message_lsn,
+bool transactional,
+const char *prefix,
+Size message_size,
+const char *message
+);


We should at least document what txn is set to if not transactional.



Will do (it's NULL).


+void
+logicalmsg_desc(StringInfo buf, XLogReaderState *record)
+{
+   char   *rec = XLogRecGetData(record);
+   xl_logical_message *xlrec = (xl_logical_message *) rec;
+
+   appendStringInfo(buf, "%s message size %zu bytes",
+xlrec->transactional ? "transactional" : 
"nontransactional",
+xlrec->message_size);
+}


Shouldn't we check
   uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
   if XLogRecGetInfo(record) == XLOG_LOGICAL_MESSAGE
here?



I thought it's kinda pointless, but we seem to be doing it in other 
places so will add.





+void
+ReorderBufferQueueMessage(ReorderBuffer *rb, TransactionId xid, XLogRecPtr lsn,
+ bool transactional, const 
char *prefix, Size msg_sz,
+ const char *msg)
+{
+   ReorderBufferTXN *txn = NULL;
+
+   if (transactional)
+   {
+   ReorderBufferChange *change;
+
+   txn = ReorderBufferTXNByXid(rb, xid, true, NULL, lsn, true);
+
+   Assert(xid != InvalidTransactionId);
+   Assert(txn != NULL);
+
+   change = ReorderBufferGetChange(rb);
+   change->action = REORDER_BUFFER_CHANGE_MESSAGE;
+   change->data.msg.transactional = true;
+   change->data.msg.prefix = pstrdup(prefix);
+   change->data.msg.message_size = msg_sz;
+   change->data.msg.message = palloc(msg_sz);
+   memcpy(change->data.msg.message, msg, msg_sz);
+
+   ReorderBufferQueueChange(rb, xid, lsn, change);
+   }
+   else
+   {
+   rb->message(rb, txn, lsn, transactional, prefix, msg_sz, msg);
+   }
+}



This approach prohibts catalog access when processing a nontransaction
message as there's no snapshot set up.



Hmm I do see usefulness in having snapshot, although I wonder if that 
does not kill the point of non-transactional messages. Question is then 
though which snapshot should the message see, base_snapshot of 
transaction? That would mean we'd have to call SnapBuildProcessChange 
for non-transactional messages which we currently avoid. Alternatively 
we could probably invent

[HACKERS] WIP: Upper planner pathification

2016-02-28 Thread Tom Lane
Those with long memories will recall that I've been waving my arms about
$SUBJECT for more than five years.  I started to work seriously on a patch
last summer, and here is a version that I feel comfortable exposing to
public scrutiny (which is not to call it "done"; more below).

The basic point of this patch is to apply the generate-and-compare-Paths
paradigm to the planning steps after query_planner(), which only covers
scan and join processing (the FROM and WHERE parts of a query).  These
later steps deal with grouping/aggregation, window functions, SELECT
DISTINCT, ORDER BY, LockRows (SELECT FOR UPDATE), LIMIT/OFFSET, and
ModifyTable.  Also UNION/INTERSECT/EXCEPT.  Back in the bad old days we
had only one way to do any of that stuff, so there was no real problem
with the approach of converting query_planner's answer into a Plan and
then stacking more Plan nodes atop that.  Over time we grew other ways
to do those steps, and chose between those ways with ad-hoc code in
grouping_planner().  That was messy enough in itself, but it had other
disadvantages too: subquery_planner() had to choose and return a single
Plan, without regard to what the outer query might need.  (Well, we did
pass down a tuple_fraction parameter, but that is a pretty limited bit of
information.)

An even larger problem is that we had no way to handle addition of new
alternative plan types for these upper-planning steps without fundamental
hacking on grouping_planner().  An example is the code I added in commit
addc42c339208d6a and later (planagg.c and other places) for optimization
of MIN/MAX aggregates: that code had a positively incestuous relationship
with grouping_planner(), and was darn ugly in multiple other ways besides.
Of late, the main way this issue has surfaced is that we have no practical
way to plan pushdown of aggregates or updates on remote tables to the
responsible FDWs, because the FDWs cannot create Paths representing such
operations.

The present patch addresses this problem by inventing Path nodes to
represent every post-scan/join step, and changing the API of
grouping_planner() and subquery_planner() so that they return sets of
Paths rather than single Plans.  Creation of a Plan tree happens only
after control returns to the top level of standard_planner().  The Path
nodes for these post-scan/join steps are attached to "upper relation"
RelOptInfos that didn't exist before.  There are provisions for FDWs to
inject candidate Paths for these upper-level steps.  As proof of concept
for that, planagg.c has been revised to work by injecting a new Path
into the grouping/aggregation upper rel, rather than predetermining what
the answer will be.  This vastly decreases its coupling with both
grouping_planner and some other parts of the system such as equivclass.c
(though, the Law of Conservation of Cruft being what it is, I did have to
push some knowledge about planagg.c's work into setrefs.c).

I'm pretty pleased with the way this turned out.  grouping_planner() is
about half the length it was before, and much more straightforward IMO.
planagg.c no longer seems like a complete hack; it's a reasonable
prototype for injecting nontraditional implementation paths into
aggregation or other late planner stages, and grouping_planner() doesn't
need to know about it.

The patch does add a lot of net new lines (and it's not done) but
most of the new code is very straightforward boilerplate.

The main thing that makes this WIP and not committable is that I've not
yet bothered to implement outfuncs.c code and some other debug support for
all the new path struct types.  A lot of the new function header comments
remain to be fleshed out too, and some more documentation needs to be
written.  But I think it's reviewable as-is; the other stuff would just
make it even longer but not more interesting.

There's a lot of future work to be done within this skeleton.  Notably,
I did not fix the UNION/INTERSECT/EXCEPT planning code to consider
multiple paths; it still only generates a single Path tree.  That code
needs to be rewritten from scratch, probably, and it seems like doing so
is a separate project.  I'd also like to do some more refactoring in
createplan.c: some code paths are still doing redundant cost estimation,
and I'm growing increasingly dissatisfied with the "use_physical_tlist"
hack.  But that seems like a separable issue as well.

So, where to go from here?  I'm acutely aware that we're hard up against
the final 9.6 commitfest, and that we discourage major patches arriving
so late in a devel cycle.  But I simply couldn't get this done any faster.
I don't really want to hold it over for the 9.7 devel cycle.  It's been
enough trouble maintaining this patch in the face of conflicting commits
over the last year or so (it's probably still got bugs related to parallel
query...), and there definitely are conflicting patches in the upcoming
'fest.  And the lack of this infrastructure is blocking progress on FDWs
and some other thing

Re: [HACKERS] exposing pg_controldata and pg_config as functions

2016-02-28 Thread Joe Conway
On 02/28/2016 05:16 AM, Michael Paquier wrote:
> +Returns information about current controldata file state.
> s/controldata/control data?
> 
> +
> + 
> +  
> +   Column Name
> +   Data Type
> +  
> + 
> +
> Having a description of each field would be nice.

Might be nice, but the pg_controldata section of the docs does not have
any description either, and I don't feel compelled to improve on that
just for the sake of this patch. I'll put it on my TODO to improve both
at some point though.

> + * pg_controldata.c
> + * Expose select pg_controldata output, except via SQL functions
> I am not much getting the meaning of this sentence. What about the following:
> "Set of routines exposing the contents of the control data file in a
> set of SQL functions".

Ok, reworded to something similar.

> +   /* Populate the values and null arrays */
> +   values[0] = LSNGetDatum(ControlFile->checkPoint);
> +   nulls[0] = false;
> +
> +   values[1] = LSNGetDatum(ControlFile->prevCheckPoint);
> +   nulls[1] = false;
> Instead of setting each flag of nulls[] one by one, just calling once
> MemSet would be fine. Any method is fine though.

I prefer the current style and I believe it is more consistent
with what is done elsewhere. In any case this is not exactly a
performance critical path.

> +   /* get a copy of the control file */
> +   ControlFile = get_controlfile(DataDir, progname);
> Some whitespaces here. git diff is showing in red here.

fixed

> +   if (ControlFile->pg_control_version % 65536 == 0 &&
> +   ControlFile->pg_control_version / 65536 != 0)
> +   elog(ERROR, _("byte ordering mismatch"));
> You may want to put this check directly in get_controlfile(). it is
> repeated 4 times in the backend, and has an equivalent in
> pg_controldata.

makes sense - done


> our @pgcommonallfiles = qw(
> - config_info.c exec.c pg_lzcompress.c pgfnames.c psprintf.c
> + config_info.c controldata_utils.c exec.c pg_lzcompress.c pgfnames.c
>   relpath.c rmtree.c string.c username.c wait_error.c);
> "psprintf.c" has been removed from this list. This causes the build
> with MSVC to fail.

good catch -- fixed

If there are no other complaints or comments, I will commit the attached
sometime this coming week (the the requisite catversion bump).

Thanks!

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index c0b94bc..4b5ee81 100644
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
*** SELECT collation for ('foo' COLLATE "de_
*** 16703,16708 
--- 16703,17064 
  
 
  
+
+ The functions shown in 
+ print information initialized during initdb, such
+ as the catalog version. They also show information about write-ahead
+ logging and checkpoint processing. This information is cluster-wide,
+ and not specific to any one database. They provide most of the same
+ information, from the same source, as
+ , although in a form better suited
+ to SQL functions.
+
+ 
+
+ Control Data Functions
+ 
+  
+   Name Return Type Description
+  
+ 
+  
+   
+
+ pg_control_checkpoint
+ pg_control_checkpoint()
+
+record
+
+ Returns information about current checkpoint state.
+
+   
+ 
+   
+
+ pg_control_system
+ pg_control_system()
+
+record
+
+ Returns information about current controldata file state.
+
+   
+ 
+   
+
+ pg_control_init
+ pg_control_init()
+
+record
+
+ Returns information about cluster initialization state.
+
+   
+ 
+   
+
+ pg_control_recovery
+ pg_control_recovery()
+
+record
+
+ Returns information about recovery state.
+
+   
+ 
+  
+ 
+
+ 
+
+ pg_control_checkpoint returns a record, shown in
+ 
+
+ 
+
+ pg_control_checkpoint Columns
+ 
+  
+   
+Column Name
+Data Type
+   
+  
+ 
+  
+ 
+   
+checkpoint_location
+pg_lsn
+   
+ 
+   
+prior_location
+pg_lsn
+   
+ 
+   
+redo_location
+pg_lsn
+   
+ 
+   
+redo_wal_file
+text
+   
+ 
+   
+timeline_id
+integer
+   
+ 
+   
+prev_timeline_id
+integer
+   
+ 
+   
+full_page_writes
+boolean
+   
+ 
+   
+next_xid
+text
+   
+ 
+   
+next_oid
+oid
+   
+ 
+   
+next_multixact_id
+xid
+   
+ 
+   
+next_multi_offset
+xid
+   
+ 
+   
+oldest_xid
+  

Re: [HACKERS] The plan for FDW-based sharding

2016-02-28 Thread Simon Riggs
On 27 February 2016 at 22:38, Kevin Grittner  wrote:


> That could be part of a solution.  What I sketched out with the
> "apparent order of execution" ordering of the transactions
> (basically, commit order except when one SERIALIZABLE transaction
> needs to be dragged in front of another due to a read-write
> dependency) is possibly the simplest approach, but batching may
> well give better performance.
>
> > Collecting a list of transactions that must be applied before the current
> > one could be accumulated during SSI processing and added to the commit
> > record. But reordering the transaction apply is something we'd need to
> get
> > some real clear theory on before we considered it.
>
> Oh, there is a lot of very clear theory on it.  I even considered
> whether it might work at the physical level, but that seems fraught
> with potential land-mines due to the subtle ways in which we manage
> race conditions at the detail level.  It's one of those things that
> seems theoretically possible, but probably a really bad idea in
> practice.  For logical replication, though, there is a clear way to
> determine a reasonable order of applying changes that will never
> yield a serialization anomaly -- if we do that, we dodge the choice
> between using a "stale" safe snapshot or waiting an indeterminate
> length of time for a "fresh" safe snapshot -- at the cost of
> delaying logical replication itself at various points.
>

I think we're going to have practical difficulties with these concepts.

If an xid commits with inConflicts, those refer to transactions that may
not yet have assigned xids. They may be assigned xids for hours or days
even so its hard to know whether they will eventually become write
transactions or not, making it a challenge to even know whether we should
delay. And if even if we did know, delaying apply of commits for hours to
allow us to reorder transactions isn't practical in all cases, clearly,
more so if the impact is caused by one minor table that nobody much cares
about.

What I see as more practical is reducing the scope of "safe transactions"
down to "safe scopes", where particular tables or sets of tables are known
safe at particular times, so we know more about which things we can look at
safely.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] Move PinBuffer and UnpinBuffer to atomics

2016-02-28 Thread Dilip Kumar
On Sat, Feb 27, 2016 at 5:14 AM, Andres Freund  wrote:

> > > > >> ./pgbench  -j$ -c$ -T300 -M prepared -S postgres
> > > > >>
> > > > >> ClientBasePatch
> > > > >> 11716916454
> > > > >> 8108547105559
> > > > >> 32241619262818
> > > > >> 64206868233606
> > > > >> 128137084217013
> > >
> > > So, there's a small regression on low client counts. That's worth
> > > addressing.
> > >
> >
> > Interesting. I'll try to reproduce it.
>
> Any progress here?


In Multi socket machine with 8 sockets and 64 cores, I have seen more
regression compared to my previous run in power8 with 2 socket, currently I
tested Read only workload for 5 mins Run, When I get time, I will run for
longer time and confirm again.

Shared Buffer= 8GB
Scale Factor=300

./pgbench  -j$ -c$ -T300 -M prepared -S postgres
client basepatch
1   7057 5230
2 10043 9573
4 2014018188


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Re: [COMMITTERS] pgsql: Respect TEMP_CONFIG when running contrib regression tests.

2016-02-28 Thread Andrew Dunstan



On 02/27/2016 01:24 PM, John Gorman wrote:


On Sat, Feb 27, 2016 at 9:25 AM, Robert Haas > wrote:


On Sat, Feb 27, 2016 at 7:08 PM, Andrew Dunstan
mailto:and...@dunslane.net>> wrote:

> Perhaps what we need to do is modify pg_regress.c slightly to
allow more
> than one --temp-config argument. But that could be done later.

Well, I'm pretty interested in using --temp-config for parallelism
testing; I want to be able to run the whole regression test suite with
a given --temp-config.  I'm in agreement with this change but if it
doesn't play well with that need, I suppose I'll be writing that
pg_regress.c patch sooner rather than later.


Here is a patch to allow pg_regress to include several --temp-config 
files.





Thanks, wonderfully small patch. Applied.

cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-28 Thread Craig Ringer
On 27 February 2016 at 06:37, Michael Paquier 
wrote:

> On Sat, Feb 27, 2016 at 4:30 AM, Alvaro Herrera
>  wrote:
> > Craig Ringer wrote:
> >> Should be committed ASAP IMO.
> >
> > Finally pushed it.  Let's see how it does in the buildfarm.  Now let's
> > get going and add more tests, I know there's no shortage of people with
> > test scripts waiting for this.
> >
> > Thanks, Michael, for the persistency, and thanks to all reviewers.  (I'm
> > sorry we seem to have lost Amir Rohan in the process.  He was doing
> > a great job.)
>
> Date of first message of this thread: Mon, 2 Dec 2013 15:40:41 +0900
> Date of this message: Fri, 26 Feb 2016 16:30:08 -0300
> This has been a long trip. Thanks a lot to all involved. Many people
> have reviewed and helped out with this patch.
>

Congratulations and thanks.

I don't see any new buildfarm failures. The BinInstallCheck failure on
Windows predates this and the isolationtest failure on OpenBSD is unrelated.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-28 Thread Michael Paquier
On Sun, Feb 28, 2016 at 10:41 PM, Michael Paquier
 wrote:
> On Sat, Feb 27, 2016 at 7:37 AM, Michael Paquier wrote:
>> Date of first message of this thread: Mon, 2 Dec 2013 15:40:41 +0900
>> Date of this message: Fri, 26 Feb 2016 16:30:08 -0300
>> This has been a long trip. Thanks a lot to all involved. Many people
>> have reviewed and helped out with this patch.
>
> I just had a closer look at what has been committed, and I found a
> couple of minor issues, addressed via the patch attached:
> 1) install-windows.sgml should use the markup command when mentioning
> bincheck and recoverycheck
> 2) src/test/recovery/.gitignore is listing /regress_log/ but that's
> not needed (this is a remnant of a previous version of the patch
> posted on this thread).
> 3) Header of 002_archiving.pl mentions that the tests are running on a
> warm standby, but that's a hot standby (found by Craig and reported on
> github on my account)
> 4) src/test/recovery/Makefile is missing a clean target, to remove tmp_check/.
> 5) src/tools/msvc/clean.bat is missing the same cleanup command for
> the same thing after running the tests.
> 6) Header of 004_timeline_switch.pl should perhaps mention that a
> cascading standby is used (idea of Craig, a good addition IMO)

7) src/test/README is not describing recovery/
8) This description in src/test/recovery/README is not exact, it
mentions a set of routines that are now part of PostgresNode.pm:
+This directory contains a test suite for recovery and replication,
+testing mainly the interactions of recovery.conf with cluster
+instances by providing a simple set of routines that can be used
+to define a custom cluster for a test, including backup, archiving,
+and streaming configuration.
I would suggest removing the last 4 lines and simplify the paragraph.
9) I have no logical explanation to explain why I am seeing all those
things now.
v2 is attached.
-- 
Michael


test-recovery-fixes-v2.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [REVIEW] In-core regression tests for replication, cascading, archiving, PITR, etc.

2016-02-28 Thread Michael Paquier
On Sat, Feb 27, 2016 at 7:37 AM, Michael Paquier wrote:
> Date of first message of this thread: Mon, 2 Dec 2013 15:40:41 +0900
> Date of this message: Fri, 26 Feb 2016 16:30:08 -0300
> This has been a long trip. Thanks a lot to all involved. Many people
> have reviewed and helped out with this patch.

I just had a closer look at what has been committed, and I found a
couple of minor issues, addressed via the patch attached:
1) install-windows.sgml should use the markup command when mentioning
bincheck and recoverycheck
2) src/test/recovery/.gitignore is listing /regress_log/ but that's
not needed (this is a remnant of a previous version of the patch
posted on this thread).
3) Header of 002_archiving.pl mentions that the tests are running on a
warm standby, but that's a hot standby (found by Craig and reported on
github on my account)
4) src/test/recovery/Makefile is missing a clean target, to remove tmp_check/.
5) src/tools/msvc/clean.bat is missing the same cleanup command for
the same thing after running the tests.
6) Header of 004_timeline_switch.pl should perhaps mention that a
cascading standby is used (idea of Craig, a good addition IMO)

Regards,
-- 
Michael


test-recovery-fixes.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exposing pg_controldata and pg_config as functions

2016-02-28 Thread Michael Paquier
On Sun, Feb 28, 2016 at 9:33 AM, Joe Conway  wrote:
> On 02/21/2016 05:30 AM, Michael Paquier wrote:
>> Looking again at this thread I guess that this is consensus, based on
>> the proposal from Josh and seeing no other ideas around. Another idea
>> would be to group all the fields that into a single function
>> pg_control_data().
>
> I think a single function would be ridiculously wide. I like the four
> separate functions better if we're going to do it this way at all.
>
>> +   
>> +pg_checkpoint_state
>> +   
>> +   
>> +pg_checkpoint_state returns a record containing
>> +checkpoint_location, prior_location, redo_location, redo_wal_file,
>> +timeline_id, prev_timeline_id, full_page_writes, next_xid, next_oid,
>> +next_multixact_id, next_multi_offset, oldest_xid, oldest_xid_dbid,
>> +oldest_active_xid, oldest_multi_xid, oldest_multi_dbid,
>> +oldest_commit_ts_xid, newest_commit_ts_xid, and checkpoint_time.
>> +   
>> This is bit unreadable. The only entry in the documentation that
>> adopts a similar style is pg_stat_file, and with six fields that feels
>> as being enough. I would suggest using a table instead with the type
>> of the field and its name.
>
> Ok, changed to your suggestion.
>
>
>> Regarding the naming of the functions, I think that it would be good
>> to get something consistent with the concept of those being "Control
>> Data functions" by having them share the same prefix, say pg_control_
>> - pg_control_checkpoint
>> - pg_control_init
>> - pg_control_system
>> - pg_control_recovery
>
> No issues -- changed.
>
>> +   snprintf (controldata_name, CONTROLDATANAME_LEN,
>> + "%s:", controldata[i].name);
>> Nitpick: extra space.
>
> I didn't understand this comment but it is moot now anyway...
>
>> +static const char *const controldata_names[] =
>> +{
>> +   gettext_noop("pg_control version number"),
>> +   gettext_noop("Catalog version number"),
>> +   gettext_noop("Database system identifier"),
>> Is this complication really necessary? Those identifiers are used only
>> in the frontend and the footprint of this patch on pg_controldata is
>> really large. What I think we should do is have in src/common the
>> following set of routines that work directly on ControlFileData:
>> - checkControlFile, to perform basic sanity checks on the control file
>> (CRC, see for example pg_rewind.c)
>> - getControlFile(dataDir), that simply returns a palloc'd
>> ControlFileData to the caller after looking at global/pg_control.
>> pg_rewind could perhaps make use of the one to check the control file
>> CRC, to fetch ControlFileData there is some parallel logic for the
>> source server if it is either remote or local so it would be better to
>> not use getControlFile in this case.
>
> I agree with the assessment that much of what had been moved based on
> the original pg_controladata() SRF no longer needs to move. This version
> only puts get_controlfile() into src/common, since that is the bit that
> is still shared. If checkControlFile() or something similar is useful
> for pg_rewind or some other extension, I'd say that should be a separate
> patch.
>
> Oh, and the entire thing is now rebased against a git pull from a few
> hours ago. I moved this to the upcoming commitfest too, although I think
> it is pretty well ready to go.

Thanks for the updated version.

+Returns information about current controldata file state.
s/controldata/control data?

+
+ 
+  
+   Column Name
+   Data Type
+  
+ 
+
Having a description of each field would be nice.

+ * pg_controldata.c
+ * Expose select pg_controldata output, except via SQL functions
I am not much getting the meaning of this sentence. What about the following:
"Set of routines exposing the contents of the control data file in a
set of SQL functions".

+   /* Populate the values and null arrays */
+   values[0] = LSNGetDatum(ControlFile->checkPoint);
+   nulls[0] = false;
+
+   values[1] = LSNGetDatum(ControlFile->prevCheckPoint);
+   nulls[1] = false;
Instead of setting each flag of nulls[] one by one, just calling once
MemSet would be fine. Any method is fine though.

+   /* get a copy of the control file */
+   ControlFile = get_controlfile(DataDir, progname);
Some whitespaces here. git diff is showing in red here.

+   if (ControlFile->pg_control_version % 65536 == 0 &&
+   ControlFile->pg_control_version / 65536 != 0)
+   elog(ERROR, _("byte ordering mismatch"));
You may want to put this check directly in get_controlfile(). it is
repeated 4 times in the backend, and has an equivalent in
pg_controldata.

our @pgcommonallfiles = qw(
- config_info.c exec.c pg_lzcompress.c pgfnames.c psprintf.c
+ config_info.c controldata_utils.c exec.c pg_lzcompress.c pgfnames.c
  relpath.c rmtree.c string.c username.c wait_error.c);
"psprintf.c" has been removed from this list. This causes the build
with MSVC to fail.
-- 
Michael


-- 
Sent via pgsql-hackers mailing 

Re: [HACKERS] The plan for FDW-based sharding

2016-02-28 Thread Konstantin Knizhnik

On 02/27/2016 11:38 PM, Kevin Grittner wrote:


Is this an implementation of some particular formal technique?  If
so, do you have a reference to a paper on it?  I get the sense that
there has been a lot written about distributed transactions, and
that it would be a mistake to ignore it, but I have not (yet)
reviewed the literature for it.


The reference to the article is at our WiKi pages explaining our DTM: 
https://wiki.postgresql.org/wiki/DTM

http://research.microsoft.com/en-us/people/samehe/clocksi.srds2013.pdf

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Relation cache invalidation on replica

2016-02-28 Thread Simon Riggs
On 27 February 2016 at 07:52, Konstantin Knizhnik  wrote:

> On 02/27/2016 04:16 AM, Simon Riggs wrote:
>
> On 27 February 2016 at 00:33, Simon Riggs  wrote:
>
>> On 27 February 2016 at 00:29, Andres Freund  wrote:
>>
>>> On 2016-02-26 18:05:55 +0300, Konstantin Knizhnik wrote:
>>> > The reason of the problem is that invalidation messages are not
>>> delivered to
>>> > replica after the end of concurrent create index.
>>> > Invalidation messages are included in xlog as part of transaction
>>> commit
>>> > record.
>>> > Concurrent index create is split into three transaction, last of which
>>> is
>>> > just performing inplace update of index tuple, marking it as valid and
>>> > invalidating cache. But as far as this transaction is not assigned
>>> XID, no
>>> > transaction record is created in WAL and send to replicas. As a result,
>>> > replica doesn't receive this invalidation messages.
>>>
>>> Ugh, that's a fairly ugly bug.
>>
>>
>> Looking now.
>>
>
> If the above is true, then the proposed fix wouldn't work either.
>
> No point in sending a cache invalidation message on the standby if you
> haven't also written WAL, since the catalog re-read would just see the old
> row.
>
> heap_inplace_update() does write WAL, which blows away the starting
> premise.
>
> So I'm not seeing this as an extant bug in an open source version of
> PostgreSQL, in my current understanding.
>
>
> Inplace update really creates record in WAL and this is why index is
> marked as valid at replica.
> But invalidation messages are sent only with transaction commit record and
> such record is not created in this case,
> because there is no assigned XID.
>
> This is a real bug which originally observed by one of our customers with
> different versions of Postgres (last one them have tried was 9.5.1).
> Then we reproduced it locally and determined the reason of the problem.
> Repro scenario is very simple: you just need to create large enough table
> (pgbench with scale factor 100 works well in my case)
> so that "create index concurrently" takes substantial amount of time. If,
> while this statement is in progress, you will execute some query at replica
> which
> uses this index, then it will cache state of relation without index. And
> after even when index is actually constructed, it will never be used in
> this backend (but other backends at replica will use it).
>

OK, so this is a fairly restricted bug. I was wondering how we'd missed it
for years. It wouldn't affect all backends, just those that accessed the
index before it was valid. New backends and restarts wouldn't be affected.


> I am not sure about the best way of fixing the problem.
> I have not tested Andreas proposal:
>
> if (nrels != 0 || nmsgs != 0 || RelcacheInitFileInval)
>
> if it actually fixes the problem.
> Assigning XID in heap_inplace_update definitely should work.
> It is better than forcing assignment XID in DefineIndex? I am not sure,
> because this problem seems to be related only with concurrent update
> (but may be I am wrong).
> At least not all inplace updates should cause catalog invalidation and so
> require XID assignment.
>

We have various proposals for fixing this, so on consideration here's what
I think we should do...

1. Ignore my first patch to always set an xid. Andres thought that this may
break something else could be true, so is not worth the risk.

2. Apply Konstantin's patch to fix this situation for the specific case
only.

3. Take Andres' idea and build that in as protection. We currently check
that nrels != 0 and throw an ERROR. We should do the same thing if there is
an invalidation event, so that we catch errors not just ignore them and
issue the commit anyway. This will check that there are no other cases in
other code.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


cache_inval_has_commit_rec.v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATH] Correct negative/zero year in to_date/to_timestamp

2016-02-28 Thread Yury Zhuravlev

But I'm waiting for a discussion: what part should be changed?


I for compliance with the standard (all ISO). In addition Oracle uses 
"IYYY" format.
Standards allow to reduce liability. But I think someone like Tom Lane can 
have the final say because we break backward compatibility.

Options "year"/"isoyear" may confuse the users.

Thanks.

--
Yury Zhuravlev
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal: get oldest LSN - function

2016-02-28 Thread Michael Paquier
On Sun, Feb 28, 2016 at 4:40 PM, Kartyshov Ivan wrote:
> It will not satisfy our purposes and our administrators for three reasons.
> 1) DBA set me task to get the oldest number that present in WAL, not last

Yeah I got that.

> 2) Surely we can choose the oldest segment from list "pg_ls_dir('pg_xlog')"
> of segments and calculate the first LSN by hand, but it is not what we want
> to do manually.

That's where I am not following. In any case you are just one SQL
query away from the result. And actually  your patch is incorrect I
think. If you care about the oldest record available you should look
for the first LSN position of the oldest segment, no? What you are
calculating now is the oldest virtual LSN position in local pg_xlog.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers