date:20151115

Re: [HACKERS] Inaccurate results from numeric ln(), log(), exp() and pow()

2015-11-15 Thread Dean Rasheed

On 14 November 2015 at 20:05, Tom Lane  wrote:
> I committed this with that change and some other
> mostly-cosmetic revisions.

Thanks.

>  Notable non-cosmetic changes:
>
> * I forced all the computed rscales to be at least 0, via code like
> local_rscale = Max(local_rscale, NUMERIC_MIN_DISPLAY_SCALE);
>
> You had done that in some places but not all, with the result that
> functions like mul_var could be invoked with negative rscale and hence
> produce outputs with negative dscale.  This is not allowed according to
> the comments in the code, and while it might accidentally fail to fail,
> I do not trust the code to operate properly in that regime.  It might be
> worth revisiting the whole question of negative dscale with an eye to not
> calculating digits we don't need, but I think that's material for a
> completely separate patch.
>

Hmm, I wondered about that when deciding on the rscale for the sqrt()s
in the range reduction code for ln_var(). For very large inputs,
forcing the rscale to be non-negative can cause the sqrt() to compute
far more digits than are needed, but I take your point that it might
be playing with fire if the underlying functions aren't sure to handle
negative rscales properly.

In most normal cases it makes little difference to performance. For
ln(9.9e99) it's only around 5% slower than my original code, and for
ln(9.9e999) it is around 5 times slower, but still pretty fast (520
microseconds vs 128). Once you get to cases like ln(2.0^435411),
however, it is pretty bad (13 seconds vs 165ms). But that is a very
contrived worst-case example, and actually that performance is no
different than it was before the patch. I very much doubt anyone will
ever do such a computation, except out of academic interest.

> * I moved some of the new regression test cases into numeric_big.sql.
> While they don't add more than a couple hundred msec on my dev machine,
> they're probably going to cost a lot more than that on the slower
> buildfarm machines, and I'm not sure that they add enough benefit to be
> worth running all the time.  This code really shouldn't suffer from many
> portability issues.
>
> (I am going to go run numeric_big manually on all my own buildfarm
> critters just to be sure, though.)
>
> regards, tom lane

OK, that seems fair enough.

Thanks.

Regards,
Dean

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: multiple psql option -c

2015-11-15 Thread Andrew Dunstan




On 11/15/2015 08:50 AM, Catalin Iacob wrote:

On Sun, Nov 15, 2015 at 1:27 AM, Andrew Dunstan  wrote:

That seems to me to get rid of the main motivation for this change, which is
to allow multiple such arguments, which together would as as if they were
all written to a file which was then invoked like -f file.

It seems to me the motivation is not "multiple command line arguments"
but sending multiple statements to the backend in one psql invocation
without needing bash specific here docs or a temporary file for -f.
Most combinations of such multiple statements can already be sent via
-c which sends them in one query, the backend executes them in one
transaction but the backend rejects some combinations like SELECT +
VACUUM.

I think the proposal was mislead by the apparent similarity with -c
and said "if -c can't do SELECT + VACUUM let's do a sort of repeated
-c and call that -C SELECT -C VACUUM". But why not do the same with -C
"SELECT 1; VACUUM" which works just like having a file with that
content works today but handier for scripts? Doesn't this solve the
exact need in this thread?

I'm arguing that sending multiple statements and executing each in one
transaction (the current proposed semantics of -C) is not like -c and
doesn't need to be "repeated -c" it's exactly like reading stdin or
file passed to -f and solves the original problem.But maybe I'm
missing something.





I suggest you review the original thread on this subject before a line 
was ever written. "multiple" (see subject line on this whole thread) is 
clearly what is being asked for. Making people turn that into a single 
argument is not what was envisaged. See for example Pavel's original 
example involving use of xargs where that's clearly not at all easy.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: multiple psql option -c

2015-11-15 Thread Catalin Iacob

On Sun, Nov 15, 2015 at 1:27 AM, Andrew Dunstan  wrote:
> That seems to me to get rid of the main motivation for this change, which is
> to allow multiple such arguments, which together would as as if they were
> all written to a file which was then invoked like -f file.

It seems to me the motivation is not "multiple command line arguments"
but sending multiple statements to the backend in one psql invocation
without needing bash specific here docs or a temporary file for -f.
Most combinations of such multiple statements can already be sent via
-c which sends them in one query, the backend executes them in one
transaction but the backend rejects some combinations like SELECT +
VACUUM.

I think the proposal was mislead by the apparent similarity with -c
and said "if -c can't do SELECT + VACUUM let's do a sort of repeated
-c and call that -C SELECT -C VACUUM". But why not do the same with -C
"SELECT 1; VACUUM" which works just like having a file with that
content works today but handier for scripts? Doesn't this solve the
exact need in this thread?

I'm arguing that sending multiple statements and executing each in one
transaction (the current proposed semantics of -C) is not like -c and
doesn't need to be "repeated -c" it's exactly like reading stdin or
file passed to -f and solves the original problem.But maybe I'm
missing something.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-15 Thread Robert Haas

On Sun, Nov 15, 2015 at 5:41 AM, Simon Riggs  wrote:
> Hmm, if that's where we're at, I'll summarize my thoughts.
>
> All of this discussion presupposes we are distributing/load balancing
> queries so that reads and writes might occur on different nodes.

Agreed.  I think that's a pretty common pattern, though certainly not
the only one.

> We need a good balancer. Any discussion of this that ignores the balancer
> component is only talking about half the solution. What we need to do is
> decide whether functionality should live in the balancer or the core.

I'm all in favor of having a load-balancer in core, but that seems
completely unrelated to the patch at hand.

> Your option (1) is viable, but only in certain cases. We could add support
> for some token/wait mechanism but as you say, this would require application
> changes not pooler changes.

Agreed.

> Your option (2) is wider but also worse in some ways. It can be implemented
> in a pooler.
>
> Your option (3) doesn't excite me much. You've got a load of stuff that
> really should happen in a pooler. And at its core we have synchronous_commit
> = apply but with a timeout rather than a wait. So anyway, consider me nudged
> to finish my patch to provide capability for that by 1 Jan.

I don't see how either option (2) or option (3) could be implemented
in a pooler.  How would that work?

To be frank, it's starting to seem to me like you are just trying to
block this patch so you can have time to develop your own version
instead.  I hope that's not the case, because it would be quite
unfair.  When Thomas originally posted the patch, you complained that
"This causes every writer to wait. What we want is to isolate the wait
only to people performing a write-read sequence, so I think it should
be readers that wait. Let's have that debate up front before we start
reviewing the patch."  Now, you seem to be saying that's OK, because
you want to post a patch to do exactly the same thing under the name
synchronous_commit=apply, but you want it to be your own patch,
leaving out the other stuff that Thomas has put into this one.

That could be the right thing to do, but how about we discuss it a
bit?  The timeout stuff that Thomas has added here is really useful.
Without that, if a machine goes down, we wait forever.  That's the
right thing to do if we're replicating to make sure transactions can
never be lost, but it's a bad idea if we're replicating for load
balancing.  In the load balancing case, you want to drop sync slaves
quickly to ensure the cluster remains available, and you need them to
know they are out of sync so that the load balancer doesn't get
confused.  That's exactly what is implemented here.

If you have an idea for a simpler implementation, great, but I think
we need something.  I don't see how it's going to work to make it
entirely the pooler's job to figure out whether the cluster is in sync
- it needs a push from the core server.  Here, that push is easy to
find: if a particular replica starts returning the "i'm out of sync"
error when you query it, then stop routing queries to that replica
until the error clears (which the pooler can find out by polling it
with some trivial query).  That's a great deal more useful than
synchronous_commit=apply without such a feature.

> On a related note, any further things like "GUC causal_reads_standby_names"
> should be implemented by Node Registry as a named group of nodes. We can
> have as many arbitrary groups of nodes as we want. If that sounds strange
> look back at exactly why GUCs are called GUCs.

I think a node registry is a good idea, and my impression from the
session in Vienna is quite a few other hackers do, too.  But I also
don't think it's remotely reasonable to make that a precondition for
accepting this patch.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-15 Thread Simon Riggs

On 12 November 2015 at 18:25, Thomas Munro 
wrote:

>  I don't want to get bogged down in details, while we're talking about the
> 30,000 foot view).
>

Hmm, if that's where we're at, I'll summarize my thoughts.

All of this discussion presupposes we are distributing/load balancing
queries so that reads and writes might occur on different nodes.

We need a good balancer. Any discussion of this that ignores the balancer
component is only talking about half the solution. What we need to do is
decide whether functionality should live in the balancer or the core.

Your option (1) is viable, but only in certain cases. We could add support
for some token/wait mechanism but as you say, this would require
application changes not pooler changes.

Your option (2) is wider but also worse in some ways. It can be implemented
in a pooler.

Your option (3) doesn't excite me much. You've got a load of stuff that
really should happen in a pooler. And at its core we have
synchronous_commit = apply but with a timeout rather than a wait. So
anyway, consider me nudged to finish my patch to provide capability for
that by 1 Jan.

On a related note, any further things like "GUC causal_reads_standby_names"
should be implemented by Node Registry as a named group of nodes. We can
have as many arbitrary groups of nodes as we want. If that sounds strange
look back at exactly why GUCs are called GUCs.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] pg_stat_statements query jumbling question

2015-11-15 Thread Michael Paquier

On Sun, Nov 15, 2015 at 1:34 AM, Tom Lane  wrote:
> Michael Paquier  writes:
>> On Sun, Nov 1, 2015 at 2:03 AM, Julien Rouhaud
>>  wrote:
>>> I'm also rather sceptical about this change.
>
>> Hm. Thinking a bit about this patch, it presents the advantage to be
>> able to track the same queries easily across systems even if those
>> tables have been created with a different OID.
>
> That argument would only hold if *every* use of OIDs in the jumbles
> were replaced by names --- not only tables, but types, operators,
> functions, etc.  I'm already concerned that the additional name
> lookups will cost a lot of performance, and I think adding so many
> more would probably be disastrous.

Yeah, I was thinking about a GUC switch to change from one mode to
another yesterday night before sleeping. Now if there was a patch
actually implementing that, and proving that this has no performance
impact, well I think that this may be worth considering for
integration. But we're far from that

>> Also, isn't this patch actually broken if we rename relations and/or
>> its namespace?
>
> Well, it would mean that the query would no longer be considered "the
> same".  You could argue either way whether that's good or bad.  But
> yeah, this approach will break one set of use-cases to enable another
> set.
>
> On the whole, I think my vote is to reject this patch.

Agreed. It's clear that the patch as-is is not enough.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-15 Thread Konstantin Knizhnik


I am sorry, looks like I had incorrectly interpret Michael's comment:

Not always. If COMMIT PREPARED fails at some of the nodes but succeeds
on others, the transaction is already partially acknowledged as
committed in the cluster. Hence it makes more sense for the
coordinator to commit transactions on the remaining nodes. Before
issuing any COMMIT PREPARED queries, I guess that's fine to rollback
the transactions on all nodes though.
I am completely agree that at second stage pf 2PC, once we make a decision to commit transaction, there is no way back: we have to complete commit at all nodes. If node reports that it has successfully prepared transaction and  ready to commit, then it 
mean that node should be able to recover it in case of failure. In my example of code in GO illustrating work with 2PC I for simplicity completely  exclude login of handling errors.

Actually it should be something like this (now in C++ using pqxx):

nontransaction srcTxn(srcCon);
nontransaction dstTxn(dstCon);
string transComplete;
try {
   exec(srcTxn, "prepare transaction '%s'", gtid);
   exec(dstTxn, "prepare transaction '%s'", gtid);
   // At this moment all nodes has prepared transaction, so in principle we 
can make a decision to commit here...
   // But it is easier to wait until CSN is assigned to transaction and it 
is delivered to all nodes
   exec(srcTxn, "select dtm_begin_prepare('%s')", gtid);
   exec(dstTxn, "select dtm_begin_prepare('%s')", gtid);
   csn = execQuery(srcTxn, "select dtm_prepare('%s', 0)", gtid);
   csn = execQuery(dstTxn, "select dtm_prepare('%s', %lu)", gtid, csn);
   exec(srcTxn, "select dtm_end_prepare('%s', %lu)", gtid, csn);
   exec(dstTxn, "select dtm_end_prepare('%s' %lu)", gtid, csn);
   // If we reach this point, we make decision to commit...
   transComplete = string("commit prepared '") + gtid + "'";
   } catch (pqxx_exception const& x) {
   // ... if not, then rollback prepared transaction
   transComplete = string("rollback prepared '") + gtid + "'";
   }
   pipeline srcPipe(srcTxn);
   pipeline dstPipe(dstTxn);
   srcPipe.insert(completeTrans);
   dstPipe.insert(compteteTrans);
   srcPipe.complete();
   dstPipe.complete();


On 11/15/2015 08:59 PM, Kevin Grittner wrote:

On Saturday, November 14, 2015 8:41 AM, Craig Ringer  
wrote:

On 13 November 2015 at 21:35, Michael Paquier  wrote:

On Tue, Nov 10, 2015 at 3:46 AM, Robert Haas  wrote:

If the database is corrupted, there's no way to guarantee that
anything works as planned.  This is like saying that criticizing
somebody's disaster recovery plan on the basis that it will be
inadequate if the entire planet earth is destroyed.

+1

Once all participating servers return "success" from the "prepare"
phase, the only sane way to proceed is to commit as much as
possible.  (I guess you still have some room to argue about what to
do after an error on the attempt to commit the first prepared
transaction, but even there you need to move forward if it might
have been due to a communications failure preventing a success
indication for a commit which was actually successful.)  The idea
of two phase commit is that failure to commit after a successful
prepare should be extremely rare.


As well as there could be FS, OS, network problems... To come back to
the point, my point is simply that I found surprising the sentence of
Konstantin upthread saying that if commit fails on some of the nodes
we should rollback the prepared transaction on all nodes. In the
example given, in the phase after calling dtm_end_prepare, say we
perform COMMIT PREPARED correctly on node 1, but then failed it on
node 2 because a meteor has hit a server, it seems that we cannot
rollback, instead we had better rolling in a backup and be sure that
the transaction gets committed. How would you rollback the transaction
already committed on node 1? But perhaps I missed something...

Right.


The usual way this works in an XA-like model is:

In phase 1 (prepare transaction, in Pg's spelling), failure on
any node triggers a rollback on all nodes.

In phase 2 (commit prepared), failure on any node causes retries
until it succeeds, or until the admin intervenes - say, to remove
that node from operation. The global xact as a whole isn't
considered successful until it's committed on all nodes.

2PC and distributed commit is well studied, including the
problems. We don't have to think this up for ourselves. We don't
have to invent anything here. There's a lot of distributed
systems theory to work with - especially when dealing with well
studied relational DBs trying to maintain ACID semantics.

Right; it is silly not to build on decades of work on theory and
practice in this area.

What is making me nervous as I watch this thread is a bit of loose
talk about the I in ACID.  There have been some points where it
seemed to be implied that we

Re: [HACKERS] psql: add \pset true/false

2015-11-15 Thread Jim Nasby


On 11/12/15 3:09 PM, Tom Lane wrote:

Peter Eisentraut  writes:

Plus we already have \pset numericlocale as a similar feature in psql.


But \pset numericlocale is also a crock.  It doesn't affect COPY output
for instance, and its ability to identify which data types it should apply
to is really shaky.  And it's virtually unused, as demonstrated by the
fact that serious bugs in it went undetected for many years (cf 4778a0bda,
77130fc14).  That's a really poor advertisement for the usefulness of the
proposed feature.


My $0.02 is that something that's type-specific belongs with the type, 
not with clients.


As to the argument about displaying a check or an X, why should that 
capability only exist for boolean types? For example, why not allow psql 
to convert a numeric value into a bar of varying sizes? I've frequently 
emulated that with something like SELECT repeat( '*', blah * 30 / 
max_of_blah ). I'm sure there's other examples people could think of.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2015-11-15 Thread Robert Haas

On Fri, Nov 13, 2015 at 11:37 AM, Ildus Kurbangaliev
 wrote:
> Thanks! That's very inspiring.

Cool.  It was great work; I am very happy with how it turned out.

> I have some questions about next steps on other tranches.
> First of all, I think we can have two API calls, something like:
>
> 1) LWLockRequestTranche(char *tranche_name, int locks_count)
> 2) LWLockGetTranche(char *tranche_name)
>
> LWLockRequestTranche reserve an item in main tranches array in shared memory, 
> and
> allocates space for name, LWLockTranche and LWLocks. There is first question. 
> It is
> ok if this call will be from *ShmemSize functions? We keep those requests,
> calculate a required size in LWLockShmemSize (by moving this call to the end)
> and create all these tranches in CreateLWLocks.

I think what we should do about the buffer locks is polish up this
patch and get it applied:

http://www.postgresql.org/message-id/20150907175909.gd5...@alap3.anarazel.de

I think it needs to be adapted to use predefined constants for the
tranche IDs instead of hardcoded values, maybe based on the attached
tranche-constants.patch.  This approach could be extended for the
other stuff in the main tranche, and I think that would be cleaner
than inventing LWLockRequestTranche as you are proposing.  Just make
the tranche IDs constants, and then the rest fits nicely into the
framework we've already got.

Also, I think we should rip all the volatile qualifiers out of
bufmgr.c, using the second attached patch, devolatileize-bufmgr.patch.
The comment claiming that we need this because spinlocks aren't
compiler barriers is obsolete.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
commit bbf99acca380543c2943c79454ac07b97fd6f2f2
Author: Robert Haas 
Date:   Sun Nov 15 10:37:19 2015 -0500

tranche-ify buffer locks

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 6bca02c..43534d6 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -445,7 +445,7 @@ CreateLWLocks(void)
 
 		/* Initialize all LWLocks in main array */
 		for (id = 0, lock = MainLWLockArray; id < numLocks; id++, lock++)
-			LWLockInitialize(>lock, 0);
+			LWLockInitialize(>lock, LWTRANCHE_MAIN);
 
 		/*
 		 * Initialize the dynamic-allocation counters, which are stored just
@@ -457,7 +457,7 @@ CreateLWLocks(void)
 		LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * sizeof(int));
 		LWLockCounter[0] = NUM_FIXED_LWLOCKS;
 		LWLockCounter[1] = numLocks;
-		LWLockCounter[2] = 1;	/* 0 is the main array */
+		LWLockCounter[2] = LWTRANCHE_LAST_BUILTIN_ID + 1;
 	}
 
 	if (LWLockTrancheArray == NULL)
@@ -466,12 +466,13 @@ CreateLWLocks(void)
 		LWLockTrancheArray = (LWLockTranche **)
 			MemoryContextAlloc(TopMemoryContext,
 		  LWLockTranchesAllocated * sizeof(LWLockTranche *));
+		Assert(LWLockTranchesAllocated > LWTRANCHE_LAST_BUILTIN_ID);
 	}
 
 	MainLWLockTranche.name = "main";
 	MainLWLockTranche.array_base = MainLWLockArray;
 	MainLWLockTranche.array_stride = sizeof(LWLockPadded);
-	LWLockRegisterTranche(0, );
+	LWLockRegisterTranche(LWTRANCHE_MAIN, );
 }
 
 /*
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 4653e09..a5fbd20 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -177,6 +177,15 @@ extern void LWLockRegisterTranche(int tranche_id, LWLockTranche *tranche);
 extern void LWLockInitialize(LWLock *lock, int tranche_id);
 
 /*
+ * We reserve a few predefined tranche IDs.  These values will never be
+ * returned by LWLockNewTrancheId.
+ */
+#define LWTRANCHE_MAIN		0
+#define LWTRANCHE_BUFFER_CONTENT			1
+#define LWTRANCHE_BUFFER_IO_IN_PROGRESS		2
+#define LWTRANCHE_LAST_BUILTIN_ID			LWTRANCHE_BUFFER_IO_IN_PROGRESS
+
+/*
  * Prior to PostgreSQL 9.4, we used an enum type called LWLockId to refer
  * to LWLocks.  New code should instead use LWLock *.  However, for the
  * convenience of third-party code, we include the following typedef.
From 9e7f9219b5e752da46be0e26a0be074191ae8f62 Mon Sep 17 00:00:00 2001
From: Robert Haas 
Date: Sun, 15 Nov 2015 10:24:03 -0500
Subject: [PATCH 1/3] De-volatile-ize.

---
 src/backend/storage/buffer/bufmgr.c   | 95 +--
 src/backend/storage/buffer/freelist.c | 18 +++
 src/include/storage/buf_internals.h   | 11 ++--
 3 files changed, 59 insertions(+), 65 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8c0358e..63188a3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -92,11 +92,11 @@ int			effective_io_concurrency = 0;
 int			target_prefetch_pages = 0;
 
 /* local state for StartBufferIO and related functions */
-static volatile BufferDesc *InProgressBuf = NULL;
+static BufferDesc *InProgressBuf = NULL;
 static bool IsForInput;
 
 /* local state for

Re: [HACKERS] check for interrupts in set_rtable_names

2015-11-15 Thread Tom Lane

Jeff Janes  writes:
> On Fri, Nov 13, 2015 at 3:13 PM, Tom Lane  wrote:
>> There's probably no reason not to do that, but I'd be much more interested
>> in eliminating the slowness to begin with ...

> I was thinking about that as well, but I don't think that would be
> back-patchable, at least not the way I was envisioning it.

> I was thinking of detecting bad cases (had to count to over 10 before
> finding a novel name, more than 10 times) and then switching from an
> object-local count, to a global count, for the numbers to add to the
> name.  But that would be a behavior change under some conditions.

I experimented with using a hash table to avoid the O(N^2) behavior.
This seems to work quite well, and I think it doesn't change the results
(leastwise, it does not break the regression tests).

It would be possible to do something similar to dodge the O(N^2) costs
in make_colname_unique/colname_is_unique; but it would be a lot messier,
and the tests I did suggest that it's fairly hard to get into a regime
where those functions are a huge part of the runtime anyway, because
we have limits on the numbers of columns.  So I'm inclined to leave that
alone.

regards, tom lane

diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 3388f7c..2b6de8f 100644
*** a/src/backend/utils/adt/ruleutils.c
--- b/src/backend/utils/adt/ruleutils.c
***
*** 55,60 
--- 55,61 
  #include "utils/array.h"
  #include "utils/builtins.h"
  #include "utils/fmgroids.h"
+ #include "utils/hsearch.h"
  #include "utils/lsyscache.h"
  #include "utils/rel.h"
  #include "utils/ruleutils.h"
*** typedef struct
*** 267,272 
--- 268,282 
  #define deparse_columns_fetch(rangetable_index, dpns) \
  	((deparse_columns *) list_nth((dpns)->rtable_columns, (rangetable_index)-1))
  
+ /*
+  * Entry in set_rtable_names' hash table
+  */
+ typedef struct
+ {
+ 	char		name[NAMEDATALEN];		/* Hash key --- must be first */
+ 	int			counter;		/* Largest addition used so far for name */
+ } NameHashEntry;
+ 
  
  /* --
   * Global data
*** static void print_function_rettype(Strin
*** 312,319 
  static void print_function_trftypes(StringInfo buf, HeapTuple proctup);
  static void set_rtable_names(deparse_namespace *dpns, List *parent_namespaces,
   Bitmapset *rels_used);
- static bool refname_is_unique(char *refname, deparse_namespace *dpns,
-   List *parent_namespaces);
  static void set_deparse_for_query(deparse_namespace *dpns, Query *query,
  	  List *parent_namespaces);
  static void set_simple_column_names(deparse_namespace *dpns);
--- 322,327 
*** static void
*** 2676,2690 
  set_rtable_names(deparse_namespace *dpns, List *parent_namespaces,
   Bitmapset *rels_used)
  {
  	ListCell   *lc;
- 	int			rtindex = 1;
  
  	dpns->rtable_names = NIL;
  	foreach(lc, dpns->rtable)
  	{
  		RangeTblEntry *rte = (RangeTblEntry *) lfirst(lc);
  		char	   *refname;
  
  		if (rels_used && !bms_is_member(rtindex, rels_used))
  		{
  			/* Ignore unreferenced RTE */
--- 2684,2744 
  set_rtable_names(deparse_namespace *dpns, List *parent_namespaces,
   Bitmapset *rels_used)
  {
+ 	HASHCTL		hash_ctl;
+ 	HTAB	   *names_hash;
+ 	NameHashEntry *hentry;
+ 	bool		found;
+ 	int			rtindex;
  	ListCell   *lc;
  
  	dpns->rtable_names = NIL;
+ 	/* nothing more to do if empty rtable */
+ 	if (dpns->rtable == NIL)
+ 		return;
+ 
+ 	/*
+ 	 * We use a hash table to hold known names, so that this process is O(N)
+ 	 * not O(N^2) for N names.
+ 	 */
+ 	MemSet(_ctl, 0, sizeof(hash_ctl));
+ 	hash_ctl.keysize = NAMEDATALEN;
+ 	hash_ctl.entrysize = sizeof(NameHashEntry);
+ 	hash_ctl.hcxt = CurrentMemoryContext;
+ 	names_hash = hash_create("set_rtable_names names",
+ 			 list_length(dpns->rtable),
+ 			 _ctl,
+ 			 HASH_ELEM | HASH_CONTEXT);
+ 	/* Preload the hash table with names appearing in parent_namespaces */
+ 	foreach(lc, parent_namespaces)
+ 	{
+ 		deparse_namespace *olddpns = (deparse_namespace *) lfirst(lc);
+ 		ListCell   *lc2;
+ 
+ 		foreach(lc2, olddpns->rtable_names)
+ 		{
+ 			char	   *oldname = (char *) lfirst(lc2);
+ 
+ 			if (oldname == NULL)
+ continue;
+ 			hentry = (NameHashEntry *) hash_search(names_hash,
+    oldname,
+    HASH_ENTER,
+    );
+ 			/* we do not complain about duplicate names in parent namespaces */
+ 			hentry->counter = 0;
+ 		}
+ 	}
+ 
+ 	/* Now we can scan the rtable */
+ 	rtindex = 1;
  	foreach(lc, dpns->rtable)
  	{
  		RangeTblEntry *rte = (RangeTblEntry *) lfirst(lc);
  		char	   *refname;
  
+ 		/* Just in case this takes an unreasonable amount of time ... */
+ 		CHECK_FOR_INTERRUPTS();
+ 
  		if (rels_used && !bms_is_member(rtindex, rels_used))
  		{
  			/* Ignore unreferenced RTE */
*** set_rtable_names(deparse_namespace *dpns
*** 2712,2767 
  		}
  
  		/*
! 		 * If

Re: [HACKERS] Parallel Seq Scan

2015-11-15 Thread Gavin Flower


On 16/11/15 12:05, Robert Haas wrote:

On Fri, Nov 13, 2015 at 10:46 AM, Thom Brown  wrote:

And perhaps associated PIDs?

Yeah, that can be useful, if others also feel like it is important, I can
look into preparing a patch for the same.

Thanks.

Thom, what do you think the EXPLAIN output should look like,
specifically?  Or anyone else who feels like answering.

I don't think it would be very useful to repeat the entire EXPLAIN
output n times, once per worker.  That sounds like a loser.  But we
could add additional lines to the output for each node, like this:

  Parallel Seq Scan on foo  (cost=0.00..XXX rows=YYY width=ZZZ) (actual
time=AAA..BBB rows=CCC loops=1)
   Leader: actual time=AAA..BBB rows=CCC loops=1
   Worker 0: actual time=AAA..BBB rows=CCC loops=1
   Worker 1: actual time=AAA..BBB rows=CCC loops=1
   Worker 2: actual time=AAA..BBB rows=CCC loops=1

If "buffers" is specified, we could display the summary information
after the Parallel Seq Scan as normal and then display an additional
per-worker line after the "Leader" line and each "Worker N" line.  I
think displaying the worker index is more useful than displaying the
PID, especially if we think that a plan tree like this might ever get
executed multiple times with different PIDs on each pass.

Like?  Dislike?  Other ideas?


Possibly have an option to include the PID?

Consider altering the format field width of the Worker number (depending 
on the number of workers) so you don't get:

   Worker 9 ...
   Worker 10 ...
but something like
   Worker  9 ...
   Worker 10 ...



Cheers,
Gavin



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: multiple psql option -c

2015-11-15 Thread Peter Eisentraut

On 11/15/15 9:53 AM, Andrew Dunstan wrote:
> I suggest you review the original thread on this subject before a line
> was ever written. "multiple" (see subject line on this whole thread) is
> clearly what is being asked for. Making people turn that into a single
> argument is not what was envisaged. See for example Pavel's original
> example involving use of xargs where that's clearly not at all easy.

I can see (small) value in having a new option that is like -c but
interprets the string as a fully-featured script like -f.  (Small
because the same behavior can already be had with here strings in bash.)

The behavior should be exactly like -f, including all the behavior with
single-transaction and single-step modes or whatever.

But then I will point out that we currently don't handle multiple -f
options.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: add \pset true/false

2015-11-15 Thread Peter Eisentraut

On 11/12/15 4:09 PM, Tom Lane wrote:
> Peter Eisentraut  writes:
>> Plus we already have \pset numericlocale as a similar feature in psql.
> 
> But \pset numericlocale is also a crock.  It doesn't affect COPY output
> for instance, and its ability to identify which data types it should apply
> to is really shaky.  And it's virtually unused, as demonstrated by the
> fact that serious bugs in it went undetected for many years (cf 4778a0bda,
> 77130fc14).  That's a really poor advertisement for the usefulness of the
> proposed feature.

Just because people don't find much use for a feature doesn't mean it's
wrong in principle.

Clearly, people occasionally want to change how a data type displays in
psql.  We could implement this in the server.  But then we put a burden
on all non-psql clients to begin all their connections with a growing
number of RESET foo_display calls, and that's not very efficient.  I
think it's up to each client to figure out how to display stuff and to
customize it if necessary.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pglogical_output - a general purpose logical decoding output plugin

2015-11-15 Thread Craig Ringer

On 16 November 2015 at 09:57, Peter Eisentraut  wrote:

> On 11/2/15 7:17 AM, Craig Ringer wrote:
> > The output plugin is suitable for a number of uses. It's designed
> > primarily to supply a data stream to drive a logical replication
> > client running in another PostgreSQL instance, like the pglogical
> > client discussed at PGConf.EU 2015.
>
> So where is that client?
>

Not finished baking yet - in particular, the catalogs and UI are still in
flux. The time scale for getting that out is in the order of a few weeks.

The output plugin stands alone to a fair degree, especially with the json
output support. Comments would be greatly appreciated, especially from
people who're involved in replication, are currently using triggers to feed
data to external systems, etc.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Minor comment improvement to create_foreignscan_plan

2015-11-15 Thread Etsuro Fujita


On 2015/11/12 19:02, Etsuro Fujita wrote:

On 2015/11/10 3:53, Robert Haas wrote:

On Mon, Nov 9, 2015 at 5:34 AM, Etsuro Fujita
 wrote:

Here is a small patch to update an comment in create_foreignscan_plan;
add fdw_recheck_quals to the list of expressions that need the
replace_nestloop_params processing.  I should have updated the comment
when I proposed the patch for the fdw_recheck_quals.



OK, not a big deal, but thanks.  Committed.



Thanks!


Oops, I've found another one.  I think we should update a comment in 
postgresGetForeignPlan, too; add remote filtering expressions to the 
list of information needed to create a ForeignScan node.


Best regards,
Etsuro Fujita
*** a/contrib/postgres_fdw/postgres_fdw.c
--- b/contrib/postgres_fdw/postgres_fdw.c
***
*** 902,908  postgresGetForeignPlan(PlannerInfo *root,
  			 retrieved_attrs);
  
  	/*
! 	 * Create the ForeignScan node from target list, local filtering
  	 * expressions, remote parameter expressions, and FDW private information.
  	 *
  	 * Note that the remote parameter expressions are stored in the fdw_exprs
--- 902,908 
  			 retrieved_attrs);
  
  	/*
! 	 * Create the ForeignScan node from target list, remote/local filtering
  	 * expressions, remote parameter expressions, and FDW private information.
  	 *
  	 * Note that the remote parameter expressions are stored in the fdw_exprs

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposing COPY .. WITH PERMISSIVE

2015-11-15 Thread Michael Paquier

On Mon, Nov 16, 2015 at 11:05 AM, Peter Eisentraut  wrote:
> On 9/2/15 4:19 PM, Stefan Kaltenbrunner wrote:
>> maybe - but having a fixed "default"  is very different from baking a
>> classic unix permission concept of user/group/world^others into actual
>> DDL or into a COPY option. The proposed syntax might make some sense to
>> a admin used to a unix style system but it is likely utterly
>> incomprehensible to somebody who is used to (windows style) ACLs.
>>
>> I dont have a good answer on what to do else atm but I dont think we
>> should embedded traditional/historical unix permission models in our
>> grammer unless really really needed...
>
> The user can just create a directory with appropriate permissions and
> copy the files there.  I don't really see why COPY needs to know about
> all this.

In short, it seems that this patch is better rejected. And I am
planning to do so if there are no complaints.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-15 Thread Etsuro Fujita


On 2015/11/13 11:31, Kouhei Kaigai wrote:

On 2015/11/12 2:53, Robert Haas wrote:

  From my point of view I'm now
thinking this solution has two parts:

(1) Let foreign scans have inner and outer subplans.  For this
purpose, we only need one, but it's no more work to enable both, so we
may as well.  If we had some reason, we could add a list of subplans
of arbitrary length, but there doesn't seem to be an urgent need for
that.


I wrote:

But one thing I'm concerned about is enable both inner and
outer plans, because I think that that would make the planner
postprocessing complicated, depending on what the foreign scans do by
the inner/outer subplans.  Is it worth doing so?  Maybe I'm missing
something, though.



If you persuade other person who has different opinion, you need to
explain why was it complicated, how much complicated and what was
the solution you tried at that time.
The "complicated" is a subjectively-based term. At least, we don't
share your experience, so it is hard to understand the how complexity.


I don't mean to object that idea.  I'm unfamiliar with that idea, so I 
just wanted to know the reason, or use cases.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal: "Causal reads" mode for load balancing reads without stale data

2015-11-15 Thread Thomas Munro

On Sun, Nov 15, 2015 at 11:41 PM, Simon Riggs  wrote:

> On 12 November 2015 at 18:25, Thomas Munro 
> wrote:
>
>
>>  I don't want to get bogged down in details, while we're talking about
>> the 30,000 foot view).
>>
>
> Hmm, if that's where we're at, I'll summarize my thoughts.
>
> All of this discussion presupposes we are distributing/load balancing
> queries so that reads and writes might occur on different nodes.
>
> We need a good balancer. Any discussion of this that ignores the balancer
> component is only talking about half the solution. What we need to do is
> decide whether functionality should live in the balancer or the core.
>
> Your option (1) is viable, but only in certain cases. We could add support
> for some token/wait mechanism but as you say, this would require
> application changes not pooler changes.
>
> Your option (2) is wider but also worse in some ways. It can be
> implemented in a pooler.
>
> Your option (3) doesn't excite me much. You've got a load of stuff that
> really should happen in a pooler. And at its core we have
> synchronous_commit = apply but with a timeout rather than a wait. So
> anyway, consider me nudged to finish my patch to provide capability for
> that by 1 Jan.
>

Just to be clear, this patch doesn't use a "timeout rather than a wait".
It always waits for the current set of available causal reads standbys to
apply the commit.  It's just that nodes get kicked out of that set pretty
soon if they don't keep up, a bit like a RAID controller dropping a failing
disk.  And it does so using a protocol that ensures that the dropped
standby starts raising the error, even if contact has been lost with it, so
the causal reads guarantee is maintained at all times for all clients.

On a related note, any further things like "GUC causal_reads_standby_names"
> should be implemented by Node Registry as a named group of nodes. We can
> have as many arbitrary groups of nodes as we want. If that sounds strange
> look back at exactly why GUCs are called GUCs.
>

Agreed, the application_name whitelist stuff is clunky.  I left it out of
the first version I posted, not wanting the focus of this proposal to be
side-tracked.  But as Ants Aasma pointed out, some users might need
something like that, so I posted a 2nd version that follows the established
example, again not wanting to distract with anything new in that area.  Of
course that would eventually be replaced/improved as part of a future node
topology management project.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] [PATCH] Refactoring of LWLock tranches

2015-11-15 Thread and...@anarazel.de

On 2015-11-15 16:24:24 -0500, Robert Haas wrote:
> I think it needs to be adapted to use predefined constants for the
> tranche IDs instead of hardcoded values, maybe based on the attached
> tranche-constants.patch.

Yea, that part is clearly not ok. Let me look at the patch.

> Also, I think we should rip all the volatile qualifiers out of
> bufmgr.c, using the second attached patch, devolatileize-bufmgr.patch.
> The comment claiming that we need this because spinlocks aren't
> compiler barriers is obsolete.

Same here.

> @@ -457,7 +457,7 @@ CreateLWLocks(void)
>   LWLockCounter = (int *) ((char *) MainLWLockArray - 3 * 
> sizeof(int));
>   LWLockCounter[0] = NUM_FIXED_LWLOCKS;
>   LWLockCounter[1] = numLocks;
> - LWLockCounter[2] = 1;   /* 0 is the main array */
> + LWLockCounter[2] = LWTRANCHE_LAST_BUILTIN_ID + 1;
>   }

Man this LWLockCounter[0] stuff is just awful. Absolutely nothing that
needs to be fixed here, but here it really should be fixed someday.

>  /*
> + * We reserve a few predefined tranche IDs.  These values will never be
> + * returned by LWLockNewTrancheId.
> + */
> +#define LWTRANCHE_MAIN   0
> +#define LWTRANCHE_BUFFER_CONTENT 1
> +#define LWTRANCHE_BUFFER_IO_IN_PROGRESS  2
> +#define LWTRANCHE_LAST_BUILTIN_ID
> LWTRANCHE_BUFFER_IO_IN_PROGRESS

Nitpick: I'm inclined to use an enum to avoid having to adjust the last
builtin id when adding a new builtin tranche.


Besides that minor thing I think this works for me. We might
independently want something making adding non-builtin tranches easier,
but that's really just independent.

> From 9e7f9219b5e752da46be0e26a0be074191ae8f62 Mon Sep 17 00:00:00 2001
> From: Robert Haas 
> Date: Sun, 15 Nov 2015 10:24:03 -0500
> Subject: [PATCH 1/3] De-volatile-ize.

I very strongly think this should be done. It's painful having to
cast-away volatile, and it de-optimizes a fair bit of performance
relevant code.

>  /* local state for StartBufferIO and related functions */
>  /* local state for StartBufferIO and related functions */
> -static volatile BufferDesc *InProgressBuf = NULL;
> +static BufferDesc *InProgressBuf = NULL;
>  static bool IsForInput;
>  
>  /* local state for LockBufferForCleanup */
> -static volatile BufferDesc *PinCountWaitBuf = NULL;
> +static BufferDesc *PinCountWaitBuf = NULL;

Hm. These could also be relevant for sigsetjmp et al. Haven't found
relevant code tho.

> - volatile BufferDesc *bufHdr;
> + BufferDesc *bufHdr;
>   Block   bufBlock;

Looks mis-indented now, similarly in a bunch of other places. Maybe
pg-indent afterwards?


>   /*
>* Header spinlock is enough to examine BM_DIRTY, see comment in
> @@ -1707,7 +1707,7 @@ BufferSync(int flags)
>   num_written = 0;
>   while (num_to_scan-- > 0)
>   {
> - volatile BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
> + BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
>

This case has some theoretical behavioural implications: As
bufHdr->flags is accessed without a spinlock the compiler might re-use
an older value. But that's ok: a) there's no old value it really could
use b) the whole race-condition exists anyway, and the comment in the
body explains why that's ok.

>  BlockNumber
>  BufferGetBlockNumber(Buffer buffer)
>  {
> - volatile BufferDesc *bufHdr;
> + BufferDesc *bufHdr;
>  
>   Assert(BufferIsPinned(buffer));
>  
> @@ -2346,7 +2346,7 @@ void
>  BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
>BlockNumber *blknum)
>  {
> - volatile BufferDesc *bufHdr;
> + BufferDesc *bufHdr;

>   /* Do the same checks as BufferGetBlockNumber. */
>   Assert(BufferIsPinned(buffer));
> @@ -2382,7 +2382,7 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, 
> ForkNumber *forknum,
>   * as the second parameter.  If not, pass NULL.
>   */
>  static void
> -FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
> +FlushBuffer(BufferDesc *buf, SMgrRelation reln)
>  {

>   XLogRecPtr  recptr;
>   ErrorContextCallback errcallback;
> @@ -2520,7 +2520,7 @@ RelationGetNumberOfBlocksInFork(Relation relation, 
> ForkNumber forkNum)
>  bool
>  BufferIsPermanent(Buffer buffer)
>  {
> - volatile BufferDesc *bufHdr;
> + BufferDesc *bufHdr;


These require that the buffer is pinned (a barrier implying
operation). The pin prevents the contents from changing in a relevant
manner, the barrier implied in the pin forces the core's view to be
non-stale.


> @@ -2613,7 +2613,7 @@ DropRelFileNodeBuffers(RelFileNodeBackend rnode, 
> ForkNumber forkNum,
>  
>   for (i = 0; i < NBuffers; i++)
>   {
> - volatile BufferDesc *bufHdr = GetBufferDescriptor(i);
> + BufferDesc *bufHdr = GetBufferDescriptor(i);

>   /*
>

Re: [HACKERS] [PATCH] Skip ALTER x SET SCHEMA if the schema didn't change

2015-11-15 Thread Haribabu Kommi

On Thu, Nov 5, 2015 at 10:20 PM, Haribabu Kommi
 wrote:
> On Tue, Sep 29, 2015 at 12:17 AM, Marti Raudsepp  wrote:
>> Hi list
>>
>> The attached patch changes the behavior of multiple ALTER x SET SCHEMA
>> commands, to skip, rather than fail, when the old and new schema is
>> the same.
>>
>> The advantage is that it's now easier to write DDL scripts that are 
>> indempotent.
>>
>> This already matches the behavior of ALTER EXTENSION SET SCHEMA in
>> earlier versions, as well as many other SET-ish commands, e.g. ALTER
>> TABLE SET TABLESPACE, OWNER TO, CLUSTER ON, SET (storage_parameter...)
>> etc. I don't see why SET SCHEMA should be treated any differently.
>>
>> The code is written such that object_access_hook is still called for
>> each object.
>>
>> Regression tests included. I couldn't find any documentation that
>> needs changing.
>
> I went through the patch, following are my observations,
>
> Patch applied with hunks and compiled with out warnings.
> Basic tests are passed.
>
> In AlterTableNamespaceInternal function, if a table or matview called
> for set schema,
> If the object contains any constraints, the constraint gets updated
> with new schema.
>
> In AlterTypeNamespaceInternal function, the InvokeObjectPostAlterHook function
> doesn't get called if the type is of composite type, domain and array
> types as because
> it just returns from top of the function.

Most of the community members didn't find any problem in changing the
behavior, so here I attached updated patch with the above two corrections.

Regards,
Hari Babu
Fujitsu Australia


0001-Skip-ALTER-x-SET-SCHEMA-if-the-schema-didn-t-change_v2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] SQL function to report log message

2015-11-15 Thread Peter Eisentraut

On 10/22/15 3:20 AM, dinesh kumar wrote:
> +  
> +   
> +
> pg_report_log(logleveltext, 
> message anyelement[, ishidestmt 
> boolean ] [, detail  text] [, 
> hint text] [, sqlstate 
> text])
> +   
> +   void
> +   
> +Report message or error.
> +   
> +  

I haven't seen this discussed before, but I don't find the name
pg_report_log particularly good.  Why not jut pg_log?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-11-15 Thread Robert Haas

On Fri, Nov 13, 2015 at 10:46 AM, Thom Brown  wrote:
>>> And perhaps associated PIDs?
>>
>> Yeah, that can be useful, if others also feel like it is important, I can
>> look into preparing a patch for the same.
>
> Thanks.

Thom, what do you think the EXPLAIN output should look like,
specifically?  Or anyone else who feels like answering.

I don't think it would be very useful to repeat the entire EXPLAIN
output n times, once per worker.  That sounds like a loser.  But we
could add additional lines to the output for each node, like this:

 Parallel Seq Scan on foo  (cost=0.00..XXX rows=YYY width=ZZZ) (actual
time=AAA..BBB rows=CCC loops=1)
  Leader: actual time=AAA..BBB rows=CCC loops=1
  Worker 0: actual time=AAA..BBB rows=CCC loops=1
  Worker 1: actual time=AAA..BBB rows=CCC loops=1
  Worker 2: actual time=AAA..BBB rows=CCC loops=1

If "buffers" is specified, we could display the summary information
after the Parallel Seq Scan as normal and then display an additional
per-worker line after the "Leader" line and each "Worker N" line.  I
think displaying the worker index is more useful than displaying the
PID, especially if we think that a plan tree like this might ever get
executed multiple times with different PIDs on each pass.

Like?  Dislike?  Other ideas?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: add \pset true/false

2015-11-15 Thread Peter Eisentraut

On 11/15/15 3:20 PM, Jim Nasby wrote:
> As to the argument about displaying a check or an X, why should that
> capability only exist for boolean types? For example, why not allow psql
> to convert a numeric value into a bar of varying sizes? I've frequently
> emulated that with something like SELECT repeat( '*', blah * 30 /
> max_of_blah ). I'm sure there's other examples people could think of.

Well, why not?  The question there is only how many marginal features
you want to stuff into psql, not whether it's the right place to stuff them.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] psql: add \pset true/false

2015-11-15 Thread Peter Geoghegan

On Thu, Nov 12, 2015 at 1:09 PM, Tom Lane  wrote:
> Peter Eisentraut  writes:
>> Plus we already have \pset numericlocale as a similar feature in psql.
>
> But \pset numericlocale is also a crock.  It doesn't affect COPY output
> for instance, and its ability to identify which data types it should apply
> to is really shaky.  And it's virtually unused, as demonstrated by the
> fact that serious bugs in it went undetected for many years (cf 4778a0bda,
> 77130fc14).  That's a really poor advertisement for the usefulness of the
> proposed feature.

FWIW, I didn't realize that we had "\pset numericlocale" until about a
year ago. I now use it all the time, and find it very useful.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pglogical_output - a general purpose logical decoding output plugin

2015-11-15 Thread Peter Eisentraut

On 11/2/15 7:17 AM, Craig Ringer wrote:
> The output plugin is suitable for a number of uses. It's designed
> primarily to supply a data stream to drive a logical replication
> client running in another PostgreSQL instance, like the pglogical
> client discussed at PGConf.EU 2015.

So where is that client?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposing COPY .. WITH PERMISSIVE

2015-11-15 Thread Peter Eisentraut

On 9/2/15 4:19 PM, Stefan Kaltenbrunner wrote:
> maybe - but having a fixed "default"  is very different from baking a
> classic unix permission concept of user/group/world^others into actual
> DDL or into a COPY option. The proposed syntax might make some sense to
> a admin used to a unix style system but it is likely utterly
> incomprehensible to somebody who is used to (windows style) ACLs.
> 
> I dont have a good answer on what to do else atm but I dont think we
> should embedded traditional/historical unix permission models in our
> grammer unless really really needed...

The user can just create a directory with appropriate permissions and
copy the files there.  I don't really see why COPY needs to know about
all this.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel Seq Scan

2015-11-15 Thread Robert Haas

On Sun, Nov 15, 2015 at 1:12 AM, Amit Kapila  wrote:
> Thanks for the report.
>
> I think main reason of the leak in workers seems to be due the reason
> that one of the buffer used while sending tuples (in function
> BuildRemapInfo)
> from worker to master is not getting freed and it is allocated for each
> tuple worker sends back to master.  I couldn't find use of such a buffer,
> so I think we can avoid the allocation of same or atleast we need to free
> it.  Attached patch remove_unused_buf_allocation_v1.patch should fix the
> issue.

Oops.  Committed.

> Another thing I have noticed is that we need to build the remap info
> target list contains record type of attrs, so ideally it should not even go
> in
> this path when such attrs are not present.  The reason for the same was
> that the tuple descriptor stored in TQueueDestReceiver was not updated,
> attached patch fix_initialization_tdesc_v1 fixes this issue.

I don't understand this part.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] SQL function to report log message

2015-11-15 Thread Craig Ringer

On 16 November 2015 at 09:50, Peter Eisentraut  wrote:
>
>
> I haven't seen this discussed before, but I don't find the name
> pg_report_log particularly good.  Why not jut pg_log?
>
>
Sounds like a better name to me. 'report' is noise that adds nothing useful.

I'd like to have this functionality.

I'd prefer to omit fields if explicitly assigned to NULL. You can always
use coalesce if you want the string 'NULL'; I agree with others here that
the vast majority of users will want the field just omitted.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] proposal: PL/Pythonu - function ereport

2015-11-15 Thread Peter Eisentraut

I don't think it's right to reuse SPIError for this.  SPIError is
clearly meant to signal an error in the SPI calls.  Of course, we can't
stop users from raising whatever exception they want, but if we're going
to advertise that users can raise exceptions, then we should create
separate exception classes.

I suppose the proper way to set this up would be to create a base class
like plpy.Error and derive SPIError from that.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Making tab-complete.c easier to maintain

2015-11-15 Thread Thomas Munro

On Thu, Nov 12, 2015 at 5:16 PM, Kyotaro HORIGUCHI <
horiguchi.kyot...@lab.ntt.co.jp> wrote:

> Hello. How  about regular expressions?
>
> I've been thinking of better mechanism for tab-compltion for
> these days since I found some bugs in it.
>
> At Fri, 23 Oct 2015 14:50:58 -0300, Alvaro Herrera <
> alvhe...@2ndquadrant.com> wrote in <20151023175058.GA3391@alvherre.pgsql>
> > Jeff Janes wrote:
> >
> > > For the bigger picture, I don't think we should not apply this patch
> simply
> > > because there is something even better we might theoretically do at
> some
> > > point in the future.
> >
> > Agreed.
>
> Auto-generating from grammer should be the ultimate solution but
> I don't think it will be available. But still I found that the
> word-splitting-then-match-word-by-word-for-each-matching is
> terriblly unmaintainable and poorly capable. So, how about
> regular expressions?
>
> I tried to use pg_regex in frontend and found that it is easily
> doable. As a proof of the concept, the two patches attached to
> this message does that changes.
>
> 1. 0001-Allow-regex-module-to-be-used-outside-server.patch
>
>   This small change makes pg_regex possible to be used in
>   frontend.
>
> 2. 0002-Replace-previous-matching-rule-with-regexps.patch
>
>   Simply replaces existing matching rules almost one-by-one with
>   regular expression matches.
>
> I made these patches not to change the behavior except inevitable
> ones.
>
> We would have far powerful matching capability using regular
> expressions and it makes tab-complete.c look simpler. On the
> other hand, regular expressions - which are stashed away into new
> file by this patch - is a chunk of complexity and (also) error
> prone. For all that I think this is better than the current
> situation in terms of maintainability and capability.
>
> This should look stupid because it really be replaced stupidly
> and of course this can be more sane/effective/maintainable by
> refactoring. But before that issue, I'm not confident at all that
> this is really a alternative with *gigantic* improvement.
>
> Any opinions?
>

It's an interesting idea to use regular expressions, but it's a shame to
move the patterns so far away from the actions they trigger.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] pam auth - add rhost item

2015-11-15 Thread Peter Eisentraut

On 10/13/15 4:12 PM, kolo hhmow wrote:
> Yes, sorry. I was in hurry when I posted this message.
> I dont understand whay in CheckPAMAuth function only PAM_USER item is
> adding to pam information before authenticate?
> Wheter it would be a problem to set additional pam information like
> PAM_RHOST which is very useful because we can use this item to restrict
> access to this ip address.

Your implementation uses NI_NUMERICHOST, but the documentation of
pam_set_item speaks of a "hostname".  Which is correct?



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_hba_lookup function to get all matching pg_hba.conf entries

2015-11-15 Thread Peter Eisentraut

On 7/21/15 5:15 AM, Haribabu Kommi wrote:
> With the output of this view, administrator can identify the lines
> that are matching for the given
> criteria easily without going through the file.

How is this useful?  I could see the use if you want to debug cases of
user foo on host bar says they can't connect, but you can't impersonate
them to verify it.  But then all you need is a function with a scalar
result, not a result set.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-15 Thread Etsuro Fujita


On 2015/11/13 13:44, Kyotaro HORIGUCHI wrote:

I wrote:

What I think is, I
see zero evidence that there is a good use-case for an FDW to do
something other than doing an ExecProcNode in the callback routine, as I
said below, so I don't see the need to add such a routine while that
would cause maybe not a large, but not a little burden for writing such
a routine on FDW authors.


KaiGai-san wrote:

It is quite natural because we cannot predicate what kind of extension
is implemented on FDW interface. You might know the initial version of
PG-Strom is implemented on FDW (about 4 years before...). If I would
continue to stick FDW, it became a FDW driver with own join engine.



 From the standpoint of interface design, if we would not admit flexibility
of implementation unless community don't see a working example, a reasonable
tactics *for extension author* is to follow the interface restriction even
if it is not best approach from his standpoint.
It does not mean the approach by majority is also best for the minority.
It just requires the minority a compromise.



Or try to open the way to introduce the feature he/she wants.


I think the biggest difference between KaiGai-san's patch and mine is 
that KaiGai-san's patch introduces a callback routine to allow an FDW 
author not only to execute a secondary plan but to do something else, 
instead of executing the plan, if he/she wants to do so.  His approach 
would provide the flexibility, but IMHO I think major FDWs that would be 
implementing join pushdown, such as postgres_fdw, wouldn't be utilizing 
the flexibility; probably, they would be just executing the secondary 
plan in the routine.  Furthermore, since that for executing the plan, 
his approach would require that an FDW author has to add code not only 
for creating the plan but for initializing/executing/ending it to 
his/her FDW by itself while in my approach, he/she only has to add code 
for the plan creation, his approach would impose a more development 
burden on such major FDWs' authors than mine.  I think the flexibility 
would be a good thing, but I also think it's important not to burden FDW 
authors.  Maybe I'm missing something, though.


Best regards,
Etsuro Fujita



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] proposal: PL/Pythonu - function ereport

2015-11-15 Thread Pavel Stehule

2015-11-16 5:20 GMT+01:00 Peter Eisentraut :

> I don't think it's right to reuse SPIError for this.  SPIError is
> clearly meant to signal an error in the SPI calls.  Of course, we can't
> stop users from raising whatever exception they want, but if we're going
> to advertise that users can raise exceptions, then we should create
> separate exception classes.
>
> I suppose the proper way to set this up would be to create a base class
> like plpy.Error and derive SPIError from that.
>

Do you have some ideas about the name of this class?

Regards

Pavel

Re: [HACKERS] [PATCH] SQL function to report log message

2015-11-15 Thread dinesh kumar

Hi,

On Mon, Nov 16, 2015 at 2:50 AM, Peter Eisentraut  wrote:

> On 10/22/15 3:20 AM, dinesh kumar wrote:
> > +  
> > +   
> > +
> pg_report_log(logleveltext,
> message anyelement[, ishidestmt
> boolean ] [, detail  text] [,
> hint text] [, sqlstate
> text])
> > +   
> > +   void
> > +   
> > +Report message or error.
> > +   
> > +  
>
> I haven't seen this discussed before, but I don't find the name
> pg_report_log particularly good.  Why not jut pg_log?
>
>
Thanks for your comments.

Sorry for my too late response here.

I'm sure pg_log is good. But, I don't see it's more easily understandable.
What I mean is, If we see "pg_log" as function name, we can't say that,
what this function is going to do by just reading it's name. For example,
we have "pg_write_file". By reading the function name itself, we can define
this, this is the  function is for writing contents into given file.

So, shall we make this pg_report_log TO pg_write_log OR pg_ereport OR  from you.

-- 

Regards,
Dinesh
manojadinesh.blogspot.com

Re: [HACKERS] Foreign join pushdown vs EvalPlanQual

2015-11-15 Thread Kouhei Kaigai

> On 2015/11/13 13:44, Kyotaro HORIGUCHI wrote:
> 
> I wrote:
> >>> What I think is, I
> >>> see zero evidence that there is a good use-case for an FDW to do
> >>> something other than doing an ExecProcNode in the callback routine, as I
> >>> said below, so I don't see the need to add such a routine while that
> >>> would cause maybe not a large, but not a little burden for writing such
> >>> a routine on FDW authors.
> 
> KaiGai-san wrote:
> >> It is quite natural because we cannot predicate what kind of extension
> >> is implemented on FDW interface. You might know the initial version of
> >> PG-Strom is implemented on FDW (about 4 years before...). If I would
> >> continue to stick FDW, it became a FDW driver with own join engine.
> 
> >>  From the standpoint of interface design, if we would not admit flexibility
> >> of implementation unless community don't see a working example, a 
> >> reasonable
> >> tactics *for extension author* is to follow the interface restriction even
> >> if it is not best approach from his standpoint.
> >> It does not mean the approach by majority is also best for the minority.
> >> It just requires the minority a compromise.
> 
> > Or try to open the way to introduce the feature he/she wants.
> 
> I think the biggest difference between KaiGai-san's patch and mine is
> that KaiGai-san's patch introduces a callback routine to allow an FDW
> author not only to execute a secondary plan but to do something else,
> instead of executing the plan, if he/she wants to do so.  His approach
> would provide the flexibility, but IMHO I think major FDWs that would be
> implementing join pushdown, such as postgres_fdw, wouldn't be utilizing
> the flexibility; probably, they would be just executing the secondary
> plan in the routine.
>
Yes, my approach never deny. 

> Furthermore, since that for executing the plan,
> his approach would require that an FDW author has to add code not only
> for creating the plan but for initializing
>
Pick up a plan from fdw_plans, then call ExecInitNode()

> executing
>
Pick up a plan-state from fdw_ps then call ExecProcNode()

> ending it to
>
Also, call ExecEndNode() towards the plan-state.

> his/her FDW by itself while in my approach, he/she only has to add code
> for the plan creation, his approach would impose a more development
> burden on such major FDWs' authors than mine.
>
It looks to me the more development burden is additional three lines.

Both of our approaches commonly needs to construct alternative local
plan, likely unparametalized nest-loop, on planner phase, it shall be
supported by a utility function in the core background.
So, one more additional line will be eventually needed.

> I think the flexibility
> would be a good thing, but I also think it's important not to burden FDW
> authors.  Maybe I'm missing something, though.
>
The actual pain is people cannot design/implement their module as
they want. I've repeatedly pointed out FDW driver can have own join
implementation and people potentially want to use own logic than
local plan. At least, if PG-Strom would still run on FDW, I *want*
to reuse its CPU-fallback routine instead of the alternative sub-plan.

Could you introduce us why above sequence (a few additional lines) are
unacceptable burden and can justify to eliminate flexibility for
minorities?
If you can implement the "common part" for majority, we can implement
same stuff as utility functions can be called from the callbacks.

My questions are:
* How much lines do you expect for the additional burden?
* Why does it justify to eliminate flexibility of the interface?
* Why cannot we implement the common part as utility routines that
  can be called from the callback?

Please don't hesitate to point out flaw of my proposition, if you
noticed something significant we have never noticed.
However, at this moment, it does not seems to me your concern is
something significant.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] COPY (INSERT/UPDATE/DELETE .. RETURNING ..)

2015-11-15 Thread Michael Paquier

On Sun, Nov 1, 2015 at 2:49 AM, Marko Tiikkaja  wrote:
> Attached is a patch for being able to do $SUBJECT without a CTE.  The
> reasons this is better than a CTE version are:
>
>   1) It's not obvious why a CTE version works but a plain one doesn't
>   2) This one has less overhead (I measured a ~12% improvement on a
> not-too-unreasonable test case)

Would you mind sharing this test case as well as numbers?

> With regard to RULEs, similar restrictions apply as the ones on
> data-modifying statements in WITH.

OK, so with this patch BeginCopy holds a light copy of
RewriteQuery@rewriteHandler.c that rewrites queries with rules, and this
patch extends the error handling after calling pg_analyze_and_rewrite. This
looks like the right approach.

INSERT/UPDATE/DELETE ... RETURNING is a Postgres extension not part of the
SQL spec, so my first thought is that there is no reason to reject this
feature based on the grammar allowed.

Here are some comments about the patch.

Nice patch to begin with.

Regression tests are broken when copyselect is run in parallel because
test3 is a table used there as well. A simple idea to fix this is to rename
the table test3 to something else or to use a schema dedicated to this
test, I just renamed it to copydml_test in the patch attached.

-   | COPY select_with_parens TO opt_program
copy_file_name opt_with copy_options
+   | COPY '(' PreparableStmt ')' TO opt_program
copy_file_name opt_with copy_options
This does not look right, PreparableStmt is used for statements that are
part of PREPARE queries, any modifications happening there would impact
COPY with this implementation. I think that we had better have a new option
query_with_parens. Please note that the patch attached has not changed that.
-- 
Michael
diff --git a/doc/src/sgml/ref/copy.sgml b/doc/src/sgml/ref/copy.sgml
index 2850b47..07e2f45 100644
--- a/doc/src/sgml/ref/copy.sgml
+++ b/doc/src/sgml/ref/copy.sgml
@@ -112,10 +112,17 @@ COPY { table_name [ ( query
 
  
-  A  or
-   command
-  whose results are to be copied.
-  Note that parentheses are required around the query.
+  A , ,
+  ,  or
+   command whose results are to be
+  copied.  Note that parentheses are required around the query.
+ 
+ 
+  For INSERT, UPDATE and
+  DELETE queries a RETURNING clause must be provided,
+  and the target relation must not have a conditional rule, nor
+  an ALSO rule, nor an INSTEAD rule
+  that expands to multiple statements.
  
 

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 443b9e7..8bed38c 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -201,7 +201,7 @@ typedef struct CopyStateData
 	int			raw_buf_len;	/* total # of bytes stored */
 } CopyStateData;
 
-/* DestReceiver for COPY (SELECT) TO */
+/* DestReceiver for COPY (query) TO */
 typedef struct
 {
 	DestReceiver pub;			/* publicly-known function pointers */
@@ -772,7 +772,8 @@ CopyLoadRawBuf(CopyState cstate)
  *
  * Either unload or reload contents of table , depending on .
  * ( = TRUE means we are inserting into the table.)  In the "TO" case
- * we also support copying the output of an arbitrary SELECT query.
+ * we also support copying the output of an arbitrary SELECT, INSERT, UPDATE
+ * or DELETE query.
  *
  * If  is false, transfer is between the table and the file named
  * .  Otherwise, transfer is between the table and our regular
@@ -1374,11 +1375,11 @@ BeginCopy(bool is_from,
 		Assert(!is_from);
 		cstate->rel = NULL;
 
-		/* Don't allow COPY w/ OIDs from a select */
+		/* Don't allow COPY w/ OIDs from a query */
 		if (cstate->oids)
 			ereport(ERROR,
 	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-	 errmsg("COPY (SELECT) WITH OIDS is not supported")));
+	 errmsg("COPY (query) WITH OIDS is not supported")));
 
 		/*
 		 * Run parse analysis and rewrite.  Note this also acquires sufficient
@@ -1393,9 +1394,36 @@ BeginCopy(bool is_from,
 		rewritten = pg_analyze_and_rewrite((Node *) copyObject(raw_query),
 		   queryString, NULL, 0);
 
-		/* We don't expect more or less than one result query */
-		if (list_length(rewritten) != 1)
-			elog(ERROR, "unexpected rewrite result");
+		/* check that we got back something we can work with */
+		if (rewritten == NIL)
+		{
+			ereport(ERROR,
+	(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+	errmsg("DO INSTEAD NOTHING rules are not supported for COPY")));
+		}
+		else if (list_length(rewritten) > 1)
+		{
+			ListCell *lc;
+
+			/* examine queries to determine which error message to issue */
+			foreach(lc, rewritten)
+			{
+Query	  *q = (Query *) lfirst(lc);
+
+if (q->querySource == QSRC_QUAL_INSTEAD_RULE)
+	ereport(ERROR,
+			(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+			 errmsg("conditional DO INSTEAD rules are not supported for COPY")));
+if (q->querySource == QSRC_NON_INSTEAD_RULE)
+

Re: [HACKERS] pg_hba_lookup function to get all matching pg_hba.conf entries

2015-11-15 Thread Haribabu Kommi

On Mon, Nov 16, 2015 at 2:30 PM, Peter Eisentraut  wrote:
> On 7/21/15 5:15 AM, Haribabu Kommi wrote:
>> With the output of this view, administrator can identify the lines
>> that are matching for the given
>> criteria easily without going through the file.
>
> How is this useful?  I could see the use if you want to debug cases of
> user foo on host bar says they can't connect, but you can't impersonate
> them to verify it.  But then all you need is a function with a scalar
> result, not a result set.

Do you mean the function should return true or false based on the connection
status with the provided arguments?

I also feel difficult to understand the function result as compared to a view.

Regards,
Hari Babu
Fujitsu Australia


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] eXtensible Transaction Manager API

2015-11-15 Thread Kevin Grittner

On Saturday, November 14, 2015 8:41 AM, Craig Ringer  
wrote:
> On 13 November 2015 at 21:35, Michael Paquier  
> wrote:
>> On Tue, Nov 10, 2015 at 3:46 AM, Robert Haas  wrote:

>>> If the database is corrupted, there's no way to guarantee that
>>> anything works as planned.  This is like saying that criticizing
>>> somebody's disaster recovery plan on the basis that it will be
>>> inadequate if the entire planet earth is destroyed.

+1

Once all participating servers return "success" from the "prepare"
phase, the only sane way to proceed is to commit as much as
possible.  (I guess you still have some room to argue about what to
do after an error on the attempt to commit the first prepared
transaction, but even there you need to move forward if it might
have been due to a communications failure preventing a success
indication for a commit which was actually successful.)  The idea
of two phase commit is that failure to commit after a successful
prepare should be extremely rare.

>> As well as there could be FS, OS, network problems... To come back to
>> the point, my point is simply that I found surprising the sentence of
>> Konstantin upthread saying that if commit fails on some of the nodes
>> we should rollback the prepared transaction on all nodes. In the
>> example given, in the phase after calling dtm_end_prepare, say we
>> perform COMMIT PREPARED correctly on node 1, but then failed it on
>> node 2 because a meteor has hit a server, it seems that we cannot
>> rollback, instead we had better rolling in a backup and be sure that
>> the transaction gets committed. How would you rollback the transaction
>> already committed on node 1? But perhaps I missed something...

Right.

> The usual way this works in an XA-like model is:
>
> In phase 1 (prepare transaction, in Pg's spelling), failure on
> any node triggers a rollback on all nodes.
>
> In phase 2 (commit prepared), failure on any node causes retries
> until it succeeds, or until the admin intervenes - say, to remove
> that node from operation. The global xact as a whole isn't
> considered successful until it's committed on all nodes.
>
> 2PC and distributed commit is well studied, including the
> problems. We don't have to think this up for ourselves. We don't
> have to invent anything here. There's a lot of distributed
> systems theory to work with - especially when dealing with well
> studied relational DBs trying to maintain ACID semantics.

Right; it is silly not to build on decades of work on theory and
practice in this area.

What is making me nervous as I watch this thread is a bit of loose
talk about the I in ACID.  There have been some points where it
seemed to be implied that we had sufficient transaction isolation
if we could get all the participating nodes using the same
snapshot.  I think I've seen some hint that there is an intention
to use distributed strict two-phase locking with a global lock
manager to achieve serializable behavior.  The former is vulnerable
to some well-known serialization anomalies and the latter was
dropped from PostgreSQL when MVCC was implemented because of its
horrible concurrency and performance characteristics, which is not
going to be less of an issue in a distributed system using that
technique.  It may be possible to implement the Serializable
Snapshot Isolation (SSI) technique across multiple cooperating
nodes, but it's not obvious exactly how that would be implemented,
and I have seen no discussion of that.

If we're going to talk about maintaining ACID semantics in this
environment, I think we need to explicitly state what level of
isolation we intend to provide, and how we intend to do that.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

39 matches

Mail list logo