Re: [HACKERS] Patch: Allow SQL-language functions to reference parameters by parameter name

2012-01-30 Thread Hitoshi Harada
On Sun, Jan 29, 2012 at 1:08 AM, Matthew Draper matt...@trebex.net wrote:
 On 25/01/12 18:37, Hitoshi Harada wrote:
 I'm still not sure whether to just revise (almost) all the SQL function
 examples to use parameter names, and declare them the right choice; as
 it's currently written, named parameters still seem rather second-class.

 Agreed.

 I'll try a more comprehensive revision of the examples.

 The patch seems ok, except an example I've just found.

 db1=# create function t(a int, t t) returns int as $$ select t.a $$
 language sql;
 CREATE FUNCTION
 db1=# select t(0, row(1, 2)::t);
  t
 ---
  1
 (1 row)

 Should we throw an error in such ambiguity? Or did you make it happen
 intentionally? If latter, we should also mention the rule in the
 manual.


 I did consider it, and felt it was the most consistent:

 # select t.x, t, z from (select 1) t(x), (select 2) z(t);
  x | t |  z
 ---+---+-
  1 | 2 | (2)
 (1 row)


 I haven't yet managed to find the above behaviour described in the
 documentation either, though. To me, it feels like an obscure corner
 case, whose description would leave the rules seeming more complicated
 than they generally are.

Makes sense to me. I marked this as Ready for committer.

Thanks,
-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-01-30 Thread Kyotaro HORIGUCHI
Thank you for comments, this is revised version of the patch.

The gain of performance is more than expected. Measure script now
does query via dblink ten times for stability of measuring, so
the figures become about ten times longer than the previous ones.

   sec% to Original
Original : 31.5 100.0%
RowProcessor patch   : 31.3  99.4%
dblink patch : 24.6  78.1%

RowProcessor patch alone makes no loss or very-little gain, and
full patch gives us 22% gain for the benchmark(*1).


The modifications are listed below.


- No more use of PGresAttValue for this mechanism, and added
  PGrowValue instead. PGresAttValue has been put back to
  libpq-int.h

- pqAddTuple() is restored as original and new function
  paAddRow() to use as RowProcessor. (Previous pqAddTuple
  implement had been buggily mixed the two usage of
  PGresAttValue)

- PQgetRowProcessorParam has been dropped. Contextual parameter
  is passed as one of the parameters of RowProcessor().

- RowProcessor() returns int (as bool, is that libpq convension?)
  instead of void *. (Actually, void * had already become useless
  as of previous patch)

- PQsetRowProcessorErrMes() is changed to do strdup internally.

- The callers of RowProcessor() no more set null_field to
  PGrowValue.value. Plus, the PGrowValue[] which RowProcessor()
  receives has nfields + 1 elements to be able to make rough
  estimate by cols-value[nfields].value - cols-value[0].value -
  something.  The somthing here is 4 * nfields for protocol3 and
  4 * (non-null fields) for protocol2. I fear that this applies
  only for textual transfer usage...

- PQregisterRowProcessor() sets the default handler when given
  NULL. (pg_conn|pg_result).rowProcessor cannot be NULL for its
  lifetime.

- initStoreInfo() and storeHandler() has been provided with
  malloc error handling.


And more..

- getAnotherTuple()@fe-protocol2.c is not tested utterly.

- The uniformity of the size of columns in the test data prevents
  realloc from execution in dblink... More test should be done.


 regards,

=
(*1) The benchmark is done as follows,

==test.sql
select dblink_connect('c', 'host=localhost dbname=test');
select * from dblink('c', 'select a,c from foo limit 200') as (a text b 
bytea) limit 1;
...(repeat 9 times more)
select dblink_disconnect('c');
==

$ for i in $(seq 1 10); do time psql test -f t.sql; done

The environment is
  CentOS 6.2 on VirtualBox on Core i7 965 3.2GHz
  # of processor  1
  Allocated mem   2GB
  
Test DB schema is
   Column | Type  | Modifiers 
  +---+---
   a  | text  | 
   b  | text  | 
   c  | bytea | 
  Indexes:
  foo_a_bt btree (a)
  foo_c_bt btree (c)

test=# select count(*),
   min(length(a)) as a_min, max(length(a)) as a_max,
   min(length(c)) as c_min, max(length(c)) as c_max from foo;

  count  | a_min | a_max | c_min | c_max 
-+---+---+---+---
 200 |29 |29 |29 |29
(1 row)

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/interfaces/libpq/exports.txt b/src/interfaces/libpq/exports.txt
index 1af8df6..5ed083c 100644
--- a/src/interfaces/libpq/exports.txt
+++ b/src/interfaces/libpq/exports.txt
@@ -160,3 +160,5 @@ PQconnectStartParams  157
 PQping158
 PQpingParams  159
 PQlibVersion  160
+PQregisterRowProcessor	  161
+PQsetRowProcessorErrMes	  162
diff --git a/src/interfaces/libpq/fe-connect.c b/src/interfaces/libpq/fe-connect.c
index d454538..4fe2f41 100644
--- a/src/interfaces/libpq/fe-connect.c
+++ b/src/interfaces/libpq/fe-connect.c
@@ -2692,6 +2692,8 @@ makeEmptyPGconn(void)
 	conn-allow_ssl_try = true;
 	conn-wait_ssl_try = false;
 #endif
+	conn-rowProcessor = pqAddRow;
+	conn-rowProcessorParam = NULL;
 
 	/*
 	 * We try to send at least 8K at a time, which is the usual size of pipe
@@ -5076,3 +5078,10 @@ PQregisterThreadLock(pgthreadlock_t newhandler)
 
 	return prev;
 }
+
+void
+PQregisterRowProcessor(PGconn *conn, RowProcessor func, void *param)
+{
+	conn-rowProcessor = (func ? func : pqAddRow);
+	conn-rowProcessorParam = param;
+}
diff --git a/src/interfaces/libpq/fe-exec.c b/src/interfaces/libpq/fe-exec.c
index b743566..82914fd 100644
--- a/src/interfaces/libpq/fe-exec.c
+++ b/src/interfaces/libpq/fe-exec.c
@@ -66,6 +66,7 @@ static PGresult *PQexecFinish(PGconn *conn);
 static int PQsendDescribe(PGconn *conn, char desc_type,
 			   const char *desc_target);
 static int	check_field_number(const PGresult *res, int field_num);
+static int	pqAddTuple(PGresult *res, PGresAttValue *tup);
 
 
 /* 
@@ -160,6 +161,9 @@ PQmakeEmptyPGresult(PGconn *conn, ExecStatusType status)
 	result-curBlock = NULL;
 	result-curOffset = 0;
 	result-spaceLeft = 0;
+	result-rowProcessor = pqAddRow;
+	result-rowProcessorParam = NULL;
+	result-rowProcessorErrMes = NULL;
 
 	if (conn)
 	{
@@ -194,6 +198,10 @@ PQmakeEmptyPGresult(PGconn 

Re: [HACKERS] Hot standby off of hot standby?

2012-01-30 Thread Simon Riggs
On Mon, Jan 30, 2012 at 4:36 AM, Igor Schtein ischt...@gmail.com wrote:

 Is it possible to use a standby instance as a master/primary for another
 standby in Postgres 9.0?  In other words, does PG 9.0 supports cascading
 standby configuration?

No, but 9.2 will support that feature, known as cascading replication.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Remembering bug #6123

2012-01-30 Thread Simon Riggs
On Thu, Jan 12, 2012 at 4:30 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Tom Lane t...@sss.pgh.pa.us wrote:
 Also, what's the point of testing update_ctid?  I don't see that
 it matters whether the outdate was a delete or an update.

 The update_ctid code was a carry-over from my old, slightly
 different approach, which I failed to change as I should have.  I'll
 fix that along with the other.

 Actually, on reflection there might be a reason for checking
 update_ctid, with a view to allowing harmless cases.  I see
 these cases:

 * UPDATE finds a trigger already updated the row: must throw error
 since we can't apply the update.

 * UPDATE finds a trigger already deleted the row: arguably, we could
 let the deletion stand and ignore the update action.

 * DELETE finds a trigger already updated the row: must throw error
 since we can't apply the delete.

 * DELETE finds a trigger already deleted the row: arguably, there's
 no reason to complain.

 Don't know if that was your reasoning as well.  But if it is, then again
 the comment needs to cover that.


Just been reading this thread a little.

ISTM that seeing a SelfUpdated row on a cursor when we didn't use
WHERE CURRENT OF really ought to raise some kind of exception
condition like a WARNING or a NOTICE. That situation seems most likely
to be a bug or at least an unintended consequence.

As Tom points out, the multiple flavours of weirdness that can result
even if we do fix this are not things we should do silently. I think
we should do the best job we can without throwing an error, but we
must make some attempt to let the developer know we did that for them
so they can investigate.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Confusing EXPLAIN output in case of inherited tables

2012-01-30 Thread Ashutosh Bapat
Thanks Tom for giving a stronger case. I found the problem whille looking
at inherited tables, and didn't think beyond inherited tables. Since
inherited tables are expanded when subquery planner is invoked, I thought
the problem will occur only in Explain output as we won't generate queries,
that can be used elsewhere after/during planning.

So, as I understand we have two problems here
1. Prefixing schemaname to the fake alises if there is another RTE with
same name. There may not be a relation with that name (fake alias name
given) in the schema chosen as prefix.
2. Fake aliases themselves can be conflicting.

If I understand correctly, if we solve the second problem, first problem
will not occur. Is that correct?

On Sat, Jan 28, 2012 at 8:08 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 Robert Haas robertmh...@gmail.com writes:
  It's a feature, not a bug, that we schema-qualify names when VERBOSE
  is specified.  That was done on purpose for the benefit of external
  tools that might need this information to disambiguate which object is
  being referenced.

  Table *aliases*, of course, should not be schema-qualified, but I
  don't think that's what we're doing.  You could make it more clear by
  including an alias in the query, like this:

  explain verbose select * into table ramp from road hwy where name ~
 '.*Ramp';

 I think you are both focusing on the wrong thing.  There is a lot of
 squishiness in what EXPLAIN prints out, since SQL notation is not always
 well suited to what an execution plan actually does.  But this code has
 a hard and fast requirement that it dump view definitions correctly,
 else pg_dump doesn't work.  And after looking at this I think Ashutosh
 has in fact found a bug.  Consider this example:

 regression=# create schema s1;
 CREATE SCHEMA
 regression=# create schema s2;
 CREATE SCHEMA
 regression=# create table s1.t1 (f1 int);
 CREATE TABLE
 regression=# create table s2.t1 (f1 int);
 CREATE TABLE
 regression=# create view v1 as
 regression-#   select * from s1.t1 where exists (
 regression(# select 1 from s2.t1 where s2.t1.f1 = s1.t1.f1
 regression(#   );
 CREATE VIEW
 regression=# \d+ v1
   View public.v1
  Column |  Type   | Modifiers | Storage | Description
 +-+---+-+-
  f1 | integer |   | plain   |
 View definition:
  SELECT t1.f1
   FROM s1.t1
  WHERE (EXISTS ( SELECT 1
   FROM s2.t1
  WHERE t1.f1 = s1.t1.f1));

 regression=# alter table s2.t1 rename to tx;
 ALTER TABLE
 regression=# \d+ v1
   View public.v1
  Column |  Type   | Modifiers | Storage | Description
 +-+---+-+-
  f1 | integer |   | plain   |
 View definition:
  SELECT t1.f1
   FROM s1.t1
  WHERE (EXISTS ( SELECT 1
   FROM s2.tx t1
  WHERE t1.f1 = s1.t1.f1));

 Both of the above displays of the view are formally correct, in that the
 variables will be taken to refer to the correct upper or lower RTE.
 But let's change that back and rename the other table:

 regression=# alter table s2.tx rename to t1;
 ALTER TABLE
 regression=# alter table s1.t1 rename to tx;
 ALTER TABLE
 regression=# \d+ v1
   View public.v1
  Column |  Type   | Modifiers | Storage | Description
 +-+---+-+-
  f1 | integer |   | plain   |
 View definition:
  SELECT t1.f1
   FROM s1.tx t1
  WHERE (EXISTS ( SELECT 1
   FROM s2.t1
  WHERE t1.f1 = s1.t1.f1));

 This is just plain wrong, as you'll see if you try to execute that
 query:

 regression=# SELECT t1.f1
 regression-#FROM s1.tx t1
 regression-#   WHERE (EXISTS ( SELECT 1
 regression(#FROM s2.t1
 regression(#   WHERE t1.f1 = s1.t1.f1));
 ERROR:  invalid reference to FROM-clause entry for table t1
 LINE 5:   WHERE t1.f1 = s1.t1.f1));
^
 HINT:  There is an entry for table t1, but it cannot be referenced
 from this part of the query.

 (The HINT is a bit confused here, but the query is certainly invalid.)

 So what we have here is a potential failure to dump and reload view
 definitions, which is a lot more critical in my book than whether
 EXPLAIN's output is confusing.

 If we stick with the existing rule for attaching a fake alias to renamed
 RTEs, I think that Ashutosh's patch or something like it is probably
 appropriate, because the variable-printing code ought to be in step with
 the RTE-printing code.  Unfortunately, I think the hack to attach a fake
 alias to renamed RTEs creates some issues of its own.  Consider

select * from s1.t1
  where exists (select 1 from s2.t2 t1 where t1.f1 = s1.t1.f1);

 If s1.t1 is now renamed to s1.tx, it is still possible to express
 the same semantics:

select * from s1.tx
  where exists (select 1 from s2.t2 t1 where t1.f1 = s1.tx.f1);

 But when we attach a fake alias, it's broken:


[HACKERS]

2012-01-30 Thread horiguchi . kyotaro
I'm sorry.

 Thank you for comments, this is revised version of the patch.

The malloc error handling in dblink.c of the patch is broken. It
is leaking memory and breaking control.

i'll re-send the properly fixed patch for dblink.c later.

# This severe back pain should have made me stupid :-p

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-01-30 Thread horiguchi . kyotaro
I'm sorry.

 Thank you for comments, this is revised version of the patch.

The malloc error handling in dblink.c of the patch is broken. It
is leaking memory and breaking control.

i'll re-send the properly fixed patch for dblink.c later.

# This severe back pain should have made me stupid :-p

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Hot standby off of hot standby?

2012-01-30 Thread Igor Schtein
Hi,

Is it possible to use a standby instance as a master/primary for another
standby in Postgres 9.0?  In other words, does PG 9.0 supports cascading
standby configuration?

Thanks,
Igor


Re: [HACKERS] JSON for PG 9.2

2012-01-30 Thread Abhijit Menon-Sen
At 2012-01-27 09:47:05 +0530, a...@toroid.org wrote:

 I've started reviewing this patch, but it'll take me a bit longer to go
 through json.c properly.

OK, I finished reading json.c. I don't have an answer to the detoasting
question in the XXX comment, but the code looks fine.

Aside: is query_to_json really necessary? It seems rather ugly and
easily avoidable using row_to_json.

-- ams

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Heikki Linnakangas

On 27.01.2012 15:38, Robert Haas wrote:

On Fri, Jan 27, 2012 at 8:35 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Yeah, we have to be careful with any overhead in there, it can be a hot
spot. I wouldn't expect any measurable difference from the above, though.
Could I ask you to rerun the pgbench tests you did recently with this patch?
Or can you think of a better test for this?


I can't do so immediately, because I'm waiting for Nate Boley to tell
me I can go ahead and start testing on that machine again.  But I will
do it once I get the word.


I committed this. I ran pgbench test on an 8-core box and didn't see any 
slowdown. It would still be good if you get a chance to rerun the bigger 
test, but I feel confident that there's no measurable slowdown.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Multithread Query Planner

2012-01-30 Thread Robert Haas
On Fri, Jan 27, 2012 at 2:56 PM, Pierre C li...@peufeu.com wrote:
 Right.  I think it makes more sense to try to get parallelism working
 first with the infrastructure we have.  Converting to use threading,
 if we ever do it at all, should be something we view as a later
 performance optimization.  But I suspect we won't want to do it
 anyway; I think there will be easier ways to get where we want to be.

 Multithreading got fashionable with the arrival of the Dual-core CPU a few
 years ago. However, multithreading as it is used currently has a huge
 problem : usually, threads share all of their memory. This opens the door to
 an infinite number of hard to find bugs, and more importantly, defeats the
 purpose.

 Re-entrant palloc() is nonsense. Suppose you can make a reentrant palloc()
 which scales OK at 2 threads thanks to a cleverly placed atomic instruction.
 How is it going to scale on 64 cores ? On HP's new 1000-core ARM server with
 non-uniform memory access ? Probably it would suck very very badly... not to
 mention the horror of multithreaded exception-safe deallocation when 1
 thread among many blows up on an error...

There are academic papers out there on how to build a thread-safe,
highly concurrent memory allocator.  You seem to be assuming that
everyone doing allocations would need to compete for access to a
single freelist, or something like that, which is simply not true.  A
lot of effort and study has been put into figuring out how to get past
bottlenecks in this area, because there is a lot of multi-threaded
code out there that needs to surmount these problems.  I don't believe
that the problem is that it can't be done, but rather that we haven't
done it.

 For the ultimate in parallelism, ask a FPGA guy. Is he using shared memory
 to wire together his 12000 DSP blocks ? Nope, he's using isolated Processes
 which share nothing and communicate through FIFOs and hardware message
 passing. Like shell pipes, basically. Or Erlang.

I'm not sure we can conclude much from this example.  The programming
style of people using FPGAs is probably governed by the nature of the
interface and the type of computation they are doing rather than
anything else.

 Good parallelism = reduce shared state and communicate through data/message
 channels.

 Shared-everything multithreading is going to be in a lot of trouble on
 future many-core machines. Incidentally, Postgres, with its Processes,
 sharing only what is needed, has a good head start...

 With more and more cores coming, you guys are going to have to fight to
 reduce the quantity of shared state between processes, not augment it by
 using shared memory threads !...

I do agree that it's important to reduce shared state.  We've seen
some optimizations this release cycle that work precisely because they
cut down on the rate at which cache lines must be passed between
cores, and it's pretty clear that we need to go farther in that
direction.  On the other hand, I think it's a mistake to confuse the
programming model with the amount of shared state.  In a
multi-threaded programming model there is likely to be a lot more
memory that is technically shared in the sense that any thread could
technically access it.  But if the application is coded in such a way
that actual sharing is minimal, then it's not necessarily any worse
than a process model as far as concurrency is concerned.  Threading
provides a couple of key advantages which, with our process model, we
can't get: it avoids the cost of a copy-on-write operation every time
a child is forked, and it allows arbitrary amounts of memory rather
than being limited to a single shared memory segment that must be
sized in advance.  The major disadvantage is really with robustness,
not performance, I think: in a threaded environment, with a shared
address space, the consequences of a random memory stomp will be less
predictable.

 Say you want to parallelize sorting.
 Sorting is a black-box with one input data pipe and one output data pipe.
 Data pipes are good for parallelism, just like FIFOs. FPGA guys love black
 boxes with FIFOs between them.

 Say you manage to send tuples through a FIFO like zeromq. Now you can even
 run the sort on another machine and allow it to use all the RAM if you like.
 Now split the black box in two black boxes (qsort and merge), instanciate as
 many qsort boxes as necessary, and connect that together with pipes. Run
 some boxes on some of this machine's cores, some other boxes on another
 machine, etc. That would be very flexible (and scalable).

 Of course the black box has a small backdoor : some comparison functions can
 access shared state, which is basically *the* issue (not reentrant stuff,
 which you do not need).

Well, you do need reentrant stuff, if the comparator does anything
non-trivial.  It's easy to imagine that comparing strings or dates or
whatever is a trivial operation that's done without allocating any
memory or throwing any errors, but it's not really 

Re: [HACKERS] Configuring Postgres to Add A New Source File

2012-01-30 Thread Robert Haas
On Fri, Jan 27, 2012 at 6:37 PM, Tareq Aljabban
tareq.aljab...@gmail.com wrote:
 Indeed, I'm a beginner in Make, but I read few tutorials and was able to
 do what I wanted outside of PG using a simple make file.
 Now, when moving to PG, I found the Make structure much more complicated and
 didn't know where to add my configuration.
 I'm looking only for this file to run in PG (the required functionality is
 done already). My code will be a part of the backend, so I want to keep it
 there.

Uh, no it won't.  The compile command you showed had it creating a
standalone executable, hdfs_test.

You're likely to find that linking a JVM into the backend
significantly decreases the overall robustness of the system.  For
example, ^C isn't going to be able to abort out of processing a query
that's off doing something inside the JVM.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Simulating Clog Contention

2012-01-30 Thread Robert Haas
On Sat, Jan 28, 2012 at 3:32 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 I think that even in normal (non-initialization) usage, this message
 should be suppressed when the provided scale factor
 is equal to the pgbench_branches table count.

 The attached patch does just that.  There is probably no reason to
 warn people that we are doing what they told us to, but not for the
 reason they think.

In my opinion, a more sensible approach than anything we're doing
right now would be to outright *reject* options that will only be
ignored.  If -s isn't supported except with -i, then trying to specify
-s without -i should just error out at the options-parsing stage,
before we even try to connect to the database.  It's not very helpful
to accept options and then ignore them, and we have many instances of
that right now: initialization-only switches are accepted and ignored
when not initializing, and run-only switches are accepted and ignored
with initializing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-30 Thread Robert Haas
On Fri, Jan 27, 2012 at 4:07 PM, Dan Scales sca...@vmware.com wrote:
 The advantage of putting the checksum calculation in smgrwrite() (or 
 mdwrite()) is that it catches a bunch of page writes that don't go through 
 the buffer pool (see calls to smgrwrite() in nbtree.c, nbtsort.c, spginsert.c)

Maybe we should have some sort of wrapper function, then.  I really
dislike the idea that the smgr layer knows anything about the page
format, and if md has to know that's even worse.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Simon Riggs
On Mon, Jan 30, 2012 at 2:57 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 I committed this. I ran pgbench test on an 8-core box and didn't see any
 slowdown. It would still be good if you get a chance to rerun the bigger
 test, but I feel confident that there's no measurable slowdown.

I asked clearly and specifically for you to hold back committing
anything. Not sure why you would ignore that and commit without
actually asking myself or Peter. On a point of principle alone, I
think you should revert. Working together is difficult if
communication channels are openly ignored and disregarded.

Peter and I have been working on a new version that seems likely to
improve performance over your suggestions. We should be showing
something soon.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Heikki Linnakangas

On 30.01.2012 17:18, Simon Riggs wrote:

On Mon, Jan 30, 2012 at 2:57 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:


I committed this. I ran pgbench test on an 8-core box and didn't see any
slowdown. It would still be good if you get a chance to rerun the bigger
test, but I feel confident that there's no measurable slowdown.


I asked clearly and specifically for you to hold back committing
anything. Not sure why you would ignore that and commit without
actually asking myself or Peter. On a point of principle alone, I
think you should revert. Working together is difficult if
communication channels are openly ignored and disregarded.


You must be referring to this:

http://archives.postgresql.org/pgsql-hackers/2012-01/msg01406.php

What I committed in the end was quite different from the version that 
was in reply to, too. If you have a specific objection to the patch as 
committed, please let me know.



Peter and I have been working on a new version that seems likely to
improve performance over your suggestions. We should be showing
something soon.


Please post those ideas, and let's discuss them. If it's something 
simple, maybe we can still sneak them into this release. Otherwise, 
let's focus on the existing patches that are pending review or commit.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-30 Thread Andrew Dunstan



On 01/30/2012 09:54 AM, Abhijit Menon-Sen wrote:

At 2012-01-27 09:47:05 +0530, a...@toroid.org wrote:

I've started reviewing this patch, but it'll take me a bit longer to go
through json.c properly.

OK, I finished reading json.c. I don't have an answer to the detoasting
question in the XXX comment, but the code looks fine.



Looking at somewhat analogous code in xml.c, it doesn't seem to be done 
there. SO maybe we don't need to worry about it.




Aside: is query_to_json really necessary? It seems rather ugly and
easily avoidable using row_to_json.



I started with this, again by analogy with query_to_xml(). But I agree 
it's a bit ugly. If we're not going to do it, then we definitely need to 
look at caching the output funcs in the function info. A closer 
approximation is actually:


   SELECT array_to_json(array_agg(q))
   FROM ( your query here ) q;


But then I'd want the ability to break that up a bit with line feeds, so 
we'd need to adjust the interface slightly. (Hint: don't try the above 
with select * from pg_class.)



I'll wait on further comments, but I can probably turn these changes 
around very quickly once we're agreed.



Cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Simulating Clog Contention

2012-01-30 Thread Jeff Janes
On Mon, Jan 30, 2012 at 7:24 AM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Jan 28, 2012 at 3:32 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 I think that even in normal (non-initialization) usage, this message
 should be suppressed when the provided scale factor
 is equal to the pgbench_branches table count.

 The attached patch does just that.  There is probably no reason to
 warn people that we are doing what they told us to, but not for the
 reason they think.

 In my opinion, a more sensible approach than anything we're doing
 right now would be to outright *reject* options that will only be
 ignored.  If -s isn't supported except with -i, then trying to specify
 -s without -i should just error out at the options-parsing stage,
 before we even try to connect to the database.  It's not very helpful
 to accept options and then ignore them, and we have many instances of
 that right now: initialization-only switches are accepted and ignored
 when not initializing, and run-only switches are accepted and ignored
 with initializing.

I like the ability to say, effectively, I think I had previously did
the initialization with -s 40, if I actually didn't then scream at me,
and if I did then go ahead and do the pgbench I just asked for.
But, since it does unconditionally report the scale actually used and
I just have to have the discipline to go look at the result, I can see
where this is perhaps overkill.   In my own (non-PG-related) code,
when I have tasks that have to be run in multiple phases that can get
out of sync if I am not careful, I like to be able to specify the
flags even in the unused invocation, so that the code can verify I
am being consistent.  Code is better at that than I am.


I'm not sure I know what all would be incompatible with what.  I could
start drawing that matrix up once the API stabilizes, but I think you
are still planning on whacking this -I option around a bit.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Simulating Clog Contention

2012-01-30 Thread Jeff Janes
On Thu, Jan 26, 2012 at 6:18 AM, Abhijit Menon-Sen a...@toroid.org wrote:
 At 2012-01-12 12:31:20 +, si...@2ndquadrant.com wrote:

 The following patch adds a pgbench option -I to load data using
 INSERTs

 This is just to confirm that the patch applies and builds and works
 fine (though of course it does take a long time… pity there doesn't
 seem to be any easy way to add progress indication like -i has).

I was thinking the opposite.  That -i should only print progress
indication when -d is given.  Or at least knock an order of magnitude
or two off of how often it does so.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Jeff Janes
On Sun, Jan 29, 2012 at 1:20 PM, Greg Smith g...@2ndquadrant.com wrote:
 On 01/28/2012 07:48 PM, Jeff Janes wrote:


 I haven't inspected that deep fall off at 30 clients for the patch.
 By way of reference, if I turn off synchronous commit, I get
 tps=1245.8 which is 100% CPU limited.  This sets an theoretical upper
 bound on what could be achieved by the best possible group committing
 method.


 This sort of thing is why I suspect that to completely isolate some results,
 we're going to need a moderately high end server--with lots of
 cores--combined with an intentionally mismatched slow drive.  It's too easy
 to get pgbench and/or PostgreSQL to choke on something other than I/O when
 using smaller core counts.  I don't think I have anything where the floor is
 24 TPS per client though.  Hmmm...I think I can connect an IDE drive to my
 MythTV box and format it with ext4.  Thanks for the test idea.

 One thing you could try on this system is using the -N Do not update
 pgbench_tellers and pgbench_branches.  That eliminates a lot of the
 contention that might be pulling down your higher core count tests, while
 still giving a completely valid test of whether the group commit mechanism
 works.  Not sure whether that will push up the top-end usefully for you,
 worth a try if you have time to test again.

Adding the -N did eliminate the fall-off at 30 clients for
group_commit patch.  But, I still want to explore why the fall off
occurs when I get a chance.  I know why the curve would stop going up
without using -N (with -s of 40 and -c of 30, many connections will be
waiting on row locks for updates to pgbench_branches) but that should
cause a leveling off, not a collapse.

Other than the lack of drop off at 30 clients, -N didn't meaningfully
change anything.  Everyone got slightly faster except at -c1.


 If the group_commit patch goes in, would we then rip out commit_delay
 and commit_siblings?


 The main reason those are still hanging around at all are to allow pushing
 on the latency vs. throughput trade-off on really busy systems.  The use
 case is that you expect, say, 10 clients to constantly be committing at a
 high rate.  So if there's just one committing so far, assume it's the
 leading edge of a wave and pause a bit for the rest to come in.  I don't
 think the cases where this is useful behavior--people both want it and the
 current mechanism provides it--are very common in the real world.

The tests I did are exactly that environment where commit_delay might
be expected to help.  And it did help, but just not all that much. One
of the problems is that while it does wait for those others to come in
and then it does flush them in one fsync; but often the others never
get woken up successfully to realize that they have already been
flushed.  They continue to block.

The group_commit patch, on the other hand, accomplishes exactly what
commit_delay was intended to accomplish but doesn't do a very good job
of.

With the -N option, I also used commit_delay on top of group_commit,
and the difference between the two look like it was within the margin
of error.  So commit_delay did not obviously cause further
improvement.


 It can be
 useful for throughput oriented benchmarks though, which is why I'd say it
 hasn't killed off yet.

 We'll have to see whether the final form this makes sense in will usefully
 replace that sort of thing.  I'd certainly be in favor of nuking
 commit_delay and commit_siblings with a better option; it would be nice if
 we don't eliminate this tuning option in the process though.

But I'm pretty sure that group_commit has stolen that thunder.
Obviously a few benchmarks on one system isn't enough to prove that,
though.

The only use case I see left for commit_delay is where it is set on a
per-connection basis rather than system-wide.  Once you start a fsync,
everyone who missed the bus is locked out until the next one.  So
low-priority connections can set commit_delay so as not to trigger the
bus to leave before the high priority process gets on.  But that seems
like a pretty tenuous use case with better ways to do it.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Confusing EXPLAIN output in case of inherited tables

2012-01-30 Thread Tom Lane
Ashutosh Bapat ashutosh.ba...@enterprisedb.com writes:
 So, as I understand we have two problems here
 1. Prefixing schemaname to the fake alises if there is another RTE with
 same name. There may not be a relation with that name (fake alias name
 given) in the schema chosen as prefix.
 2. Fake aliases themselves can be conflicting.

Well, the issue is more that a fake alias might unintentionally collide
with a regular alias elsewhere in the query.  There's no guard against
that in the current behavior, and ISTM there needs to be.

The one possibly-simplifying thing about this whole issue is that we
needn't cater for references that couldn't have been made in the
original query.  For instance, if the inner and outer queries both have
explicit aliases tx, it's impossible for the inner query to have
referred to any columns of the outer tx --- so we don't have to try to
make it possible in the dumped form.

 If I understand correctly, if we solve the second problem, first problem
 will not occur. Is that correct?

I don't believe that the problem has anything to do with the names of
other tables that might happen to exist in the database.  It's a matter
of what RTE names/aliases are exposed for variable references in
different parts of the query.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-01-30 Thread Robert Haas
On Sun, Jan 29, 2012 at 12:00 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Joachim Wieland j...@mcknight.de writes:
 I know that you took back some of your comments, but I'm with you
 here. Archive is allocated as an ArchiveHandle and then casted back to
 Archive*, so you always know that an Archive is an ArchiveHandle. I'm
 all for getting rid of Archive and just using ArchiveHandle throughout
 pg_dump which would get rid of these useless casts.

 I'd like to see a more thoroughgoing look at the basic structure of
 pg_dump.  Everybody who's ever looked at that code has found it
 confusing, with the possible exception of the original author who is
 long gone from the project anyway.  I don't know exactly what would make
 it better, but the useless distinction between Archive and ArchiveHandle
 seems like a minor annoyance, not the core disease.

 Not that there'd be anything wrong with starting with that.

After some study, I'm reluctant to completely abandon the
Archive/ArchiveHandle distinction because it seems to me that it does
serve some purpose: right now, nothing in pg_dump.c - where all the
code to actually dump stuff lives - knows anything about what's inside
the ArchiveHandle, just the Archive.  So having two data structures
really does serve a purpose: if it's part of the Archive, you need it
in order to query the system catalogs and generate SQL.  If it isn't,
you don't.  Considering how much more crap there is inside an
ArchiveHandle than an Archive, I don't think we should lightly abandon
that distinction.

Now, that having been said, there are probably better ways of making
that distinction than what we have here; Archive and ArchiveHandle
could be better named, perhaps, and we could have pointers from one
structure to the other instead of magically embedding them inside each
other.  All the function pointers that are part of the ArchiveHandle
could be separated out into a static, constant structure with a name
like ArchiveMethod, and we could keep a pointer to the appropriate
ArchiveMethod inside each ArchiveHandle instead of copying all the
pointers into it.  I think that'd be a lot less confusing.

But the immediate problem is that pg_dump.c is heavily reliant on
global variables, which isn't going to fly if we want this code to use
threads on Windows (or anywhere else).  It's also bad style.  So I
suggest that we refactor pg_dump.c to get rid of g_conn and g_fout.
Getting rid of g_fout should be fairly easy: in many places, we're
already passing Archive *fout as a parameter.  If we pass it as a
parameter to the functions that need it but aren't yet getting it as a
parameter, then it can cease to exist as a global variable and become
local to main().

Getting rid of g_conn is a little harder.  Joachim's patch takes the
view that we can safely overload the existing ArchiveHandle.connection
member.  Currently, that member is the connection to which we are
doing a direct to database restore; what he's proposing to do is also
use it to store the connection from which are doing the dump.  But it
seems possible that we might someday want to dump from one database
and restore into another database at the same time, so maybe we ought
to play it safe and use different variables for those things.  So I'm
inclined to think we could just add a PGconn *remote_connection
argument to the Archive structure (to go with all of the similarly
named remote fields, all of which describe the DB to be dumped) and
then that would be available everywhere that the Archive itself is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Simulating Clog Contention

2012-01-30 Thread Robert Haas
On Mon, Jan 30, 2012 at 10:53 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Thu, Jan 26, 2012 at 6:18 AM, Abhijit Menon-Sen a...@toroid.org wrote:
 At 2012-01-12 12:31:20 +, si...@2ndquadrant.com wrote:

 The following patch adds a pgbench option -I to load data using
 INSERTs

 This is just to confirm that the patch applies and builds and works
 fine (though of course it does take a long time… pity there doesn't
 seem to be any easy way to add progress indication like -i has).

 I was thinking the opposite.  That -i should only print progress
 indication when -d is given.  Or at least knock an order of magnitude
 or two off of how often it does so.

I'd be in all in favor of having -i emit progress reports 10x less
often; even on a laptop, the current reports are very chatty.  But I
think 100x less often might be taking it too far.

Either way, if we're going to have an option for inserts, they should
produce the same progress reports that COPY does - though possibly
more often, since I'm guessing it's likely to be way slower.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Simulating Clog Contention

2012-01-30 Thread Robert Haas
On Mon, Jan 30, 2012 at 10:48 AM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Mon, Jan 30, 2012 at 7:24 AM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Jan 28, 2012 at 3:32 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 I think that even in normal (non-initialization) usage, this message
 should be suppressed when the provided scale factor
 is equal to the pgbench_branches table count.

 The attached patch does just that.  There is probably no reason to
 warn people that we are doing what they told us to, but not for the
 reason they think.

 In my opinion, a more sensible approach than anything we're doing
 right now would be to outright *reject* options that will only be
 ignored.  If -s isn't supported except with -i, then trying to specify
 -s without -i should just error out at the options-parsing stage,
 before we even try to connect to the database.  It's not very helpful
 to accept options and then ignore them, and we have many instances of
 that right now: initialization-only switches are accepted and ignored
 when not initializing, and run-only switches are accepted and ignored
 with initializing.

 I like the ability to say, effectively, I think I had previously did
 the initialization with -s 40, if I actually didn't then scream at me,
 and if I did then go ahead and do the pgbench I just asked for.
 But, since it does unconditionally report the scale actually used and
 I just have to have the discipline to go look at the result, I can see
 where this is perhaps overkill.   In my own (non-PG-related) code,
 when I have tasks that have to be run in multiple phases that can get
 out of sync if I am not careful, I like to be able to specify the
 flags even in the unused invocation, so that the code can verify I
 am being consistent.  Code is better at that than I am.

I guess my point is - if we're gonna have it do that, then shouldn't
we actually *error out* if the scales don't match, not just print a
warning and then charge ahead?

 I'm not sure I know what all would be incompatible with what.  I could
 start drawing that matrix up once the API stabilizes, but I think you
 are still planning on whacking this -I option around a bit.

It's mostly that everything in the initialization options section of
pgbench --help is incompatible with everything in the benchmarking
options section, and the other way around.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Why extract( ... from timestamp ) is not immutable?

2012-01-30 Thread hubert depesz lubaczewski
On Wed, Jan 25, 2012 at 11:30:49AM -0500, Tom Lane wrote:
 hubert depesz lubaczewski dep...@depesz.com writes:
  anyway - the point is that in \df date_part(, timestamp) says it's
  immutable, while it is not.
 
 Hmm, you're right.  I thought we'd fixed that way back when, but
 obviously not.  Or maybe the current behavior of the epoch case
 postdates that.

is there a chance something will happen with/about it?

preferably I would see extract( epoch from timestamp ) to be really
immutable, i.e. (in my opinion) it should treat incoming data as UTC
- for epoch calculation.
Alternatively - perhaps epoch extraction should be moved to specialized
function, which would have swapped mutability:

get_epoch(timestamptz) would be immutable
while
get_epoch(timestamp) would be stable

Best regards,

depesz

-- 
The best thing about modern society is how easy it is to avoid contact with it.
 http://depesz.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Hot standby off of hot standby?

2012-01-30 Thread Josh Berkus
On 1/29/12 8:36 PM, Igor Schtein wrote:
 Hi,
 
 Is it possible to use a standby instance as a master/primary for another
 standby in Postgres 9.0?  In other words, does PG 9.0 supports cascading
 standby configuration?

No, that's a 9.2 feature in development.

If you can build PostgreSQL from source and apply patches, we could use
your help testing it!

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-01-30 Thread Marko Kreen
On Mon, Jan 30, 2012 at 06:06:57PM +0900, Kyotaro HORIGUCHI wrote:
 The gain of performance is more than expected. Measure script now
 does query via dblink ten times for stability of measuring, so
 the figures become about ten times longer than the previous ones.
 
sec% to Original
 Original : 31.5 100.0%
 RowProcessor patch   : 31.3  99.4%
 dblink patch : 24.6  78.1%
 
 RowProcessor patch alone makes no loss or very-little gain, and
 full patch gives us 22% gain for the benchmark(*1).

Excellent!

 - The callers of RowProcessor() no more set null_field to
   PGrowValue.value. Plus, the PGrowValue[] which RowProcessor()
   receives has nfields + 1 elements to be able to make rough
   estimate by cols-value[nfields].value - cols-value[0].value -
   something.  The somthing here is 4 * nfields for protocol3 and
   4 * (non-null fields) for protocol2. I fear that this applies
   only for textual transfer usage...

Excact estimate is not important here.  And (nfields + 1) elem
feels bit too much magic, considering that most users probably
do not need it.  Without it, the logic would be:

 total = last.value - first.value + ((last.len  0) ? last.len : 0)

which isn't too complex.  So I think we can remove it.


= Problems =

* Remove the dubious memcpy() in pqAddRow()

* I think the dynamic arrays in getAnotherTuple() are not portable enough,
  please do proper allocation for array.  I guess in PQsetResultAttrs()?


= Minor notes =

These can be argued either way, if you don't like some
suggestion, you can drop it.

* Move PQregisterRowProcessor() into fe-exec.c, then we can make
  pqAddRow static.

* Should PQclear() free RowProcessor error msg?  It seems
  it should not get outside from getAnotherTuple(), but
  thats not certain.  Perhaps it would be clearer to free
  it here too.

* Remove the part of comment in getAnotherTuple():
   * Buffer content may be shifted on reloading additional
   * data. So we must set all pointers on every scan.

  It's confusing why it needs to clarify that, as there
  is nobody expecting it.

* PGrowValue documentation should mention that -value pointer
  is always valid.

* dblink: Perhaps some of those mallocs() could be replaced
  with pallocs() or even StringInfo, which already does
  the realloc dance?  I'm not familiar with dblink, and
  various struct lifetimes there so I don't know it that
  actually makes sense or not.


It seems this patch is getting ReadyForCommitter soon...

-- 
marko


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [v9.2] Add GUC sepgsql.client_label

2012-01-30 Thread Robert Haas
On Sun, Jan 29, 2012 at 7:27 AM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 2012/1/28 Kohei KaiGai kai...@kaigai.gr.jp:
 2012/1/26 Robert Haas robertmh...@gmail.com:
 On Thu, Jan 26, 2012 at 2:07 PM, Kohei KaiGai kai...@kaigai.gr.jp wrote:
 2012/1/26 Robert Haas robertmh...@gmail.com:
 I'm wondering if a function would be a better fit than a GUC.  I don't
 think you can really restrict the ability to revert a GUC change -
 i.e. if someone does a SET and then a RESET, you pretty much have to
 allow that.  I think.  But if you expose a function then it can work
 however you like.

 One benefit to use GUC is that we can utilize existing mechanism to
 revert a value set within a transaction block on error.
 If we implement same functionality with functions, XactCallback allows
 sepgsql to get control on appropriate timing?

 Not sure, but I thought the use case was to set this at connection
 startup time and then hand the connection off to a client.  What keeps
 the client from just issuing RESET?

 In the use case of Joshua, the original security label does not privileges
 to reference any tables, and it cannot translate any domains without
 credential within IC-cards. Thus, RESET is harmless.

 However, I also assume one other possible use-case; the originator
 has privileges to translate 10 of different domains, but disallows to
 revert it. In this case, RESET without permission checks are harmful.
 (This scenario will be suitable with multi-category-model.)

 Apart from this issue, I found a problem on implementation using GUC
 options. It makes backend crash, if it tries to reset sepgsql.client_label
 without permissions within error handler because of double-errors.

 I found an idea to avoid this scenario.

 The problematic case is reset GUC variable because of transaction
 rollback and permission violation, however, we don't intend to apply
 permission checks, since it is not committed yet.
 Thus, I considered to check state of the transaction using
 IsTransactionState() on the assign_hook of GUC variable.

 Unlike function based implementation, it is available to utilize existing
 infrastructure to set the client_label to be transaction-aware.

 It shall be implemented as:

 void
 sepgsql_assign_client_label(const char *newval, void *extra)
 {
    if (!IsTransactionState)
        return;

    /* check whether valid security context */

    /* check permission check to switch current context */
 }

 How about such an implementation?

Well, I think the idea that GUC changes can be reverted at any time is
fairly deeply ingrained in the GUC machinery.  So I think that's a
hack: it might happen to work today (or at least on the cases we can
think of to test), but I wouldn't be a bit surprised if there are
other cases where it fails, either now or in the future.  I don't see
any good reason to assume that unrevertable changes are OK even inside
a transaction context.  For example, if someone applies a context to a
function with ALTER FUNCTION, they're going to expect that the altered
context remains in effect while the function is running and gets
reverted on exit.  If you throw an error when the system tries to
revert the change at function-exit time, it might not crash the
server, but it's certainly going to violate the principle of least
astonishment.

Of course, we already have a trusted procedure mechanism which is
intended to support temporary changes to the effective security label,
so you might say, hey, people shouldn't do that.  But they might.  And
I wouldn't like to bet that's the only case that could be problematic
either.  What about a setting in postgresql.conf?  You could end up
being asked to set the security label to some other value before
you've initialized it.  What about SET LOCAL?  It's not OK to fail to
revert that at transaction exit.

Hence my suggestion of a function: if you use functions, you can
implement whatever semantics you want.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Make group commit more effective.

2012-01-30 Thread Robert Haas
On Mon, Jan 30, 2012 at 9:55 AM, Heikki Linnakangas
heikki.linnakan...@iki.fi wrote:
 Make group commit more effective.

 When a backend needs to flush the WAL, and someone else is already flushing
 the WAL, wait until it releases the WALInsertLock and check if we still need
 to do the flush or if the other backend already did the work for us, before
 acquiring WALInsertLock. This helps group commit, because when the WAL flush
 finishes, all the backends that were waiting for it can be woken up in one
 go, and the can all concurrently observe that they're done, rather than
 waking them up one by one in a cascading fashion.

 This is based on a new LWLock function, LWLockWaitUntilFree(), which has
 peculiar semantics. If the lock is immediately free, it grabs the lock and
 returns true. If it's not free, it waits until it is released, but then
 returns false without grabbing the lock. This is used in XLogFlush(), so
 that when the lock is acquired, the backend flushes the WAL, but if it's
 not, the backend first checks the current flush location before retrying.

 Original patch and benchmarking by Peter Geoghegan and Simon Riggs, although
 this patch as committed ended up being very different from that.

Either this patch, or something else committed this morning, is
causing make check to hang or run extremely slowly for me.  I think
it's this patch, because I attached to a backend and stopped it a few
times, and all the backtraces look like this:

#0  0x7fff8a545b22 in semop ()
#1  0x0001001ff8df in PGSemaphoreLock (sema=0x103d7de70,
interruptOK=0 '\0') at pg_sema.c:418
#2  0x00010024d7dd in LWLockWaitUntilFree (lockid=value
temporarily unavailable, due to optimizations, mode=value
temporarily unavailable, due to optimizations) at lwlock.c:666
#3  0x00010005d3b3 in XLogFlush (record=value temporarily
unavailable, due to optimizations) at xlog.c:2148
#4  0x0001000506bb in CommitTransaction () at xact.c:1113
#5  0x000100050b35 in CommitTransactionCommand () at xact.c:2613
#6  0x00010025a403 in finish_xact_command () at postgres.c:2388
#7  0x00010025d525 in exec_simple_query (query_string=0x101055638
CREATE INDEX wowidx ON test_tsvector USING gin (a);) at
postgres.c:1052
#8  0x00010025dfc1 in PostgresMain (argc=2, argv=value
temporarily unavailable, due to optimizations, username=value
temporarily unavailable, due to optimizations) at postgres.c:3881
#9  0x00010020c258 in ServerLoop () at postmaster.c:3587
#10 0x00010020d167 in PostmasterMain (argc=6, argv=0x100d08f40) at
postmaster.c:1110
#11 0x00010019e745 in main (argc=6, argv=0x100d08f40) at main.c:199

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Patch pg_is_in_backup()

2012-01-30 Thread Gilles Darold
Hello,

Yesterday I was facing a little issue with a backup software (NetBackup)
that do not report error when a post backup script is run. The problem
is that this script execute pg_stop_backup() and if there's any failure
PostgreSQL keeps running in on-line backup mode. So the backup is not
completed and the next one too because the call to pg_start_backup()
will fail.

After some time searching for a Pg system administration function like
pg_is_in_recovery(), let's say pg_is_in_backup(), I couldn't find one.
The minor patch attached here adds this administrative function that can
be used with others backup control functions. This is a very little
patch outside documentation because the function is only a wrapper to
the internal BackupInProgress() function, just like pg_is_in_recovery()
is calling RecoveryInProgress().

I hope it could be included as it could help to save time for some
sysadmin who want to monitor on-line backup process and that do not have
SQL knowledge. Saying that it's always possible to check if the
backup_label file exist under PGDATA but this need to be checked on the
host with postgres priviledge. The other way is to create a plpgsql
function with security definer that will have the same effect than the
patch above except that it need some SQL knowledge. I've give it here
for web search, thanks to Guillaume and Marc.

CREATE OR REPLACE FUNCTION pg_is_in_backup ( )
RETURNS BOOLEAN AS $$
DECLARE is_exists BOOLEAN;
BEGIN
-- Set a secure search_path: trusted schemas, then 'pg_temp'.
PERFORM pg_catalog.set_config('search_path', 'pg_temp', true);
SELECT ((pg_stat_file('backup_label')).modification IS NOT NULL)
INTO is_exists;
RETURN is_exists;
EXCEPTION
WHEN undefined_file THEN
  RETURN false;
END
$$ LANGUAGE plpgsql SECURITY DEFINER;

Regards,

-- 
Gilles Darold
http://dalibo.com - http://dalibo.org

diff -ru postgresql/doc/src/sgml/func.sgml postgresql-is_in_backup-patch/doc/src/sgml/func.sgml
--- postgresql/doc/src/sgml/func.sgml	2012-01-26 23:01:31.956613398 +0100
+++ postgresql-is_in_backup-patch/doc/src/sgml/func.sgml	2012-01-26 23:14:29.468613019 +0100
@@ -14371,6 +14371,9 @@
 primarypg_current_xlog_location/primary
/indexterm
indexterm
+primarypg_is_in_backup/primary
+   /indexterm
+   indexterm
 primarypg_start_backup/primary
/indexterm
indexterm
@@ -14424,6 +14427,14 @@
   /row
   row
entry
+literalfunctionpg_is_in_backup()/function/literal
+/entry
+   entrytypebool/type/entry
+   entryTrue if an on-line backup is still in progress.
+   /entry
+  /row
+  row
+   entry
 literalfunctionpg_start_backup(parameterlabel/ typetext/ optional, parameterfast/ typeboolean/ /optional)/function/literal
 /entry
entrytypetext/type/entry
diff -ru postgresql/src/backend/access/transam/xlogfuncs.c postgresql-is_in_backup-patch/src/backend/access/transam/xlogfuncs.c
--- postgresql/src/backend/access/transam/xlogfuncs.c	2012-01-26 23:01:32.032613398 +0100
+++ postgresql-is_in_backup-patch/src/backend/access/transam/xlogfuncs.c	2012-01-26 23:16:04.100612972 +0100
@@ -465,3 +465,12 @@
 {
 	PG_RETURN_BOOL(RecoveryInProgress());
 }
+
+/*
+ * Returns bool with current on-line backup mode, a global state.
+ */
+Datum
+pg_is_in_backup(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_BOOL(BackupInProgress());
+}
diff -ru postgresql/src/include/access/xlog_internal.h postgresql-is_in_backup-patch/src/include/access/xlog_internal.h
--- postgresql/src/include/access/xlog_internal.h	2012-01-26 23:01:32.332613398 +0100
+++ postgresql-is_in_backup-patch/src/include/access/xlog_internal.h	2012-01-26 23:22:51.492612774 +0100
@@ -281,5 +281,6 @@
 extern Datum pg_xlog_replay_pause(PG_FUNCTION_ARGS);
 extern Datum pg_xlog_replay_resume(PG_FUNCTION_ARGS);
 extern Datum pg_is_xlog_replay_paused(PG_FUNCTION_ARGS);
+extern Datum pg_is_in_backup(PG_FUNCTION_ARGS);
 
 #endif   /* XLOG_INTERNAL_H */
diff -ru postgresql/src/include/catalog/pg_proc.h postgresql-is_in_backup-patch/src/include/catalog/pg_proc.h
--- postgresql/src/include/catalog/pg_proc.h	2012-01-26 23:01:32.340613398 +0100
+++ postgresql-is_in_backup-patch/src/include/catalog/pg_proc.h	2012-01-27 09:26:09.918736657 +0100
@@ -2899,6 +2899,8 @@
 DESCR(prepare for taking an online backup);
 DATA(insert OID = 2173 ( pg_stop_backup			PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 25  _null_ _null_ _null_ _null_ pg_stop_backup _null_ _null_ _null_ ));
 DESCR(finish taking an online backup);
+DATA(insert OID = 3882 (  pg_is_in_backup		PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 16  _null_ _null_ _null_ _null_ pg_is_in_backup _null_ _null_ _null_ ));
+DESCR(true if server is in on-line backup);
 DATA(insert OID = 2848 ( pg_switch_xlog			PGNSP PGUID 12 1 0 0 0 f f f t f v 0 0 25  _null_ _null_ _null_ _null_ pg_switch_xlog _null_ _null_ _null_ ));
 DESCR(switch to new xlog file);
 DATA(insert OID = 3098 ( pg_create_restore_point	PGNSP PGUID 12 1 0 0 

Re: [HACKERS] [GENERAL] Why extract( ... from timestamp ) is not immutable?

2012-01-30 Thread Josh Berkus

 preferably I would see extract( epoch from timestamp ) to be really
 immutable, i.e. (in my opinion) it should treat incoming data as UTC
 - for epoch calculation.
 Alternatively - perhaps epoch extraction should be moved to specialized
 function, which would have swapped mutability:

We can't have functions which are immutable or not depending on their
inputs.  That way lies madness.

 get_epoch(timestamptz) would be immutable
 while
 get_epoch(timestamp) would be stable

Well, to_epoch, in order to be consistent with other conversion functions.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Why extract( ... from timestamp ) is not immutable?

2012-01-30 Thread hubert depesz lubaczewski
On Mon, Jan 30, 2012 at 10:35:21AM -0800, Josh Berkus wrote:
 
  preferably I would see extract( epoch from timestamp ) to be really
  immutable, i.e. (in my opinion) it should treat incoming data as UTC
  - for epoch calculation.
  Alternatively - perhaps epoch extraction should be moved to specialized
  function, which would have swapped mutability:
 
 We can't have functions which are immutable or not depending on their
 inputs.  That way lies madness.

but this is exactly what's happening now.
extract( ... from timestamp) is marked as immutable, while in some cases
(namely when you want epoch) it should be stable because the return from
function changes.

Best regards,

depesz

-- 
The best thing about modern society is how easy it is to avoid contact with it.
 http://depesz.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Simon Riggs
On Mon, Jan 30, 2012 at 3:32 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 30.01.2012 17:18, Simon Riggs wrote:

 I asked clearly and specifically for you to hold back committing
 anything. Not sure why you would ignore that and commit without
 actually asking myself or Peter. On a point of principle alone, I
 think you should revert. Working together is difficult if
 communication channels are openly ignored and disregarded.


 You must be referring to this:

 http://archives.postgresql.org/pgsql-hackers/2012-01/msg01406.php

 What I committed in the end was quite different from the version that was in
 reply to, too. If you have a specific objection to the patch as committed,
 please let me know.

I said There is much yet to discuss so please don't think about committing
anything yet.

There's not really any way you could misinterpret them.


 Peter and I have been working on a new version that seems likely to
 improve performance over your suggestions. We should be showing
 something soon.


 Please post those ideas, and let's discuss them. If it's something simple,
 maybe we can still sneak them into this release. Otherwise, let's focus on
 the existing patches that are pending review or commit.

If you really did want to discuss it, it would have taken you 5
minutes to check whether there was consensus on the patch before
committing it. Your actions betray the opposite of a willingness to
discuss anything.

Yes, I'd like to discuss ideas, not just ram home a half-discussed and
half-finished patch that happens to do things the way you personally
prefer, overriding all inputs.

Especially when you know we're working on another version.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Heikki Linnakangas

On 30.01.2012 21:55, Simon Riggs wrote:

On Mon, Jan 30, 2012 at 3:32 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

On 30.01.2012 17:18, Simon Riggs wrote:



Peter and I have been working on a new version that seems likely to
improve performance over your suggestions. We should be showing
something soon.


Please post those ideas, and let's discuss them. If it's something simple,
maybe we can still sneak them into this release. Otherwise, let's focus on
the existing patches that are pending review or commit.


If you really did want to discuss it, it would have taken you 5
minutes to check whether there was consensus on the patch before
committing it. Your actions betray the opposite of a willingness to
discuss anything.

Yes, I'd like to discuss ideas, not just ram home a half-discussed and
half-finished patch that happens to do things the way you personally
prefer, overriding all inputs.

Especially when you know we're working on another version.


Sorry. So, what's the approach you're working on?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] CLOG contention, part 2

2012-01-30 Thread Robert Haas
On Fri, Jan 27, 2012 at 8:21 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 On Fri, Jan 27, 2012 at 3:16 PM, Merlin Moncure mmonc...@gmail.com wrote:
 On Fri, Jan 27, 2012 at 4:05 PM, Jeff Janes jeff.ja...@gmail.com wrote:
 Also, I think the general approach is wrong.  The only reason to have
 these pages in shared memory is that we can control access to them to
 prevent write/write and read/write corruption.  Since these pages are
 never written, they don't need to be in shared memory.   Just read
 each page into backend-local memory as it is needed, either
 palloc/pfree each time or using a single reserved block for the
 lifetime of the session.  Let the kernel worry about caching them so
 that the above mentioned reads are cheap.

 right -- exactly.  but why stop at one page?

 If you have more than one, you need code to decide which one to evict
 (just free) every time you need a new one.  And every process needs to
 be running this code, while the kernel is still going to need make its
 own decisions for the entire system.  It seems simpler to just let the
 kernel do the job for everyone.  Are you worried that a read syscall
 is going to be slow even when the data is presumably cached in the OS?

I think that would be a very legitimate worry.  You're talking about
copying 8kB of data because you need two bits.  Even if the
user/kernel mode context switch is lightning-fast, that's a lot of
extra data copying.

In a previous commit, 33aaa139e6302e81b4fbf2570be20188bb974c4f, we
increased the number of CLOG buffers from 8 to 32 (except in very
low-memory configurations).  The main reason that shows a win on Nate
Boley's 32-core test machine appears to be because it avoids the
scenario where there are, say, 12 people simultaneously wanting to
read 12 different CLOG buffers, and so 4 of them have to wait for a
buffer to become available before they can even think about starting a
read.  The really bad latency spikes were happening not because the
I/O took a long time, but because it can't be started immediately.
However, these spikes are now gone, as a result of the above-commit.
Probably you can get them back with enough cores, but you'll probably
hit a lot of other, more serious problems first.

I assume that if there's any purpose to further optimization here,
it's either because the overall miss rate of the cache is too large,
or because the remaining locking costs are too high.  Unfortunately I
haven't yet had time to look at this patch and understand what it
does, or machine cycles available to benchmark it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Make group commit more effective.

2012-01-30 Thread Heikki Linnakangas

On 30.01.2012 20:27, Robert Haas wrote:

Either this patch, or something else committed this morning, is
causing make check to hang or run extremely slowly for me.  I think
it's this patch, because I attached to a backend and stopped it a few
times, and all the backtraces look like this:


Yeah, sure looks like it's the group commit commit. It works for me, and 
staring at the code, I have no idea what could be causing it. The 
buildfarm seems happy too, so this is pretty mysterious.


I did find one bug, see attached, but AFAICS it should only cause 
unnecessary wakeups in some corner cases, which is harmless.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
commit 73b479198d9e7e1cbdb2da705a0cd9226b0d8265
Author: Heikki Linnakangas heikki.linnakan...@iki.fi
Date:   Mon Jan 30 20:56:35 2012 +0200

Fix thinko in the group commit patch.

When releasing a lwlock, if the first waiter in the queue waited for a
shared lock, and all the rest were of the new wait-until-free kind, the
releaseOK flag would not be cleared. That's harmless AFAICS, but it was
not intended.

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index bee35b8..eb9ff95 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -768,9 +768,9 @@ LWLockRelease(LWLockId lockid)
 while (proc-lwWaitLink != NULL 
 	   proc-lwWaitLink-lwWaitMode != LW_EXCLUSIVE)
 {
-	proc = proc-lwWaitLink;
 	if (proc-lwWaitMode != LW_WAIT_UNTIL_FREE)
 		releaseOK = false;
+	proc = proc-lwWaitLink;
 }
 			}
 			/* proc is now the last PGPROC to be released */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Make group commit more effective.

2012-01-30 Thread Heikki Linnakangas

On 30.01.2012 22:50, Heikki Linnakangas wrote:

On 30.01.2012 20:27, Robert Haas wrote:

Either this patch, or something else committed this morning, is
causing make check to hang or run extremely slowly for me. I think
it's this patch, because I attached to a backend and stopped it a few
times, and all the backtraces look like this:


Yeah, sure looks like it's the group commit commit. It works for me, and
staring at the code, I have no idea what could be causing it. The
buildfarm seems happy too, so this is pretty mysterious.


And just after sending that, I succeeded to reproduce this. I had to 
lower wal_buffers to a small value to make it happen. I'm debugging this 
now..


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch pg_is_in_backup()

2012-01-30 Thread Euler Taveira de Oliveira
On 30-01-2012 15:33, Gilles Darold wrote:
 Yesterday I was facing a little issue with a backup software (NetBackup)
 that do not report error when a post backup script is run. The problem
 is that this script execute pg_stop_backup() and if there's any failure
 PostgreSQL keeps running in on-line backup mode. So the backup is not
 completed and the next one too because the call to pg_start_backup()
 will fail.
 
I use something similar to your pl/PgSQL version and was about to propose
something along these lines to core. +1 to include it. Please, add your patch
to the next CF.


-- 
   Euler Taveira de Oliveira - Timbira   http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Make group commit more effective.

2012-01-30 Thread Heikki Linnakangas

On 30.01.2012 23:06, Heikki Linnakangas wrote:

On 30.01.2012 22:50, Heikki Linnakangas wrote:

On 30.01.2012 20:27, Robert Haas wrote:

Either this patch, or something else committed this morning, is
causing make check to hang or run extremely slowly for me. I think
it's this patch, because I attached to a backend and stopped it a few
times, and all the backtraces look like this:


Yeah, sure looks like it's the group commit commit. It works for me, and
staring at the code, I have no idea what could be causing it. The
buildfarm seems happy too, so this is pretty mysterious.


And just after sending that, I succeeded to reproduce this. I had to
lower wal_buffers to a small value to make it happen. I'm debugging this
now..


It was a bug in the LWLockRelease code, after all. Fixed. Unfortunately 
this added a couple more instructions to that critical codepath, but I 
think it should still go without notice. Let me know if this doesn't fix 
the hang on your laptop.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Simon Riggs
On Mon, Jan 30, 2012 at 8:04 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 So, what's the approach you're working on?

I've had a few days leave at end of last week, so no time to fully
discuss the next steps with the patch. That's why you were requested
not to commit anything.

You've suggested there was no reason to have the WALwriter be
involved, which isn't the case and made other comments about latches
that weren't correct also.

The plan here is to allow WAL flush and clog updates to occur
concurrently. Which allows the clog contention and update time to be
completely hidden behind the wait for the WAL flush. That is only
possible if we have the WALwriter involved since we need two processes
to be actively involved.

It's a relatively minor change and uses code that is already committed
and working, not some just invented low level stuff that might not
work right. You might then ask, why the delay? Just simply because my
absence has prevented moving forwards. We'll have a patch tomorrow.

The theory behind this is clear, but needs some explanation.

There are 5 actions that need to occur at commit
1) insert WAL record
2) optionally flush WAL record
3) mark the clog AND set LSN from (1) if we skipped (2)
4) optionally wait for sync rep
5) remove the proc from the procarray

Dependencies between those actions are these
Step (3) must always happen before (5) otherwise we get race
conditions in visibility checking.
Step (4) must always happen before (5) otherwise we also get race
conditions in disaster cases.
Step (1) must always happen before (2) if it happens
Step (1) must always happen before (3) if we skipped (2)

Notice that step (2) and step (3) are actually independent of each other.

So an improved design for commit is to
2) request flush up to LSN, but don't wait
3) mark the clog and set LSN
4) wait for LSN once, either for walwriter or walsender to release us

This is free of race conditions as long as step (3) marks each clog
page with a valid LSN, just as we would do for asynchronous commit.

Marking the clog with an LSN ensures that we issue XLogFlush(LSN) on
the clog page before it is written, so we always get WAL flushed to
the desired LSN before the clog mark appears on disk.

Does this cause any other behaviour? No, because the LSN marked on the
clog is always flushed by the time we hit step (5), so there is no
delay in any hint bit setting, or any other effect.

So step (2) requests the flush, which is then performed by WALwriter.
Backend then performs (3) while the flush takes place and then waits
at step (4) to be woken

We only wait once in step 4, rather than waiting for flush at step (2)
and then waiting again at step (4).

So we use the existing code path for TransactionIdAsyncCommitTree()
yet we wait at step (4) using the SyncRep code.

Step 5 happens last, as always.

There are two benefits to this approach
* The clog update happens for free since it is hidden behind a
longer running task
* The implementation uses already tested and robust code for SyncRep
and AsyncCommit

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] foreign key locks, 2nd attempt

2012-01-30 Thread Noah Misch
On Tue, Jan 24, 2012 at 03:47:16PM -0300, Alvaro Herrera wrote:
 The biggest item remaining is the point you raised about multixactid
 wraparound.  This is closely related to multixact truncation and the way
 checkpoints are to be handled.  If we think that MultiXactId wraparound
 is possible, and we need to involve autovacuum to keep it at bay, then I

To prove it possible, we need prove there exists some sequence of operations
consuming N xids and M  N multixactids.  Have N transactions key-lock N-1
rows apiece, then have each of them key-lock one of the rows locked by each
other transaction.  This consumes N xids and N(N-1) multixactids.  I believe
you could construct a workload with N! multixact usage, too.

Existence proofs are one thing, real workloads another.  My unsubstantiated
guess is that multixactid use will not overtake xid use in bulk on a workload
not specifically tailored to do so.  So, I think it's enough to notice it,
refuse to assign a new multixactid, and tell the user to clear the condition
with a VACUUM FREEZE of all databases.  Other opinions would indeed be useful.

 think the only way to make that work is to add another column to
 pg_class so that each table's oldest multixact is tracked, same as we do
 with relfrozenxid for Xids.  If we do that, I think we can do away with
 most of the MultiXactTruncate junk I added -- things would become a lot
 simpler.  The cost would be bloating pg_class a bit more.  Are we okay
 with paying that cost?  I asked this question some months ago and I
 decided that I would not add the column, but I am starting to lean the
 other way.  I would like some opinions on this.

That's not the only way; autovacuum could just accelerate normal freezing to
advance the multixactid horizon indirectly.  Your proposal could make it far
more efficient, though.

 You asked two questions about WAL-logging locks: one was about the level
 of detail we log for each lock we grab; the other was about
 heap_xlog_update logging the right info.  AFAICS, the main thing that
 makes detailed WAL logging necessary is hot standbys.  That is, the
 standby must have all the locking info so that concurrent transactions
 are similarly locked as in the master ... or am I wrong in that?  (ISTM
 this means I need to fix heap_xlog_update so that it faithfully
 registers the lock info we're storing, not just the current Xid).

Standby servers do not need accurate tuple locks, because it's not possible to
wait on a tuple lock during recovery.  (By the way, the question about log
detail was just from my own curiosity.  We don't especially need an answer to
move forward with this patch, unless you want one.)

 * Columns that are part of the key
   Noah thinks the set of columns should only consider those actually 
 referenced
   by keys, not those that *could* be referenced.

Well, do you disagree?  To me it's low-hanging fruit, because it isolates the
UPDATE-time overhead of this patch to FK-referenced tables rather than all
tables having a PK or PK-like index (often just all tables).

   Also, in a table without columns, are all columns part of the key, or is the
   key the empty set?  I changed HeapSatisfiesHOTUpdate but that seems 
 arbitrary.

It does seem arbitrary.  What led you to switch in a later version?

Thanks,
nm

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-01-30 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 But the immediate problem is that pg_dump.c is heavily reliant on
 global variables, which isn't going to fly if we want this code to use
 threads on Windows (or anywhere else).  It's also bad style.  So I
 suggest that we refactor pg_dump.c to get rid of g_conn and g_fout.

I've looked at that a bit in the past and decided that the notational
overhead would be too much.  OTOH, if we're going to be forced into it
in order to support parallel pg_dump, we might as well do it first in a
separate patch.

 ...  But it
 seems possible that we might someday want to dump from one database
 and restore into another database at the same time, so maybe we ought
 to play it safe and use different variables for those things.  So I'm
 inclined to think we could just add a PGconn *remote_connection
 argument to the Archive structure (to go with all of the similarly
 named remote fields, all of which describe the DB to be dumped) and
 then that would be available everywhere that the Archive itself is.

I always thought that the remote terminology was singularly unhelpful.
It implies there's a local connection floating around somewhere, which
of course there is not, and it does nothing to remind you of whether the
connection leads to a database being dumped or a database being restored
into.  If we are going to have two fields, could we name them something
less opaque, perhaps src_connection and dest_connection?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Why extract( ... from timestamp ) is not immutable?

2012-01-30 Thread Tom Lane
hubert depesz lubaczewski dep...@depesz.com writes:
 On Mon, Jan 30, 2012 at 10:35:21AM -0800, Josh Berkus wrote:
 We can't have functions which are immutable or not depending on their
 inputs.  That way lies madness.

 but this is exactly what's happening now.

Well, the current marking is clearly incorrect.  What to do about that
is a bit less clear --- should we downgrade the marking, or change the
function's behavior so that it really is immutable?

I haven't formed an opinion on that myself, other than to think that
it's something that requires more than a moment's thought.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-01-30 Thread Kyotaro HORIGUCHI
This is fixed version of dblink.c for row processor.

 i'll re-send the properly fixed patch for dblink.c later.

- malloc error in initStoreInfo throws ERRCODE_OUT_OF_MEMORY. (new error)

- storeHandler() now returns FALSE on malloc failure. Garbage
  cleanup is done in dblink_fetch() or dblink_record_internal().
  The behavior that this dblink displays this error as 'unkown
  error/could not execute query' on the user session is as it did
  before.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/contrib/dblink/dblink.c b/contrib/dblink/dblink.c
index 36a8e3e..7a82ea1 100644
--- a/contrib/dblink/dblink.c
+++ b/contrib/dblink/dblink.c
@@ -63,11 +63,23 @@ typedef struct remoteConn
 	bool		newXactForCursor;		/* Opened a transaction for a cursor */
 } remoteConn;
 
+typedef struct storeInfo
+{
+	Tuplestorestate *tuplestore;
+	int nattrs;
+	MemoryContext oldcontext;
+	AttInMetadata *attinmeta;
+	char** valbuf;
+	int *valbuflen;
+	bool error_occurred;
+	bool nummismatch;
+	ErrorData *edata;
+} storeInfo;
+
 /*
  * Internal declarations
  */
 static Datum dblink_record_internal(FunctionCallInfo fcinfo, bool is_async);
-static void materializeResult(FunctionCallInfo fcinfo, PGresult *res);
 static remoteConn *getConnectionByName(const char *name);
 static HTAB *createConnHash(void);
 static void createNewConnection(const char *name, remoteConn *rconn);
@@ -90,6 +102,10 @@ static char *escape_param_str(const char *from);
 static void validate_pkattnums(Relation rel,
    int2vector *pkattnums_arg, int32 pknumatts_arg,
    int **pkattnums, int *pknumatts);
+static void initStoreInfo(storeInfo *sinfo, FunctionCallInfo fcinfo);
+static void finishStoreInfo(storeInfo *sinfo);
+static int storeHandler(PGresult *res, void *param, PGrowValue *columns);
+
 
 /* Global */
 static remoteConn *pconn = NULL;
@@ -503,6 +519,7 @@ dblink_fetch(PG_FUNCTION_ARGS)
 	char	   *curname = NULL;
 	int			howmany = 0;
 	bool		fail = true;	/* default to backward compatible */
+	storeInfo   storeinfo;
 
 	DBLINK_INIT;
 
@@ -559,15 +576,36 @@ dblink_fetch(PG_FUNCTION_ARGS)
 	appendStringInfo(buf, FETCH %d FROM %s, howmany, curname);
 
 	/*
+	 * Result is stored into storeinfo.tuplestore instead of
+	 * res-result retuned by PQexec below
+	 */
+	initStoreInfo(storeinfo, fcinfo);
+	PQregisterRowProcessor(conn, storeHandler, storeinfo);
+
+	/*
 	 * Try to execute the query.  Note that since libpq uses malloc, the
 	 * PGresult will be long-lived even though we are still in a short-lived
 	 * memory context.
 	 */
 	res = PQexec(conn, buf.data);
+	finishStoreInfo(storeinfo);
+
 	if (!res ||
 		(PQresultStatus(res) != PGRES_COMMAND_OK 
 		 PQresultStatus(res) != PGRES_TUPLES_OK))
 	{
+		/* finishStoreInfo saves the fields referred to below. */
+		if (storeinfo.nummismatch)
+		{
+			/* This is only for backward compatibility */
+			ereport(ERROR,
+	(errcode(ERRCODE_DATATYPE_MISMATCH),
+	 errmsg(remote query result rowtype does not match 
+			the specified FROM clause rowtype)));
+		}
+		else if (storeinfo.edata)
+			ReThrowError(storeinfo.edata);
+
 		dblink_res_error(conname, res, could not fetch from cursor, fail);
 		return (Datum) 0;
 	}
@@ -579,8 +617,8 @@ dblink_fetch(PG_FUNCTION_ARGS)
 (errcode(ERRCODE_INVALID_CURSOR_NAME),
  errmsg(cursor \%s\ does not exist, curname)));
 	}
+	PQclear(res);
 
-	materializeResult(fcinfo, res);
 	return (Datum) 0;
 }
 
@@ -640,6 +678,7 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
 	remoteConn *rconn = NULL;
 	bool		fail = true;	/* default to backward compatible */
 	bool		freeconn = false;
+	storeInfo   storeinfo;
 
 	/* check to see if caller supports us returning a tuplestore */
 	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
@@ -715,164 +754,217 @@ dblink_record_internal(FunctionCallInfo fcinfo, bool is_async)
 	rsinfo-setResult = NULL;
 	rsinfo-setDesc = NULL;
 
+
+	/*
+	 * Result is stored into storeinfo.tuplestore instead of
+	 * res-result retuned by PQexec/PQgetResult below
+	 */
+	initStoreInfo(storeinfo, fcinfo);
+	PQregisterRowProcessor(conn, storeHandler, storeinfo);
+
 	/* synchronous query, or async result retrieval */
 	if (!is_async)
 		res = PQexec(conn, sql);
 	else
-	{
 		res = PQgetResult(conn);
-		/* NULL means we're all done with the async results */
-		if (!res)
-			return (Datum) 0;
-	}
 
-	/* if needed, close the connection to the database and cleanup */
-	if (freeconn)
-		PQfinish(conn);
+	finishStoreInfo(storeinfo);
 
-	if (!res ||
-		(PQresultStatus(res) != PGRES_COMMAND_OK 
-		 PQresultStatus(res) != PGRES_TUPLES_OK))
+	/* NULL res from async get means we're all done with the results */
+	if (res || !is_async)
 	{
-		dblink_res_error(conname, res, could not execute query, fail);
-		return (Datum) 0;
+		if (freeconn)
+			PQfinish(conn);
+
+		if (!res ||
+			(PQresultStatus(res) != PGRES_COMMAND_OK 
+			 PQresultStatus(res) != PGRES_TUPLES_OK))
+		{
+			/* finishStoreInfo saves the fields referred 

Re: [HACKERS] [GENERAL] Why extract( ... from timestamp ) is not immutable?

2012-01-30 Thread Josh Berkus
On 1/30/12 5:41 PM, Tom Lane wrote:
 hubert depesz lubaczewski dep...@depesz.com writes:
 On Mon, Jan 30, 2012 at 10:35:21AM -0800, Josh Berkus wrote:
 We can't have functions which are immutable or not depending on their
 inputs.  That way lies madness.
 
 but this is exactly what's happening now.
 
 Well, the current marking is clearly incorrect.  What to do about that
 is a bit less clear --- should we downgrade the marking, or change the
 function's behavior so that it really is immutable?

AFAIK, the only case which is NOT immutable is extract(epoch FROM
timestamp without time zone), no?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Confusing EXPLAIN output in case of inherited tables

2012-01-30 Thread Ashutosh Bapat
I don't believe that the problem has anything to do with the names of

 other tables that might happen to exist in the database.  It's a matter
 of what RTE names/aliases are exposed for variable references in
 different parts of the query.


Names of other tables come into picture when we schema qualify the fake
aliases in the generated query. See examples in first post.


regards, tom lane




-- 
Best Wishes,
Ashutosh Bapat
EntepriseDB Corporation
The Enterprise Postgres Company


Re: [HACKERS] [GENERAL] pg_dump -s dumps data?!

2012-01-30 Thread Tom Lane
[ Note: please follow-up to pgsql-hackers not pgsql-general; I think
  this discussion needs to move there ]

hubert depesz lubaczewski dep...@depesz.com writes:
 On Mon, Jan 30, 2012 at 11:30:51AM -0500, Tom Lane wrote:
 That is way too vague for my taste, as you have not shown the pg_dump
 options you're using, for example.

 i tried to explain that the options don't matter, but here we go. full
 example:
 [ example showing pg_dump's odd behavior for extension config tables ]

[ traces through that with gdb... ]

As I suspected, the behavioral change from 9.1 to HEAD is not
intentional.  It is an artifact of commit
7b070e896ca835318c90b02c830a5c4844413b64, which is almost, but not
quite, entirely broken.  I won't enumerate its shortcomings here,
because they're not really relevant, but it does seem appropriate to
discuss exactly what we think *should* happen for tables created inside
extensions.

The original design intention for non-config tables was, per the manual:

Ordinarily, if a table is part of an extension, neither the
table's definition nor its content will be dumped by pg_dump.

the assumption being that both the definition and the content would be
re-loaded by executing the extension's SQL script.  The purpose of
pg_extension_config_dump() is to allow part or all of the data in the
table to be treated as user data and thus dumped; this is assumed to
be data not supplied by the extension script but by subsequent user
insertions.

I don't recall that we thought very hard about what should happen when
pg_dump switches are used to produce a selective dump, but ISTM
reasonable that if it's user data then it should be dumped only if
data in a regular user table would be.  So I agree it's pretty broken
that pg_dump -t foo will dump data belonging to a config table not
selected by the -t switch.  I think this should be changed in both HEAD
and 9.1 (note that HEAD will presumably return to the 9.1 behavior once
that --exclude-table-data patch gets fixed).

What's not apparent to me is whether there's an argument for doing more
than that.  It strikes me that the current design is not very friendly
towards the idea of an extension that creates a table that's meant
solely to hold user data --- you'd have to mark it as config which
seems a bit unfortunate terminology for that case.  Is it important to
do something about that, and if so what?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [GENERAL] Why extract( ... from timestamp ) is not immutable?

2012-01-30 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 On 1/30/12 5:41 PM, Tom Lane wrote:
 Well, the current marking is clearly incorrect.  What to do about that
 is a bit less clear --- should we downgrade the marking, or change the
 function's behavior so that it really is immutable?

 AFAIK, the only case which is NOT immutable is extract(epoch FROM
 timestamp without time zone), no?

That's the only one we currently know is not immutable.  But before we
make any decisions, I think it'd be a good idea to scrutinize all the
other cases too, because obviously this area has gotten some careless
hacking (*) done on it in the past.

regards, tom lane

(*) I have a nasty feeling that the carelessness was mine.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch for parallel pg_dump

2012-01-30 Thread Joachim Wieland
On Mon, Jan 30, 2012 at 12:20 PM, Robert Haas robertmh...@gmail.com wrote:
 But the immediate problem is that pg_dump.c is heavily reliant on
 global variables, which isn't going to fly if we want this code to use
 threads on Windows (or anywhere else).  It's also bad style.

Technically, since most of pg_dump.c dumps the catalog and since this
isn't done in parallel but only in the master process, most functions
need not be changed for the parallel restore. Only those that are
called from the worker threads need to be changed, this has been done
in e.g. dumpBlobs(), the change that you quoted upthread.

 But it
 seems possible that we might someday want to dump from one database
 and restore into another database at the same time, so maybe we ought
 to play it safe and use different variables for those things.

Actually I've tried that but in the end concluded that it's best to
have at most one database connection in an ArchiveHandle if you don't
want to do a lot more refactoring. Besides the normal connection
parameters like host, port, ... there's also std_strings, encoding,
savedPassword, currUser/currSchema, lo_buf, remoteVersion ... that
wouldn't be obvious where they belonged to.

Speaking about refactoring, I'm happy to also throw in the idea to
make the dump and restore more symmetrical than they are now. I kinda
disliked RestoreOptions* being a member of the ArchiveHandle without
having something similar for the dump. Ideally I'd say there should be
a DumpOptions and a RestoreOptions field (with a struct Connection
being part of them containing all the different connection
parameters). They could be a union if you wanted to allow only one
connection, or not if you want more than one.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Speed dblink using alternate libpq tuple storage

2012-01-30 Thread Marko Kreen
On Tue, Jan 31, 2012 at 4:59 AM, Kyotaro HORIGUCHI
horiguchi.kyot...@oss.ntt.co.jp wrote:
 This is fixed version of dblink.c for row processor.

 i'll re-send the properly fixed patch for dblink.c later.

 - malloc error in initStoreInfo throws ERRCODE_OUT_OF_MEMORY. (new error)

 - storeHandler() now returns FALSE on malloc failure. Garbage
  cleanup is done in dblink_fetch() or dblink_record_internal().
  The behavior that this dblink displays this error as 'unkown
  error/could not execute query' on the user session is as it did
  before.

Another thing: if realloc() fails, the old pointer stays valid.
So it needs to be assigned to temp variable first, before
overwriting old pointer.

And seems malloc() is preferable to palloc() to avoid
exceptions inside row processor.  Although latter
one could be made to work, it might be unnecessary
complexity.  (store current pqresult into remoteConn?)

-- 
marko

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Group commit, revised

2012-01-30 Thread Heikki Linnakangas

On 31.01.2012 01:35, Simon Riggs wrote:

The plan here is to allow WAL flush and clog updates to occur
concurrently. Which allows the clog contention and update time to be
completely hidden behind the wait for the WAL flush. That is only
possible if we have the WALwriter involved since we need two processes
to be actively involved.

...

The theory behind this is clear, but needs some explanation.

There are 5 actions that need to occur at commit
1) insert WAL record
2) optionally flush WAL record
3) mark the clog AND set LSN from (1) if we skipped (2)
4) optionally wait for sync rep
5) remove the proc from the procarray


 ...


Notice that step (2) and step (3) are actually independent of each other.

So an improved design for commit is to
2) request flush up to LSN, but don't wait
3) mark the clog and set LSN
4) wait for LSN once, either for walwriter or walsender to release us


That seems like a pretty marginal gain. If you're bound by the speed of 
fsyncs, this will reduce the latency by the time it takes to mark the 
clog, which is tiny in comparison to all the other stuff that needs to 
happen, like, flushing the WAL. And that's ignoring any additional 
overhead caused by the signaling between processes. If you're bound by 
CPU capacity, this doesn't help at all because it just moves the work 
around.


Anyway, this is quite different from the original goal and patch for 
group commit, so can we please leave this for 9.3, and move on with the 
review of pending 9.2 patches.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers