Re: [HACKERS] automating CF submissions (was xlog location arithmetic)

2012-01-15 Thread Magnus Hagander
On Sun, Jan 15, 2012 at 05:44, Greg Smith g...@2ndquadrant.com wrote:
 On 01/14/2012 10:49 PM, Gurjeet Singh wrote:

 So lets make it easy for the patch submitter to start the process. I
 propose that we have a page in the CF application where people can
 upload/attach the patch, and the app posts the patch to -hackers and uses
 the post URL to create the CF entry.


 That would be nice, but there's at least two serious problems with it, which
 I would guess are both unsolvable without adding an unsupportable amount of
 work to the current PostgreSQL web team.  First, it is technically risky for
 a web application hosted on postgresql.org to be e-mailing this list.  There
 are some things in the infrastructure that do that already--I believe the
 pgsql-commiters list being driven from commits is the busiest such bot.  But
 all of the ones that currently exist are either moderated, have a limited
 number of approved submitters, or both.

It's not really a problem from that perspective, as long as it
requires the user to be logged in. The mail would be sent from the
users account, with that one as a sender, and thus be exposed to the
same moderation rules as the rest of the list posts.


 If it were possible for a bot to create a postgresql.org community account,
 then trigger an e-mail to pgsql-hackers just by filling out a web form, I'd
 give it maybe six months before it has to be turned off for a bit--because
 there are thousands messages queued up once the first bored spammer figures

Said bot can already use the bug report form *without*  having to sign
up for an account.

Or said bot could submit news or events, which trigger an email to at
least some lists, which hasn't bene done.

It's supposedly not easy for a bot to sign up for a community account,
since it requires you to have access to the email address it's
registered on. If that doesn't work, it's a bug and needs to be fixed
regardless.


 that out.  Securing web to e-mail gateways is a giant headache, and everyone
 working on the PostgreSQL infrastructure who might work on that is already
 overloaded with community volunteer work.  There's an element of zero-sum

We've already solved that problem for other situtations, and given how
the infrastructure is built, that's fairly easy to replicate to
another node.

I think the bigger problem is who'll write it. AFAIK, the CF app
*itself* is even more person- and time-constrained to senior
developers (Robert Haas only) than the infrastructure, and that's a
bigger problem. There are already a bunch of things that are a lot
simpler than this that has been pending on that one for well over half
a year.

 Second, e-mail provides some level of validation that patches being
 submitted are coming from the person they claim.  We currently reject
 patches that are only shared with the community on the web, via places like
 github.  The process around this mailing list tries to make it clear sending
 patches to here is a code submission under the PostgreSQL license.  And
 e-mail nowadays keeps increasing the number of checks that confirm it's
 coming from the person it claims sent it.  I can go check into the DKIM
 credentials your Gmail message to the list contained if I'd like, to help
 confirm it really came from your account.  E-mail headers are certainly not

I think DKIM was a bad example, because AFAIK our lists mangle DKIM
and thus actually show them as *invalid* for at least the majority of
messages...

 One unicorn I would like to have here would give the CF app a database of
 recent e-mails to pgsql-hackers.  I login to the CF app, click on Add
 recent submission, and anything matching my e-mail address appears with a
 checkbox next to it.  Click on the patch submissions, and then something
 like you described would happen.  That would save me the annoying work
 around looking up message IDs so much.

That would be neat.

And FWIW, I'd find it a lot more useful for the CF app to have the
ability to post *reviews* in it, that would end up being properly
threaded. The way it is now, half the reviewers create a *new* thread
to post their reviews on, making it a PITA to keep track of those
patches on the list at all, which somewhat takes away the whole idea
of mail being the primary way to track it. Not saying it's critical,
but I'd put it a lot higher on the list than being able to post the
initial patch.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Vacuum rate limit in KBps

2012-01-15 Thread Greg Smith
So far the reaction I've gotten from my recent submission to make 
autovacuum log its read/write in MB/s has been rather positive.  I've 
been surprised at the unprecedented (to me at least) amount of 
backporting onto big production systems it's gotten.  There is a whole 
lot of pent up frustration among larger installs over not having good 
visibility into how changing cost-based vacuum parameters turns into 
real-world units.


That got me thinking:  if MB/s is what everyone wants to monitor, can we 
provide a UI to set these parameters that way too?  The attached patch 
is a bit rough still, but it does that.  The key was recognizing that 
the cost delay plus cost limit can be converted into an upper limit on 
cost units per second, presuming the writes themselves are free.  If you 
then also assume the worst case--that everything will end up dirty--by 
throwing in the block size, too,  you compute a maximum rate in MB/s.  
That represents the fastest you can possibly write.


If you then turn that equation around, making the maximum write rate the 
input, for any given cost delay and dirty page cost you can solve for 
the cost limit--the parameter in fictitious units everyone hates.  It 
works like this, with the computation internals logged every time they 
run for now:


#vacuum_cost_rate_limit = 4000  # maximum write rate in kilobytes/second
LOG:  cost limit=200 based on rate limit=4000 KB/s delay=20 dirty cost=20

That's the same cost limit that was there before, except now it's 
derived from that maximum write rate figure.  vacuum_cost_limit is gone 
as a GUC, replaced with this new vacuum_cost_rate_limit.  Internally, 
vacuum_cost_rate_limit hasn't gone anywhere though.  All of the entry 
points into vacuum and autovacuum derive an internal-only 
VacuumCostLimit as part of any setup or rebalance operation.  But 
there's no change to underlying cost management code; the cost limit is 
budgeted and accounted for in exactly the same way as it always was.


Why is this set in kilobytes/second rather than using something based on 
a memory unit?  That decision was made after noting these values can 
also be set in relation options.  Making relation options aware of 
memory unit math seemed ambitious relative to its usefulness, and it's 
not like KB/s is hard to work with in this context.


OK, I lied; technically this is set in kibibytes per second right now.  
Ran out of energy before I got to confirming that was consistent with 
all similar GUC settings, will put on my pedantic hat later to check that.


One nice thing that falls out of this is that the *vacuum_cost_delay 
settings essentially turn into a boolean.  If the delay is 0, cost 
limits are off; set it to any other value, and the rate you get is 
driven almost entirely by vacuum_cost_rate_limit (disclaimer mainly 
because of issues like sleep time accuracy are possible).  You can see 
that at work in these examples:


LOG:  cost limit=200 based on rate limit=4000 KB/s delay=20 dirty cost=20
LOG:  cost limit=100 based on rate limit=4000 KB/s delay=10 dirty cost=20

LOG:  cost limit=200 based on rate limit=4000 KB/s delay=20 dirty cost=20
LOG:  cost limit=100 based on rate limit=2000 KB/s delay=20 dirty cost=20

Halve the delay to 10, and the cost limit drops in half too to keep the 
same I/O rate.  Halve the rate limit instead, and the cost limit halves 
with it.  Most sites will never need to change the delay figure from 
20ms, they can just focus on tuning the more human-readable rate limit 
figure instead.  The main reason I thought of to keep the delay around 
as an integer still is sites trying to minimize power use, they might 
increase it from the normally used 20ms.  I'm not as worried about 
postgresql.conf settings bloat to support a valid edge use case, so long 
as most sites find a setting unnecessary to tune.  And the autovacuum 
side of cost delay should fall into that category with this change.


Here's a full autovacuum log example.  This shows how close to the KBps 
rate the server actually got, along with the autovacuum cost balancing 
working the same old way (this is after running the boring 
autovac-big.sql test case attached here too):


2012-01-15 02:10:51.905 EST: LOG:  cost limit=200 based on rate 
limit=4000 KB/s delay=20 dirty cost=20
2012-01-15 02:10:51.906 EST: DEBUG:  autovac_balance_cost(pid=13054 
db=16384, rel=16444, cost_rate_limit=4000, cost_limit=200, 
cost_limit_base=200, cost_delay=20)


2012-01-15 02:11:05.127 EST: DEBUG:  t: removed 499 row versions 
in 22124 pages
2012-01-15 02:11:05.127 EST: DEBUG:  t: found 499 removable, 
501 nonremovable row versions in 44248 out of 44248 pages
2012-01-15 02:11:05.127 EST: DETAIL:  0 dead row versions cannot be 
removed yet.

There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.27s/0.97u sec elapsed 131.73 sec.
2012-01-15 02:11:05.127 EST: LOG:  automatic vacuum of table 
gsmith.public.t: index scans: 0

pages: 0 

Re: [HACKERS] Our poll() based WaitLatch implementation is broken

2012-01-15 Thread Peter Geoghegan
On 15 January 2012 07:26, Peter Geoghegan pe...@2ndquadrant.com wrote:
 Build Postgres master, on Linux or another platform that will use the
 poll() implementation rather than the older select(). Send the
 Postmaster SIGKILL. Observe that the WAL Writer lives on, representing
 a denial of service as it stays attached to shared memory, busy
 waiting (evident from the fact that it quickly leaks memory).

Correction: The CPU usage % approaches 100.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] automating CF submissions (was xlog location arithmetic)

2012-01-15 Thread Greg Smith

On 01/15/2012 03:17 AM, Magnus Hagander wrote:

And FWIW, I'd find it a lot more useful for the CF app to have the
ability to post *reviews* in it, that would end up being properly
threaded.


Next you'll be saying we should have some sort of web application to 
help with the whole review process, show the work on an integrated sort 
of Review Board or something.  What crazy talk.


My contribution toward patch review ease for this week is that peg is 
quite a bit smarter about referring to the correct part of the origin 
git repo now, when you've been working on a branch for a while then 
create a new one:  https://github.com/gregs1104/peg


Last week's was that I confirmed that on a Mac using Homebrew for 
package management, after brew install postgresql to get the 
dependencies in, you can then use peg to setup a PostgreSQL in your home 
directory for patch testing or development.  Works fine out of the box, 
you just won't have things like all the PLs installed.


Yes, I am aware I'm going at this bottom-up.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] automating CF submissions (was xlog location arithmetic)

2012-01-15 Thread Magnus Hagander
On Sun, Jan 15, 2012 at 09:37, Greg Smith g...@2ndquadrant.com wrote:
 On 01/15/2012 03:17 AM, Magnus Hagander wrote:

 And FWIW, I'd find it a lot more useful for the CF app to have the
 ability to post *reviews* in it, that would end up being properly
 threaded.


 Next you'll be saying we should have some sort of web application to help
 with the whole review process, show the work on an integrated sort of Review
 Board or something.  What crazy talk.

Well, it's early in the morning for being sunday, I blame that.


 My contribution toward patch review ease for this week is that peg is quite
 a bit smarter about referring to the correct part of the origin git repo
 now, when you've been working on a branch for a while then create a new one:
  https://github.com/gregs1104/peg

Being able to refer to a git branch is one of those things that have
been on the todo list for the cf app since pgcon last year...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pgstat documentation tables

2012-01-15 Thread Magnus Hagander
Right now we have a single table on
http://www.postgresql.org/docs/devel/static/monitoring-stats.html#MONITORING-STATS-VIEWS
that lists all our statistics views. It makes for difficult to parse
descriptions like One row only, showing cluster-wide statistics from
the background writer: number of scheduled checkpoints, requested
checkpoints, buffers written by checkpoints and cleaning scans, and
the number of times the background writer stopped a cleaning scan
because it had written too many buffers. Also includes statistics
about the shared buffer pool, including buffers written by backends
(that is, not by the background writer), how many times those backends
had to execute their own fsync calls (normally the background writer
handles those even when the backend does its own write), total buffers
allocated, and time of last statistics reset.

I'd like to turn that into one table for each view, with two columns,
one being the name the other one being the description. That'll also
make it possible to expand on the descriptions without making it
completley unreadable, should we want to.

Comments?Objections?

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] IDLE in transaction introspection

2012-01-15 Thread Greg Smith

On 01/12/2012 11:57 AM, Scott Mead wrote:
Pretty delayed, but please find the attached patch that addresses all 
the issues discussed.


The docs on this v4 look like they suffered a patch order problem here.  
In the v3, you added a whole table describing the pg_stat_activity 
documentation in more detail than before.  v4 actually tries to remove 
those new docs, a change which won't even apply as they don't exist 
upstream.


My guess is you committed v3 to somewhere, applied the code changes for 
v4, but not the documentation ones.  It's easy to do that and end up 
with a patch that removes a bunch of docs the previous patch added.  I 
have to be careful to always do something like git diff origin/master 
to avoid this class of problem, until I got into that habit I did this 
sort of thing regularly.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgstat documentation tables

2012-01-15 Thread Greg Smith

On 01/15/2012 03:51 AM, Magnus Hagander wrote:

I'd like to turn that into one table for each view, with two columns,
one being the name the other one being the description. That'll also
make it possible to expand on the descriptions without making it
completley unreadable, should we want to.


Scott Mead's pg_stat_activity refactoring patch already was forced 
toward doing something similar for its documentation to be readable 
(maybe exactly the same as you were thinking, can't tell because my 
SGML-fu is weak) .  I feel doing this for every view in there is 
inevitable, just a matter of when to bite on the problem.  This whole 
section is barely usable in its current form.  I strongly hope the 
pg_stat_activity patch goes in, which will be close to a compelling 
event to motivate fixing the rest of them in a similar way.  I would 
advocate leading with that one until it seems right, then adopting the 
same style of reorg to the rest until they're all converted.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Review of: pg_stat_statements with query tree normalization

2012-01-15 Thread Daniel Farina
I've *finally* gotten around to reviewing this patch.

My first step was to de-bitrot it very slightly.  More on that in a moment.

After that, I tried using it.  Installation worked nicely -- I did
CREATE EXTENSION and then tried reading from pg_stat_statements.  I
was then given an error message to add pg_stat_statements into
shared_preload_libraries, and so I did.  Given its use of shared
memory, there is not much other choice, but it's neat that I got it
running in just a few minutes without reading any documentation.  I
ran a non-trivial but not-enormous application against the
pg_stat_statements-enabled database, and the results were immediate
and helpful, being correctly normalized and statistics apparently
accrued.

'make check' also worked fine.  I ran with assertions and debugging enabled.

This query I suspect will quickly become a new crowd-pleaser:

  SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 20;

The output looks like this (fixed width looks best):

userid  | 16385
dbid| 16386
query   | SELECT * FROM lock_head($1, $2);
calls   | 12843
total_time  | 19.795586
rows| 8825
shared_blks_hit | 194583
shared_blks_read| 19
shared_blks_written | 0
local_blks_hit  | 0
local_blks_read | 0
local_blks_written  | 0
temp_blks_read  | 0
temp_blks_written   | 0

A couple of nit-picks: a query execution is probably not often
referred to as a call, and total_time doesn't signify its units in
the name, but I eyeballed it to be seconds.

I cannot see superuser query text when I'm in a non-superuser role,
however, I can see the execution statistics:

userid  | 10
dbid| 16386
query   | insufficient privilege
calls   | 1448
total_time  | 0.696188
rows| 205616
shared_blks_hit | 13018
shared_blks_read| 14
shared_blks_written | 0
local_blks_hit  | 0
local_blks_read | 0
local_blks_written  | 0
temp_blks_read  | 0
temp_blks_written   | 0

Prepared statements are less informative, unless one knows their
naming convention:

query   | execute foo;

I don't know what JDBC users or ActiveRecord 3 users are going to
think about that.  However, just about everyone else will be happy.

The patch very nicely handles its intended task: folding together
queries with the same literal.  This is the key feature that makes the
output really useful, and is quite unique: the next best thing I know
of is pgfouine, which uses some heuristics and needs to be run over
specially formatted logs, usually offline.  This is much better, much
faster, more interactive, and much more sound when it comes to
normalizing queries.

Onto the mechanism: the patch is both a contrib and changes to
Postgres.  The changes to postgres are mechanical in nature, but
voluminous.  The key change is to not only remember the position of
Const nodes in the query tree, but also their length, and this change
is really extensive although repetitive.  It was this swathe of change
that bitrotted the source, when new references to some flex constructs
were added.  Every such reference has needs to explicitly refer to
'.begins', which is the beginning position of the const -- this used
to be the only information tracked.  Because .length was never
required by postgres in the past, it fastidiously bookkeeps that
information without ever referring to it internally: only
pg_stat_statements does.

There is a performance test program (written in Python) conveniently
included in the contrib to allay fears that such const-length
bookkeeping may be too expensive.  I have not run it yet myself.

There is also a regression test -- also a python program -- presumably
because it would be more difficult to test this the normal way via
pg_regress, at least to get a full integration test.

I'm a little surprised that pg_stat_statements.c uses direct recursion
instead of the mutually recursive walker interface, which makes it
quite a bit different than most passes in the optimizer I've seen.

I'm still looking at the mechanism in more detail, but I am having a
fairly easy time following it -- there's just a lot of it to cover the
litany of Nodes.  I'll submit my editorializations of whitespace and
anti-bitrot to Peter Geoghegan via git (actually, that information is
below, for any curious onlookers).  If anyone wants to play with the
this anti-bitrotted copy against a very recent master, please see:

$ git fetch https://github.com/fdr/postgres.git
pg_stat_statements_norm:pg_stat_statements_norm
$ git checkout pg_stat_statements_norm

All in all, it looks and works well, inside and out.

-- 
fdr

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Review of: pg_stat_statements with query tree normalization

2012-01-15 Thread Peter Geoghegan
On 15 January 2012 11:41, Daniel Farina dan...@heroku.com wrote:
 I've *finally* gotten around to reviewing this patch.

 My first step was to de-bitrot it very slightly.  More on that in a moment.

Thanks.

 Prepared statements are less informative, unless one knows their
 naming convention:

 query               | execute foo;

 I don't know what JDBC users or ActiveRecord 3 users are going to
 think about that.  However, just about everyone else will be happy.

I should point out, though I think you might be aware of this already,
that the patch actually behaves in the same way as the existing
implementation here. It will only normalise prepared queries that are
prepared with PQprepare() or its underlying wire-protocol facility. If
you run the example in the pg_stat_statements docs, where pgbench is
passed -M prepared, that will work just as well as before. Perhaps
it's something that the patch should be able to do. That said, it
seems pretty likely that client libraries won't be dynamically
generating SQL Prepare/Execute statements under the hood.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] exit() calls in libraries

2012-01-15 Thread Peter Eisentraut
On mån, 2011-12-05 at 15:04 -0300, Alvaro Herrera wrote:
 Having had to battle some exit() calls in the PHP interpreter back
 when
 I was working in PL/php, I agree that they shouldn't be there --
 abort()
 seems more appropriate if the system is truly busted.  As for the
 fr-print.c code, I'm not really sure why don't we just remove it.
 
This is the patch that would get rid of all exit() calls in the
libraries.
diff --git i/src/interfaces/libpq/fe-print.c w/src/interfaces/libpq/fe-print.c
index 5fa3be0..a6c2959 100644
--- i/src/interfaces/libpq/fe-print.c
+++ w/src/interfaces/libpq/fe-print.c
@@ -112,17 +112,17 @@ PQprint(FILE *fout, const PGresult *res, const PQprintOpt *po)
 		if (!(fieldNames = (const char **) calloc(nFields, sizeof(char *
 		{
 			fprintf(stderr, libpq_gettext(out of memory\n));
-			exit(1);
+			abort();
 		}
 		if (!(fieldNotNum = (unsigned char *) calloc(nFields, 1)))
 		{
 			fprintf(stderr, libpq_gettext(out of memory\n));
-			exit(1);
+			abort();
 		}
 		if (!(fieldMax = (int *) calloc(nFields, sizeof(int
 		{
 			fprintf(stderr, libpq_gettext(out of memory\n));
-			exit(1);
+			abort();
 		}
 		for (numFieldName = 0;
 			 po-fieldName  po-fieldName[numFieldName];
@@ -203,7 +203,7 @@ PQprint(FILE *fout, const PGresult *res, const PQprintOpt *po)
 			if (!(fields = (char **) calloc(nFields * (nTups + 1), sizeof(char *
 			{
 fprintf(stderr, libpq_gettext(out of memory\n));
-exit(1);
+abort();
 			}
 		}
 		else if (po-header  !po-html3)
@@ -391,7 +391,7 @@ do_field(const PQprintOpt *po, const PGresult *res,
 			if (!(fields[i * nFields + j] = (char *) malloc(plen + 1)))
 			{
 fprintf(stderr, libpq_gettext(out of memory\n));
-exit(1);
+abort();
 			}
 			strcpy(fields[i * nFields + j], pval);
 		}
@@ -462,7 +462,7 @@ do_header(FILE *fout, const PQprintOpt *po, const int nFields, int *fieldMax,
 		if (!border)
 		{
 			fprintf(stderr, libpq_gettext(out of memory\n));
-			exit(1);
+			abort();
 		}
 		p = border;
 		if (po-standard)
@@ -605,7 +605,7 @@ PQdisplayTuples(const PGresult *res,
 		if (!fLength)
 		{
 			fprintf(stderr, libpq_gettext(out of memory\n));
-			exit(1);
+			abort();
 		}
 
 		for (j = 0; j  nFields; j++)
@@ -704,7 +704,7 @@ PQprintTuples(const PGresult *res,
 			if (!tborder)
 			{
 fprintf(stderr, libpq_gettext(out of memory\n));
-exit(1);
+abort();
 			}
 			for (i = 0; i = width; i++)
 tborder[i] = '-';
diff --git i/src/interfaces/libpq/libpq-int.h w/src/interfaces/libpq/libpq-int.h
index 64dfcb2..a0d93e0 100644
--- i/src/interfaces/libpq/libpq-int.h
+++ w/src/interfaces/libpq/libpq-int.h
@@ -483,7 +483,7 @@ extern pgthreadlock_t pg_g_threadlock;
 #define PGTHREAD_ERROR(msg) \
 	do { \
 		fprintf(stderr, %s\n, msg); \
-		exit(1); \
+		abort(); \
 	} while (0)
 
 
diff --git i/src/port/path.c w/src/port/path.c
index 9cb0b01..de345c2 100644
--- i/src/port/path.c
+++ w/src/port/path.c
@@ -445,7 +445,7 @@ get_progname(const char *argv0)
 	if (progname == NULL)
 	{
 		fprintf(stderr, %s: out of memory\n, nodir_name);
-		exit(1);/* This could exit the postmaster */
+		abort();/* This could exit the postmaster */
 	}
 
 #if defined(__CYGWIN__) || defined(WIN32)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] age(xid) on hot standby

2012-01-15 Thread Peter Eisentraut
On ons, 2011-12-28 at 14:35 -0500, Tom Lane wrote:
 Alvaro Herrera alvhe...@commandprompt.com writes:
  Excerpts from Peter Eisentraut's message of mié dic 28 15:04:09 -0300 2011:
  On a hot standby, this fails with:
  ERROR:  cannot assign TransactionIds during recovery
 
  I think we could just have the xid_age call
  GetCurrentTransactionIdIfAny, and if that returns InvalidXid, use
  ReadNewTransactionId instead.  That xid_age assigns a transaction seems
  more of an accident than really intended.
 
 The trouble with using ReadNewTransactionId is that it makes the results
 volatile, not stable as the function is declared to be.

Could we alleviate that problem with some caching within the function?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-15 Thread Andrew Dunstan



On 01/15/2012 01:37 AM, Tom Lane wrote:

Peter Eisentrautpete...@gmx.net  writes:

I see that in some places our code already uses #ifdef
USE_ASSERT_CHECKING, presumably to hide similar issues.  But in most
cases using this would significantly butcher the code.  I found that
adding __attribute__((unused)) is cleaner.  Attached is a patch that
cleans up all the warnings I encountered.

Surely this will fail entirely on most non-gcc compilers?  Not to
mention that next month's gcc may complain hey, you used this 'unused'
variable.  I think #ifdef USE_ASSERT_CHECKING is really the only way
if you care about quieting these warnings.  (Personally, I don't.)





It would possibly have some documentary value too. Just looking very 
quickly at Peter's patch, I don't really understand his assertion that 
this would significantly butcher the code. The worst effect would be 
that in a few cases we'd have to break up multiple declarations where 
one of the variables was in this class. That doesn't seem like a tragedy.


I like software that compiles in the normal use with few or no warnings. 
I should have thought that would appeal to most packagers, too.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] lots of unused variable warnings in assert-free builds

2012-01-15 Thread Simon Riggs
On Sun, Jan 15, 2012 at 1:14 PM, Andrew Dunstan and...@dunslane.net wrote:


 On 01/15/2012 01:37 AM, Tom Lane wrote:

 Peter Eisentrautpete...@gmx.net  writes:

 I see that in some places our code already uses #ifdef
 USE_ASSERT_CHECKING, presumably to hide similar issues.  But in most
 cases using this would significantly butcher the code.  I found that
 adding __attribute__((unused)) is cleaner.  Attached is a patch that
 cleans up all the warnings I encountered.

 Surely this will fail entirely on most non-gcc compilers?  Not to
 mention that next month's gcc may complain hey, you used this 'unused'
 variable.  I think #ifdef USE_ASSERT_CHECKING is really the only way
 if you care about quieting these warnings.  (Personally, I don't.)





 It would possibly have some documentary value too. Just looking very quickly
 at Peter's patch, I don't really understand his assertion that this would
 significantly butcher the code. The worst effect would be that in a few
 cases we'd have to break up multiple declarations where one of the variables
 was in this class. That doesn't seem like a tragedy.

 I like software that compiles in the normal use with few or no warnings. I
 should have thought that would appeal to most packagers, too.


Sounds good, but let's not do it yet because we have a few patches to
commit first.

It would be good to minimise bit rot during the CF.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgstat documentation tables

2012-01-15 Thread Magnus Hagander
On Sun, Jan 15, 2012 at 10:35, Greg Smith g...@2ndquadrant.com wrote:
 On 01/15/2012 03:51 AM, Magnus Hagander wrote:

 I'd like to turn that into one table for each view, with two columns,
 one being the name the other one being the description. That'll also
 make it possible to expand on the descriptions without making it
 completley unreadable, should we want to.


 Scott Mead's pg_stat_activity refactoring patch already was forced toward
 doing something similar for its documentation to be readable (maybe exactly
 the same as you were thinking, can't tell because my SGML-fu is weak) .  I
 feel doing this for every view in there is inevitable, just a matter of when
 to bite on the problem.  This whole section is barely usable in its current
 form.  I strongly hope the pg_stat_activity patch goes in, which will be
 close to a compelling event to motivate fixing the rest of them in a similar
 way.  I would advocate leading with that one until it seems right, then
 adopting the same style of reorg to the rest until they're all converted.

Yes, the pg_stat_activity patch will make it more important - but I
want to do it for *all* of them...

I'll be happy to do the mechanical work on turning them into table,
but yes, I was planning to look at the pg_stat_activity patch first
all along :)

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch to allow users to kill their own queries

2012-01-15 Thread Magnus Hagander
On Fri, Jan 13, 2012 at 14:46, Magnus Hagander mag...@hagander.net wrote:
 On Fri, Jan 13, 2012 at 14:42, Greg Smith g...@2ndquadrant.com wrote:
 On 01/03/2012 12:59 PM, Tom Lane wrote:

 Noah Mischn...@leadboat.com  writes:


 Regarding the other message, avoid composing a translated message from
 independently-translated parts.


 Yes.  I haven't looked at the patch, but I wonder whether it wouldn't be
 better to dodge both of these problems by having the subroutine return a
 success/failure result code, so that the actual ereport can be done at
 an outer level where the full message (and hint) can be written
 directly.



 Between your and Noah's comments I understand the issue and likely direction
 to sort this out now.  Bounced forward to the next CommitFest and back on my
 plate again.  Guess I'll be learning new and exciting things about
 translation this week.

 I should say that I have more or less finished this one off, since I
 figured you were working on more important things. Just a final round
 of cleanup and commit left, really, so you can take it off your plate
 again if you want to :-)

Finally, to end the bikeshedding, I've applied this patch after
another round of fairly extensive changes.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Concurrent CREATE TABLE/DROP SCHEMA leaves inconsistent leftovers

2012-01-15 Thread Hitoshi Harada
On Sat, Jan 14, 2012 at 6:48 AM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Jan 14, 2012 at 5:25 AM, Hitoshi Harada umi.tan...@gmail.com wrote:
 The patch looks ok, though I wonder if we could have a way to release
 the lock on namespace much before the end of transaction.

 Well, that wold kind of miss the point, wouldn't it?  I mean, the race
 is that the process dropping the schema might not see the newly
 created object.  The only way to prevent that is to hold some sort of
 lock until the newly created object is visible, which doesn't happen
 until commit.

 I know it's a limited situation, though. Maybe the right way would be
 to check the namespace at the end of the transaction if any object is
 created, rather than locking it.

 And then what?  If you find that you created an object in a namespace
 that's been subsequently dropped, what do you do about that?  As far
 as I can see, your own really choice would be to roll back the
 transaction that uncovers this problem, which is likely to produce
 more rollbacks than just letting the deadlock detector sort it out.

Right. I thought this patch introduced lock on schema in SELECT, but
actually we'd been locking schema with SELECT since before the patch.
So the behavior becomes consistent (between SELECT and CREATE) now
with it. And I agree that my insist is pointless after looking more.

Thanks,
-- 
Hitoshi Harada

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-15 Thread Andrew Dunstan



On 01/14/2012 03:06 PM, Andrew Dunstan wrote:




OK, here's a patch that does both query_to_json and array_to_json, 
along with docs and regression tests. It include Robert's original 
patch, although I can produce a differential patch if required. It can 
also be pulled from https://bitbucket.org/adunstan/pgdevel






Here's an update that adds row_to_json, plus a bit more cleanup. Example:


   andrew=# SELECT row_to_json(q)
   FROM (SELECT $$a$$ || x AS b,
 y AS c,
 ARRAY[ROW(x.*,ARRAY[1,2,3]),
   ROW(y.*,ARRAY[4,5,6])] AS z
  FROM generate_series(1,2) x,
   generate_series(4,5) y) q;
row_to_json
   
 {b:a1,c:4,z:[{f1:1,f2:[1,2,3]},{f1:4,f2:[4,5,6]}]}
 {b:a1,c:5,z:[{f1:1,f2:[1,2,3]},{f1:5,f2:[4,5,6]}]}
 {b:a2,c:4,z:[{f1:2,f2:[1,2,3]},{f1:4,f2:[4,5,6]}]}
 {b:a2,c:5,z:[{f1:2,f2:[1,2,3]},{f1:5,f2:[4,5,6]}]}
   (4 rows)


(This might be more to Robert's taste than query_to_json() :-) )

cheers

andrew


diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 152ef2f..f45b10b 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -269,6 +269,12 @@
entry/entry
entryXML data/entry
   /row
+
+  row
+   entrytypejson/type/entry
+   entry/entry
+   entryJSON data/entry
+  /row
  /tbody
 /tgroup
/table
@@ -4169,6 +4175,21 @@ SET xmloption TO { DOCUMENT | CONTENT };
/sect2
   /sect1
 
+  sect1 id=datatype-json
+   titleacronymJSON/ Type/title
+
+   indexterm zone=datatype-json
+primaryJSON/primary
+   /indexterm
+
+   para
+The typejson/type data type can be used to store JSON data.  Such
+data can also be stored as typetext/type, but the
+typejson/type data type has the advantage of checking that each
+stored value is a valid JSON value.
+   /para
+  /sect1
+
   array;
 
   rowtypes;
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2e06346..9368739 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -9615,6 +9615,77 @@ table2-mapping
   /sect2
  /sect1
 
+ sect1 id=functions-json
+  titleJSON functions/title
+
+  indexterm zone=datatype-json
+	primaryJSON/primary
+	secondaryFunctions and operators/secondary
+  /indexterm
+
+  para
+This section descripbes the functions that are available for creating
+JSON (see xref linkend=datatype-json) data.
+  /para
+
+  table id=functions-json-table
+titleJSON Support Functions/title
+tgroup cols=4
+ thead
+  row
+   entryFunction/entry
+   entryDescription/entry
+   entryExample/entry
+   entryExample Result/entry
+  /row
+ /thead
+ tbody
+  row
+   entry
+ indexterm
+  primaryquery_to_json/primary
+ /indexterm
+ literalquery_to_json(text, boolean)/literal
+   /entry
+   entry
+ Returns the result of running the query as JSON. If the
+ second parameter is true, there will be a line feed between records.
+   /entry
+   entryliteralquery_to_json('select 1 as a, $$foo$$ as b', false)/literal/entry
+   entryliteral[{a:1,b:foo}]/literal/entry
+  /row
+  row
+   entry
+ indexterm
+  primaryarray_to_json/primary
+ /indexterm
+ literalarray_to_json(anyarray)/literal
+   /entry
+   entry
+ Returns the array as JSON. A Postgres multi-dimensional array becomes a JSON
+ array of arrays.
+   /entry
+   entryliteralarray_to_json('{{1,5},{99,100}}'::int[])/literal/entry
+   entryliteral[[1,5],[99,100]]/literal/entry
+  /row
+  row
+   entry
+ indexterm
+  primaryrow_to_json/primary
+ /indexterm
+ literalrow_to_json(record)/literal
+   /entry
+   entry
+ Returns the row as JSON.
+   /entry
+   entryliteralrow_to_json(row(1,'foo'))/literal/entry
+   entryliteral{f1:1,f2:foo}/literal/entry
+  /row
+ /tbody
+/tgroup
+   /table
+
+ /sect1
 
  sect1 id=functions-sequence
   titleSequence Manipulation Functions/title
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8b48105..ddb2784 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -24,6 +24,7 @@
 #include rewrite/rewriteHandler.h
 #include tcop/tcopprot.h
 #include utils/builtins.h
+#include utils/json.h
 #include utils/lsyscache.h
 #include utils/rel.h
 #include utils/snapmgr.h
@@ -99,7 +100,6 @@ static void ExplainDummyGroup(const char *objtype, const char *labelname,
 static void ExplainXMLTag(const char *tagname, int flags, ExplainState *es);
 static void ExplainJSONLineEnding(ExplainState *es);
 static void ExplainYAMLLineStarting(ExplainState *es);
-static void escape_json(StringInfo buf, const char *str);
 static void escape_yaml(StringInfo buf, const char 

[HACKERS] WAL Restore process during recovery

2012-01-15 Thread Simon Riggs
WALRestore process asynchronously executes restore_command while
recovery continues working.

Overlaps downloading of next WAL file to reduce time delays in file
based archive recovery.

Handles cases of file-only and streaming/file correctly.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ce659ec..e8b0b69 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -40,6 +40,7 @@
 #include pgstat.h
 #include postmaster/bgwriter.h
 #include postmaster/startup.h
+#include postmaster/walrestore.h
 #include replication/walreceiver.h
 #include replication/walsender.h
 #include storage/bufmgr.h
@@ -187,7 +188,6 @@ static bool InArchiveRecovery = false;
 static bool restoredFromArchive = false;
 
 /* options taken from recovery.conf for archive recovery */
-static char *recoveryRestoreCommand = NULL;
 static char *recoveryEndCommand = NULL;
 static char *archiveCleanupCommand = NULL;
 static RecoveryTargetType recoveryTarget = RECOVERY_TARGET_UNSET;
@@ -575,8 +575,8 @@ bool reachedConsistency = false;
 
 static bool InRedo = false;
 
-/* Have we launched bgwriter during recovery? */
-static bool bgwriterLaunched = false;
+/* Have we launched background procs during archive recovery yet? */
+static bool ArchRecoveryBgProcsActive = false;
 
 /*
  * Information logged when we detect a change in one of the parameters
@@ -632,8 +632,6 @@ static bool XLogPageRead(XLogRecPtr *RecPtr, int emode, bool fetching_ckpt,
 			 bool randAccess);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
-static bool RestoreArchivedFile(char *path, const char *xlogfname,
-	const char *recovername, off_t expectedSize);
 static void ExecuteRecoveryCommand(char *command, char *commandName,
 	   bool failOnerror);
 static void PreallocXlogFiles(XLogRecPtr endptr);
@@ -2706,19 +2704,47 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
 
 	XLogFileName(xlogfname, tli, log, seg);
 
+#define TMPRECOVERYXLOG	RECOVERYXLOG
+
 	switch (source)
 	{
 		case XLOG_FROM_ARCHIVE:
+			/*
+			 * Check to see if the WALRestore process has already put the
+			 * next file in place while we were working. If so, use that.
+			 * If not, get it ourselves. This makes it easier to handle
+			 * initial state before the WALRestore is active, and also
+			 * handles the stop/start logic correctly when we have both
+			 * streaming and file based replication active.
+			 *
+			 * We queue up the next task for WALRestore after we've begun to
+			 * use this file later in XLogFileRead().
+			 *
+			 * If the WALRestore process is still active, the lock wait makes
+			 * us wait, which is just like we were executing the command
+			 * ourselves and so doesn't alter the logic elsewhere.
+			 */
+			if (XLogFileIsNowFullyRestored(tli, log, seg))
+			{
+snprintf(path, MAXPGPATH, XLOGDIR /%s, TMPRECOVERYXLOG);
+restoredFromArchive = true;
+break;
+			}
+
 			/* Report recovery progress in PS display */
 			snprintf(activitymsg, sizeof(activitymsg), waiting for %s,
 	 xlogfname);
 			set_ps_display(activitymsg, false);
 
 			restoredFromArchive = RestoreArchivedFile(path, xlogfname,
-	  RECOVERYXLOG,
+	  TMPRECOVERYXLOG,
 	  XLogSegSize);
+
 			if (!restoredFromArchive)
+			{
+LWLockRelease(WALRestoreCommandLock);
 return -1;
+			}
 			break;
 
 		case XLOG_FROM_PG_XLOG:
@@ -2748,18 +2774,42 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
 		if (stat(xlogfpath, statbuf) == 0)
 		{
 			if (unlink(xlogfpath) != 0)
+			{
+LWLockRelease(WALRestoreCommandLock);
 ereport(FATAL,
 		(errcode_for_file_access(),
 		 errmsg(could not remove file \%s\: %m,
 xlogfpath)));
+			}
 			reload = true;
 		}
 
 		if (rename(path, xlogfpath)  0)
+		{
+			LWLockRelease(WALRestoreCommandLock);
 			ereport(ERROR,
 (errcode_for_file_access(),
  errmsg(could not rename file \%s\ to \%s\: %m,
 		path, xlogfpath)));
+		}
+
+		/*
+		 * Make sure we recover from the new filename, so we can reuse the
+		 * temporary filename for asynchronous restore actions.
+		 */
+		strcpy(path, xlogfpath);
+
+		/*
+		 * Tell the WALRestore process to get the next file now.
+		 * Hopefully it will be ready for use in time for the next call the
+		 * Startup process makes to XLogFileRead().
+		 *
+		 * It might seem like we should do that earlier but then there is a
+		 * race condition that might lead to replacing RECOVERYXLOG with
+		 * another file before we've copied it.
+		 */
+		SetNextWALRestoreLogSeg(tli, log, seg);
+		LWLockRelease(WALRestoreCommandLock);
 
 		/*
 		 * If the existing segment was replaced, since walsenders might have
@@ -2911,8 +2961,11 @@ XLogFileClose(void)
  * For fixed-size files, the caller may pass the expected 

Re: [HACKERS] pgstat documentation tables

2012-01-15 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 Right now we have a single table on
 http://www.postgresql.org/docs/devel/static/monitoring-stats.html#MONITORING-STATS-VIEWS
 that lists all our statistics views ...
 I'd like to turn that into one table for each view,

Please follow the style already used for system catalogs; ie I think
there should be a summary table with one entry per view, and then a
separate description and table-of-columns for each view.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 9.2 Reviewfest - logging hooks patch

2012-01-15 Thread Martin Pihlak
On 01/13/2012 05:29 AM, Christopher Maujean wrote:
 Patch
 http://archives.postgresql.org/message-id/4edbefbb.9070...@gmail.com
 
 
 In reviewing the referenced patch, we found that it has no developer
 documentation and no regression tests. While it compiles and installs,
 there is nothing to tell us how to use (or test) it.
 

Thanks for reviewing!

I think the regression tests should be part of the module that uses
the hooks. Alas, there are none that we'd want to ship with Postgres
at this time.

The best I could offer for testing the hooks is a log forwarding module
I've been working on. Basically, what it does is that it installs the
hooks, looks at the message emitted by backend and if needed sends it
away via UDP. Available at: https://github.com/mpihlak/pg_logforward

To install, run make install and modify postgresql.conf to include:

shared_preload_libraries = 'pg_logforward'
logforward.target_names = 'jsonsrv'
logforward.jsonsrv_host = '127.0.0.1'
logforward.jsonsrv_port = 23456

Now, after restarting postgres, you could use nc 23456 -u -l to
receive the log. Alternatively use testing/test_logserver.py as a
log server.

About documentation - it sure would be nice to have all the hooks
written up somewhere. Any suggestions as to where to put it?

regards,
Martin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] libpq: PQcmdStatus, PQcmdTuples signatures can be painlessly improved

2012-01-15 Thread Peter Eisentraut
On tis, 2012-01-10 at 14:10 -0500, Alex Goncharov wrote:
 | Note that const PGresult * would only warn against changing the
 | fields
 
 It would not warn, it would err (the compilation should fail).

No, const violations generally only produce warnings.

 
 | of the PGresult struct.  It doesn't do anything about changing the data
 | pointed to by pointers in the PGresult struct.  So what you are saying
 | doesn't follow.
 
 By this logic, passing 'const struct foo *' doesn't have any point and
 value, for any function.

It pretty much doesn't.

 But we know that this is done (and thank you
 for that) in many cases -- a good style, self-documentation and some
 protection.

I'm not disagreeing, but I'm pointing out that it won't do what you
expect.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dry-run mode for pg_archivecleanup

2012-01-15 Thread Gabriele Bartolini

Hi Josh,

Il 15/01/12 01:13, Josh Kupershmidt ha scritto:

I have signed on to review this patch for the 2012-01 CF. The patch
applies cleanly, includes the necessary documentation, and implements
a useful feature.

Thank you for dedicating your time to this matter.

I think the actual debugging line:

+   fprintf(stdout, %s\n, WALFilePath);
My actual intention was to have the filename as output of the command, 
in order to easily pipe it to another script. Hence my first choice 
was to use the stdout channel, considering also that pg_archivecleanup 
in dry-run mode is harmless and does not touch the content of the directory.

Oh, and I think the removing file...  debug message above should not
be printed in dryrun-mode, lest we confuse the admin.

Yes, I agree with you here.

Let me know what you think about my reasons for the stdout channel, then 
I will send v2.


Thanks again!

Ciao,
Gabriele

--
 Gabriele Bartolini - 2ndQuadrant Italia
 PostgreSQL Training, Services and Support
 gabriele.bartol...@2ndquadrant.it | www.2ndQuadrant.it


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_trigger_depth() v3 (was: TG_DEPTH)

2012-01-15 Thread Kevin Grittner
Kevin Grittner  wrote:
 Florian Pflug  wrote:
 
 The trigger depth is incremented before calling the trigger
 function in ExecCallTriggerFunc() and decremented right
 afterwards, which seems fine - apart from the fact that the
 decrement is skipped in case of an error. The patch handles that
 by saving respectively restoring the value of pg_depth in
 PushTransaction() respectively {Commit,Abort}SubTransaction().

 While I can't come up with a failure case for that approach, it
 seems rather fragile - who's to say that nested trigger
 invocations correspond that tightly to sub-transaction?
 
 I believe the same effect could be achieved more cleanly by
 a TRY/CATCH block in ExecCallTriggerFunc().
 
Done that way in this version.
 
 * Other in-core PLs
 As it stands, the patch only export tg_depth to plpgsql functions,
 not to functions written in one of the other in-core PLs. It'd be
 good to change that, I believe - otherwise the other PLs become
 second-class citizens in the long run.
 
 Are you suggesting that this be implemented as a special trigger
 variable in every PL, or that it simply be a regular function that
 returns zero when not in a trigger and some positive value when
 called from a trigger? The latter seems like a pretty good idea to
 me. If that is done, is there any point to *also* having a TG_DEPTH
 trigger variable in plpgsql? (I don't think there is.)
 
I dropped the TG_DEPTH name and the patch just supports a
pg_trigger_depth() function now.  Useful values are probably:
 
0: No trigger active on the connection.
1: Top level trigger.  Useful to restrict certain DML to be allowed
   only from triggers.
1: OK to do trigger-restricted DML.
greater than expected maximum depth: warn before hard crash
 
 [questions about code coverage in regression tests]
 
I altered the tests to improve code coverage.  In addition, since
this is no longer just a plpgsql feature, I move the tests to the
triggers.sql file.
 
-Kevin




tg_depth.v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Our poll() based WaitLatch implementation is broken

2012-01-15 Thread Heikki Linnakangas

On 15.01.2012 09:26, Peter Geoghegan wrote:

Build Postgres master, on Linux or another platform that will use the
poll() implementation rather than the older select(). Send the
Postmaster SIGKILL. Observe that the WAL Writer lives on, representing
a denial of service as it stays attached to shared memory, busy
waiting (evident from the fact that it quickly leaks memory).


The poll()-based implementation checked for POLLIN on the 
postmaster-alive-pipe, just like we check for the fd to become readable 
in the select() implementation. But poll() has a separate POLLHUP event 
code for that; it does not set POLLIN on the fd but POLLHUP.


Fixed, to check POLLHUP. I still kept the check POLLIN, the pipe should 
never become readable so if it does something is badly wrong. I also 
threw in a check for POLLNVAL, which means Invalid request: fd not 
open. That should definitely not happen, but if it does, it seems good 
to treat it as postmaster death too. Even if the postmaster isn't dead 
yet, we could no longer detect when it does die.



The rationale for introducing the poll()-based implementation where
available was that it performed better than a select()-based one. I
wonder, how compelling a win is that expected to be?


Ganesh Venkitachalam did some micro-benchmarking after the latch patch 
was committed: 
http://archives.postgresql.org/pgsql-hackers/2010-09/msg01609.php. I 
don't think it make any meaningful difference in real applications, but 
poll() also doesn't have an arbitrary limit on the range of fd numbers 
that can be used.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup is not checking IDENTIFY_SYSTEM numbre of columns

2012-01-15 Thread Jaime Casanova
On Wed, Jan 11, 2012 at 5:11 PM, Magnus Hagander mag...@hagander.net wrote:

 No, no reason. Adding such a check would be a good idea.


ok. patch attached, it also adds a few PQclear() calls before
disconnect_and_exit().
btw, in BaseBackup() in line 1149 (after the patch is applied) there
is an exit instead of disconnect_and_exit() and that is probably a
typo too

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
new file mode 100644
index 4007680..22d906a
*** a/src/bin/pg_basebackup/pg_basebackup.c
--- b/src/bin/pg_basebackup/pg_basebackup.c
*** BaseBackup(void)
*** 916,927 
  	{
  		fprintf(stderr, _(%s: could not identify system: %s\n),
  progname, PQerrorMessage(conn));
  		disconnect_and_exit(1);
  	}
! 	if (PQntuples(res) != 1)
  	{
! 		fprintf(stderr, _(%s: could not identify system, got %i rows\n),
! progname, PQntuples(res));
  		disconnect_and_exit(1);
  	}
  	sysidentifier = strdup(PQgetvalue(res, 0, 0));
--- 916,929 
  	{
  		fprintf(stderr, _(%s: could not identify system: %s\n),
  progname, PQerrorMessage(conn));
+ 		PQclear(res);
  		disconnect_and_exit(1);
  	}
! 	if (PQntuples(res) != 1 || PQnfields(res) != 3)
  	{
! 		fprintf(stderr, _(%s: could not identify system, got %i rows and %i fields\n),
! progname, PQntuples(res), PQnfields(res));
! 		PQclear(res);
  		disconnect_and_exit(1);
  	}
  	sysidentifier = strdup(PQgetvalue(res, 0, 0));
*** BaseBackup(void)
*** 954,965 
--- 956,969 
  	{
  		fprintf(stderr, _(%s: could not initiate base backup: %s),
  progname, PQerrorMessage(conn));
+ 		PQclear(res);
  		disconnect_and_exit(1);
  	}
  	if (PQntuples(res) != 1)
  	{
  		fprintf(stderr, _(%s: no start point returned from server\n),
  progname);
+ 		PQclear(res);
  		disconnect_and_exit(1);
  	}
  	strcpy(xlogstart, PQgetvalue(res, 0, 0));
*** BaseBackup(void)
*** 976,986 
--- 980,992 
  	{
  		fprintf(stderr, _(%s: could not get backup header: %s),
  progname, PQerrorMessage(conn));
+ 		PQclear(res);
  		disconnect_and_exit(1);
  	}
  	if (PQntuples(res)  1)
  	{
  		fprintf(stderr, _(%s: no data returned from server\n), progname);
+ 		PQclear(res);
  		disconnect_and_exit(1);
  	}
  
*** BaseBackup(void)
*** 1010,1015 
--- 1016,1022 
  	{
  		fprintf(stderr, _(%s: can only write single tablespace to stdout, database has %d\n),
  progname, PQntuples(res));
+ 		PQclear(res);
  		disconnect_and_exit(1);
  	}
  
*** BaseBackup(void)
*** 1051,1062 
--- 1058,1071 
  	{
  		fprintf(stderr, _(%s: could not get WAL end position from server\n),
  progname);
+ 		PQclear(res);
  		disconnect_and_exit(1);
  	}
  	if (PQntuples(res) != 1)
  	{
  		fprintf(stderr, _(%s: no WAL end position returned from server\n),
  progname);
+ 		PQclear(res);
  		disconnect_and_exit(1);
  	}
  	strcpy(xlogend, PQgetvalue(res, 0, 0));
*** BaseBackup(void)
*** 1069,1074 
--- 1078,1084 
  	{
  		fprintf(stderr, _(%s: final receive failed: %s),
  progname, PQerrorMessage(conn));
+ 		PQclear(res);
  		disconnect_and_exit(1);
  	}
  
*** BaseBackup(void)
*** 1089,1094 
--- 1099,1105 
  		{
  			fprintf(stderr, _(%s: could not send command to background pipe: %s\n),
  	progname, strerror(errno));
+ 			PQclear(res);
  			disconnect_and_exit(1);
  		}
  
*** BaseBackup(void)
*** 1098,1121 
--- 1109,1136 
  		{
  			fprintf(stderr, _(%s: could not wait for child process: %s\n),
  	progname, strerror(errno));
+ 			PQclear(res);
  			disconnect_and_exit(1);
  		}
  		if (r != bgchild)
  		{
  			fprintf(stderr, _(%s: child %i died, expected %i\n),
  	progname, r, bgchild);
+ 			PQclear(res);
  			disconnect_and_exit(1);
  		}
  		if (!WIFEXITED(status))
  		{
  			fprintf(stderr, _(%s: child process did not exit normally\n),
  	progname);
+ 			PQclear(res);
  			disconnect_and_exit(1);
  		}
  		if (WEXITSTATUS(status) != 0)
  		{
  			fprintf(stderr, _(%s: child process exited with error %i\n),
  	progname, WEXITSTATUS(status));
+ 			PQclear(res);
  			disconnect_and_exit(1);
  		}
  		/* Exited normally, we're happy! */
*** BaseBackup(void)
*** 1130,1135 
--- 1145,1151 
  		{
  			fprintf(stderr, _(%s: could not parse xlog end position \%s\\n),
  	progname, xlogend);
+ 			PQclear(res);
  			exit(1);
  		}
  		InterlockedIncrement(has_xlogendptr);
*** BaseBackup(void)
*** 1140,1145 
--- 1156,1162 
  			_dosmaperr(GetLastError());
  			fprintf(stderr, _(%s: could not wait for child thread: %s\n),
  	progname, strerror(errno));
+ 			PQclear(res);
  			disconnect_and_exit(1);
  		}
  		if (GetExitCodeThread((HANDLE) bgchild, status) == 0)
*** BaseBackup(void)
*** 1147,1158 
--- 1164,1177 
  			

Re: [HACKERS] [PATCH] Support for foreign keys with arrays

2012-01-15 Thread Marco Nenciarini
Hello,

Il giorno dom, 11/12/2011 alle 19.45 -0500, Noah Misch ha scritto:
 On Sat, Dec 10, 2011 at 09:47:53AM +0100, Gabriele Bartolini wrote:
  So, here is a summary:
 
  --- - -
 |   ON|   ON|
  Action | DELETE  | UPDATE  |
  --- - -
  CASCADE|   Row   |  Error  |
  SET NULL   |   Row   |   Row   |
  SET DEFAULT|   Row   |   Row   |
  ARRAY CASCADE  | Element | Element |
  ARRAY SET NULL | Element | Element |
  NO ACTION  |-|-|
  RESTRICT   |-|-|
  --- - -
 
  If that's fine with you guys, Marco and I will refactor the development  
  based on these assumptions.
 
 Looks fine.
 

This is our latest version of the patch. Gabriele, Gianni and I have
discussed a lot and decided to send an initial patch which uses EACH
REFERENCES instead of ARRAY REFERENCES. The reason behind this is that
ARRAY REFERENCES generates an ambiguous grammar, and we all agreed that
EACH REFERENCES makes sense (and the same time does not introduce any
new keyword). This is however open for discussion.


The patch now includes the following clauses on the delete/update
actions - as per previous emails:


--- - -
   |   ON|   ON|
Action | DELETE  | UPDATE  |
--- - -
CASCADE|   Row   |  Error  |
SET NULL   |   Row   |   Row   |
SET DEFAULT|   Row   |   Row   |
ARRAY CASCADE  | Element | Element |
ARRAY SET NULL | Element | Element |
NO ACTION  |-|-|
RESTRICT   |-|-|
--- - - 


We will resubmit the patch for the 2012-01 commit fest.


Thank you,
Marco


-- 
Marco Nenciarini - System manager @ Devise.IT
marco.nenciar...@devise.it | http://www.devise.it



foreign-key-array-v2.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Vacuum rate limit in KBps

2012-01-15 Thread Heikki Linnakangas

On 15.01.2012 10:24, Greg Smith wrote:

That got me thinking: if MB/s is what everyone wants to monitor, can we
provide a UI to set these parameters that way too? The attached patch is
a bit rough still, but it does that. The key was recognizing that the
cost delay plus cost limit can be converted into an upper limit on cost
units per second, presuming the writes themselves are free. If you then
also assume the worst case--that everything will end up dirty--by
throwing in the block size, too, you compute a maximum rate in MB/s.
That represents the fastest you can possibly write.


+1. I've been thinking we should do that for a long time, but haven't 
gotten around to it.


I think it makes more sense to use the max read rate as the main knob, 
rather than write rate. That's because the max read rate is higher than 
the write rate, when you don't need to dirty pages. Or do you think 
saturating the I/O system with writes is so much bigger a problem than 
read I/O that it makes more sense to emphasize the writes?


I was thinking of something like this, in postgresql.conf:

# - Vacuum Throttling -

#vacuum_cost_page_miss = 1.0# measured on an arbitrary scale
#vacuum_cost_page_dirty = 2.0   # same scale as above
#vacuum_cost_page_hit = 0.1 # same scale as above
#vacuum_rate_limit = 8MB# max reads per second

This is now similar to the cost settings for the planner, which is good.


There's one serious concern I don't have a quick answer to. What do we
do with in-place upgrade of relations that specified a custom
vacuum_cost_limit? I can easily chew on getting the right logic to
convert those to equals in the new setting style, but I am not prepared
to go solely on the hook for all in-place upgrade work one might do
here. Would this be easiest to handle as one of those dump/restore
transformations?


It needs to be handled at dump/restore time. I'm not sure where that 
transformation belongs to, though. Do we have any precedence for this? I 
think we have two options:


1. Accept the old autovacuum_cost_limit setting in CREATE TABLE, and 
transform it immediately into corresponding autovacuum_rate_limit setting.


2. Transform in pg_dump, so that the CREATE TABLE statements in the dump 
use the new autovacuum_rate_limit setting.


The advantage of 1. option is that dumps taken with old 9.1 pg_dump 
still work on a 9.2 server. We usually try to preserve that 
backwards-compatibility, although we always recommend using the pg_dump 
from the newer version on upgrade. However, you need to know the 
vacuum_cost_page_miss setting effective in the old server to do the 
transformation correctly (or vacuum_cost_page_dirty, if we use the write 
max rate as the main knob as you suggested), and we don't have access 
when restoring a dump.



My guess is that's more sensible than the alternative
of making an on-read converter that only writes in the new format, then
worrying about upgrading all old pages before moving forward.


This requires any page format changes, so I don't think the above 
sentence makes any sense.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Fix breakage from earlier plperl fix.

2012-01-15 Thread Andrew Dunstan



On 01/13/2012 07:09 PM, Alex Hunsaker wrote:

On Fri, Jan 13, 2012 at 16:07, Andrew Dunstanand...@dunslane.net  wrote:


On 01/12/2012 09:28 PM, Alex Hunsaker wrote:

Util.c/o not depending on plperl_helpers.h was also throwing me for a loop
so I fixed it and SPI.c... Thoughts?


Basically looks good, but I'm confused by this:

   do language plperl $$ elog('NOTICE', ${^TAINT}); $$;



Why is NOTICE quoted here? It's constant that we've been careful to set up.

No reason.

[ Err well It was just part of me flailing around while trying to
figure out why elog was still crashing even thought I had the issue
fixed. The real problem was Util.o was not being recompiled due to
Util.c not depending on plperl_helpers.h) ]



Except that it doesn't work. The above produces no output in the 
regression tests.



I've committed it with that changed and result file changed accordingly.

cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2012-01-15 Thread Jeff Janes
On Fri, Jan 13, 2012 at 9:50 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Jeff Janes jeff.ja...@gmail.com writes:
 I don't understand why this is controversial.  In the current code, if
 you have a master and a single sync standby, and the master disappears
 and you promote the standby, now the new master is running *without a
 standby*.

 If you configured it to use sync rep, it won't accept any transactions
 until you give it a standby.  If you configured it not to, then it's you
 that has changed the replication requirements.

Sure, but isn't that a very common usage?  Maybe my perceptions are
out of whack, but I commonly hear about fail-over and rarely hear
about using more than one slave so that you can fail over and still
have a positive number of slaves.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Standalone synchronous master

2012-01-15 Thread Jeff Janes
On Fri, Jan 13, 2012 at 10:12 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Jeff Janes jeff.ja...@gmail.com wrote:\

 I don't understand why this is controversial.

 I'm having a hard time seeing why this is considered a feature.  It
 seems to me what is being proposed is a mode with no higher
 integrity guarantee than asynchronous replication, but latency
 equivalent to synchronous replication.

There are never 100% guarantees.  You could always have two
independent failures (the WAL disk of the master and of the slave)
nearly simultaneously.

If you look at weaker guarantees, then with asynchronous replication
you are almost guaranteed to lose transactions on a fail-over of a
busy server, and with the proposed option you are almost guaranteed
not to, as long as disconnections are rare.

As far as latency, I think there are many cases when a small latency
is pretty much equivalent to zero latency.  A human on the other end
of a commit is unlikely to notice a latency of 0.1 seconds.

 I can see where it's
 tempting to want to think it gives something more in terms of
 integrity guarantees, but when I think it through, I'm not really
 seeing any actual benefit.

I think the value of having a synchronously replicated commit is
greater than zero but less than infinite.  I don't think it is
outrageous to think that that value could be approximately expressed
in seconds you are willing to wait for that replicated commit before
going ahead without it.


 If this fed into something such that people got jabber message,
 emails, or telephone calls any time it switched between synchronous
 and stand-alone mode, that would make it a built-in monitoring,
 fail-over, and alert system -- which *would* have some value.  But
 in the past we've always recommended external tools for such
 features.

Since synchronous_standby_names cannot be changed without bouncing the
server, we do not provide the tools for an external tool to make this
change cleanly.

Cheers,

Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Dry-run mode for pg_archivecleanup

2012-01-15 Thread Josh Kupershmidt
On Sun, Jan 15, 2012 at 3:02 PM, Gabriele Bartolini
gabriele.bartol...@2ndquadrant.it wrote:

 My actual intention was to have the filename as output of the command, in
 order to easily pipe it to another script. Hence my first choice was to
 use the stdout channel, considering also that pg_archivecleanup in dry-run
 mode is harmless and does not touch the content of the directory.

Oh, right - I should have re-read your initial email before diving
into the patch. That all makes sense given your intended purpose. I
guess your goal of constructing some simple way to pass the files
which would be removed on to another script is a little different than
what I initially thought the patch would be useful for, namely as a
testing/debugging aid for an admin.

Perhaps both goals could be met by making use of '--debug' together
with '--dry-run'. If they are both on, then an additional message like
pg_archivecleanup: would remove file ...  would be printed to
stderr, along with just the filename printed to stdout you already
have.

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Group commit, revised

2012-01-15 Thread Peter Geoghegan
Attached is a patch that myself and Simon Riggs collaborated on. I
took the group commit patch that Simon posted to the list back in
November, and partially rewrote it. Here is that original thread:

http://archives.postgresql.org/pgsql-hackers/2011-11/msg00802.php

I've also attached the results of a pgbench-tools driven benchmark,
which are quite striking (Just the most relevant image - e-mail me
privately if you'd like a copy of the full report, as I don't want to
send a large PDF file to the list as a courtesy to others). Apart from
the obvious improvement in throughput, there is also a considerable
improvement in average and worst latency at all client counts.

To recap, the initial impetus to pursue this idea came from the
observation that with sync rep, we could get massive improvements in
the transactions-per-second throughput by simply adding clients. Greg
Smith performed a benchmark while in Amsterdam for the PgConf.EU
conference, which was discussed in a talk there. Over an
inter-continental connection from Amsterdam to his office in Baltimore
on the U.S. east coast, he saw TPS as reported by pgbench on what I
suppose was either an insert or update workload grow from a mere 10
TPS for a single connection to over 2000 TPS for about 300
connections. That was with the large, inherent latency imposed by
those sorts of distances (3822 miles/ 6150 km, about a 100ms ping time
on a decent connection). Quite obviously, as clients were added, the
server was able to batch increasing numbers of commits in each
confirmation message, resulting in this effect.

The main way that I've added value here is by refactoring and fixing
bugs. There were some tricky race conditions that caused the
regression tests to fail for that early draft patch, but there were a
few other small bugs too. There is an unprecedented latch pattern
introduced by this patch: Two processes (the WAL Writer and any given
connection backend) have a mutual dependency. Only one may sleep, in
which case the other is responsible for waking it. Much of the
difficulty in debugging this patch, and much of the complexity that
I've added, came from preventing both from simultaneously sleeping,
even in the face of various known failure modes like postmaster death.
If this happens, it does of course represent a denial-of-service, so
that's something that reviewers are going to want to heavily
scrutinise. I suppose that sync rep should be fine here, as it waits
on walsenders, but I haven't actually comprehensively tested the
patch, so there may be undiscovered unpleasant interactions with other
xlog related features. I can report that it passes the regression
tests successfully, and on an apparently consistently basis - I
battled with intermittent failures for a time.

Simon's original patch largely copied the syncrep.c code as an
expedient to prove the concept. Obviously this design was never
intended to get committed, and I've done some commonality and
variability analysis, refactoring to considerably slim down the new
groupcommit.c file by exposing some previously module-private
functions from syncrep.c .

I encourage others to reproduce my benchmark here. I attach a
pgbench-tools config. You can get the latest version of the tool at:

https://github.com/gregs1104/pgbench-tools

I also attach hdparm information for the disk that was used during
these benchmarks. Note that I have not disabled the write cache. It's
a Linux box, with ext4, running a recent kernel.

The benefits (and, admittedly, the controversies) of this patch go
beyond mere batching of commits: it also largely, though not entirely,
obviates the need for user backends to directly write/flush WAL, and
the WAL Writer may never sleep if it continually finds work to do -
wal_writer_delay is obsoleted, as are commit_siblings and
commit_delay. I suspect that they will not be missed. Of course, it
does all this to facilitate group commit itself. The group commit
feature does not have a GUC to control it, as this seems like
something that would be fairly pointless to turn off. FWIW, this is
currently the case for the recently introduced Maria DB group commit
implementation.

Auxiliary processes cannot use group commit. The changes made prevent
them from availing of commit_siblings/commit_delay parallelism,
because it doesn't exist anymore.

Group commit is sometimes throttled, which seems appropriate - if a
backend requests that the WAL Writer flush an LSN deemed too far from
the known flushed point, that request is rejected and the backend goes
through another path, where XLogWrite() is called. Currently the group
commit infrastructure decides that on the sole basis of there being a
volume of WAL that is equivalent in size to checkpoint_segments
between the two points. This is probably a fairly horrible heuristic,
not least since it overloads checkpoint_segments, but is of course
only a first-pass effort. Bright ideas are always welcome.

Thoughts?

-- 
Peter Geoghegan       

[HACKERS] SKIP LOCKED DATA

2012-01-15 Thread Thomas Munro
Hi

Apologies for posting about new vapourware features for distant
future releases at a very busy time in the cycle for 9.2...

I am wondering out loud whether I am brave enough to try to propose
SKIP LOCKED DATA support and would be grateful for any feedback and/or
{en|dis}couragement.  I don't see it on the todo list, and didn't find
signs of others working on this (did I miss something?), but there are
examples of users asking for this feature (by various names) on the
mailing lists.  Has the idea already been rejected, is it
fundamentally infeasible for some glaring reason, or far too
complicated for new players?

What it is:

Like the existing NOWAIT option, it means your query doesn't wait for
others sessions when trying to get an exclusive lock.  However, rather
than returning an error if it would block, it simply skips over the
rows that couldn't be locked.

What it looks like in other RDBMSs:

DB2 (z/OS only): FOR UPDATE SKIP LOCKED DATA
Oracle: FOR UPDATE SKIP LOCKED
Sybase: FOR UPDATE READPAST
MS SQL Server: FOR UPDATE WITH (READPAST)

(I'm not 100% sure about the last two, which I found by googling for
equivalents of the first two, and there are no doubt subtle differences
among these).

What it's for:

A common usage for this is to increase parallelism in systems with
multiple workers taking jobs from a queue.  I've used it for this
purpose myself on another RDBMS, having seen it recommended for some
types of work queue implementation.  It may have other uses.

How it might be implemented in PostgreSQL:

1.  Extend the grammar and parser to support SKIP LOCKED DATA (or some
other choice of words) in the same place that NOWAIT can appear.

2.  Modify heap_lock_tuple so that the boolean 'nowait' argument is
replaced by an enumeration LockWaitPolicy with values
LOCK_WAIT_POLICY_WAIT (= what false currently does),
LOCK_WAIT_POLICY_LOCK_OR_ERROR (= what true currently does),
LOCK_WAIT_POLICY_LOCK_OR_SKIP (= new behaviour).  Where currently
'nowait' is handled, the new case would also be handled, cleaning up
resources and returning a new HTSU_Result enumerator
HeapTupleWouldBlock.

3.  Modify ExecLockRows to pass the appropriate value to
heap_lock_tuple (presumably received via ExecRowMark, as nowait is
received currently).  Modify the switch on the result of that call,
treating the new case HeapTupleWouldBlock the same way that
HeapTupleSelfUpdated is treated -- that is, goto lnext to fetch the
next tuple.

4.  Probably some changes to handle table-level locks too.

5.  Probably many other things that I'm not aware of right now and
won't discover until I dig/ask further and/or run into a brick wall!

Useful?  Doable?

Thanks,

Thomas Munro

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] disable prompting by default in createuser

2012-01-15 Thread Josh Kupershmidt
On Thu, Dec 22, 2011 at 2:26 PM, Peter Eisentraut pete...@gmx.net wrote:
 On lör, 2011-11-26 at 01:28 +0200, Peter Eisentraut wrote:
 I propose that we change createuser so that it does not prompt for
 anything by default.  We can arrange options so that you can get prompts
 for whatever is missing, but by default, a call to createuser should
 just run CREATE USER with default options.  The fact that createuser
 issues prompts is always annoying when you create setup scripts, because
 you have to be careful to specify all the necessary options, and they
 are inconsistent and different between versions (although the last
 change about that was a long time ago), and the whole behavior seems
 contrary to the behavior of all other utilities.  I don't think there'd
 be a real loss by not prompting by default; after all, we don't really
 want to encourage users to create more superusers, do we?  Comments?

 Patch attached.  I'll add it to the next commitfest.

I looked at this patch for the 2012-01 CF. I like the idea, the
interactive-by-default behavior of createuser has always bugged me as
well.

I see this patch includes a small change to dropuser, to make the
'username' argument mandatory if --interactive is not set, for
symmetry with createuser's new behavior. That's dandy, though IMO we
shouldn't have -i be shorthand for --interactive with dropuser,
and something different with createuser (i.e. we should just get rid
of the i alias for dropuser).

Another little inconsistency I see with the behavior when no username
to create or drop is given:

$  createuser
createuser: creation of new role failed: ERROR:  role josh already exists
$ dropuser
dropuser: missing required argument role name
Try dropuser --help for more information.

i.e. createuser tries taking either $PGUSER or the current username as
a default user to create, while dropuser just bails out. Personally, I
prefer just bailing out if no create/drop user is specified, but
either way I think they should be consistent.

Oh, and I think the leading whitespace of this help message:

printf(_(  --interactive prompt for missing role name
and attributes rather\n

should be made the same as for other commands with no short-alias, e.g.

printf(_(  --replication role can initiate replication\n));

Other than those little complaints, everything looks good.

Josh

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Memory usage during sorting

2012-01-15 Thread Jeff Janes
In tuplesort.c, it initially reads tuples into memory until availMem
is exhausted.

It then switches to the tape sort algorithm, and allocates buffer
space for each tape it will use.  This substantially over-runs the
allowedMem, and drives availMem negative.

It works off this deficit by writing tuples to tape, and pfree-ing
their spot in the heap, which pfreed space is more or less randomly
scattered.  Once this is done it proceeds to alternate between freeing
more space in the heap and adding things to the heap (in a nearly
strict 1+1 alternation if the tuples are constant size).

The space freed up by the initial round of pfree where it is working
off the space deficit from inittapes is never re-used.  It also cannot
be paged out by the VM system, because it is scattered among actively
used memory.

I don't think that small chunks can be reused from one memory context
to another, but I haven't checked.  Even if it can be, during a big
sort like an index build the backend process doing the build may have
no other contexts which need to use it.

So having over-ran workMem and stomped all over it to ensure no one
else can re-use it, we then scrupulously refuse to benefit from that
over-run amount ourselves.

The attached patch allows it to reuse that memory.  On my meager
system it reduced the building of an index on an integer column in a
skinny 200 million row totally randomly ordered table by about 3% from
a baseline of 25 minutes.

I think it would be better to pre-deduct the tape overhead amount we
will need if we decide to switch to tape sort from the availMem before
we even start reading (and then add it back if we do indeed make that
switch).  That way we wouldn't over-run the memory in the first place.
 However, that would cause apparent regressions in which sorts that
previously fit into maintenance_work_mem no longer do.  Boosting
maintenance_work_mem to a level that was actually being used
previously would fix those regressions, but pointing out that the
previous behavior was not optimal doesn't change the fact that people
are used to it and perhaps tuned to it.  So the attached patch seems
more backwards-friendly.


Cheers,

Jeff


sort_mem_usage_v1.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pgstat documentation tables

2012-01-15 Thread Greg Smith

On 01/15/2012 12:20 PM, Tom Lane wrote:

Please follow the style already used for system catalogs; ie I think
there should be a summary table with one entry per view, and then a
separate description and table-of-columns for each view.


Yes, that's a perfect precedent.  I think the easiest path forward here 
is to tweak the updated pg_stat_activity documentation, since that's 
being refactoring first anyway.  That can be reformatted until it looks 
just like the system catalog documentation.  And then once that's done, 
the rest of them can be converted over to follow the same style.  I'd be 
willing to work on doing that in a way that improves what is documented, 
too.  The difficulty of working with the existing tables has been the 
deterrent for improving that section to me.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] archive_keepalive_command

2012-01-15 Thread Simon Riggs
On Fri, Dec 16, 2011 at 3:01 PM, Simon Riggs si...@2ndquadrant.com wrote:
 archive_command and restore_command describe how to ship WAL files
 to/from an archive.

 When there is nothing to ship, we delay sending WAL files. When no WAL
 files, the standby has no information at all.

 To provide some form of keepalive on quiet systems the
 archive_keepalive_command provides a generic hook to implement
 keepalives. This is implemented as a separate command to avoid storing
 keepalive messages in the archive, or at least allow overwrites using
 a single filename like keepalive.

 Examples
 archive_keepalive_command = 'arch_cmd keepalive'   # sends a file
 called keepalive to archive, overwrites allowed
 archive_keepalive_command = 'arch_cmd %f.%t.keepalive  #sends a file
 like 0001000ABFE.20111216143517.keepalive

 If there is no WAL file to send, then we send a keepalive file
 instead. Keepalive is a small file that contains same contents as a
 streaming keepalive message (re: other patch on that).

 If no WAL file is available and we are attempting to restore in
 standby_mode, then we execute restore_keepalive_command to see if a
 keepalive file is available. Checks for a file in the specific
 keepalive format and then uses that to update last received info from
 master.

 e.g.
 restore_keepalive_command = 'restore_cmd keepalive'   # gets a file
 called keepalive to archive, overwrites allowed

Patch.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/transam/recovery.conf.sample b/src/backend/access/transam/recovery.conf.sample
index 5acfa57..fab288c 100644
--- a/src/backend/access/transam/recovery.conf.sample
+++ b/src/backend/access/transam/recovery.conf.sample
@@ -43,6 +43,13 @@
 #
 #restore_command = ''		# e.g. 'cp /mnt/server/archivedir/%f %p'
 #
+# restore_keepalive_command
+#
+# specifies an optional shell command to download keepalive files
+#  e.g. archive_keepalive_command = 'cp -f %p $ARCHIVE/keepalive /dev/null'
+#  e.g. restore_keepalive_command = 'cp $ARCHIVE/keepalive %p'
+#
+#restore_keepalive_command = ''
 #
 # archive_cleanup_command
 #
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ce659ec..2729141 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -73,8 +73,10 @@ int			CheckPointSegments = 3;
 int			wal_keep_segments = 0;
 int			XLOGbuffers = -1;
 int			XLogArchiveTimeout = 0;
+int			XLogArchiveKeepaliveTimeout = 10;	/* XXX set to 60 before commit */
 bool		XLogArchiveMode = false;
 char	   *XLogArchiveCommand = NULL;
+char	   *XLogArchiveKeepaliveCommand = NULL;
 bool		EnableHotStandby = false;
 bool		fullPageWrites = true;
 bool		log_checkpoints = false;
@@ -188,6 +190,7 @@ static bool restoredFromArchive = false;
 
 /* options taken from recovery.conf for archive recovery */
 static char *recoveryRestoreCommand = NULL;
+static char *recoveryRestoreKeepaliveCommand = NULL;
 static char *recoveryEndCommand = NULL;
 static char *archiveCleanupCommand = NULL;
 static RecoveryTargetType recoveryTarget = RECOVERY_TARGET_UNSET;
@@ -634,6 +637,7 @@ static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static bool RestoreArchivedFile(char *path, const char *xlogfname,
 	const char *recovername, off_t expectedSize);
+static void RestoreKeepaliveFile(void);
 static void ExecuteRecoveryCommand(char *command, char *commandName,
 	   bool failOnerror);
 static void PreallocXlogFiles(XLogRecPtr endptr);
@@ -2718,7 +2722,10 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
 	  RECOVERYXLOG,
 	  XLogSegSize);
 			if (!restoredFromArchive)
+			{
+RestoreKeepaliveFile();
 return -1;
+			}
 			break;
 
 		case XLOG_FROM_PG_XLOG:
@@ -3179,6 +3186,192 @@ not_available:
 	return false;
 }
 
+static void
+RestoreKeepaliveFile(void)
+{
+	char		keepalivepath[MAXPGPATH];
+	char		keepaliveRestoreCmd[MAXPGPATH];
+	char	   *dp;
+	char	   *endp;
+	const char *sp;
+	int			rc;
+	bool		signaled;
+	struct stat stat_buf;
+
+	/* In standby mode, restore_command might not be supplied */
+	if (recoveryRestoreKeepaliveCommand == NULL)
+		return;
+
+	snprintf(keepalivepath, MAXPGPATH, XLOGDIR /archive_status/KEEPALIVE);
+
+	/*
+	 * Make sure there is no existing file in keepalivepath
+	 */
+	if (stat(keepalivepath, stat_buf) == 0)
+	{
+		if (unlink(keepalivepath) != 0)
+			ereport(FATAL,
+	(errcode_for_file_access(),
+	 errmsg(could not remove file \%s\: %m,
+			keepalivepath)));
+	}
+
+	/*
+	 * construct the command to be executed
+	 */
+	dp = keepaliveRestoreCmd;
+	endp = keepaliveRestoreCmd + MAXPGPATH - 1;
+	*endp = '\0';
+
+	for (sp = recoveryRestoreKeepaliveCommand; *sp; sp++)
+	{
+		if (*sp == '%')
+		{
+			switch (sp[1])
+			{
+case 'p':
+	/* %p: relative path of target file */
+	sp++;
+	

Re: [HACKERS] Vacuum rate limit in KBps

2012-01-15 Thread Greg Smith

On 01/15/2012 04:17 PM, Heikki Linnakangas wrote:
I think it makes more sense to use the max read rate as the main knob, 
rather than write rate. That's because the max read rate is higher 
than the write rate, when you don't need to dirty pages. Or do you 
think saturating the I/O system with writes is so much bigger a 
problem than read I/O that it makes more sense to emphasize the writes?


I haven't had the I/O rate logging available for long enough to have a 
good feel for which is more important to emphasize.  I'm agnostic on 
this.  I'd have no problem accepting the argument that exposing the 
larger of the two rates--which is the read one--makes for a cleaner UI.  
Or that it is the one more like other knobs setting precedents here.


My guess is that the changed documentation will actually be a bit 
cleaner that way.  I give an example in the patch of how read and write 
rate are related, based on the ratio of the values for dirty vs. hit.  I 
wasn't perfectly happy with how that was written yet, and I think it 
could be cleaner if the read rate is the primary tunable.


We usually try to preserve that backwards-compatibility, although we 
always recommend using the pg_dump from the newer version on upgrade. 
However, you need to know the vacuum_cost_page_miss setting effective 
in the old server to do the transformation correctly (or 
vacuum_cost_page_dirty, if we use the write max rate as the main knob 
as you suggested), and we don't have access when restoring a dump.


If someone does a storage parameter change to 
autovacuum_vacuum_cost_limit but doesn't touch 
autovacuum_vacuum_cost_delay there, I think it's possible to need the 
GUC value for autovacuum_vacuum_cost_delay too, which can then refer to 
vacuum_cost_delay as well.


I don't think that tweaking these parameters, particularly at the 
storage options level, is a popular thing to do.  I've run into 
customers who made changes there while trying to navigate the complexity 
of autovacuum tuning, but not very many.  My guess as I think about that 
history is that I've ended up reverting them as often, or maybe even 
slightly more often, than I've ended up keeping them around.  It's been 
hard to do well.  And the level of PostgreSQL deployment that reaches 
that stage of tuning, where they managed to tweak those productively, 
doesn't seem likely to do a blind upgrade to me.


One of the reasons I thought now was a good time to work on this change 
is because there's already things brewing that are going to make 9.2 
break a few more things than anyone would like, all for long-term 
positive benefits.  recovery.conf and pg_stat_activity changes are the 
two of those I've been tracking the closest.


My current thinking on this is that we ought to learn a lesson from the 
8.3 casting breakage saga and provide a clear 9.2 Upgrade Migration 
Guide that goes beyond just listing what changed in the release notes.  
Aim more toward having a checklist of things to look for and tools to 
help find them.  In this case, having the migration guide include a 
query that pokes through the catalogs looking for this particular 
customization would be helpful.  It would be easy to produce a bit of 
post-9.2 upgrade SQL that converted a customization here if you started 
by running something against the existing installation.


And that should be a wiki page so it can be extended as new things are 
discovered, some of which will include application specific guidance.  
The rough idea in this direction I put together for 8.3 (it didn't catch 
on for later versions) is at 
http://wiki.postgresql.org/wiki/Version_History  Note how that grew to 
include tsearch2 migration info and MediaWiki specific advice at one 
point (some of that was then lost in a Planet fire, but you can get the 
idea just from the article titles).  Needing a version migration guide 
with specific details about issues like this is something I think 
PostgreSQL needs to accept as part of the release cycle.  Expecting that 
pg_upgrade can transparently handle every possible change would be 
setting a high bar to clear, higher than I think is expected by the 
database industry at large.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: add timing of buffer I/O requests

2012-01-15 Thread Greg Smith

On 01/15/2012 05:14 PM, Ants Aasma wrote:

I hope that having a tool to measure the overhead and check the sanity
of clock sources is enough to answer the worries about the potential
performance hit. We could also check that the clock source is fast
enough on start-up/when the guc is changed, but that seems a bit too
much and leaves open the question about what is fast enough.


I've been thinking along those same lines--check at startup, provide 
some guidance on the general range of what's considered fast vs. slow in 
both the code and documentation.  What I'm hoping to do here is split 
your patch in half and work on the pg_test_timing contrib utility 
first.  That part answers some overdue questions about when EXPLAIN 
ANALYZE can be expected to add a lot of overhead, which means it's 
useful all on its own.  I'd like to see that utility go into 9.2, along 
with a new documentation section covering that topic.  I'll write the 
new documentation bit.


As far as the buffer timing goes, there is a lot of low-level timing 
information I'd like to see the database collect.  To pick a second 
example with very similar mechanics, I'd like to know which queries 
spend a lot of their time waiting on locks.  The subset of time a 
statement spends waiting just for commit related things is a third.  The 
industry standard term for these is wait events, as seen in Oracle, 
MySQL, MS SQL Server. etc.  That's so standard I don't see an 
intellectual property issue with PostgreSQL using the same term.  Talk 
with a random person who is converting from Oracle to PostgreSQL, ask 
them about their performance concerns.  At least 3/4 of those 
conversations I have mention being nervous about not having wait event data.


Right now, I feel the biggest hurdle to performance tuning PostgreSQL is 
not having good enough built-in query log analysis tools.  If the 
pg_stat_statements normalization upgrade in the CF queue is commited, 
that's enough to make me bump that to solved well enough.  After 
clearing that hurdle, figuring out how to log, analyze, and manage 
storage of wait events is the next biggest missing piece.  One of my top 
goals for 9.3 was to make sure that happened.


I don't think the long-term answer for how to manage wait event data is 
to collect it as part of pg_stat_statements though.  But I don't have a 
good alternate proposal, while you've submitted a patch that actually 
does something useful right now.  I'm going to think some more about how 
to reconcile all that.  There is an intermediate point to consider as 
well, which is just committing something that adjusts the core code to 
make the buffer wait event data available.  pg_stat_statements is easy 
enough to continue work on outside of core.  I could see a path where 
that happens in parallel with adding a better core wait event 
infrastructure, just to get the initial buffer wait info into people's 
hands earlier.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] separate initdb -A options for local and host

2012-01-15 Thread Robert Haas
On Sat, Jan 14, 2012 at 5:18 PM, Peter Eisentraut pete...@gmx.net wrote:
 On lör, 2011-11-26 at 01:20 +0200, Peter Eisentraut wrote:
 I think it would be useful to have separate initdb -A options for local
 and host entries.  In 9.1, we went out of our way to separate the peer
 and ident methods, but we have moved the confusion into the initdb -A
 option, where ident sometimes means peer, and peer sometimes means
 ident.  Moreover, having separate options would allow what I think
 would be a far more common use case, namely having local peer and host
 something other than ident, such as md5.

 I'm thinking, we could keep the existing -A option, but add long options
 such as --auth-local and --auth-host, to specify more detail.

 Here is a patch that implements exactly that.

I reviewed this patch.  It looks OK to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] triggered_change_notification v3

2012-01-15 Thread Kevin Grittner
Attached is a version of a previously posted patch which has been
modified based on on-list feedback from Álvaro.
 
This is a generalized trigger function which can be used as an AFTER
EACH ROW trigger on any table which has a primary key, and will send
notifications of operations for which the trigger is attached, with a
payload indicating the table, the operation, and the primary key.  A
channel can be specified in the CREATE TRIGGER command, but will
default to tcn if omitted.
 
I had previously submitted this as a core function, but with a mature
extensions feature now present, I reworked it as a contrib extension.
 If not accepted into contrib, I can post it to PGXN, but I thought
it seemed a reasonable candidate for contrib.
 
I took a shot at creating docs similar to other contrib extensions,
but couldn't quite figure out how to get them to build.  If someone
can give me pointers on that, I'll polish that up.  I find it hard to
work on the docs in sgml without being able to build and review the
html output.
 
-Kevin




tcn-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SKIP LOCKED DATA

2012-01-15 Thread Tom Lane
Thomas Munro mu...@ip9.org writes:
 I am wondering out loud whether I am brave enough to try to propose
 SKIP LOCKED DATA support and would be grateful for any feedback and/or
 {en|dis}couragement.  I don't see it on the todo list, and didn't find
 signs of others working on this (did I miss something?), but there are
 examples of users asking for this feature (by various names) on the
 mailing lists.  Has the idea already been rejected, is it
 fundamentally infeasible for some glaring reason, or far too
 complicated for new players?

It sounds to me like silently give the wrong answers.  Are you sure
there are not better, more deterministic ways to solve your problem?

(Or in other words: the fact that Oracle has it isn't enough to persuade
me it's a good idea.)

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] patch : Allow toast tables to be moved to a different tablespace

2012-01-15 Thread Jaime Casanova
On Tue, Dec 13, 2011 at 12:29 PM, Julien Tachoires jul...@gmail.com wrote:

 OK, considering that, I don't see any way to handle the case raised by Jaime 
 :(

Did you consider what Álvaro suggested? anyway, seems is too late for
this commitfest.
are you intending to resume work on this for the next cycle?
do we consider this as a useful thing?

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Publish checkpoint timing and sync files summary data to pg_stat_bgwriter

2012-01-15 Thread Greg Smith
One of the most useful bits of feedback on how well checkpoint I/O is 
going is the amount of time taken to sync files to disk.  Right now the 
only way to get that is to parse the logs.  The attached patch publishes 
the most useful three bits of data you could only get from 
log_checkpoints before out to pg_stat_bgwriter.  Easiest to just show an 
example:


$ createdb pgbench
$ psql -c select pg_stat_reset_shared('bgwriter')
$ pgbench -i -s 10 pgbench
$ psql -c checkpoint

$ tail $PGLOG
LOG:  checkpoint complete: wrote 590 buffers (14.4%); 0 transaction log 
file(s) added, 0 removed, 0 recycled; write=0.002 s, sync=0.117 s, 
total=1.746 s; sync files=31, longest=0.104 s, average=0.003 s


$ psql -c select write_ms,sync_ms,sync_files from pg_stat_bgwriter
 write_ms | sync_ms | sync_files
--+-+
2 | 117 | 31

Like the other checkpoint statistics, these are updated in one lump when 
the checkpoint finishes.


This includes rough documentation and the code isn't very complicated; a 
cat version bump is required.  Some open concerns I still have about this:


-The total amount of time in the sync phase is not the same as the total 
of all sync wait times.  I have published the total elapsed sync time.  
The total of the times spent waiting on individual sync calls is also 
available at the point when the above data is collected, and it could be 
published as well.  I don't think that's necessary yet, since they are 
almost identical right now.  But any feature that staggers sync calls 
around might make that distinction important.


-The total checkpoint time is larger than the total of write+sync.  You 
can see that here:  2ms write + 117ms sync = 119ms; total checkpoint 
time is actually 1746ms though.  While I don't normally find the total 
useful from a tuning perspective, it may be sensible to publish it just 
to further reduce any perceived need for log scraping here.


-None of the debug1 level information about individual sync calls can be 
published this way usefully.  I'd like that to come out somewhere 
eventually, but really what's needed there is a performance event 
logging history collector, not pg_stats.  I'm content that this patch 
gets the big picture data out; it's sufficient to let you rule out 
checkpoint problems if they aren't happening.  If you only have to drill 
into the logs when long times start showing up in this data, that's not 
so bad to live with.


-I can't tell for sure if this is working properly when log_checkpoints 
is off.  This now collects checkpoint end time data in all cases, 
whereas before it ignored that work if log_checkpoints was off.  I think 
I will have to fake the logging in this area, make it log some things 
even when log_checkpoints is off, to confirm it gives the right 
results.  I haven't done that yet.


I would like to be able to tell people they need never turn on 
log_checkpoints if they monitor pg_stat_bgwriter instead, because that 
sets a good precedent for what direction the database is going.  It 
would be nice for pg_stat_bgwriter's format to settle down for a long 
period too.  In addition to my small list of concerns here, there seem 
to be an increasing number of write path feature patches floating around 
here lately.  That all works against this being stable.  The main 
consumer of this data I use regularly is the pg_stat_bgwriter graphs 
that Munin produces, and my expectation is that myself and/or Magnus 
will make sure it's compatible with any changes made here for 9.2.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index a12a9a2..674091c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1182,6 +1182,34 @@ postgres: replaceableuser/ replaceabledatabase/ replaceablehost/ re
  /row
 
  row
+  entryliteralfunctionpg_stat_get_bgwriter_write_ms()/function/literal/entry
+  entrytypebigint/type/entry
+  entry
+   Total amount of time that has been spent in the part of checkpoint
+   processing where files are written to disk, in milliseconds.
+  /entry
+ /row
+
+
+ row
+  entryliteralfunctionpg_stat_get_bgwriter_sync_ms()/function/literal/entry
+  entrytypebigint/type/entry
+  entry
+   Total amount of time that has been spent in the part of checkpoint
+   processing where files are synchronized to disk, in milliseconds.
+  /entry
+ /row
+
+ row
+  entryliteralfunctionpg_stat_get_bgwriter_sync_files()/function/literal/entry
+  entrytypebigint/type/entry
+  entry
+   Total number of files that have been synchronized to disk during
+   checkpoint processing.
+  /entry
+ /row
+
+ row
   entryliteralfunctionpg_stat_get_wal_senders()/function/literal/entry
   entrytypesetof record/type/entry
  

Re: [HACKERS] Publish checkpoint timing and sync files summary data to pg_stat_bgwriter

2012-01-15 Thread Greg Smith

On 01/16/2012 01:28 AM, Greg Smith wrote:
-I can't tell for sure if this is working properly when 
log_checkpoints is off.  This now collects checkpoint end time data in 
all cases, whereas before it ignored that work if log_checkpoints was off.


...and there's at least one I missed located already:  inside of md.c.  
I'd forgotten how many spots where timing calls are optimized out are 
floating around this code path.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Checkpoint sync pause

2012-01-15 Thread Greg Smith
Last year at this point, I submitted an increasingly complicated 
checkpoint sync spreading feature.  I wasn't able to prove any 
repeatable drop in sync time latency from those patches.  While that was 
going on, and continuing into recently, the production server that 
started all this with its sync time latency issues didn't stop having 
that problem.  Data collection continued, new patches were tried.


There was a really simple triage step Simon and I made before getting 
into the complicated ones:  just delay for a few seconds between every 
single sync call made during a checkpoint.  That approach is still 
hanging around that server's patched PostgreSQL package set, and it 
still works better than anything more complicated we've tried so far.  
The recent split of background writer and checkpointer makes that whole 
thing even easier to do without rippling out to have unexpected 
consequences.


In order to be able to tune this usefully, you need to know information 
about how many files a typical checkpoint syncs.  That could be 
available without needing log scraping using the Publish checkpoint 
timing and sync files summary data to pg_stat_bgwriter addition I just 
submitted.  People who set this new checkpoint_sync_pause value too high 
can face checkpoints running over schedule, but you can measure how bad 
your exposure is with the new view information.


I owe the community a lot of data to prove this is useful before I'd 
expect it to be taken seriously.  I was planning to leave this whole 
area alone until 9.3.  But since recent submissions may pull me back 
into trying various ways of rearranging the write path for 9.2, I wanted 
to have my own miniature horse in that race.  It works simply:


...
2012-01-16 02:39:01.184 EST [25052]: DEBUG:  checkpoint sync: number=34 
file=base/16385/11766 time=0.006 msec
2012-01-16 02:39:01.184 EST [25052]: DEBUG:  checkpoint sync delay: 
seconds left=3
2012-01-16 02:39:01.284 EST [25052]: DEBUG:  checkpoint sync delay: 
seconds left=2
2012-01-16 02:39:01.385 EST [25052]: DEBUG:  checkpoint sync delay: 
seconds left=1
2012-01-16 02:39:01.860 EST [25052]: DEBUG:  checkpoint sync: number=35 
file=global/12007 time=375.710 msec
2012-01-16 02:39:01.860 EST [25052]: DEBUG:  checkpoint sync delay: 
seconds left=3
2012-01-16 02:39:01.961 EST [25052]: DEBUG:  checkpoint sync delay: 
seconds left=2
2012-01-16 02:39:02.061 EST [25052]: DEBUG:  checkpoint sync delay: 
seconds left=1
2012-01-16 02:39:02.161 EST [25052]: DEBUG:  checkpoint sync: number=36 
file=base/16385/11754 time=0.008 msec
2012-01-16 02:39:02.555 EST [25052]: LOG:  checkpoint complete: wrote 
2586 buffers (63.1%); 1 transaction log file(s) added, 0 removed, 0 
recycled; write=2.422 s, sync=13.282 s, total=16.123 s; sync files=36, 
longest=1.085 s, average=0.040 s


No docs yet, really need a better guide to tuning checkpoints as they 
exist now before there's a place to attach a discussion of this to.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0b792d2..54da69a 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -142,6 +142,7 @@ static BgWriterShmemStruct *BgWriterShmem;
 int			CheckPointTimeout = 300;
 int			CheckPointWarning = 30;
 double		CheckPointCompletionTarget = 0.5;
+int			CheckPointSyncPause = 0;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
@@ -157,6 +158,8 @@ static bool am_checkpointer = false;
 
 static bool ckpt_active = false;
 
+static int checkpoint_flags = 0;
+
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
 static XLogRecPtr ckpt_start_recptr;
@@ -643,6 +646,9 @@ CheckpointWriteDelay(int flags, double progress)
 	if (!am_checkpointer)
 		return;
 
+ 	/* Cache this value for a later sync delay */
+ 	checkpoint_flags=flags;
+
 	/*
 	 * Perform the usual duties and take a nap, unless we're behind
 	 * schedule, in which case we just try to catch up as quickly as possible.
@@ -685,6 +691,72 @@ CheckpointWriteDelay(int flags, double progress)
 }
 
 /*
+ * CheckpointSyncDelay -- control rate of checkpoint sync stage
+ *
+ * This function is called after each relation sync performed by mdsync().
+ * It delays for a fixed period while still making sure to absorb
+ * incoming fsync requests.
+ * 
+ * Due to where this is called with the md layer, it's not practical
+ * for it to be directly passed the checkpoint flags.  It's expected
+ * they will have been stashed within the checkpointer's local state
+ * by a call to CheckpointWriteDelay.
+ *
+ */
+void
+CheckpointSyncDelay()
+{
+	static int	absorb_counter = WRITES_PER_ABSORB;
+ 	int			sync_delay_secs = CheckPointSyncPause;
+ 
+	/* Do nothing if checkpoint is being executed by non-checkpointer process */
+	if