Re: [HACKERS] random_page_cost vs seq_page_cost

2012-01-11 Thread Benedikt Grundmann
(replying just to you)
On 10/01/12 15:22, Greg Smith wrote:
 On 1/5/12 5:04 AM, Benedikt Grundmann wrote:
 That sort of thing is one reason why all attempts so far to set
 random_page_cost based on physical characteristics haven't gone
 anywhere useful.  The setting is sort of overloaded right now, it's a
 fuzzy mix of true random seek cost blended with some notion of cache
 percentage. Trying to bring some measurements to bear on it is a less
 effective approach than what people actually do here.  Monitor the
 profile of query execution, change the value, see what happens.  Use
 that as feedback for what direction to keep going; repeat until
 you're just spinning with no improvements.
 
Thank you very much for the reply it is very interesting.  I'm
excited to hear that documentation in that area will improve in
9.2.  It's interesting postgres has remarkable good documentation
but it is a sufficiently complex system that to actually sensible
tune the knobs provided you have to understand quite a lot about
what is going on.  A colleague of mine likes to say 
all abstractions leak, which seems very appropriate in this case.

 We are not sure if the database used to choose differently
 before the move to the new hardware and the hardware is
 performing worse for random seeks.  Or if the planner is
 now making different choices.
 
 I don't recommend ever deploying new hardware without first doing
 some low-level benchmarks to validate its performance.  Once stuff
 goes into production, you can't do that anymore.  See
 http://www.2ndquadrant.com/en/talks/ for my hardware benchmarking
 talks if you'd like some ideas on what to collect.
 
We had actually done lots of tests on the sequential read performance.
But you are right we could have done better (and I'll definitely read
through your talks).

Did you see my follow up?  Based on the feedback we did further tests
and It is now clear that neither the hardware nor the database version 
are at fault.  A different plan is chosen by both new and old database 
version if spun up on the database as it is right now.

Our best guess is that the clusters we run after we had moved to the
hardware (it having more diskspace and faster sequential I/O making
it possible) changed the planners perception of how the joins will
perform in relation to each other.

Cheers,

Bene

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] random_page_cost vs seq_page_cost

2012-01-11 Thread Benedikt Grundmann
On 11/01/12 08:26, Benedikt Grundmann wrote:
 (replying just to you)
Clearly I didn't.  Sigh. Getting myself a coffee now.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sending notifications from the master to the standby

2012-01-11 Thread Simon Riggs
On Wed, Jan 11, 2012 at 4:33 AM, Tom Lane t...@sss.pgh.pa.us wrote:

 BTW ... it occurs to me to ask whether we really have a solid use-case
 for having listeners attached to slave servers.  I have personally never
 seen an application for LISTEN/NOTIFY in which the listeners were
 entirely read-only.  Even if there are one or two cases out there, it's
 not clear to me that supporting it is worth the extra complexity that
 seems to be needed.

The idea is to support external caches that re-read the data when it changes.

If we can do that from the standby then we offload from the master.

Yes, there are other applications for LISTEN/NOTIFY and we wouldn't be
able to support them all with this.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpoint writeback via sync_file_range

2012-01-11 Thread Simon Riggs
On Wed, Jan 11, 2012 at 4:38 AM, Greg Smith g...@2ndquadrant.com wrote:
 On 1/10/12 9:14 PM, Robert Haas wrote:

 Based on that, I whipped up the attached patch, which,
 if sync_file_range is available, simply iterates through everything
 that will eventually be fsync'd before beginning the write phase and
 tells the Linux kernel to put them all under write-out.


 I hadn't really thought of using it that way.  The kernel expects that when
 this is called the normal way, you're going to track exactly which segments
 you want it to sync.  And that data isn't really passed through the fsync
 absorption code yet; the list of things to fsync has already lost that level
 of detail.

 What you're doing here doesn't care though, and I hadn't considered that
 SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its
 docs.  Used this way, it's basically fsync without the wait or guarantee; it
 just tries to push what's already dirty further ahead of the write queue
 than those writes would otherwise be.

I don't think this will help at all, I think it will just make things worse.

The problem comes from hammering the fsyncs one after the other. What
this patch does is initiate all of the fsyncs at the same time, so it
will max out the disks even more because this will hit all disks all
at once.

It does open the door to various other uses, so I think this work will
be useful.


 One idea I was thinking about here was building a little hash table inside
 of the fsync absorb code, tracking how many absorb operations have happened
 for whatever the most popular relation files are.  The idea is that we might
 say use sync_file_range every time N calls for a relation have come in,
 just to keep from ever accumulating too many writes to any one file before
 trying to nudge some of it out of there. The bat that keeps hitting me in
 the head here is that right now, a single fsync might have a full 1GB of
 writes to flush out, perhaps because it extended a table and then write more
 than that to it.  And in everything but a SSD or giant SAN cache situation,
 1GB of I/O is just too much to fsync at a time without the OS choking a
 little on it.

A better idea. Seems like it should be easy enough to keep a counter.

I see some other uses around large writes also.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpoint writeback via sync_file_range

2012-01-11 Thread Florian Weimer
* Greg Smith:

 One idea I was thinking about here was building a little hash table
 inside of the fsync absorb code, tracking how many absorb operations
 have happened for whatever the most popular relation files are.  The
 idea is that we might say use sync_file_range every time N calls
 for a relation have come in, just to keep from ever accumulating too
 many writes to any one file before trying to nudge some of it out of
 there. The bat that keeps hitting me in the head here is that right
 now, a single fsync might have a full 1GB of writes to flush out,
 perhaps because it extended a table and then write more than that to
 it.  And in everything but a SSD or giant SAN cache situation, 1GB of
 I/O is just too much to fsync at a time without the OS choking a
 little on it.

Isn't this pretty much like tuning vm.dirty_bytes?  We generally set it
to pretty low values, and seems to help to smoothen the checkpoints.

-- 
Florian Weimerfwei...@bfk.de
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] log messages for archive recovery progress

2012-01-11 Thread Simon Riggs
2012/1/11 Euler Taveira de Oliveira eu...@timbira.com:
 On 08-01-2012 11:59, Satoshi Nagayasu / Uptime Technologies, LLC. wrote:
 [2011-12-08 15:14:36 JST] 16758: LOG:  restored log file 
 00080046 from archive
 [2011-12-08 15:14:36 JST] 16758: LOG:  recoverying 00080046
 [2011-12-08 15:14:36 JST] 16758: LOG:  restored log file 
 00080047 from archive
 [2011-12-08 15:14:36 JST] 16758: LOG:  recoverying 00080047
 cp: cannot stat `/backups/archlog/00080048': No such file 
 or directory
 [2011-12-08 15:14:37 JST] 16758: LOG:  could not restore file 
 00080048 from archive
 [2011-12-08 15:14:37 JST] 16758: LOG:  attempting to look into pg_xlog
 [2011-12-08 15:14:37 JST] 16758: LOG:  recoverying 00080048

 What about just 'restored log file 00080048 from pg_xlog'
 instead of the last two messages? If you can't read from pg_xlog emit 'could
 not restore file 00080048 from pg_xlog'.

Yes, simple is better.

We already have a message if the file is not present, we just need one
if we switch to using pg_xlog.

Please look at this.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8e65962..055ae6d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2817,6 +2817,16 @@ XLogFileRead(uint32 log, uint32 seg, int emode, TimeLineID tli,
  xlogfname);
 		set_ps_display(activitymsg, false);
 
+		/*
+		 * If we're in archive recovery and the archive runs dry,
+		 * we switch to reading from the pg_xlog directly. In that case
+		 * we want to output a message that shows what just happened.
+		 */
+		if (InArchiveRecovery  source == XLOG_FROM_PG_XLOG)
+			ereport(LOG,
+	(errmsg(restored log file \%s\ from pg_xlog,
+			xlogfname)));
+
 		/* Track source of data in assorted state variables */
 		readSource = source;
 		XLogReceiptSource = source;
@@ -2867,10 +2877,7 @@ XLogFileReadAnyTLI(uint32 log, uint32 seg, int emode, int sources)
 		{
 			fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_ARCHIVE, true);
 			if (fd != -1)
-			{
-elog(DEBUG1, got WAL segment from archive);
 return fd;
-			}
 		}
 
 		if (sources  XLOG_FROM_PG_XLOG)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Confusing EXPLAIN output in case of inherited tables

2012-01-11 Thread Ashutosh Bapat
Hi,
After running regression, I ran EXPLAIN on one of the queries in regression
(test create_misc) and got following output
regression=# explain verbose select * into table ramp from road where name
~ '.*Ramp';
 QUERY
PLAN

 Result  (cost=0.00..154.00 rows=841 width=67)
   Output: public.road.name, public.road.thepath
   -  Append  (cost=0.00..154.00 rows=841 width=67)
 -  Seq Scan on public.road  (cost=0.00..135.05 rows=418 width=67)
   Output: public.road.name, public.road.thepath
   Filter: (public.road.name ~ '.*Ramp'::text)
 -  Seq Scan on public.ihighway road  (cost=0.00..14.99 rows=367
width=67)
^
   Output: public.road.name, public.road.thepath
   ^^,   ^^
   Filter: (public.road.name ~ '.*Ramp'::text)
 ^^^
 -  Seq Scan on public.shighway road  (cost=0.00..3.96 rows=56
width=67)
   Output: public.road.name, public.road.thepath
   Filter: (public.road.name ~ '.*Ramp'::text)
(12 rows)

regression=# \d+ road
Table public.road
 Column  | Type | Modifiers | Storage  | Stats target | Description
-+--+---+--+--+-
 name| text |   | extended |  |
 thepath | path |   | extended |  |
Indexes:
rix btree (name)
Child tables: ihighway,
  shighway
Has OIDs: no

Table road has children ihighway and shighway as seen in the \d+
output above. The EXPLAIN output of Seq Scan node on children has
public.road as prefix for variables. public.road could imply the parent
table road and thus can cause confusion, as to what's been referreed, the
columns of parent table or child table. In the EXPLAIN output children
tables have road as alias (as against public.road). The alias comes
from RangeTblEntry-eref-aliasname. It might be better to have road as
prefix in the variable names over public.road.

The reason why this happens is the code in get_variable()
3865 /* Exceptions occur only if the RTE is alias-less */
3866 if (rte-alias == NULL)
3867 {
3868 if (rte-rtekind == RTE_RELATION)
3869 {
3870 /*
3871  * It's possible that use of the bare refname would find
another
3872  * more-closely-nested RTE, or be ambiguous, in which case
we need
3873  * to specify the schemaname to avoid these errors.
3874  */
3875 if (find_rte_by_refname(rte-eref-aliasname, context) !=
rte)
3876 schemaname =
get_namespace_name(get_rel_namespace(rte-relid));
3877 }

If there is no alias, we find out the schema name and later add it to the
prefix. In the inherited table case, we are actually creating a kind of
alias for the children table and thus we should not find out the schema
name and add it to the prefix. This case has been taken care of in
get_from_clause_item(),
6505 else if (rte-rtekind == RTE_RELATION 
6506 strcmp(rte-eref-aliasname,
get_relation_name(rte-relid)) != 0)
6507 {
6508 /*
6509  * Apparently the rel has been renamed since the rule was
made.
6510  * Emit a fake alias clause so that variable references
will still
6511  * work.  This is not a 100% solution but should work in
most
6512  * reasonable situations.
6513  */
6514 appendStringInfo(buf,  %s,
6515  quote_identifier(rte-eref-aliasname));
6516 gavealias = true;
6517 }

I see similar code in ExplainTargetRel()
1778 if (objectname == NULL ||
1779 strcmp(rte-eref-aliasname, objectname) != 0)
1780 appendStringInfo(es-str,  %s,
1781  quote_identifier(rte-eref-aliasname));

Based on this, here is patch to not add schemaname in the prefix for a
variable.

I have run make check. All except inherit.sql passed. The expected output
change is included in the patch.

-- 
Best Wishes,
Ashutosh Bapat
EntepriseDB Corporation
The Enterprise Postgres Company
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 9ad54c5..2e87183 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -3858,21 +3858,26 @@ get_variable(Var *var, int levelsup, bool showstar, deparse_context *context)
 		return NULL;
 	}
 
 	/* Identify names to use */
 	schemaname = NULL;			/* default assumptions */
 	refname = rte-eref-aliasname;
 
 	/* Exceptions occur only if the RTE is alias-less */
 	if (rte-alias == NULL)
 	{
-		if (rte-rtekind == RTE_RELATION)
+		/*
+		 * If the rel has been renamed since the rule was made, that's
+		 * 

Re: [HACKERS] Confusing EXPLAIN output in case of inherited tables

2012-01-11 Thread Chetan Suttraway
On Wed, Jan 11, 2012 at 5:13 PM, Ashutosh Bapat 
ashutosh.ba...@enterprisedb.com wrote:

 Hi,
 After running regression, I ran EXPLAIN on one of the queries in
 regression (test create_misc) and got following output
 regression=# explain verbose select * into table ramp from road where name
 ~ '.*Ramp';
  QUERY
 PLAN

 
  Result  (cost=0.00..154.00 rows=841 width=67)
Output: public.road.name, public.road.thepath
-  Append  (cost=0.00..154.00 rows=841 width=67)
  -  Seq Scan on public.road  (cost=0.00..135.05 rows=418 width=67)
Output: public.road.name, public.road.thepath
Filter: (public.road.name ~ '.*Ramp'::text)
  -  Seq Scan on public.ihighway road  (cost=0.00..14.99 rows=367
 width=67)
 ^
Output: public.road.name, public.road.thepath
^^,   ^^
Filter: (public.road.name ~ '.*Ramp'::text)
  ^^^
  -  Seq Scan on public.shighway road  (cost=0.00..3.96 rows=56
 width=67)
Output: public.road.name, public.road.thepath
Filter: (public.road.name ~ '.*Ramp'::text)
 (12 rows)

 regression=# \d+ road
 Table public.road
  Column  | Type | Modifiers | Storage  | Stats target | Description
 -+--+---+--+--+-
  name| text |   | extended |  |
  thepath | path |   | extended |  |
 Indexes:
 rix btree (name)
 Child tables: ihighway,
   shighway
 Has OIDs: no

 Table road has children ihighway and shighway as seen in the \d+
 output above. The EXPLAIN output of Seq Scan node on children has
 public.road as prefix for variables. public.road could imply the parent
 table road and thus can cause confusion, as to what's been referreed, the
 columns of parent table or child table. In the EXPLAIN output children
 tables have road as alias (as against public.road). The alias comes
 from RangeTblEntry-eref-aliasname. It might be better to have road as
 prefix in the variable names over public.road.

 The reason why this happens is the code in get_variable()
 3865 /* Exceptions occur only if the RTE is alias-less */
 3866 if (rte-alias == NULL)
 3867 {
 3868 if (rte-rtekind == RTE_RELATION)
 3869 {
 3870 /*
 3871  * It's possible that use of the bare refname would find
 another
 3872  * more-closely-nested RTE, or be ambiguous, in which
 case we need
 3873  * to specify the schemaname to avoid these errors.
 3874  */
 3875 if (find_rte_by_refname(rte-eref-aliasname, context) !=
 rte)
 3876 schemaname =
 get_namespace_name(get_rel_namespace(rte-relid));
 3877 }

 If there is no alias, we find out the schema name and later add it to the
 prefix. In the inherited table case, we are actually creating a kind of
 alias for the children table and thus we should not find out the schema
 name and add it to the prefix. This case has been taken care of in
 get_from_clause_item(),
 6505 else if (rte-rtekind == RTE_RELATION 
 6506 strcmp(rte-eref-aliasname,
 get_relation_name(rte-relid)) != 0)
 6507 {
 6508 /*
 6509  * Apparently the rel has been renamed since the rule was
 made.
 6510  * Emit a fake alias clause so that variable references
 will still
 6511  * work.  This is not a 100% solution but should work in
 most
 6512  * reasonable situations.
 6513  */
 6514 appendStringInfo(buf,  %s,
 6515  quote_identifier(rte-eref-aliasname));
 6516 gavealias = true;
 6517 }

 I see similar code in ExplainTargetRel()
 1778 if (objectname == NULL ||
 1779 strcmp(rte-eref-aliasname, objectname) != 0)
 1780 appendStringInfo(es-str,  %s,
 1781  quote_identifier(rte-eref-aliasname));

 Based on this, here is patch to not add schemaname in the prefix for a
 variable.

 I have run make check. All except inherit.sql passed. The expected output
 change is included in the patch.

 --
 Best Wishes,
 Ashutosh Bapat
 EntepriseDB Corporation
 The Enterprise Postgres Company



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


A table can inherit from one or more parent table. So in that case,
qualifying schema/table name
helps in finding out where the column is coming from.

Regards,
Chetan

-- 
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

 Website: www.enterprisedb.com
EnterpriseDB 

Re: [HACKERS] [WIP] Double-write with Fast Checksums

2012-01-11 Thread Heikki Linnakangas

On 10.01.2012 23:43, David Fetter wrote:

Please find attached a new revision of the double-write patch.  While
this one still uses the checksums from VMware, it's been
forward-ported to 9.2.

I'd like to hold off on merging Simon's checksum patch into this one
for now because there may be some independent issues.


Could you write this patch so that it doesn't depend on any of the 
checksum patches, please? That would make the patch smaller and easier 
to review, and it would allow benchmarking the performance impact of 
double-writes vs full page writes independent of checksums.


At the moment, double-writes are done in one batch, fsyncing the 
double-write area first and the data files immediately after that. 
That's probably beneficial if you have a BBU, and/or a fairly large 
shared_buffers setting, so that pages don't get swapped between OS and 
PostgreSQL cache too much. But when those assumptions don't hold, it 
would be interesting to treat the double-write buffers more like a 2nd 
WAL for full-page images. Whenever a dirty page is evicted from 
shared_buffers, write it to the double-write area, but don't fsync it or 
write it back to the data file yet. Instead, let it sit in the 
double-write area, and grow the double-write file(s) as necessary, until 
the next checkpoint comes along.


In general, I must say that I'm pretty horrified by all these extra 
fsync's this introduces. You really need a BBU to absorb them, and even 
then, you're fsyncing data files to disk much more frequently than you 
otherwise would.


Jignesh mentioned having run some performance tests with this. I would 
like to see those results, and some analysis and benchmarks of how 
settings like shared_buffers and the presence of BBU affect this, 
compared to full_page_writes=on and off.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Double-write with Fast Checksums

2012-01-11 Thread Simon Riggs
On Wed, Jan 11, 2012 at 12:13 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 At the moment, double-writes are done in one batch, fsyncing the
 double-write area first and the data files immediately after that. That's
 probably beneficial if you have a BBU, and/or a fairly large shared_buffers
 setting, so that pages don't get swapped between OS and PostgreSQL cache too
 much. But when those assumptions don't hold, it would be interesting to
 treat the double-write buffers more like a 2nd WAL for full-page images.
 Whenever a dirty page is evicted from shared_buffers, write it to the
 double-write area, but don't fsync it or write it back to the data file yet.
 Instead, let it sit in the double-write area, and grow the double-write
 file(s) as necessary, until the next checkpoint comes along.

 In general, I must say that I'm pretty horrified by all these extra fsync's
 this introduces. You really need a BBU to absorb them, and even then, you're
 fsyncing data files to disk much more frequently than you otherwise would.

Agreed. Almost exactly the design I've been mulling over while waiting
for the patch to get tidied up.

Interestingly, you use the term double write buffer, which is a
concept that doesn't exist in the patch, and should.

You don't say it, but presumably the bgwriter would flush double write
buffers as needed. Perhaps the checkpointer could do that when not, so
we wouldn't need to send as many fsync messages.

Bottom line is that an increased number of fsyncs on main data files
will throw the balance of performance out, so other performance tuning
will go awry.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpoint writeback via sync_file_range

2012-01-11 Thread Simon Riggs
On Wed, Jan 11, 2012 at 9:28 AM, Simon Riggs si...@2ndquadrant.com wrote:

 It does open the door to various other uses, so I think this work will
 be useful.

Yes, I think this would allow a better design for the checkpointer.

Checkpoint scan will collect buffers to write for checkpoint and sort
them by fileid, like Koichi/Itagaki already suggested.

We then do all the writes for a particular file, then issue a
background sync_file_range, then sleep a little. Loop. At end of loop,
collect up and close the sync_file_range calls with a
SYNC_FILE_RANGE_WAIT_AFTER.

So we're interleaving the writes and fsyncs throughout the whole
checkpoint, not bursting the fsyncs at the end.

With that design we would just have a continuous checkpoint, rather
than having 0,5 or 0.9

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpoint writeback via sync_file_range

2012-01-11 Thread Andres Freund
On Wednesday, January 11, 2012 03:14:31 AM Robert Haas wrote:
 Greg Smith muttered a while ago about wanting to do something with
 sync_file_range to improve checkpoint behavior on Linux.  I thought he
 was talking about trying to sync only the range of blocks known to be
 dirty, which didn't seem like a very exciting idea, but after looking
 at the man page for sync_file_range, I think I understand what he was
 really going for: sync_file_range allows you to hint the Linux kernel
 that you'd like it to clean a certain set of pages.  I further recall
 from Greg's previous comments that in the scenarios he's seen,
 checkpoint I/O spikes are caused not so much by the data written out
 by the checkpoint itself but from the other dirty data in the kernel
 buffer cache.  Based on that, I whipped up the attached patch, which,
 if sync_file_range is available, simply iterates through everything
 that will eventually be fsync'd before beginning the write phase and
 tells the Linux kernel to put them all under write-out.
I played around with this before and my problem was that sync_file_range is not 
really a hint. It actually starts writeback *directly* and only returns when 
the io is placed inside the queue (at least thats the way it was back then). 
Which very quickly leads to it blocking all the time...

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpoint writeback via sync_file_range

2012-01-11 Thread Andres Freund
On Wednesday, January 11, 2012 10:28:11 AM Simon Riggs wrote:
 On Wed, Jan 11, 2012 at 4:38 AM, Greg Smith g...@2ndquadrant.com wrote:
  On 1/10/12 9:14 PM, Robert Haas wrote:
  Based on that, I whipped up the attached patch, which,
  if sync_file_range is available, simply iterates through everything
  that will eventually be fsync'd before beginning the write phase and
  tells the Linux kernel to put them all under write-out.
  
  I hadn't really thought of using it that way.  The kernel expects that
  when this is called the normal way, you're going to track exactly which
  segments you want it to sync.  And that data isn't really passed through
  the fsync absorption code yet; the list of things to fsync has already
  lost that level of detail.
  
  What you're doing here doesn't care though, and I hadn't considered that
  SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its
  docs.  Used this way, it's basically fsync without the wait or guarantee;
  it just tries to push what's already dirty further ahead of the write
  queue than those writes would otherwise be.
 
 I don't think this will help at all, I think it will just make things
 worse.
 
 The problem comes from hammering the fsyncs one after the other. What
 this patch does is initiate all of the fsyncs at the same time, so it
 will max out the disks even more because this will hit all disks all
 at once.
The advantage of using sync_file_range that way is that it starts writeout but 
doesn't cause queue drains/barriers/whatever to be issued which can be quite a 
signfiicant speed gain. In theory.


Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpoint writeback via sync_file_range

2012-01-11 Thread Andres Freund
On Wednesday, January 11, 2012 10:33:47 AM Florian Weimer wrote:
 * Greg Smith:
  One idea I was thinking about here was building a little hash table
  inside of the fsync absorb code, tracking how many absorb operations
  have happened for whatever the most popular relation files are.  The
  idea is that we might say use sync_file_range every time N calls
  for a relation have come in, just to keep from ever accumulating too
  many writes to any one file before trying to nudge some of it out of
  there. The bat that keeps hitting me in the head here is that right
  now, a single fsync might have a full 1GB of writes to flush out,
  perhaps because it extended a table and then write more than that to
  it.  And in everything but a SSD or giant SAN cache situation, 1GB of
  I/O is just too much to fsync at a time without the OS choking a
  little on it.
 
 Isn't this pretty much like tuning vm.dirty_bytes?  We generally set it
 to pretty low values, and seems to help to smoothen the checkpoints.
If done correctly/way much more invasive you could only issue sync_file_range's 
to the areas of the file where checkpointing needs to happen and you could 
leave out e.g. hint bit only changes. Which could help to reduce the cost of 
checkpoints.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-11 Thread Andrew Dunstan



On 01/11/2012 01:18 AM, Pavel Stehule wrote:


I like this patch and this feature.


I'm about to read the patch in detail - I certainly like the feature.



I see only one issue - there is not functionality that helps generate
JSON in pg.

What do you think about functions: array_to_json(anyarray),
row_to_json(any) and format_json(text, text, ...)



Actually, more than these, I (and at least one very interested client) 
want query_to_json, which would do something like:


   # select q2json('select $$a$$ || x as b, y as c from generate_series(1,3) x, 
generate_series(4,5) y');
 q2json
   
-
 
[{b:a1,c:4},{b:a1,c:5},{b:a2,c:4},{b:a2,c:5},{b:a3,c:4},{b:a3,c:5}]


No doubt several variants are possible such as returning a setof json, 
one per row, instead of a single json, and allowing query parameters as 
separate arguments (maybe just using variadic functions), but probably 
for a first go just something as simple as this would meet the case.


Given the short time span available before patches must be in, I am 
prepared to work on this ASAP.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-11 Thread Pavel Stehule
2012/1/11 Andrew Dunstan and...@dunslane.net:


 On 01/11/2012 01:18 AM, Pavel Stehule wrote:


 I like this patch and this feature.


 I'm about to read the patch in detail - I certainly like the feature.



 I see only one issue - there is not functionality that helps generate
 JSON in pg.

 What do you think about functions: array_to_json(anyarray),
 row_to_json(any) and format_json(text, text, ...)


 Actually, more than these, I (and at least one very interested client) want
 query_to_json, which would do something like:

   # select q2json('select $$a$$ || x as b, y as c from generate_series(1,3)
 x, generate_series(4,5) y');
                                                     q2json

 -

 [{b:a1,c:4},{b:a1,c:5},{b:a2,c:4},{b:a2,c:5},{b:a3,c:4},{b:a3,c:5}]


we have a query_to_xml - so there should similar query_to_json. But
this is not enough for usage from SP. What about two rich functions

* query_to_json - by your proposal
* array_to_json - with possibility to serialize array of records

This can be a basic set

Regards

Pavel





 No doubt several variants are possible such as returning a setof json, one
 per row, instead of a single json, and allowing query parameters as separate
 arguments (maybe just using variadic functions), but probably for a first go
 just something as simple as this would meet the case.



 Given the short time span available before patches must be in, I am prepared
 to work on this ASAP.

 cheers

 andrew


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-11 Thread Robert Haas
On Wed, Jan 11, 2012 at 1:18 AM, Pavel Stehule pavel.steh...@gmail.com wrote:
 2012/1/11 Robert Haas robertmh...@gmail.com:
 On Tue, Dec 20, 2011 at 9:06 PM, David E. Wheeler da...@kineticode.com 
 wrote:
 On Dec 20, 2011, at 10:39 AM, Claes Jakobsson wrote:
 Are people explicitly asking for a) *JSON* datatype or b) a type that lets 
 you store arbitrary complex semi-untyped data structures?

 Yes.

 if b) then this might get a lot more interesting

 JSON is the most popular/likely way to represent that, I think.

 On that note, here's an updated version of the patch I posted
 upthread, with some regression tests and minimal documentation.

 I like this patch and this feature.

 I see only one issue - there is not functionality that helps generate
 JSON in pg.

 What do you think about functions: array_to_json(anyarray),
 row_to_json(any) and format_json(text, text, ...)

I think we might want all of that stuff, but I doubt there is time to
do it for 9.2.

Actually, I think the next logical step would be to define equality
(is there an official definition of that for JSON?) and build a btree
opclass.  I believe the code I've already written could be extended to
construct an abstract syntax tree for those operations that need it.
But we need to make some decisions first.  A btree opclass requires a
total ordering, so we have to arbitrarily define whether 1  true, 1 
[1], 1  1, etc.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-11 Thread Pavel Stehule
2012/1/11 Robert Haas robertmh...@gmail.com:
 On Wed, Jan 11, 2012 at 1:18 AM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 2012/1/11 Robert Haas robertmh...@gmail.com:
 On Tue, Dec 20, 2011 at 9:06 PM, David E. Wheeler da...@kineticode.com 
 wrote:
 On Dec 20, 2011, at 10:39 AM, Claes Jakobsson wrote:
 Are people explicitly asking for a) *JSON* datatype or b) a type that 
 lets you store arbitrary complex semi-untyped data structures?

 Yes.

 if b) then this might get a lot more interesting

 JSON is the most popular/likely way to represent that, I think.

 On that note, here's an updated version of the patch I posted
 upthread, with some regression tests and minimal documentation.

 I like this patch and this feature.

 I see only one issue - there is not functionality that helps generate
 JSON in pg.

 What do you think about functions: array_to_json(anyarray),
 row_to_json(any) and format_json(text, text, ...)

 I think we might want all of that stuff, but I doubt there is time to
 do it for 9.2.

 Actually, I think the next logical step would be to define equality
 (is there an official definition of that for JSON?) and build a btree
 opclass.  I believe the code I've already written could be extended to
 construct an abstract syntax tree for those operations that need it.
 But we need to make some decisions first.  A btree opclass requires a
 total ordering, so we have to arbitrarily define whether 1  true, 1 
 [1], 1  1, etc.


I don't understand why we have to do it?

We don't support similar functionality for XML, so why for JSON?

Pavel

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpoint writeback via sync_file_range

2012-01-11 Thread Robert Haas
On Tue, Jan 10, 2012 at 11:38 PM, Greg Smith g...@2ndquadrant.com wrote:
 What you're doing here doesn't care though, and I hadn't considered that
 SYNC_FILE_RANGE_WRITE could be used that way on my last pass through its
 docs.  Used this way, it's basically fsync without the wait or guarantee; it
 just tries to push what's already dirty further ahead of the write queue
 than those writes would otherwise be.

Well, my goal was to make sure they got into the write queue rather
than just sitting in memory while the kernel twiddles its thumbs.  My
hope is that the kernel is smart enough that, when you put something
under write-out, the kernel writes it out as quickly as it can without
causing too much degradation in foreground activity.  If that turns
out to be an incorrect assumption, we'll need a different approach,
but I thought it might be worth trying something simple first and
seeing what happens.

 One idea I was thinking about here was building a little hash table inside
 of the fsync absorb code, tracking how many absorb operations have happened
 for whatever the most popular relation files are.  The idea is that we might
 say use sync_file_range every time N calls for a relation have come in,
 just to keep from ever accumulating too many writes to any one file before
 trying to nudge some of it out of there. The bat that keeps hitting me in
 the head here is that right now, a single fsync might have a full 1GB of
 writes to flush out, perhaps because it extended a table and then write more
 than that to it.  And in everything but a SSD or giant SAN cache situation,
 1GB of I/O is just too much to fsync at a time without the OS choking a
 little on it.

That's not a bad idea, but there's definitely some potential down
side: you might end up reducing write-combining quite significantly if
you keep pushing things out to files when it isn't really needed yet.
I was aiming to only push things out when we're 100% sure that they're
going to have to be fsync'd, and certainly any already-written buffers
that are in the OS cache at the start of a checkpoint fall into that
category.  That having been said, experimental evidence is king.

 I'll put this into my testing queue after the upcoming CF starts.

Thanks!

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] log messages for archive recovery progress

2012-01-11 Thread Satoshi Nagayasu

2012/01/11 19:56, Simon Riggs wrote:

2012/1/11 Euler Taveira de Oliveiraeu...@timbira.com:

On 08-01-2012 11:59, Satoshi Nagayasu / Uptime Technologies, LLC. wrote:

[2011-12-08 15:14:36 JST] 16758: LOG:  restored log file 
00080046 from archive
[2011-12-08 15:14:36 JST] 16758: LOG:  recoverying 00080046
[2011-12-08 15:14:36 JST] 16758: LOG:  restored log file 
00080047 from archive
[2011-12-08 15:14:36 JST] 16758: LOG:  recoverying 00080047
cp: cannot stat `/backups/archlog/00080048': No such file or 
directory
[2011-12-08 15:14:37 JST] 16758: LOG:  could not restore file 
00080048 from archive
[2011-12-08 15:14:37 JST] 16758: LOG:  attempting to look into pg_xlog
[2011-12-08 15:14:37 JST] 16758: LOG:  recoverying 00080048



What about just 'restored log file 00080048 from pg_xlog'
instead of the last two messages? If you can't read from pg_xlog emit 'could
not restore file 00080048 from pg_xlog'.


Yes, simple is better.

We already have a message if the file is not present, we just need one
if we switch to using pg_xlog.

Please look at this.


Thanks for a patch. I agree that simple is better.

However, I'm a bit afraid that it will confuse DBA if we use
restored under the pg_xlog replay context, because we have
already used restored that means a WAL file as successfully
copied (not replayed) from archive directory into pg_xlog
directory under the archive recovery context.

So, to determine the status of copying WAL files from
archive directory, I think we can use restored, or
could not restore on failure.

And to determine the status of replaying WAL files
in pg_xlog directory (even if a WAL is copied from archive),
we have to use recover or replay.

I'll try to revise my proposed log messages again.

Thanks,
--
Satoshi Nagayasu sn...@uptime.jp
Uptime Technologies, LLC. http://www.uptime.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-11 Thread Robert Haas
On Wed, Jan 11, 2012 at 8:38 AM, Pavel Stehule pavel.steh...@gmail.com wrote:
 I don't understand why we have to do it?

 We don't support similar functionality for XML, so why for JSON?

Hrm.  Well, that's an interesting point.  Maybe we don't.  I assumed
that people would eventually want to optimize queries of the form
SELECT whatever FROM tab WHERE jsoncol = 'constant'.  If that's a
sufficiently marginal use case that we don't care, then fine.

One difference between JSON and XML is that XML really has no
well-defined comparison semantics.  For example, consider:

foobar1.0/bar/foo
foobar1.0/bar /foo

If the XML is being used as a transport mechanism, then the extra
space is semantically insignificant, but if this is markup, then it
might matter a lot.  Also, consider:

foobar1.00/bar/foo

That one might be equal if we think 1.0 is intended to be a number,
but if it's intended as a string then it's not.  We could perhaps do
comparisons in XML relative to some DTD or schema if those provide
details about what the values mean, but in a vacuum it's not
well-defined.  On the other hand, in JSON, it's pretty clear that { 1,
2, 3 } is the same value as {1,2,3} but 1,2,3 is different from 1,
2, 3.  There are some borderline cases that might need some sweat,
like whether 1 = 1.0 = 1.00 = 1e0, but in general the level of
ambiguity seems to me to be much less, making it more feasible here
than it would be for XML.

That having been said, uncertainties about whether we want this at all
(and if so in what form) are exactly why I didn't include this kind of
stuff in the patch to begin with, and I think that if we get this much
committed for 9.2 we'll be doing pretty well.  If we can agree on and
do more, great; if not, we'll at least have this much, which IMHO
would be an improvement over what we have now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.

2012-01-11 Thread Fujii Masao
On Sat, Dec 31, 2011 at 10:34 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Send new protocol keepalive messages to standby servers.
 Allows streaming replication users to calculate transfer latency
 and apply delay via internal functions. No external functions yet.

pq_flush_if_writable() needs to be called just after
WalSndKeepalive(). Otherwise,
keepalive packet is not sent for a while.

+static void
+ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)

walEnd is not used in ProcessWalSndrMessage() at all. Can't we remove it?
If yes, walEnd field in WalSndrMessage is also not used anywhere, so ISTM
we can remove it.

+   elog(DEBUG2, sendtime %s receipttime %s replication apply delay %d
transfer latency %d,
+   timestamptz_to_str(sendTime),
+   timestamptz_to_str(lastMsgReceiptTime),
+   GetReplicationApplyDelay(),
+   GetReplicationTransferLatency());

The unit of replication apply delay and transfer latency should be in
log message.

GetReplicationApplyDelay() and GetReplicationTransferLatency() are called
whenever the standby receives the message from the master. Which might
degrade the performance of replication a bit. So we should skip the above elog
when log_message = DEBUG2?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpoint writeback via sync_file_range

2012-01-11 Thread Greg Smith

On 1/11/12 4:33 AM, Florian Weimer wrote:

Isn't this pretty much like tuning vm.dirty_bytes?  We generally set it
to pretty low values, and seems to help to smoothen the checkpoints.


When I experimented with dropping the actual size of the cache, 
checkpoint spikes improved, but things like VACUUM ran terribly slow. 
On a typical medium to large server nowadays (let's say 16GB+), 
PostgreSQL needs to have gigabytes of write cache for good performance.


What we're aiming to here is keep the benefits of having that much write 
cache, while allowing checkpoint related work to send increasingly 
strong suggestions about ordering what it needs written soon.  There's 
basically three primary states on Linux to be concerned about here:


Dirty:  in the cache via standard write
|
v  pdflush does writeback at 5 or 10% dirty || sync_file_range push
|
Writeback
|
v  write happens in the background || fsync call
|
Stored on disk

The systems with bad checkpoint problems will typically have gigabytes 
Dirty, which is necessary for good performance.  It's very lazy about 
pushing things toward Writeback though.  Getting the oldest portions 
of the outstanding writes into the Writeback queue more aggressively 
should make the eventual fsync less likely to block.



--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpoint writeback via sync_file_range

2012-01-11 Thread Greg Smith

On 1/11/12 7:46 AM, Andres Freund wrote:

I played around with this before and my problem was that sync_file_range is not
really a hint. It actually starts writeback *directly* and only returns when
the io is placed inside the queue (at least thats the way it was back then).
Which very quickly leads to it blocking all the time...


Right, you're answering one of Robert's questions here:  yes, once 
something is pushed toward writeback, it moves toward an actual write 
extremely fast.  And the writeback queue can fill itself.  But we don't 
really care if this blocks.  There's a checkpointer process, it will be 
doing this work, and it has no other responsibilities anymore (as of 
9.2, which is why some of these approaches suddenly become practical). 
It's going to get blocked waiting for things sometimes, the way it 
already does rarely when it writes, and often when it call fsync.



--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpoint writeback via sync_file_range

2012-01-11 Thread Andres Freund
On Wednesday, January 11, 2012 03:20:09 PM Greg Smith wrote:
 On 1/11/12 7:46 AM, Andres Freund wrote:
  I played around with this before and my problem was that sync_file_range
  is not really a hint. It actually starts writeback *directly* and only
  returns when the io is placed inside the queue (at least thats the way
  it was back then). Which very quickly leads to it blocking all the
  time...
 
 Right, you're answering one of Robert's questions here:  yes, once
 something is pushed toward writeback, it moves toward an actual write
 extremely fast.  And the writeback queue can fill itself.  But we don't
 really care if this blocks.  There's a checkpointer process, it will be
 doing this work, and it has no other responsibilities anymore (as of
 9.2, which is why some of these approaches suddenly become practical).
 It's going to get blocked waiting for things sometimes, the way it
 already does rarely when it writes, and often when it call fsync.
We do care imo. The heavy pressure putting it directly in the writeback queue 
leads to less efficient io because quite often it won't reorder sensibly with 
other io anymore and thelike. At least that was my experience in using it with 
in another application.
Lots of that changed with linux 3.2 (near complete rewrite of the writeback 
mechanism), so a bit of that might be moot anyway.

I definitely aggree that 9.2 opens new possibilities there.

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] log messages for archive recovery progress

2012-01-11 Thread Simon Riggs
On Wed, Jan 11, 2012 at 1:54 PM, Satoshi Nagayasu sn...@uptime.jp wrote:

 However, I'm a bit afraid that it will confuse DBA if we use
 restored under the pg_xlog replay context, because we have
 already used restored that means a WAL file as successfully
 copied (not replayed) from archive directory into pg_xlog
 directory under the archive recovery context.

 So, to determine the status of copying WAL files from
 archive directory, I think we can use restored, or
 could not restore on failure.

 And to determine the status of replaying WAL files
 in pg_xlog directory (even if a WAL is copied from archive),
 we have to use recover or replay.

Agreed. I can change restored to using, so we have two message types

LOG:  restored log file 00080047 from archive
LOG:  using pre-existing log file 00080047 from pg_xlog

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-11 Thread Pavel Stehule
2012/1/11 Robert Haas robertmh...@gmail.com:
 On Wed, Jan 11, 2012 at 8:38 AM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 I don't understand why we have to do it?

 We don't support similar functionality for XML, so why for JSON?

 Hrm.  Well, that's an interesting point.  Maybe we don't.  I assumed
 that people would eventually want to optimize queries of the form
 SELECT whatever FROM tab WHERE jsoncol = 'constant'.  If that's a
 sufficiently marginal use case that we don't care, then fine.

 One difference between JSON and XML is that XML really has no
 well-defined comparison semantics.  For example, consider:

 foobar1.0/bar/foo
 foobar1.0/bar /foo

 If the XML is being used as a transport mechanism, then the extra
 space is semantically insignificant, but if this is markup, then it
 might matter a lot.  Also, consider:

 foobar1.00/bar/foo

 That one might be equal if we think 1.0 is intended to be a number,
 but if it's intended as a string then it's not.  We could perhaps do
 comparisons in XML relative to some DTD or schema if those provide
 details about what the values mean, but in a vacuum it's not
 well-defined.  On the other hand, in JSON, it's pretty clear that { 1,
 2, 3 } is the same value as {1,2,3} but 1,2,3 is different from 1,
 2, 3.  There are some borderline cases that might need some sweat,
 like whether 1 = 1.0 = 1.00 = 1e0, but in general the level of
 ambiguity seems to me to be much less, making it more feasible here
 than it would be for XML.

 That having been said, uncertainties about whether we want this at all
 (and if so in what form) are exactly why I didn't include this kind of
 stuff in the patch to begin with, and I think that if we get this much
 committed for 9.2 we'll be doing pretty well.  If we can agree on and
 do more, great; if not, we'll at least have this much, which IMHO
 would be an improvement over what we have now.


I understand it now. My opinion is so some operators and index search
can be in 9.2 - so use a JSON just as communication format now.

* we need to build JSON
* we need to check if some is valid JSON
* we need to store JSON

other steps should be (9.2)
* basic operators eq, neeq
* some tool like XQuery - simple query on JSON document available from
SQL that can be used for functional indexes.




 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Double-write with Fast Checksums

2012-01-11 Thread Aidan Van Dyk
On Wed, Jan 11, 2012 at 7:13 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

 At the moment, double-writes are done in one batch, fsyncing the
 double-write area first and the data files immediately after that. That's
 probably beneficial if you have a BBU, and/or a fairly large shared_buffers
 setting, so that pages don't get swapped between OS and PostgreSQL cache too
 much. But when those assumptions don't hold, it would be interesting to
 treat the double-write buffers more like a 2nd WAL for full-page images.
 Whenever a dirty page is evicted from shared_buffers, write it to the
 double-write area, but don't fsync it or write it back to the data file yet.
 Instead, let it sit in the double-write area, and grow the double-write
 file(s) as necessary, until the next checkpoint comes along.

Ok, but for correctness, you need to *fsync* the double-write buffer
(WAL) before you can issue the write on the normal datafile at all.

All the double write can do is move the FPW from the WAL stream (done
at commit time) to some other double buffer space (which can be done
at write time).

It still has to fsync the write-ahead part of the double write
before it can write any of the normal part, or you leave the the
torn-page possibility.

And you still need to keep all the write-ahead part of the
double-write around until all the normal writes have been fsynced
(checkpoint time) so you can redo them all on crash recovery.

So, I think that the work in double-writes has merit, but if it's done
correctly, it isn't this magic bullet that suddenly gives us atomic,
durable writes for free.

It has major advantages (including, but not limited too)
1) Moving the FPW out of normal WAL/commit processing
2) Allowing fine control of (possibly seperate) FPW locations on a per
tablespace/relation basis

It does this by moving the FPW/IO penalty from the commit time of a
backend dirtying the buffer first, to the eviction time of a backend
evicting a dirty buffer.  And if you're lucky enough that the
background writer is the only one writing dirty buffers, you'll see
lots of improvements in your performance (equivilent of running with
current FPW off).  But I have a feeling that many of us see backends
having to write dirty buffers often enough too that the reduction in
commit/WAL latency will be offset (hopefully not as much) by increased
query processing time as backends double-write dirty buffers.


a.



-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-11 Thread Robert Haas
On Wed, Jan 11, 2012 at 9:41 AM, Pavel Stehule pavel.steh...@gmail.com wrote:
 I understand it now. My opinion is so some operators and index search
 can be in 9.2 - so use a JSON just as communication format now.

 * we need to build JSON
 * we need to check if some is valid JSON
 * we need to store JSON

 other steps should be (9.2)
 * basic operators eq, neeq
 * some tool like XQuery - simple query on JSON document available from
 SQL that can be used for functional indexes.

That would be nice, but let's not let the perfect be the enemy of the
good.  We don't have a lot of time here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Double-write with Fast Checksums

2012-01-11 Thread Robert Haas
On Wed, Jan 11, 2012 at 9:47 AM, Aidan Van Dyk ai...@highrise.ca wrote:
 It does this by moving the FPW/IO penalty from the commit time of a
 backend dirtying the buffer first, to the eviction time of a backend
 evicting a dirty buffer.  And if you're lucky enough that the
 background writer is the only one writing dirty buffers, you'll see
 lots of improvements in your performance (equivilent of running with
 current FPW off).  But I have a feeling that many of us see backends
 having to write dirty buffers often enough too that the reduction in
 commit/WAL latency will be offset (hopefully not as much) by increased
 query processing time as backends double-write dirty buffers.

I have that feeling, too.  Someone needs to devote some time to
performance testing this stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.

2012-01-11 Thread Simon Riggs
On Wed, Jan 11, 2012 at 2:05 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Sat, Dec 31, 2011 at 10:34 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Send new protocol keepalive messages to standby servers.
 Allows streaming replication users to calculate transfer latency
 and apply delay via internal functions. No external functions yet.

Thanks for further review.

 pq_flush_if_writable() needs to be called just after
 WalSndKeepalive(). Otherwise,
 keepalive packet is not sent for a while.

It will get sent though won't it? Maybe not immediately. I guess we
may as well flush though, since we're not doing anything else - by
definition. Will add.

 +static void
 +ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)

 walEnd is not used in ProcessWalSndrMessage() at all. Can't we remove it?
 If yes, walEnd field in WalSndrMessage is also not used anywhere, so ISTM
 we can remove it.

It's there to allow extension of the message processing to be more
complex than it currently is. Changing the protocol is much harder
than changing a function call.

I'd like to keep it since it doesn't have any negative effects.

 +       elog(DEBUG2, sendtime %s receipttime %s replication apply delay %d
 transfer latency %d,
 +                                       timestamptz_to_str(sendTime),
 +                                       
 timestamptz_to_str(lastMsgReceiptTime),
 +                                       GetReplicationApplyDelay(),
 +                                       GetReplicationTransferLatency());

 The unit of replication apply delay and transfer latency should be in
 log message.

OK, will do.

 GetReplicationApplyDelay() and GetReplicationTransferLatency() are called
 whenever the standby receives the message from the master. Which might
 degrade the performance of replication a bit. So we should skip the above elog
 when log_message = DEBUG2?

OK, will put in a specific test for you.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-11 Thread Pavel Stehule
2012/1/11 Robert Haas robertmh...@gmail.com:
 On Wed, Jan 11, 2012 at 9:41 AM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 I understand it now. My opinion is so some operators and index search
 can be in 9.2 - so use a JSON just as communication format now.

 * we need to build JSON
 * we need to check if some is valid JSON
 * we need to store JSON

 other steps should be (9.2)
 * basic operators eq, neeq
 * some tool like XQuery - simple query on JSON document available from
 SQL that can be used for functional indexes.

 That would be nice, but let's not let the perfect be the enemy of the
 good.  We don't have a lot of time here.


sorry - replace 9.2 by 9.3 - I am sorry

I am able to write array_to_json fce and Andrew can write query_to_json

Pavel

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-11 Thread Misa Simic

I am able to write array_to_json fce and Andrew can write query_to_json

+1

Thanks guys...

We are using a lot of JSON as communication protocol...

having core support for JSON, And those functions, will be a real life
saver...

Many thanks,

Misa

Sent from my Windows Phone
From: Pavel Stehule
Sent: 11/01/2012 16:22
To: Robert Haas
Cc: David E. Wheeler; Claes Jakobsson; Dimitri Fontaine; Merlin
Moncure; Andrew Dunstan; Magnus Hagander; Jan Urbański; Simon Riggs;
Joey Adams; Bruce Momjian; PostgreSQL-development Hackers; Jan Wieck
Subject: Re: [HACKERS] JSON for PG 9.2
2012/1/11 Robert Haas robertmh...@gmail.com:
 On Wed, Jan 11, 2012 at 9:41 AM, Pavel Stehule pavel.steh...@gmail.com 
 wrote:
 I understand it now. My opinion is so some operators and index search
 can be in 9.2 - so use a JSON just as communication format now.

 * we need to build JSON
 * we need to check if some is valid JSON
 * we need to store JSON

 other steps should be (9.2)
 * basic operators eq, neeq
 * some tool like XQuery - simple query on JSON document available from
 SQL that can be used for functional indexes.

 That would be nice, but let's not let the perfect be the enemy of the
 good.  We don't have a lot of time here.


sorry - replace 9.2 by 9.3 - I am sorry

I am able to write array_to_json fce and Andrew can write query_to_json

Pavel

 --
 Robert Haas
 EnterpriseDB: http://www.enterprisedb.com
 The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sending notifications from the master to the standby

2012-01-11 Thread Josh Berkus
Tom,

 BTW ... it occurs to me to ask whether we really have a solid use-case
 for having listeners attached to slave servers.  I have personally never
 seen an application for LISTEN/NOTIFY in which the listeners were
 entirely read-only.  Even if there are one or two cases out there, it's
 not clear to me that supporting it is worth the extra complexity that
 seems to be needed.

Actually, I've seen requests for it from my clients and on IRC.  Not
sure how serious those are, but users have brought it up.  Certainly
users intuitively think they should be able to LISTEN on a standby, and
are surprised when they find out they can't.

The basic idea is that if we can replicate LISTENs, then you can use
replication as a simple distributed (and lossy) queueing system.  This
is especially useful if the replica is geographically distant, and there
are a lot of listeners.

The obvious first use case for this is for cache invalidation.  For
example, we have one application where we're using Redis to queue cache
invalidation messages; if LISTEN/NOTIFY were replicated, we could use it
instead and simplify our infrastructure.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PL/Python result metadata

2012-01-11 Thread Peter Eisentraut
There is currently no reliable way to retrieve from a result object in
PL/Python the number, name, or type of the result columns.  You can get
the number and name if the query returned more than zero rows by looking
at the row dicts, but that is unreliable.  The type information isn't
available at all.

I propose to add two functions to the result object:

.colnames() returns a list of column names (strings)
.coltypes() returns a list of type OIDs (integers)

I just made that up because there is no guidance in the other standard
PLs for this sort of thing, AFAICT.

Patch attached.  Comments welcome.
diff --git i/doc/src/sgml/plpython.sgml w/doc/src/sgml/plpython.sgml
index 618f8d0..69c9c90 100644
--- i/doc/src/sgml/plpython.sgml
+++ w/doc/src/sgml/plpython.sgml
@@ -886,9 +886,11 @@ $$ LANGUAGE plpythonu;
list or dictionary object.  The result object can be accessed by
row number and column name.  It has these additional methods:
functionnrows/function which returns the number of rows
-   returned by the query, and functionstatus/function which is the
-   functionSPI_execute()/function return value.  The result object
-   can be modified.
+   returned by the query, functionstatus/function which is the
+   functionSPI_execute()/function return value,
+   functioncolnames/function which is the list of column names, and
+   functioncoltypes/function which is the list of column type OIDs.  The
+   result object can be modified.
   /para
 
   para
diff --git i/src/pl/plpython/expected/plpython_spi.out w/src/pl/plpython/expected/plpython_spi.out
index 3b4d7a3..cd76147 100644
--- i/src/pl/plpython/expected/plpython_spi.out
+++ w/src/pl/plpython/expected/plpython_spi.out
@@ -117,10 +117,12 @@ SELECT join_sequences(sequences) FROM sequences
 --
 CREATE FUNCTION result_nrows_test() RETURNS int
 AS $$
-plan = plpy.prepare(SELECT 1 UNION SELECT 2)
+plan = plpy.prepare(SELECT 1 AS foo, '11'::text AS bar UNION SELECT 2, '22')
 plpy.info(plan.status()) # not really documented or useful
 result = plpy.execute(plan)
 if result.status()  0:
+   plpy.info(result.colnames())
+   plpy.info(result.coltypes())
return result.nrows()
 else:
return None
@@ -128,6 +130,10 @@ $$ LANGUAGE plpythonu;
 SELECT result_nrows_test();
 INFO:  True
 CONTEXT:  PL/Python function result_nrows_test
+INFO:  ['foo', 'bar']
+CONTEXT:  PL/Python function result_nrows_test
+INFO:  [23, 25]
+CONTEXT:  PL/Python function result_nrows_test
  result_nrows_test 
 ---
  2
diff --git i/src/pl/plpython/plpy_resultobject.c w/src/pl/plpython/plpy_resultobject.c
index bf46a16..e7d14d4 100644
--- i/src/pl/plpython/plpy_resultobject.c
+++ w/src/pl/plpython/plpy_resultobject.c
@@ -12,6 +12,8 @@
 
 
 static void PLy_result_dealloc(PyObject *arg);
+static PyObject *PLy_result_colnames(PyObject *self, PyObject *unused);
+static PyObject *PLy_result_coltypes(PyObject *self, PyObject *unused);
 static PyObject *PLy_result_nrows(PyObject *self, PyObject *args);
 static PyObject *PLy_result_status(PyObject *self, PyObject *args);
 static Py_ssize_t PLy_result_length(PyObject *arg);
@@ -35,6 +37,8 @@ static PySequenceMethods PLy_result_as_sequence = {
 };
 
 static PyMethodDef PLy_result_methods[] = {
+	{colnames, PLy_result_colnames, METH_NOARGS, NULL},
+	{coltypes, PLy_result_coltypes, METH_NOARGS, NULL},
 	{nrows, PLy_result_nrows, METH_VARARGS, NULL},
 	{status, PLy_result_status, METH_VARARGS, NULL},
 	{NULL, NULL, 0, NULL}
@@ -96,6 +100,7 @@ PLy_result_new(void)
 	ob-status = Py_None;
 	ob-nrows = PyInt_FromLong(-1);
 	ob-rows = PyList_New(0);
+	ob-tupdesc = NULL;
 
 	return (PyObject *) ob;
 }
@@ -108,11 +113,44 @@ PLy_result_dealloc(PyObject *arg)
 	Py_XDECREF(ob-nrows);
 	Py_XDECREF(ob-rows);
 	Py_XDECREF(ob-status);
+	if (ob-tupdesc)
+	{
+		FreeTupleDesc(ob-tupdesc);
+		ob-tupdesc = NULL;
+	}
 
 	arg-ob_type-tp_free(arg);
 }
 
 static PyObject *
+PLy_result_colnames(PyObject *self, PyObject *unused)
+{
+	PLyResultObject *ob = (PLyResultObject *) self;
+	PyObject   *list;
+	int			i;
+
+	list = PyList_New(ob-tupdesc-natts);
+	for (i = 0; i  ob-tupdesc-natts; i++)
+		PyList_SET_ITEM(list, i, PyString_FromString(NameStr(ob-tupdesc-attrs[i]-attname)));
+
+	return list;
+}
+
+static PyObject *
+PLy_result_coltypes(PyObject *self, PyObject *unused)
+{
+	PLyResultObject *ob = (PLyResultObject *) self;
+	PyObject   *list;
+	int			i;
+
+	list = PyList_New(ob-tupdesc-natts);
+	for (i = 0; i  ob-tupdesc-natts; i++)
+		PyList_SET_ITEM(list, i, PyInt_FromLong(ob-tupdesc-attrs[i]-atttypid));
+
+	return list;
+}
+
+static PyObject *
 PLy_result_nrows(PyObject *self, PyObject *args)
 {
 	PLyResultObject *ob = (PLyResultObject *) self;
diff --git i/src/pl/plpython/plpy_resultobject.h w/src/pl/plpython/plpy_resultobject.h
index 719828a..1b37d1d 100644
--- i/src/pl/plpython/plpy_resultobject.h
+++ w/src/pl/plpython/plpy_resultobject.h
@@ -5,6 +5,9 @@
 #ifndef PLPY_RESULTOBJECT_H
 #define PLPY_RESULTOBJECT_H
 
+#include 

Re: [HACKERS] PL/Python result metadata

2012-01-11 Thread Alvaro Herrera

Excerpts from Peter Eisentraut's message of mié ene 11 17:05:34 -0300 2012:
 There is currently no reliable way to retrieve from a result object in
 PL/Python the number, name, or type of the result columns.  You can get
 the number and name if the query returned more than zero rows by looking
 at the row dicts, but that is unreliable.  The type information isn't
 available at all.
 
 I propose to add two functions to the result object:
 
 .colnames() returns a list of column names (strings)
 .coltypes() returns a list of type OIDs (integers)

No typmods?

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] psql case preserving completion

2012-01-11 Thread Peter Eisentraut
In psql, the tab completion always converts key words to upper case.  In
practice, I and I think most users type in lower case.  So then you end
up with commands looking like this:

= alter TABLE foo add CONSTRAINT bar check (a  0);

To address this, I have implemented a slightly different completion mode
that looks at the word being completed and converts the completed word
to the case of the original word.  (Well, it looks at the first letter.)

In fact, since almost all completions in psql are of this nature, I made
this the default mode for COMPLETE_WITH_CONST and COMPLETE_WITH_LIST and
added a new macro COMPLETE_WITH_LIST_CS that uses the old case-sensitive
behavior. The latter is used mainly for completing backslash commands.

After playing with this a little, I find the behavior more pleasing.
Less yelling. ;-)

Patch attached.
diff --git a/src/bin/psql/tab-complete.c b/src/bin/psql/tab-complete.c
index a27ef69..5d042f0 100644
--- a/src/bin/psql/tab-complete.c
+++ b/src/bin/psql/tab-complete.c
@@ -132,6 +132,7 @@ static const char *const * completion_charpp;	/* to pass a list of strings */
 static const char *completion_info_charp;		/* to pass a second string */
 static const char *completion_info_charp2;		/* to pass a third string */
 static const SchemaQuery *completion_squery;	/* to pass a SchemaQuery */
+static bool completion_case_sensitive;			/* completion is case sensitive */
 
 /*
  * A few macros to ease typing. You can use these to complete the given
@@ -155,15 +156,24 @@ do { \
 	matches = completion_matches(text, complete_from_schema_query); \
 } while (0)
 
+#define COMPLETE_WITH_LIST_CS(list) \
+do { \
+	completion_charpp = list; \
+	completion_case_sensitive = true; \
+	matches = completion_matches(text, complete_from_list); \
+} while (0)
+
 #define COMPLETE_WITH_LIST(list) \
 do { \
 	completion_charpp = list; \
+	completion_case_sensitive = false; \
 	matches = completion_matches(text, complete_from_list); \
 } while (0)
 
 #define COMPLETE_WITH_CONST(string) \
 do { \
 	completion_charp = string; \
+	completion_case_sensitive = false; \
 	matches = completion_matches(text, complete_from_const); \
 } while (0)
 
@@ -771,7 +781,7 @@ psql_completion(char *text, int start, int end)
 
 	/* If a backslash command was started, continue */
 	if (text[0] == '\\')
-		COMPLETE_WITH_LIST(backslash_commands);
+		COMPLETE_WITH_LIST_CS(backslash_commands);
 
 	/* Variable interpolation */
 	else if (text[0] == ':'  text[1] != ':')
@@ -2864,7 +2874,7 @@ psql_completion(char *text, int start, int end)
 			null, fieldsep, tuples_only, title, tableattr,
 		linestyle, pager, recordsep, NULL};
 
-		COMPLETE_WITH_LIST(my_list);
+		COMPLETE_WITH_LIST_CS(my_list);
 	}
 	else if (strcmp(prev2_wd, \\pset) == 0)
 	{
@@ -2874,14 +2884,14 @@ psql_completion(char *text, int start, int end)
 			{unaligned, aligned, wrapped, html, latex,
 			troff-ms, NULL};
 
-			COMPLETE_WITH_LIST(my_list);
+			COMPLETE_WITH_LIST_CS(my_list);
 		}
 		else if (strcmp(prev_wd, linestyle) == 0)
 		{
 			static const char *const my_list[] =
 			{ascii, old-ascii, unicode, NULL};
 
-			COMPLETE_WITH_LIST(my_list);
+			COMPLETE_WITH_LIST_CS(my_list);
 		}
 	}
 	else if (strcmp(prev_wd, \\set) == 0)
@@ -3234,6 +3244,31 @@ _complete_from_query(int is_schema_query, const char *text, int state)
 
 
 /*
+ * Make a pg_strdup copy of s and convert it to the same case as ref.
+ */
+static char *
+pg_strdup_same_case(const char *s, const char *ref)
+{
+	char *ret, *p;
+	unsigned char first = ref[0];
+
+	if (isalpha(first))
+	{
+		ret = pg_strdup(s);
+		if (islower(first))
+			for (p = ret; *p; p++)
+*p = pg_tolower((unsigned char) *p);
+		else
+			for (p = ret; *p; p++)
+*p = pg_toupper((unsigned char) *p);
+		return ret;
+	}
+	else
+		return pg_strdup(s);
+}
+
+
+/*
  * This function returns in order one of a fixed, NULL pointer terminated list
  * of strings (if matching). This can be used if there are only a fixed number
  * SQL words that can appear at certain spot.
@@ -3255,7 +3290,7 @@ complete_from_list(const char *text, int state)
 	{
 		list_index = 0;
 		string_length = strlen(text);
-		casesensitive = true;
+		casesensitive = completion_case_sensitive;
 		matches = 0;
 	}
 
@@ -3270,7 +3305,14 @@ complete_from_list(const char *text, int state)
 
 		/* Second pass is case insensitive, don't bother counting matches */
 		if (!casesensitive  pg_strncasecmp(text, item, string_length) == 0)
-			return pg_strdup(item);
+		{
+			if (completion_case_sensitive)
+return pg_strdup(item);
+			else
+/* If case insensitive matching was requested initially, return
+ * it in the case of what was already entered. */
+return pg_strdup_same_case(item, text);
+		}
 	}
 
 	/*
@@ -3300,12 +3342,16 @@ complete_from_list(const char *text, int state)
 static char *
 complete_from_const(const char *text, int state)
 {
-	(void) text;/* We don't care about what was entered
- * already. */
-
 	

Re: [HACKERS] psql case preserving completion

2012-01-11 Thread Pavel Stehule
2012/1/11 Peter Eisentraut pete...@gmx.net:
 In psql, the tab completion always converts key words to upper case.  In
 practice, I and I think most users type in lower case.  So then you end
 up with commands looking like this:

 = alter TABLE foo add CONSTRAINT bar check (a  0);

 To address this, I have implemented a slightly different completion mode
 that looks at the word being completed and converts the completed word
 to the case of the original word.  (Well, it looks at the first letter.)

 In fact, since almost all completions in psql are of this nature, I made
 this the default mode for COMPLETE_WITH_CONST and COMPLETE_WITH_LIST and
 added a new macro COMPLETE_WITH_LIST_CS that uses the old case-sensitive
 behavior. The latter is used mainly for completing backslash commands.

 After playing with this a little, I find the behavior more pleasing.
 Less yelling. ;-)

 Patch attached.

+1

Pavel



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [PATCH] renaming constraints

2012-01-11 Thread Peter Eisentraut
Here is a patch to add a command

ALTER TABLE ... RENAME CONSTRAINT ...

Currently, it only supports table constraints.  I have an almost
finished patch for renaming domain constraints, but it is easier to keep
it separate.
diff --git a/doc/src/sgml/ref/alter_table.sgml b/doc/src/sgml/ref/alter_table.sgml
index 1976f6d..cd9ee97 100644
--- a/doc/src/sgml/ref/alter_table.sgml
+++ b/doc/src/sgml/ref/alter_table.sgml
@@ -25,6 +25,8 @@ ALTER TABLE [ ONLY ] replaceable class=PARAMETERname/replaceable [ * ]
 replaceable class=PARAMETERaction/replaceable [, ... ]
 ALTER TABLE [ ONLY ] replaceable class=PARAMETERname/replaceable [ * ]
 RENAME [ COLUMN ] replaceable class=PARAMETERcolumn/replaceable TO replaceable class=PARAMETERnew_column/replaceable
+ALTER TABLE [ ONLY ] replaceable class=PARAMETERname/replaceable [ * ]
+RENAME CONSTRAINT replaceable class=PARAMETERconstraint_name/replaceable TO replaceable class=PARAMETERnew_constraint_name/replaceable
 ALTER TABLE replaceable class=PARAMETERname/replaceable
 RENAME TO replaceable class=PARAMETERnew_name/replaceable
 ALTER TABLE replaceable class=PARAMETERname/replaceable
@@ -555,8 +557,8 @@ ALTER TABLE replaceable class=PARAMETERname/replaceable
 listitem
  para
   The literalRENAME/literal forms change the name of a table
-  (or an index, sequence, or view) or the name of an individual column in
-  a table. There is no effect on the stored data.
+  (or an index, sequence, or view), the name of an individual column in
+  a table, or the name of a constraint of the table. There is no effect on the stored data.
  /para
 /listitem
/varlistentry
@@ -869,7 +871,8 @@ ALTER TABLE replaceable class=PARAMETERname/replaceable
 
para
 If a table has any descendant tables, it is not permitted to add,
-rename, or change the type of a column in the parent table without doing
+rename, or change the type of a column, or rename an inherited constraint
+in the parent table without doing
 the same to the descendants.  That is, commandALTER TABLE ONLY/command
 will be rejected.  This ensures that the descendants always have
 columns matching the parent.
@@ -969,6 +972,13 @@ ALTER TABLE distributors RENAME TO suppliers;
   /para
 
   para
+   To rename an existing constraint:
+programlisting
+ALTER TABLE distributors RENAME CONSTRAINT zipchk TO zip_check;
+/programlisting
+  /para
+
+  para
To add a not-null constraint to a column:
 programlisting
 ALTER TABLE distributors ALTER COLUMN street SET NOT NULL;
diff --git a/src/backend/commands/alter.c b/src/backend/commands/alter.c
index 7c658c0..ef8c5fb 100644
--- a/src/backend/commands/alter.c
+++ b/src/backend/commands/alter.c
@@ -57,6 +57,10 @@ ExecRenameStmt(RenameStmt *stmt)
 			RenameCollation(stmt-object, stmt-newname);
 			break;
 
+		case OBJECT_CONSTRAINT:
+			RenameConstraint(stmt);
+			break;
+
 		case OBJECT_CONVERSION:
 			RenameConversion(stmt-object, stmt-newname);
 			break;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c373016..33ba611 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2320,6 +2320,108 @@ renameatt(RenameStmt *stmt)
 	   stmt-behavior);
 }
 
+
+/*
+ * same logic as renameatt_internal
+ */
+static void
+rename_constraint_internal(Oid myrelid,
+		   const char *oldconname,
+		   const char *newconname,
+		   bool recurse,
+		   bool recursing,
+		   int expected_parents)
+{
+	Relation	targetrelation;
+	Oid			constraintOid;
+	HeapTuple   tuple;
+	Form_pg_constraint con;
+
+	targetrelation = relation_open(myrelid, AccessExclusiveLock);
+	/* don't tell it whether we're recursing; we allow changing typed tables here */
+	renameatt_check(myrelid, RelationGetForm(targetrelation), false);
+
+	constraintOid = get_constraint_oid(myrelid, oldconname, false);
+
+	tuple = SearchSysCache1(CONSTROID, ObjectIdGetDatum(constraintOid));
+	if (!HeapTupleIsValid(tuple))
+		elog(ERROR, cache lookup failed for constraint %u,
+			 constraintOid);
+	con = (Form_pg_constraint) GETSTRUCT(tuple);
+
+	if (!con-conisonly)
+	{
+		if (recurse)
+		{
+			List	   *child_oids,
+*child_numparents;
+			ListCell   *lo,
+*li;
+
+			child_oids = find_all_inheritors(myrelid, AccessExclusiveLock,
+			 child_numparents);
+
+			forboth(lo, child_oids, li, child_numparents)
+			{
+Oid			childrelid = lfirst_oid(lo);
+int			numparents = lfirst_int(li);
+
+if (childrelid == myrelid)
+	continue;
+
+rename_constraint_internal(childrelid, oldconname, newconname, false, true, numparents);
+			}
+		}
+		else
+		{
+			if (expected_parents == 0 
+find_inheritance_children(myrelid, NoLock) != NIL)
+ereport(ERROR,
+		(errcode(ERRCODE_INVALID_TABLE_DEFINITION),
+		 errmsg(inherited constraint \%s\ must be renamed in child tables too,
+oldconname)));
+		}
+
+		if (con-coninhcount  expected_parents)
+			

Re: [HACKERS] [WIP] Double-write with Fast Checksums

2012-01-11 Thread Dan Scales
Thanks for all the comments and suggestions on the double-write patch.  We are 
working on generating performance results for the 9.2 patch, but there is 
enough difference between 9.0 and 9.2 that it will take some time.

One thing in 9.2 that may be causing problems with the current patch is the 
fact that the checkpointer and bgwriter are separated and can run at the same 
time (I think), and therefore will contend on the double-write file.  Is there 
any thought that the bgwriter might be paused while the checkpointer is doing a 
checkpoint, since the checkpointer is doing some of the cleaning that the 
bgwriter wants to do anyways?

The current patch (as mentioned) also may not do well if there are a lot of 
dirty-page evictions by backends, because of the extra fsyncing just to write 
individual buffers.  I think Heikki's (and Simon's) idea of a growing shared 
double-write buffer (only doing double-writes when it gets to a certain size) 
instead is a great idea that could deal with the dirty-page eviction issue with 
less performance hit.  It could also deal with the checkpointer/bgwriter 
contention, if we can't avoid that.  I will think about that approach and any 
issues that might arise.  But for now, we will work on getting performance 
numbers for the current patch.

With respect to all the extra fsyncs, I agree they are expensive if done on 
individual buffers by backends.  For the checkpointer, there will be extra 
fsyncs, but the batching helps greatly, and the fsyncs per batch are traded off 
against the often large  unpredictable fsyncs at the end of checkpoints.  In 
our performance runs on 9.0, the configuration was such that there were not a 
lot of dirty evictions, and the checkpointer/bgwriter was able to finish the 
checkpoint on time, even with the double writes.

And just wanted to reiterate one other benefit of double writes -- it greatly 
reduces the size of the WAL logs.

Thanks,

Dan

- Original Message -
From: Heikki Linnakangas heikki.linnakan...@enterprisedb.com
To: David Fetter da...@fetter.org
Cc: PG Hackers pgsql-hackers@postgresql.org, jks...@gmail.com
Sent: Wednesday, January 11, 2012 4:13:01 AM
Subject: Re: [HACKERS] [WIP] Double-write with Fast Checksums

On 10.01.2012 23:43, David Fetter wrote:
 Please find attached a new revision of the double-write patch.  While
 this one still uses the checksums from VMware, it's been
 forward-ported to 9.2.

 I'd like to hold off on merging Simon's checksum patch into this one
 for now because there may be some independent issues.

Could you write this patch so that it doesn't depend on any of the 
checksum patches, please? That would make the patch smaller and easier 
to review, and it would allow benchmarking the performance impact of 
double-writes vs full page writes independent of checksums.

At the moment, double-writes are done in one batch, fsyncing the 
double-write area first and the data files immediately after that. 
That's probably beneficial if you have a BBU, and/or a fairly large 
shared_buffers setting, so that pages don't get swapped between OS and 
PostgreSQL cache too much. But when those assumptions don't hold, it 
would be interesting to treat the double-write buffers more like a 2nd 
WAL for full-page images. Whenever a dirty page is evicted from 
shared_buffers, write it to the double-write area, but don't fsync it or 
write it back to the data file yet. Instead, let it sit in the 
double-write area, and grow the double-write file(s) as necessary, until 
the next checkpoint comes along.

In general, I must say that I'm pretty horrified by all these extra 
fsync's this introduces. You really need a BBU to absorb them, and even 
then, you're fsyncing data files to disk much more frequently than you 
otherwise would.

Jignesh mentioned having run some performance tests with this. I would 
like to see those results, and some analysis and benchmarks of how 
settings like shared_buffers and the presence of BBU affect this, 
compared to full_page_writes=on and off.

-- 
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_basebackup is not checking IDENTIFY_SYSTEM numbre of columns

2012-01-11 Thread Jaime Casanova
Hi,

I just notice $SUBJECT and this could lead us to a segmentation fault
if by accident we point to a system with a different number of columns
in IDENTIFY_SYSTEM, at least i point pg_receivexlog from current head
to a 9.0 instalation and got that.

Any reason for not checking number of columns?

-- 
Jaime Casanova         www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PL/Python result metadata

2012-01-11 Thread Dimitri Fontaine
Peter Eisentraut pete...@gmx.net writes:
 .colnames() returns a list of column names (strings)
 .coltypes() returns a list of type OIDs (integers)

 I just made that up because there is no guidance in the other standard
 PLs for this sort of thing, AFAICT.

What about having the same or comparable API as in psycopg or DB API

  http://initd.org/psycopg/docs/cursor.html

You could expose a py.description structure?
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_basebackup is not checking IDENTIFY_SYSTEM numbre of columns

2012-01-11 Thread Magnus Hagander
On Wed, Jan 11, 2012 at 22:31, Jaime Casanova ja...@2ndquadrant.com wrote:
 Hi,

 I just notice $SUBJECT and this could lead us to a segmentation fault
 if by accident we point to a system with a different number of columns
 in IDENTIFY_SYSTEM, at least i point pg_receivexlog from current head
 to a 9.0 instalation and got that.

 Any reason for not checking number of columns?

Hmm. I could've bet it did that. I must've taken that code out at some
point during refactoring.

No, no reason. Adding such a check would be a good idea.

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2012-01-11 Thread Simon Riggs
On Sun, Jan 8, 2012 at 2:03 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sat, Jan 7, 2012 at 11:09 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sat, Jan 7, 2012 at 10:55 AM, Simon Riggs si...@2ndquadrant.com wrote:

 So there isn't any problem with there being incorrect checksums on
 blocks and you can turn the parameter on and off as often and as
 easily as you want. I think it can be USERSET but I wouldn't want to
 encourage users to see turning it off as a performance tuning feature.
 If the admin turns it on for the server, its on, so its SIGHUP.

 Any holes in that I haven't noticed?

 And of course, as soon as I wrote that I thought of the problem. We
 mustn't make a write that hasn't been covered by a FPW, so we must
 know ahead of time whether to WAL log hints or not. We can't simply
 turn it on/off any longer, now that we have to WAL log hint bits also.
 So thanks for making me think of that.

 We *could* make it turn on/off at each checkpoint, but its easier just
 to say that it can be turned on/off at server start.

 Attached patch v6 now handles hint bits and checksums correctly,
 following Heikki's comments.

 In recovery, setting a hint doesn't dirty a block if it wasn't already
 dirty. So we can write some hints, and we can set others but not write
 them.

 Lots of comments in the code.


v7

* Fixes merge conflict
* Minor patch cleanups
* Adds checksum of complete page including hole
* Calcs checksum in mdwrite() so we pickup all non-shared buffer writes also

Robert mentioned to me there were outstanding concerns on this patch.
I know of none, and have double checked the thread to confirm all
concerns are fully addressed. Adding to CF.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0cc3296..3cb8d2a 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1701,6 +1701,47 @@ SET ENABLE_SEQSCAN TO OFF;
   /listitem
  /varlistentry
 
+ varlistentry id=guc-page-checksums xreflabel=page_checksums
+  indexterm
+   primaryvarnamepage_checksums/ configuration parameter/primary
+  /indexterm
+  termvarnamepage_checksums/varname (typeboolean/type)/term
+  listitem
+   para
+When this parameter is on, the productnamePostgreSQL/ server
+calculates checksums when it writes main database pages to disk,
+flagging the page as checksum protected.  When this parameter is off,
+no checksum is written, only a standard watermark in the page header.
+The database may thus contain a mix of pages with checksums and pages
+without checksums.
+   /para
+
+   para
+When pages are read into shared buffers any page flagged with a
+checksum has the checksum re-calculated and compared against the
+stored value to provide greatly improved validation of page contents.
+   /para
+
+   para
+Writes via temp_buffers are not checksummed.
+   /para
+
+   para
+Turning this parameter off speeds normal operation, but
+might allow data corruption to go unnoticed. The checksum uses
+16-bit checksums, using the fast Fletcher 16 algorithm. With this
+parameter enabled there is still a non-zero probability that an error
+could go undetected, as well as a non-zero probability of false
+positives.
+   /para
+
+   para
+This parameter can only be set at server start.
+The default is literaloff/.
+   /para
+  /listitem
+ /varlistentry
+
  varlistentry id=guc-wal-buffers xreflabel=wal_buffers
   termvarnamewal_buffers/varname (typeinteger/type)/term
   indexterm
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 19ef66b..7c7b20e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -709,6 +709,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	bool		updrqst;
 	bool		doPageWrites;
 	bool		isLogSwitch = (rmid == RM_XLOG_ID  info == XLOG_SWITCH);
+	bool		IsHint = (rmid == RM_SMGR_ID  info == XLOG_SMGR_HINT);
 	uint8		info_orig = info;
 
 	/* cross-check on whether we should be here or not */
@@ -955,6 +956,18 @@ begin:;
 	}
 
 	/*
+	 * If this is a hint record and we don't need a backup block then
+	 * we have no more work to do and can exit quickly without inserting
+	 * a WAL record at all. In that case return InvalidXLogRecPtr.
+	 */
+	if (IsHint  !(info  XLR_BKP_BLOCK_MASK))
+	{
+		LWLockRelease(WALInsertLock);
+		END_CRIT_SECTION();
+		return InvalidXLogRecPtr;
+	}
+
+	/*
 	 * If there isn't enough space on the current XLOG page for a record
 	 * header, advance to the next page (leaving the unused space as zeroes).
 	 */
@@ -3650,6 +3663,13 @@ RestoreBkpBlocks(XLogRecPtr lsn, XLogRecord *record, bool cleanup)
    BLCKSZ - (bkpb.hole_offset + 

Re: [HACKERS] PL/Python result metadata

2012-01-11 Thread Jan Urbański
On 11/01/12 22:52, Dimitri Fontaine wrote:
 Peter Eisentraut pete...@gmx.net writes:
 .colnames() returns a list of column names (strings)
 .coltypes() returns a list of type OIDs (integers)

 I just made that up because there is no guidance in the other standard
 PLs for this sort of thing, AFAICT.
 
 What about having the same or comparable API as in psycopg or DB API
 
   http://initd.org/psycopg/docs/cursor.html
 
 You could expose a py.description structure?

+1 for providing a read-only result.description. Not sure if it's worth
it to follow DB-API there, but maybe yes. Perhaps we could have a
result.description_ex information that's PG-specific or just not present
in PEP 249, like the typmod, collation and so on.

J

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Double-write with Fast Checksums

2012-01-11 Thread Josh Berkus
On 1/11/12 1:25 PM, Dan Scales wrote:
 And just wanted to reiterate one other benefit of double writes -- it greatly 
 reduces the size of the WAL logs.

Even if you're replicating?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PGCon 2012 Call for Papers

2012-01-11 Thread Dan Langille
PGCon 2012 will be held 17-18 May 2012, in Ottawa at the University of
Ottawa.  It will be preceded by two days of tutorials on 15-16 May 2012.

We are now accepting proposals for talks.  Proposals can be quite 
simple. We do not require academic-style papers.

If you are doing something interesting with PostgreSQL, please submit
a proposal.  You might be one of the backend hackers or work on a
PostgreSQL related project and want to share your know-how with
others. You might be developing an interesting system using PostgreSQL
as the foundation. Perhaps you migrated from another database to
PostgreSQL and would like to share details.  These, and other stories
are welcome. Both users and developers are encouraged to share their
experiences.

Here are a some ideas to jump start your proposal process:

- novel ways in which PostgreSQL is used
- migration of production systems from another database
- data warehousing
- tuning PostgreSQL for different work loads
- replication and clustering
- hacking the PostgreSQL code
- PostgreSQL derivatives and forks
- applications built around PostgreSQL
- benchmarking and performance engineering
- case studies
- location-aware and mapping software with PostGIS
- The latest PostgreSQL features and features in development
- research and teaching with PostgreSQL
- things the PostgreSQL project could do better
- integrating PostgreSQL with 3rd-party software

Both users and developers are encouraged to share their experiences.

The schedule is:

 8 Jan 2012 Proposal acceptance begins
29 Jan 2012 Proposal acceptance ends
19 Feb 2012 Confirmation of accepted proposals

See also http://www.pgcon.org/2012/papers.php

Instructions for submitting a proposal to PGCon 2012 are available
from: http://www.pgcon.org/2012/submissions.php

-- 
Dan Langille - http://langille.org


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Double-write with Fast Checksums

2012-01-11 Thread Simon Riggs
On Wed, Jan 11, 2012 at 11:07 PM, Josh Berkus j...@agliodbs.com wrote:
 On 1/11/12 1:25 PM, Dan Scales wrote:
 And just wanted to reiterate one other benefit of double writes -- it 
 greatly reduces the size of the WAL logs.

 Even if you're replicating?

Yes, but it will increase random I/O on the standby when we replay if
we don't have FPWs.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLOCK_STATS

2012-01-11 Thread Jim Nasby
On Jan 10, 2012, at 3:16 AM, Simon Riggs wrote:
 On Tue, Jan 10, 2012 at 12:24 AM, Jim Nasby j...@nasby.net wrote:
 IIRC, pg_bench is *extremely* write-heavy. There's probably not that many 
 systems that operate that way. I suspect that most OLTP systems read more 
 than they write, and some probably have as much as a 10-1 ratio.
 
 IMHO the main PostgreSQL design objective is doing a flexible, general
 purpose 100% write workload. Which is why Hot Standby and
 LISTEN/NOTIFY are so important as mechanisms for offloading read
 traffic to other places, so we can scale the total solution beyond 1
 node without giving up the power of SQL.

There's a problem with that theory though... in an actual OLTP system it can be 
extremely difficult to effectively split read and write workloads unless you've 
got some really easy way to know that you're not reading data that was just 
modified. I realize that there are caching and some other tricks that can help 
here, but AFAICT they all have some pretty significant drawbacks that can 
easily limit where they can be used.
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Remembering bug #6123

2012-01-11 Thread Kevin Grittner
Going back through the patches we had to make to 9.0 to move to
PostgreSQL triggers, I noticed that I let the issues raised as bug
#6123 lie untouched during the 9.2 development cycle.  In my view,
the best suggestion for a solution was proposed by Florian here:
 
http://archives.postgresql.org/pgsql-hackers/2011-08/msg00388.php
 
As pointed out in a brief exchange after that post, there would be
ways to do the more exotic things that might be desired, while
preventing apparently straightforward code from doing surprising
things.  I'm not sure whether that discussion fully satisfies the
concerns raised by Robert, though.
 
Because I let this lapse, it only seems feasible to go forward with
a patch for 9.2 if there is consensus around Florian's proposal.  If
there is any dissent, I guess the thing to do is for me to gather
the issues and see about getting something into 9.3, once 9.2 work
has died down -- in five months or so.  Wisconsin Courts can
continue to deal with the issues using my more simple-minded patch,
but others still are getting surprised by it -- bug #6226 is
apparently another manifestation.
 
Comments?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sending notifications from the master to the standby

2012-01-11 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 BTW ... it occurs to me to ask whether we really have a solid use-case
 for having listeners attached to slave servers.  I have personally never
 seen an application for LISTEN/NOTIFY in which the listeners were
 entirely read-only.  Even if there are one or two cases out there, it's
 not clear to me that supporting it is worth the extra complexity that
 seems to be needed.

 The basic idea is that if we can replicate LISTENs, then you can use
 replication as a simple distributed (and lossy) queueing system.

Well, this is exactly what I don't believe.  A queueing system requires
that recipients be able to remove things from the queue.  You can't do
that on a slave server, because you can't make any change in the
database that would be visible to other users.

 The obvious first use case for this is for cache invalidation.

Yeah, upthread Simon pointed out that propagating notifies would be
useful for flushing caches in applications that watch the database in a
read-only fashion.  I grant that such a use-case is technically possible
within the limitations of a slave server; I'm just dubious that it's a
sufficiently attractive use-case to justify the complexity and future
maintenance costs of the sort of designs we are talking about.  Or in
other words: so far, cache invalidation is not the first use-case,
it's the ONLY POSSIBLE use-case.  That's not useful enough.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sending notifications from the master to the standby

2012-01-11 Thread Josh Berkus

 Yeah, upthread Simon pointed out that propagating notifies would be
 useful for flushing caches in applications that watch the database in a
 read-only fashion.  I grant that such a use-case is technically possible
 within the limitations of a slave server; I'm just dubious that it's a
 sufficiently attractive use-case to justify the complexity and future
 maintenance costs of the sort of designs we are talking about.  Or in
 other words: so far, cache invalidation is not the first use-case,
 it's the ONLY POSSIBLE use-case.  That's not useful enough.

Well, cache invalidation is a pretty common task; probably more than 50%
of all database applications need to do it.  Note that we're not just
talking about memcached for web applications here.  For example, one of
the companies quoted for PostgreSQL 9.0 release uses LISTEN/NOTIFY to
inform remote devices (POS systems) that there's new data available for
them. That's a form of cache invalidation.  It's certainly a more common
design pattern than using XML in the database.

However, there's the question of whether or not this patch actually
allows a master-slave replication system to support more Listeners more
efficiently than having them all simply listen to the master.  And what
impact it has on the performance of LISTEN/NOTIFY on standalone systems.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sending notifications from the master to the standby

2012-01-11 Thread Peter Geoghegan
On 11 January 2012 23:51, Josh Berkus j...@agliodbs.com wrote:

 Yeah, upthread Simon pointed out that propagating notifies would be
 useful for flushing caches in applications that watch the database in a
 read-only fashion.  I grant that such a use-case is technically possible
 within the limitations of a slave server; I'm just dubious that it's a
 sufficiently attractive use-case to justify the complexity and future
 maintenance costs of the sort of designs we are talking about.  Or in
 other words: so far, cache invalidation is not the first use-case,
 it's the ONLY POSSIBLE use-case.  That's not useful enough.

 Well, cache invalidation is a pretty common task; probably more than 50%
 of all database applications need to do it.

I agree that it would be nice to support this type of cache
invalidation - without commenting on the implementation, I think that
the concept is very useful, and of immediate benefit to a significant
number of people.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Double-write with Fast Checksums

2012-01-11 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 On Wed, Jan 11, 2012 at 11:07 PM, Josh Berkus j...@agliodbs.com wrote:
 On 1/11/12 1:25 PM, Dan Scales wrote:
 And just wanted to reiterate one other benefit of double writes -- it 
 greatly reduces the size of the WAL logs.

 Even if you're replicating?

 Yes, but it will increase random I/O on the standby when we replay if
 we don't have FPWs.

The question is how you prevent torn pages when a slave server crashes
during replay.  Right now, the presence of FPIs in the WAL stream,
together with the requirement that replay restart from a checkpoint,
is sufficient to guarantee that any torn pages will be fixed up.  If
you remove FPIs from WAL and don't transmit some substitute information,
ISTM you've lost protection against slave server crashes.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] WIP -- renaming implicit sequences

2012-01-11 Thread Thomas Munro
Hi,

Here is an unfinished patch to implement something which appears on
the TODO list under ALTER: automatic renaming of sequences created
with serial when the table and column names change.  I've often wanted
this feature and it seemed like a good starter project.  I'd be
grateful for any feedback and advice on how I could get it into
acceptable shape.  Example:

hack=# create table foo (id serial primary key);
NOTICE:  CREATE TABLE will create implicit sequence foo_id_seq for
serial column foo.id
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
foo_pkey for table foo
CREATE TABLE
hack=# alter table foo rename to bar;
NOTICE:  ALTER TABLE will rename implicit sequence foo_id_seq to bar_id_seq
ALTER TABLE
hack=# alter table bar rename id to snacks;
NOTICE:  ALTER TABLE will rename implicit sequence bar_id_seq to
bar_snacks_seq
ALTER TABLE

Sequences are considered to be renameable if they are owned by the
table, and have a name conforming to the name pattern used by CREATE
TABLE (table_column_seq with optional trailing numbers).  If you've
manually renamed a SEQUENCE so that it doesn't conform, it won't touch
it.  If you've created a SEQUENCE and declared it to be OWNED BY the
table, then it will be renamed only if it happens to conform.

I'm not sure what to do about permissions.  I guess it should silently
skip renaming sequences if the user doesn't have appropriate
privileges.

Useful?  Why would anyone not want this behaviour?  Have I used
inappropriate locking levels?  What should I read to understand the
rules of locking?  Have I failed to handle errors? Have I made memory
ownership mistakes?

Thanks!
Thomas Munro
diff --git a/src/backend/parser/parse_utilcmd.c b/src/backend/parser/parse_utilcmd.c
index 335bdc6..cd6318e 100644
--- a/src/backend/parser/parse_utilcmd.c
+++ b/src/backend/parser/parse_utilcmd.c
@@ -876,6 +876,160 @@ transformTableLikeClause(CreateStmtContext *cxt, TableLikeClause *table_like_cla
 	heap_close(relation, NoLock);
 }
 
+/*
+ * Check if name appears to be a sequence name as generated by CREATE TABLE for
+ * serial columns.  column_name may be NULL to mean any column.
+ */
+static bool
+matchesGeneratedSequenceForm(const char *name,
+			 const char *table_name,
+			 const char *column_name)
+{
+	/* Must start with the table name and an underscore. */
+	size_t table_name_len = strlen(table_name);
+	if (strncmp(name, table_name, table_name_len) != 0)
+		return false;
+	name += table_name_len;
+	if (name[0] != '_')
+		return false;
+	name += 1;
+	if (column_name != NULL)
+	{
+		/* Must match a specific column name. */
+		size_t column_name_len = strlen(column_name);
+		if (strncmp(name, column_name, column_name_len) != 0)
+		return false;
+	name += column_name_len;
+	}
+	else
+	{
+		/* Step over any column name. */
+		name = strrchr(name, '_');
+		if (name == NULL)
+			return false;
+	}
+	/* Must have a trailing 'seq'. */
+	if (strncmp(name, _seq, 4) != 0)
+		return false;
+	name += 4;
+	/* We tolerate any number of digits at the end */
+	while (*name)
+	{
+		if (*name  '0' || *name  '9')
+		{
+			return false;
+		}
+		++name;
+	}
+	return true;
+}
+
+/*
+ * Given a sequence name as generated for serial columns, attempt to extract
+ * the column name as a newly allocated string.  If it can't be done, return
+ * NULL.
+ */
+static char *
+extractColumnName(const char *seq_name, const char *table_name)
+{
+	const char *column_begin = seq_name + strlen(table_name) + 1;
+	const char *column_end = strrchr(column_begin, '_');
+	if (column_end == NULL)
+	{
+		return NULL;
+	}
+	else 
+	{
+		size_t column_len = column_end - column_begin;
+		char *column = palloc(column_len + 1);
+		strcpy(column, column_begin);
+		column[column_len] = 0;
+		return column;
+	}
+}
+
+/*
+ * Expand a RenameStmt into a list of statements.  If the statement renames a
+ * table or a column, then this builds statements to rename any owned sequences
+ * that have a name apparently created implicitly.
+ */
+List *
+transformRenameStmt(RenameStmt *stmt)
+{
+	List *result = NIL;
+	if (stmt-renameType == OBJECT_TABLE || stmt-renameType == OBJECT_COLUMN)
+	{
+		Oid table_relid;
+		Relation table_rel;
+		List *sequences;
+		ListCell *cell;
+		char *table_name;
+
+		/* Get the table's (current) name. */
+		table_relid = RangeVarGetRelid(stmt-relation, NoLock, false);
+		table_rel = relation_open(table_relid, AccessExclusiveLock);
+		table_name = RelationGetRelationName(table_rel);
+		relation_close(table_rel, NoLock);
+		
+		/*
+		 * Find all owned sequences, and consider renaming them if they appear
+		 * to conform to the naming rule used for implicit sequence generation
+		 * during table creation.  We skip other owned sequences.
+		 */
+		sequences = getOwnedSequences(table_relid);
+		foreach(cell, sequences)
+		{
+			Oid seq_relid = lfirst_oid(cell);
+			Relation seq_rel = relation_open(seq_relid, AccessExclusiveLock);
+			char *seq_name = RelationGetRelationName(seq_rel); /* 

Re: [HACKERS] Remembering bug #6123

2012-01-11 Thread Tom Lane
Kevin Grittner kevin.gritt...@wicourts.gov writes:
 Going back through the patches we had to make to 9.0 to move to
 PostgreSQL triggers, I noticed that I let the issues raised as bug
 #6123 lie untouched during the 9.2 development cycle.  In my view,
 the best suggestion for a solution was proposed by Florian here:
 
 http://archives.postgresql.org/pgsql-hackers/2011-08/msg00388.php

Do you mean this:

After every BEFORE trigger invocation, if the trigger returned
non-NULL, check if latest row version is still the same as when
the trigger started. If not, complain.

While that sounds relatively safe, if possibly performance-impacting,
it's not apparent to me how it fixes the problem you complained of.
The triggers you were using were modifying rows other than the one
being targeted by the triggering action, so a test like the above would
not notice that they'd done anything.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JSON for PG 9.2

2012-01-11 Thread Andrew Dunstan



On 01/11/2012 10:21 AM, Pavel Stehule wrote:

2012/1/11 Robert Haasrobertmh...@gmail.com:

On Wed, Jan 11, 2012 at 9:41 AM, Pavel Stehulepavel.steh...@gmail.com  wrote:

I understand it now. My opinion is so some operators and index search
can be in 9.2 - so use a JSON just as communication format now.

* we need to build JSON
* we need to check if some is valid JSON
* we need to store JSON

other steps should be (9.2)
* basic operators eq, neeq
* some tool like XQuery - simple query on JSON document available from
SQL that can be used for functional indexes.

That would be nice, but let's not let the perfect be the enemy of the
good.  We don't have a lot of time here.


sorry - replace 9.2 by 9.3 - I am sorry

I am able to write array_to_json fce and Andrew can write query_to_json



For those who want to play along, see 
https://bitbucket.org/adunstan/pgdevel which has Robert's patch and my 
additions to it.


I'm actually half way through writing an array_to_json function, since 
it it necessary anyway for query_to_json. I hope to have a fairly 
complete working function in about 24 hours.


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] order of operations for pg_restore

2012-01-11 Thread Andrew Hammond
I'm working on a tool that runs pg_restore with -j 4. I notice that
after COPYing in the data, pg_restore does two indexes and a cluster
command in parallel. The first CREATE INDEX is running, the CLUSTER
command is waiting on it and the second CREATE INDEX is waiting on the
CLUSTER. This seems sub-optimal. Would it make sense to run the
CLUSTER command first? I'm pretty sure I can replicate the behavior if
necessary. Running 9.1.2.

Andrew

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP -- renaming implicit sequences

2012-01-11 Thread Tom Lane
Thomas Munro mu...@ip9.org writes:
 Here is an unfinished patch to implement something which appears on
 the TODO list under ALTER: automatic renaming of sequences created
 with serial when the table and column names change.  I've often wanted
 this feature and it seemed like a good starter project.

Hmm ... this seems a bit inconsistent with the fact that we got rid of
automatic renaming of indexes a year or three back.  Won't renaming of
serials have all the same problems that caused us to give up on renaming
indexes?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP] Double-write with Fast Checksums

2012-01-11 Thread Aidan Van Dyk
On Wed, Jan 11, 2012 at 7:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 The question is how you prevent torn pages when a slave server crashes
 during replay.  Right now, the presence of FPIs in the WAL stream,
 together with the requirement that replay restart from a checkpoint,
 is sufficient to guarantee that any torn pages will be fixed up.  If
 you remove FPIs from WAL and don't transmit some substitute information,
 ISTM you've lost protection against slave server crashes.

This double-write stragegy is all an attempt to make writes durable.
 You remove the FPW from the WAL stream only because you're writes
are make durable using some other stragegy, like the double-write.
Any standby will need to be using some stragegy to make sure it's
writes are durable, namely, the same double-write.

So on a standby crash, it will replay whatever FPWs it has in the
double-write buffer it has accumulated to make sure it's writes were
consistent.  Exactly as the master would do.

a.

-- 
Aidan Van Dyk                                             Create like a god,
ai...@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] measuring spinning

2012-01-11 Thread Robert Haas
I've had cause, a few times this development cycle, to want to measure
the amount of spinning on each lwlock in the system.  To that end,
I've found the attached patch useful.  Note that if you don't define
LWLOCK_STATS, this changes nothing except that the return value from
s_lock becomes int rather than void.  If you do define LWLOCK_STATS,
then LWLockAcquire() counts the number of pg_usleep() calls that are
required to acquire each LWLock, in addition to the other statistics.
Since this has come up for me a few times now, I'd like to proposing
including it in core.

Just to whet your appetite, here are the top spinners on a 32-client
SELECT-only test on a 32-core Itanium server.  All the locks not shown
below have two orders of magnitude less of a problem than these do.

lwlock 48: shacq 6042357 exacq 34590 blk 53 spin 1288
lwlock 42: shacq 5014729 exacq 34942 blk 58 spin 1321
lwlock 43: shacq 5448771 exacq 34725 blk 44 spin 1608
lwlock 44: shacq 6420992 exacq 34980 blk 67 spin 1713
lwlock 35: shacq 6353111 exacq 34256 blk 59 spin 1784
lwlock 38: shacq 6052801 exacq 34913 blk 70 spin 1801
lwlock 46: shacq 6401413 exacq 34698 blk 58 spin 1879
lwlock 36: shacq 6051887 exacq 35309 blk 73 spin 1915
lwlock 45: shacq 6812817 exacq 35170 blk 59 spin 1924
lwlock 33: shacq 6793666 exacq 35009 blk 59 spin 1955
lwlock 34: shacq 6395994 exacq 34941 blk 58 spin 2019
lwlock 40: shacq 6388136 exacq 34579 blk 51 spin 2054
lwlock 37: shacq 7250574 exacq 35242 blk 73 spin 2409
lwlock 39: shacq 7109729 exacq 34892 blk 65 spin 2632
lwlock 47: shacq 8243963 exacq 35256 blk 88 spin 3018
lwlock 328936: shacq 33992167 exacq 0 blk 0 spin 59816
lwlock 4: shacq 33994583 exacq 168 blk 40 spin 74018
lwlock 41: shacq 40098590 exacq 35001 blk 440 spin 81324
grant total: shacq 361775238 exacq 172965995 blk 2315 spin 245342

This is a 5-minute test run.  If I had to take a shot in the dark, the
buffer mapping lock in the #1 spot is the one protecting the root page
of pgbench_acocunts_pkey, and the high numbered lock in position #3 is
the content lock on the buffer itself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] measuring spinning

2012-01-11 Thread Thom Brown
On 12 January 2012 01:48, Robert Haas robertmh...@gmail.com wrote:
 I've had cause, a few times this development cycle, to want to measure
 the amount of spinning on each lwlock in the system.  To that end,
 I've found the attached patch useful.  Note that if you don't define
 LWLOCK_STATS, this changes nothing except that the return value from
 s_lock becomes int rather than void.  If you do define LWLOCK_STATS,
 then LWLockAcquire() counts the number of pg_usleep() calls that are
 required to acquire each LWLock, in addition to the other statistics.
 Since this has come up for me a few times now, I'd like to proposing
 including it in core.

 Just to whet your appetite, here are the top spinners on a 32-client
 SELECT-only test on a 32-core Itanium server.  All the locks not shown
 below have two orders of magnitude less of a problem than these do.

 lwlock 48: shacq 6042357 exacq 34590 blk 53 spin 1288
 lwlock 42: shacq 5014729 exacq 34942 blk 58 spin 1321
 lwlock 43: shacq 5448771 exacq 34725 blk 44 spin 1608
 lwlock 44: shacq 6420992 exacq 34980 blk 67 spin 1713
 lwlock 35: shacq 6353111 exacq 34256 blk 59 spin 1784
 lwlock 38: shacq 6052801 exacq 34913 blk 70 spin 1801
 lwlock 46: shacq 6401413 exacq 34698 blk 58 spin 1879
 lwlock 36: shacq 6051887 exacq 35309 blk 73 spin 1915
 lwlock 45: shacq 6812817 exacq 35170 blk 59 spin 1924
 lwlock 33: shacq 6793666 exacq 35009 blk 59 spin 1955
 lwlock 34: shacq 6395994 exacq 34941 blk 58 spin 2019
 lwlock 40: shacq 6388136 exacq 34579 blk 51 spin 2054
 lwlock 37: shacq 7250574 exacq 35242 blk 73 spin 2409
 lwlock 39: shacq 7109729 exacq 34892 blk 65 spin 2632
 lwlock 47: shacq 8243963 exacq 35256 blk 88 spin 3018
 lwlock 328936: shacq 33992167 exacq 0 blk 0 spin 59816
 lwlock 4: shacq 33994583 exacq 168 blk 40 spin 74018
 lwlock 41: shacq 40098590 exacq 35001 blk 440 spin 81324
 grant total: shacq 361775238 exacq 172965995 blk 2315 spin 245342

 This is a 5-minute test run.  If I had to take a shot in the dark, the
 buffer mapping lock in the #1 spot is the one protecting the root page
 of pgbench_acocunts_pkey, and the high numbered lock in position #3 is
 the content lock on the buffer itself.

Patch missing.

-- 
Thom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] order of operations for pg_restore

2012-01-11 Thread Andrew Dunstan



On 01/11/2012 07:57 PM, Andrew Hammond wrote:

I'm working on a tool that runs pg_restore with -j 4. I notice that
after COPYing in the data, pg_restore does two indexes and a cluster
command in parallel. The first CREATE INDEX is running, the CLUSTER
command is waiting on it and the second CREATE INDEX is waiting on the
CLUSTER. This seems sub-optimal. Would it make sense to run the
CLUSTER command first? I'm pretty sure I can replicate the behavior if
necessary. Running 9.1.2.





Well, we don't actually run CLUSTER. We run a command to mark a table as 
clustered on the index. The nasty part is that it's not a separate TOC 
member, it's in the same TOC as the index creation. But ALTER TABLE has 
different locking requirements from CREATE INDEX. If the clustered index 
is not one created from a constraint we could have the dependencies 
wrong. It looks like this is something we all missed when parallel 
restore was implemented. I think we might need to split the ALTER TABLE 
... CLUSTER from its parent statement.


cheers

andrew

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Send new protocol keepalive messages to standby servers.

2012-01-11 Thread Fujii Masao
On Thu, Jan 12, 2012 at 12:20 AM, Simon Riggs si...@2ndquadrant.com wrote:
 +static void
 +ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime)

 walEnd is not used in ProcessWalSndrMessage() at all. Can't we remove it?
 If yes, walEnd field in WalSndrMessage is also not used anywhere, so ISTM
 we can remove it.

 It's there to allow extension of the message processing to be more
 complex than it currently is. Changing the protocol is much harder
 than changing a function call.

 I'd like to keep it since it doesn't have any negative effects.

OK. Another problem about walEnd is that WalDataMessageHeader.walEnd is not
the same kind of location as WalSndrMessage.walEnd. The former indicates the
location that WAL has already been flushed (maybe not sent yet), i.e.,
send request
location. OTOH, the latter indicates the location that WAL has
already been sent.
Is this inconsistency intentional?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Confusing EXPLAIN output in case of inherited tables

2012-01-11 Thread Ashutosh Bapat
On Wed, Jan 11, 2012 at 5:25 PM, Chetan Suttraway 
chetan.suttra...@enterprisedb.com wrote:



 On Wed, Jan 11, 2012 at 5:13 PM, Ashutosh Bapat 
 ashutosh.ba...@enterprisedb.com wrote:

 Hi,
 After running regression, I ran EXPLAIN on one of the queries in
 regression (test create_misc) and got following output
 regression=# explain verbose select * into table ramp from road where
 name ~ '.*Ramp';
  QUERY
 PLAN

 
  Result  (cost=0.00..154.00 rows=841 width=67)
Output: public.road.name, public.road.thepath
-  Append  (cost=0.00..154.00 rows=841 width=67)
  -  Seq Scan on public.road  (cost=0.00..135.05 rows=418
 width=67)
Output: public.road.name, public.road.thepath
Filter: (public.road.name ~ '.*Ramp'::text)
  -  Seq Scan on public.ihighway road  (cost=0.00..14.99 rows=367
 width=67)
 ^
Output: public.road.name, public.road.thepath
^^,   ^^
Filter: (public.road.name ~ '.*Ramp'::text)
  ^^^
  -  Seq Scan on public.shighway road  (cost=0.00..3.96 rows=56
 width=67)
Output: public.road.name, public.road.thepath
Filter: (public.road.name ~ '.*Ramp'::text)
 (12 rows)

 regression=# \d+ road
 Table public.road
  Column  | Type | Modifiers | Storage  | Stats target | Description
 -+--+---+--+--+-
  name| text |   | extended |  |
  thepath | path |   | extended |  |
 Indexes:
 rix btree (name)
 Child tables: ihighway,
   shighway
 Has OIDs: no

 Table road has children ihighway and shighway as seen in the \d+
 output above. The EXPLAIN output of Seq Scan node on children has
 public.road as prefix for variables. public.road could imply the parent
 table road and thus can cause confusion, as to what's been referreed, the
 columns of parent table or child table. In the EXPLAIN output children
 tables have road as alias (as against public.road). The alias comes
 from RangeTblEntry-eref-aliasname. It might be better to have road as
 prefix in the variable names over public.road.

 The reason why this happens is the code in get_variable()
 3865 /* Exceptions occur only if the RTE is alias-less */
 3866 if (rte-alias == NULL)
 3867 {
 3868 if (rte-rtekind == RTE_RELATION)
 3869 {
 3870 /*
 3871  * It's possible that use of the bare refname would find
 another
 3872  * more-closely-nested RTE, or be ambiguous, in which
 case we need
 3873  * to specify the schemaname to avoid these errors.
 3874  */
 3875 if (find_rte_by_refname(rte-eref-aliasname, context)
 != rte)
 3876 schemaname =
 get_namespace_name(get_rel_namespace(rte-relid));
 3877 }

 If there is no alias, we find out the schema name and later add it to the
 prefix. In the inherited table case, we are actually creating a kind of
 alias for the children table and thus we should not find out the schema
 name and add it to the prefix. This case has been taken care of in
 get_from_clause_item(),
 6505 else if (rte-rtekind == RTE_RELATION 
 6506 strcmp(rte-eref-aliasname,
 get_relation_name(rte-relid)) != 0)
 6507 {
 6508 /*
 6509  * Apparently the rel has been renamed since the rule
 was made.
 6510  * Emit a fake alias clause so that variable references
 will still
 6511  * work.  This is not a 100% solution but should work in
 most
 6512  * reasonable situations.
 6513  */
 6514 appendStringInfo(buf,  %s,
 6515  quote_identifier(rte-eref-aliasname));
 6516 gavealias = true;
 6517 }

 I see similar code in ExplainTargetRel()
 1778 if (objectname == NULL ||
 1779 strcmp(rte-eref-aliasname, objectname) != 0)
 1780 appendStringInfo(es-str,  %s,
 1781  quote_identifier(rte-eref-aliasname));

 Based on this, here is patch to not add schemaname in the prefix for a
 variable.

 I have run make check. All except inherit.sql passed. The expected output
 change is included in the patch.

 --
 Best Wishes,
 Ashutosh Bapat
 EntepriseDB Corporation
 The Enterprise Postgres Company



 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


 A table can inherit from one or more parent table. So in that case,
 qualifying schema/table name
 helps in finding out where the column is coming from.


Do you have any example