Re: [HACKERS] How to avoid base backup in automated failover

2012-10-16 Thread Amit Kapila
On Wednesday, October 17, 2012 11:22 AM chinnaobi wrote:
> Hey Haas,
> 
> What does the standby server have to wait for replication to catch up
> before
> promoting ?? Is there any parameter to configure this ??
> 
> Few more questions on this part
> 
> 1. How could we ensure the standby has received all transactions sent by
> primary till the point primary server is dead. (Meaning the dead primary
> and
> standby server are exactly same, so that the dead primary comes back it
> can
> be turned to standby without any issues).

> 2. When the dead primary is turned to standby the streaming is not
> happening
> due to current_wal_location is ahead in the standby server is ahead of
> wal_sent_location. In this case how can I start streaming without taking
> a
> fresh base backup ??
> 
> 3. When the dead primary comes back the DB still accepts data and it
> goes to
> out of sync with the current primary and streaming won't start. Is there
> any
> solution for this case ??

I think points 2 & 3 can be addressed with new feature getting implemented
by Heikki provided standby has received all WAL of primary till the point it
goes dead.
 
https://commitfest.postgresql.org/action/patch_view?id=933



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility

2012-10-16 Thread Simon Riggs
On 16 October 2012 23:03, Josh Berkus  wrote:

> Can you explain in more detail how this would be used on the receiving
> side?  I'm unable to picture it from your description.

This will allow implementation of pgq in core, as discussed many times
at cluster hackers meetings.

> I'm also a bit reluctant to call this a "message queue", since it lacks
> the features required for it to be used as an application-level queue.

It's the input end of an application-level queue. In this design the
queue is like a table, so we need SQL grammar to support this new type
of object. Replication message doesn't describe this, since it has
little if anything to do with replication and if anything its a
message type, not a message.

You're right that Hannu needs to specify the rest of the design and
outline the API. The storage of the queue is "in WAL", which raises
questions about how the API will guarantee we read just once from the
queue and what happens when queue overflows. The simple answer would
be we put everything in a table somewhere else, but that needs more
careful specification to show we have both ends of the queue and a
working design.

Do we need a new object at all? Can we not just define a record type,
then define messages using that type? At the moment I think the
named-object approach works better, but we should consider that.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How to avoid base backup in automated failover

2012-10-16 Thread chinnaobi
Hey Haas,

What does the standby server have to wait for replication to catch up before 
promoting ?? Is there any parameter to configure this ??

Few more questions on this part 

1. How could we ensure the standby has received all transactions sent by
primary till the point primary server is dead. (Meaning the dead primary and
standby server are exactly same, so that the dead primary comes back it can
be turned to standby without any issues).

2. When the dead primary is turned to standby the streaming is not happening
due to current_wal_location is ahead in the standby server is ahead of
wal_sent_location. In this case how can I start streaming without taking a
fresh base backup ??

3. When the dead primary comes back the DB still accepts data and it goes to
out of sync with the current primary and streaming won't start. Is there any
solution for this case ??

Reddy.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/How-to-avoid-base-backup-in-automated-failover-tp5711147p5728518.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] How to avoid base backup in automated failover

2012-10-16 Thread chinnaobi
Hey Haas,What does the standby server have to wait for replication to catch
up before promoting ?? Is there any parameter to configure this ??Few more
questions on this part 1. How could we ensure the standby has received all
transactions sent by primary till the point primary server is dead. (Meaning
the dead primary and standby server are exactly same, so that the dead
primary comes back it can be turned to standby without any issues).2. When
the dead primary is turned to standby the streaming is not happening due to
current_wal_location is ahead in the standby server is ahead of
wal_sent_location. In this case how can I start streaming without taking a
fresh base backup ??3. When the dead primary comes back the DB still accepts
data and it goes to out of sync with the current primary and streaming won't
start. Is there any solution for this case ??Reddy.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/How-to-avoid-base-backup-in-automated-failover-tp5711147p5728517.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

Re: [HACKERS] WebSphere Application Server support for postgres

2012-10-16 Thread Florent Guillaume
On Mon, Oct 15, 2012 at 10:28 AM, Dave Page  wrote:
> [Removing all lists except -hackers. Please do not cross-post to every
> list again!]
>
> On Mon, Oct 15, 2012 at 9:22 AM, John Nash
>  wrote:
>> Hi,
>>
>> Our IT Company systems architecture is based on IBM Websphere
>> Application Server, we would like to migrate our databases to
>> Postgres, the main problem which stops us from doing that is Postgres
>> is not supported by IBM Websphere Application Server.
>> There is a Request for Enhancement that has been opened in IBM Web in
>> order to solve this issue, if you are interested in this enhancement
>> to be done, please vote for the Enhancement in the following link:
>>
>> http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=27313
>
> A login is required to access that site. Can you provide a link or
> info that doesn't require login please?


The relevant content of the ticket is basically just:

Description: WebSphere Application Server support for JDBC access to
PosgreSQL databases.
Use case: In WAS, be able to define and use JDBC providers and
datasources pointing to PosgreSQL databases.

Florent

-- 
Florent Guillaume, Director of R&D, Nuxeo
Open Source, Java EE based, Enterprise Content Management (ECM)
http://www.nuxeo.com   http://www.nuxeo.org   +33 1 40 33 79 87


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Doc patch "only relevant" -> "relevant only"

2012-10-16 Thread Karl O. Pinc
Hi,

As long as I'm sending in trivial fixes
to the docs here's a bit of wording that's been bugging me.

In a number of places the docs read "only relevant",
this patch reverses this to read "relevant only".

I believe this reads better because it quickly
answers the question "is what?" with "is relevant",
making the sentence less of a strain to read.
"Only relevant" would be better if you really wanted
to emphasize the "only", which I don't think is called
for.

(Sending in such trivial patches makes me feel like
I'm bikeshedding.  Feel free to ignore them without comment.)

Regards,

Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein

diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index 71cf59e..02763b5 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -3596,7 +3596,7 @@ SELECT person.name, holidays.num_weeks FROM person, holidays
  as the canonical form for MAC addresses, and specifies the first
  form (with colons) as the bit-reversed notation, so that
  08-00-2b-01-02-03 = 01:00:4D:08:04:0C.  This convention is widely
- ignored nowadays, and it is only relevant for obsolete network
+ ignored nowadays, and it is relevant only for obsolete network
  protocols (such as Token Ring).  PostgreSQL makes no provisions
  for bit reversal, and all accepted formats use the canonical LSB
  order.
diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 4503830..b19f15e 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -427,7 +427,7 @@ PostgreSQL documentation
   

 Specify the superuser user name to use when disabling triggers.
-This is only relevant if --disable-triggers is used.
+This is relevant only if --disable-triggers is used.
 (Usually, it's better to leave this out, and instead start the
 resulting script as superuser.)

@@ -602,7 +602,7 @@ PostgreSQL documentation
   --disable-triggers
   

-This option is only relevant when creating a data-only dump.
+This option is relevant only when creating a data-only dump.
 It instructs pg_dump to include commands
 to temporarily disable triggers on the target tables while
 the data is reloaded.  Use this if you have referential
diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml
index 7c49c03..253ee01 100644
--- a/doc/src/sgml/ref/pg_dumpall.sgml
+++ b/doc/src/sgml/ref/pg_dumpall.sgml
@@ -190,7 +190,7 @@ PostgreSQL documentation
   

 Specify the superuser user name to use when disabling triggers.
-This is only relevant if --disable-triggers is used.
+This is relevant only if --disable-triggers is used.
 (Usually, it's better to leave this out, and instead start the
 resulting script as superuser.)

@@ -283,7 +283,7 @@ PostgreSQL documentation
   --disable-triggers
   

-This option is only relevant when creating a data-only dump.
+This option is relevant only when creating a data-only dump.
 It instructs pg_dumpall to include commands
 to temporarily disable triggers on the target tables while
 the data is reloaded.  Use this if you have referential
diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index b276da6..2993369 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -383,7 +383,7 @@
   

 Specify the superuser user name to use when disabling triggers.
-This is only relevant if --disable-triggers is used.
+This is relevant only if --disable-triggers is used.

   
  
@@ -458,7 +458,7 @@
   --disable-triggers
   

-This option is only relevant when performing a data-only restore.
+This option is relevant only when performing a data-only restore.
 It instructs pg_restore to execute commands
 to temporarily disable triggers on the target tables while
 the data is reloaded.  Use this if you have referential


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Doc patch, put commas in the right place in pg_restore docs

2012-10-16 Thread Karl O. Pinc
Hi,

Simple punctuation change to pg_restore docs.


Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein

diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index b276da6..e3520c0 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -468,9 +468,9 @@
 

 Presently, the commands emitted for
---disable-triggers must be done as superuser.  So, you
-should also specify a superuser name with -S, or
-preferably run pg_restore as a
+--disable-triggers must be done as superuser.  So you
+should also specify a superuser name with -S or,
+preferably, run pg_restore as a
 PostgreSQL superuser.

   


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Suggestion for --truncate-tables to pg_restore

2012-10-16 Thread Karl O. Pinc
Hi,

Attached is version 3.

The convention seems to be to leave the operator at the
end of the line when breaking long lines, so do that.
Add extra () -- make operator precedence explicit and
have indentation reflect operator precedence.

On 09/23/2012 08:52:07 PM, Karl O. Pinc wrote:

> On 09/23/2012 12:24:27 AM, Karl O. Pinc wrote:

> > On 09/23/2012 12:19:07 AM, Karl O. Pinc wrote:
> > > On 09/21/2012 10:54:05 AM, Karl O. Pinc wrote:
> > > > On 09/20/2012 12:24:49 PM, Karl O. Pinc wrote:
> > > > 
> > > > > I've had problems using pg_restore --data-only when
> > > > > restoring individual schemas (which contain data which
> > > > > has had bad things done to it).  --clean does not work
> > > > > well because of dependent objects in other schemas.
> > > 
> > > Since there wasn't much more to do I've gone ahead
> > > and written the patch.  Works for me.
> > > 
> > > Against git master.
> > > Passes regression tests, but there's no regression
> > > tests for pg_restore so this does not say much.
> > > Since there's no regression tests I've not written one.
> > > 
> > > Since this is a real patch for application I've given
> > > it a new name (it's not a v2).
> > > 
> > > Truncate done right before COPY, since that's what
> > > the parallel restores do.


Karl 
Free Software:  "You don't pay back, you pay forward."
 -- Robert A. Heinlein

diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
index b276da6..488d8dc 100644
--- a/doc/src/sgml/ref/pg_restore.sgml
+++ b/doc/src/sgml/ref/pg_restore.sgml
@@ -539,6 +539,26 @@
  
 
  
+  --truncate-tables
+  
+   
+This option is only relevant when performing a data-only
+restore.  It instructs pg_restore
+to execute commands to truncate the target tables while the
+data is reloaded.  Use this when restoring tables or schemas
+and --clean cannot be used because dependent
+objects would be destroyed.
+   
+
+   
+ The --disable-triggers will almost always
+ always need to be used in conjunction with this option to
+ disable check constraints on foreign keys.
+   
+  
+ 
+
+ 
   --use-set-session-authorization
   

diff --git a/src/bin/pg_dump/pg_backup.h b/src/bin/pg_dump/pg_backup.h
index 3b49395..0aaf1d3 100644
--- a/src/bin/pg_dump/pg_backup.h
+++ b/src/bin/pg_dump/pg_backup.h
@@ -101,6 +101,8 @@ typedef struct _restoreOptions
 	int			noTablespace;	/* Don't issue tablespace-related commands */
 	int			disable_triggers;		/* disable triggers during data-only
 		 * restore */
+	int			truncate_tables;		/* truncate tables during data-only
+		 * restore */
 	int			use_setsessauth;/* Use SET SESSION AUTHORIZATION commands
  * instead of OWNER TO */
 	int			no_security_labels;		/* Skip security label entries */
diff --git a/src/bin/pg_dump/pg_backup_archiver.c b/src/bin/pg_dump/pg_backup_archiver.c
index 722b3e9..43b5806 100644
--- a/src/bin/pg_dump/pg_backup_archiver.c
+++ b/src/bin/pg_dump/pg_backup_archiver.c
@@ -311,6 +311,11 @@ RestoreArchive(Archive *AHX)
 	if (ropt->createDB && ropt->dropSchema)
 		exit_horribly(modulename, "-C and -c are incompatible options\n");
 
+	/* When the schema is dropped and re-created then no point
+	 * truncating tables. */
+	if (ropt->dropSchema && ropt->truncate_tables)
+		exit_horribly(modulename, "-c and --truncate-tables are incompatible options\n");
+
 	/*
 	 * -C is not compatible with -1, because we can't create a database inside
 	 * a transaction block.
@@ -412,6 +417,10 @@ RestoreArchive(Archive *AHX)
 		}
 	}
 
+	/* Truncate tables only when restoring data. */
+	if (!ropt->dataOnly && ropt->truncate_tables)
+		exit_horribly(modulename, "--truncate-tables requires the --data-only option\n");
+
 	/*
 	 * Setup the output file if necessary.
 	 */
@@ -553,6 +562,7 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te,
 	int			retval = 0;
 	teReqs		reqs;
 	bool		defnDumped;
+	bool		truncate;
 
 	AH->currentTE = te;
 
@@ -687,15 +697,22 @@ restore_toc_entry(ArchiveHandle *AH, TocEntry *te,
 		 * server, so no need to see if we should issue BEGIN.
 		 */
 		StartTransaction(AH);
+		truncate = 1;
+	} else
+		/* Truncate the table when asked to. */
+		truncate = ropt->truncate_tables;
 
+	if (truncate) {
 		/*
 		 * If the server version is >= 8.4, make sure we issue
 		 * TRUNCATE with ONLY so that child tables are not
-		 * wiped.
+		 * wiped.  If we don't know the server version
+		 * then err on the side of safety.
 		 */
 		ahprintf(AH, "TRUNCATE TABLE %s%s;\n\n",
- (PQserverVersion(AH->connection) >= 80400 ?
-  "ONLY " : ""),
+ ((!AH->connection ||
+   PQserverVersion(AH->connection) >= 80400) ?
+  "ONLY " : ""),
  fmtId(te->tag));
 	}
 
diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c
index f6c

Re: [HACKERS] [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility

2012-10-16 Thread Josh Berkus
Hannu,

Can you explain in more detail how this would be used on the receiving
side?  I'm unable to picture it from your description.

I'm also a bit reluctant to call this a "message queue", since it lacks
the features required for it to be used as an application-level queue.
"REPLICATION MESSAGE", maybe?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] change in LOCK behavior

2012-10-16 Thread Ants Aasma
On Thu, Oct 11, 2012 at 7:53 PM, Tom Lane  wrote:
> Maybe what we really need is to find a way to make taking a snapshot a
> lot cheaper, such that the whole need for this patch goes away.  We're
> not going to get far with the idea of making SnapshotNow MVCC-safe
> unless it becomes a lot cheaper to get an MVCC snapshot.  I recall some
> discussion of trying to reduce a snapshot to a WAL offset --- did that
> idea crash and burn, or is it still viable?

This was mostly covered in the cheaper snapshots thread. [1] Robert
decided to abandon the idea after concluding that the memory overhead
was untenable with very old snapshots. [2] I had a really hand-wavy
idea of lazily converting snapshots from sequence number based
snapshots to traditional list of xids snapshots to limit the overhead.
That idea was promptly shot down because in that incarnation it needed
snapshots to be stored in shared memory. [3] I have done some more
thinking on this topic, although I have to admit that it has been on
the backburner. It seems to me that the problems are all surmountable.

To recap shortly, the idea is to define visibility and snapshots
through commit sequence numbers (LSNs have problems due to async
commit). The tricky part is the datastructure to support fast
xid-to-csn lookup for visibility checks. To support visibility checks
enough information needs to be kept so that the oldest CSN based
snapshot can resolve its xmin-xmax range to csns. My idea currently is
to have two fixed size shared memory buffers and an overflow log. The
first ring buffer is a dense array mapping of xids to csns. The
overflow entries from the dense ring buffer are checked if they might
be invisible to any CSN based snapshots, and if so inserted into the
sparse buffer. The sparse buffer is a sorted array containing xid-csn
pairs that are still running or are concurrent with an active CSN
based snapshot. Once the sparse buffer is filled up, the smallest
xid-csn pairs are evicted to the CSN log. The long running CSN based
snapshots then need to read this log to build up the
SnapshotData->xip/subxip arrays. The backends can either discover that
their snapshots CSNs values have overflowed by checking the
appropriate horizon value, or be signaled via an interrupt to enable
CSN log cleanup ASAP.

I still have to work out some details on how to handle subtransaction
overflow, how to maintain reasonably fresh values for different
horizons and what are necessary ordering barriers to get lock-free
visibility checks. The idea currently seems workable and will make
taking snapshots really cheap, while the worst case maintenance
overhead is mostly shifted to sessions that acquire lots of writing
transactions and hold snapshots open for a long time.

If anyone is interested I can do a slightly longer write up detailing
what I have worked out so far.

Ants Aasma


[1] 
http://archives.postgresql.org/message-id/CA%2BTgmoaAjiq%3Dd%3DkYt3qNj%2BUvi%2BMB-aRovCwr75Ca9egx-Ks9Ag%40mail.gmail.com
[2] 
http://archives.postgresql.org/message-id/CA%2BTgmoYD6EhYy1Rb%2BSEuns5smreY1_3rAMeL%3D76rX8deijy56Q%40mail.gmail.com
[3] 
http://archives.postgresql.org/message-id/CA%2BCSw_uDfg2SBMicGNu13bpr2upbnVL_edoTbzvacR1FrNrZ1g%40mail.gmail.com
-- 
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tuplesort memory usage: grow_memtuples

2012-10-16 Thread Peter Geoghegan
On 16 October 2012 22:18, Greg Stark  wrote:
> That's assuming my committer bits haven't lapsed and people are ok
> with me stepping back into things?

I personally have no objections.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tuplesort memory usage: grow_memtuples

2012-10-16 Thread Greg Stark
On Tue, Oct 16, 2012 at 9:47 PM, Peter Geoghegan  wrote:
> The patch will now been marked "ready for committer". Does this need
> doc changes, in light of what is arguably a behavioural difference?
> You only mentioned release notes.

I'm happy to look at this one, probably next week at pgconf.eu. It seems like a
reasonable size patch to get back into things.

That's assuming my committer bits haven't lapsed and people are ok
with me stepping back into things?


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tuplesort memory usage: grow_memtuples

2012-10-16 Thread Peter Geoghegan
On 16 October 2012 14:24, Simon Riggs  wrote:
> If you describe in detail that it is a heuristic and why that is
> proposed over other approaches that should be sufficient for future
> generations to read and understand.

I've done so, in the attached revision. Things have been simplified
somewhat too.

The same basic strategy for sizing the tuplesort memtuples array in
also exists in tuplestore. I wonder if we should repeat this there? I
suppose that that could follow later.

The patch will now been marked "ready for committer". Does this need
doc changes, in light of what is arguably a behavioural difference?
You only mentioned release notes.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


sortmem_grow-v3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal - assign result of query to psql variable

2012-10-16 Thread Pavel Stehule
2012/10/16 Shigeru HANADA :
> Hi Pavel,
>
> On Tue, Oct 16, 2012 at 6:59 AM, Pavel Stehule 
> wrote:
>> here is updated patch, I moved lot of code from lexer to command.com,
>> and now more \gset doesn't disable other backslash commands on same
>> line.
>
> * lexer changes
> IIUC, new function psql_scan_varlist_varname scans input and returns a
> variable name or a comma at each call, and command.c handles the error
> such as invalid # of variables.  This new design seems better than old one.
>
> However, IMHO the name psql_scan_varlist_varname sounds redundant and
> unintuitive.  I'd prefer psql_scan_slash_varlist, because it indicates
> that that function is expected to be used for arguments of backslash
> commands, like psql_scan_slash_command and psql_scan_slash_option.
> Thoughts?
>
> * multiple meta command
> Now both of the command sequences
>
>   $ SELECT 1, 2 \gset var1, var2 \g foo.txt
>   $ SELECT 1, 2 \g foo.txt \gset var1, var2
>
> set var1 and v2 to "1" and "2" respectively, and also write the result
> into foo.txt.  This would be what users expected.
>
> * Duplication of variables
> I found an issue we have not discussed.  Currently \gset accepts same
> variable names in the list, and stores last SELECT item in duplicated
> variables.  For instance,
>
>   $ SELECT 1, 2 \gset var, var
>
> stores "2" into var.  I think this behavior is acceptable, but it might
> be worth mentioning in document.
>
> * extra fixes
> I fixed some minor issues below.  Please see attached v10 patch for details.
>
>   * remove unused macro OT_VARLIST
>   * remove unnecessary #include directive for common.h
>   * fill comment within 80 columns
>   * indent short variable name with tab
>   * add regression test case for combination of \g and \gset
>
> * bug on FETCH_COUNT = 1
> When FETCH_COUNT is set to 1, and the number of rows returned is 1 too,
> \gset shows extra "(1 row)".  This would be a bug in
> ExecQueryUsingCursor.  Please see the last test case in regression test
> psql_cmd.

I fixed this bug

Regards

Pavel

>
> I'll mark this patch as "waiting author".
>
> Regards,
> --
> Shigeru HANADA


gset_11.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] proposal - assign result of query to psql variable

2012-10-16 Thread Pavel Stehule
Hello

here is updated patch, I moved lot of code from lexer to command.com,
and now more \gset doesn't disable other backslash commands on same
line.

Regards

Pavel

2012/10/15 Pavel Stehule :
> 2012/10/15 Pavel Stehule :
>> 2012/10/15 Shigeru HANADA :
>>> Hi Pavel,
>>>
>>> First of all, I'm sorry that my previous review was rough.  I looked
>>> your patch and existing code closely again.
>>>
>>> On 2012/10/15, at 12:57, Pavel Stehule  wrote:
 2012/10/14 Tom Lane :
> * ExecQueryUsingCursor's concept of how to process command results
>  (why? surely we don't need \gset to use a cursor)

 There was two possibilities, but hardly non using cursor is better way
>>>
>>> +1 for supporting the case when FETCH_COUNT > 0, because user might set
>>> so mainly for other queries, and they would want to avoid changing
>>> FETCH_COUNT setting during every query followed by \gset.
>>>
> * the psql lexer (adding a whole bunch of stuff that probably doesn't
>  belong there)

 ??
>>>
>>> I think that Tom is talking about psql_scan_slash_vars().  It seems too
>>> specific to \gset command.  How about to improve
>>> psql_scan_slash_options() so that it can handle comma-separated variable
>>> list?  Although you might have tried it before.
>>> # Unused OT_VARLIST macro gave me the idea.
>>
>> yes, it is possible - I'll look on it at evening
>
> a reuse of psql_scan_slash_options is not good idea - a interface of
> this function is out of my purposes - and I cannot to signalise error
> from this procedure. But I can minimize psql_scan_slash_var and I can
> move lot of code out of lexer file.
>
>>
>>>
> * the core psql settings construct (to store something that is in
>  no way a persistent setting)
>

 ??
>>>
>>> I thought that having \gset arguments in pset is reasonable, since \g
>>> uses pest.gfname to hold its one-shot setting.  Or, we should refactor
>>> \g to fit with \gset?  I might be missing Tom's point...
>>>
> Surely there is a less ugly and invasive way to do this.  The fact
> that the reviewer keeps finding bizarre bugs like "another backslash
> command on the same line doesn't work" seems to me to be a good
> indication that this is touching things it shouldn't.

 - all these bugs are based on lexer construct. A little modification
 of lexer is possible
>>>
>>> IMHO those issues come from the design rather than the implementation of
>>> lexer.  AFAIK we don't have consensus about the case that both of \g and
>>> \gset are used for a query like this:
>>>
>>> postgres=# SELECT 1 \gset var \\ \g foo.txt
>>>
>>> This seems regal.  Should we store "1" into var and write the result
>>> into foo.txt?  Or, only either of them?  It's just an idea and it
>>> requires new special character, but how about use \g command for file,
>>> pipe, and variable?  In the case we choose '&' for variable prefix:
>>>
>>> postgres=# SELECT 'hello', 'wonderful', 'world!' \g &var1,,var2
>>>
>>> Anyway, we've had no psql's meta command which processes query result
>>> other than \g.  So, we might have more considerable issues about design.
>>
>> a current design is rigid - a small implementation can stop parsing
>> target list, when other backslash statement is detected
>>
>>>
>>> BTW, what the word "comma_expected" means?  It's in the comment above
>>> psql_scan_slash_vars().  It might be a remaining of old implementation.
>>
>> This identifier is mistaken - etc this comment is wrong and related to
>> old implementation - sorry. A first design was replaced by state
>> machine described by  VarListParserState
>>
>>
>>
>>>
>>> Regards,
>>> --
>>> Shigeru HANADA
>>> shigeru.han...@gmail.com
>>>
>>>
>>>
>>>
>>>


gset09.diff
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] smgrsettransient mechanism is full of bugs

2012-10-16 Thread Alvaro Herrera
Tom Lane wrote:

> After further review, I have become convinced that in fact it's
> completely broken and needs to be redone from scratch.  The temp-file
> marking at the fd.c level can easily get out of sync with the marking
> at the smgr level, and that marking isn't too consistent with reality
> either, which means we have all of the following problems:

Oops.  Sorry about this.  Fortunately, as far as I can see, it only
results in excessive resource consumption, not data corruption or loss.

> I believe that we probably ought to revert this mechanism entirely, and
> build a new implementation based on these concepts:
> 
> * An SMgrRelation is transient if and only if it doesn't have an
> "owning" relcache entry.  Keep a list of all such SmgrRelations, and
> close them all at transaction end.  (Obviously, an SMgrRelation gets
> removed from the list if it acquires an owner mid-transaction.)
> 
> * There's no such concept as FD_XACT_TRANSIENT at the fd.c level.
> Rather, we close and delete the VFD entry when told to by SmgrRelation
> closure.

Makes sense.  It does seem simpler than the original approach.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Truncate if exists

2012-10-16 Thread Stafford, David x78061
On Mon, 15 Oct 2012 10:21:18 -0700, Robert Haas  wrote:
> Yeah, I think the functionality that we need is pretty much there
> already today.  What we need to do is to get the syntax to a point
> where people can write the code they want to write without getting
> tangled up by it.
> 
> I think the invention of DO was a big step in the right direction
> ...
> With DO, you can write the logic you want
> as an SQL statement, it's just a clunky and awkward SQL statement.  In
> my view the goal ought to be to refine that mechanism to remove the
> clunkiness and awkwardness, rather than to invent something completely
> new.

As someone who has worked with a number of databases now, none of them really 
get this DDL integration completely right.  What I'd like to see is 1) a 
predicate to easily test things about the schema (does this table, column, 
index, schema, etc. exist?  does it have the right type?) and 2) a way to 
conditionally execute DDL (and DML, which should fall right out, but it isn't 
really what this discussion is covering).  I would propose extending the 
current EXISTS / NOT EXISTS predicate as follows:

[NOT] EXISTS TABLE tab [ ( col [type] [ , col [type]]... ) ]
[NOT] EXISTS COLUMN tab.col [type] 
[NOT] EXISTS INDEX tab.idx [ ( col [ , col]... ) ]
[NOT] EXISTS CONSTRAINT tab.cstrt -- not sure what else might be useful here
[NOT] EXISTS ( select ) -- this is the current EXISTS predicate, of course
[NOT] EXISTS ANY ROWS FROM tab [join] [WHERE predicate] [GROUP BY col [ , 
col]...] -- exactly the same as 

   -- (select 1 FROM etc.)

   -- only because I like 

   -- it better
(the latter [which by no means am I nuts over; it's just that when extending 
EXISTS I can't stop neatening it up to my personal preferences] could be 
extended with [NOT] EXISTS MORE THAN n ROWS FROM and [NOT] EXISTS EXACTLY n 
ROWS FROM.)

There is a new SQL statement: IF predicate true-statement [ELSE 
false-statement].

To actually execute this new IF statement, the executor would need an IF node 
that evaluates the predicate (with ANDs and ORs, just like all SQL predicates) 
and then executes the rest only if the predicate came out TRUE (or NOT FALSE; I 
forget which is usually used, and the difference with NULL could be useful, as 
long as it matches other predicates).  This moves one more bit of procedural 
logic into the executor.  

Another wrinkle is that the dependent statement might not compile, due to 
missing tables or whatnot.  Actually executing it while it doesn't compile is 
an error, but we want to defer that error until we actually decide we need to 
execute it.  Also, it's probably good to try compiling it again at that point.  
So my thought would be to try planning the dependent statement(s); if they 
compile, hook them to the IF node; if not, hook a DEFERRED node to the IF node. 
 The DEFERRED node has the parse tree (or raw string, whatever makes sense) of 
the statement; on execution it tries again to plan that statement; if it 
succeeds, run it; if not, error out.

I'd also add a SEQUENCE node to the executor.  It just runs its children in 
order (could be n-ary, or if fixed arity nodes are what is in the 
planner/executor today, could be binary, first left, then right, and right 
could be another SEQUENCE).  The DEFERRED node means that a CREATE statement 
could precede use of what is created in the same sequence and all could get 
planned (with some deferral to execution time) in advance and run in one lump.  
This implements DO at the executor level.

The biggest concepts left from plpgsql are looping and variables.  Most 
variables could be modeled as a single row value; SQL already can update a row, 
so the planning of assignments and calculations of scalars (and arrays, I 
think) already fits into things the planner knows about.  Table variables 
(which I don't know that plpgsql supports, but someday it should) are less 
defined.  Adding plpgsql's loops to the executor would let whole functions run 
under one trip through the executor.  This is beyond just improving the DDL 
support for scripts.

I have written a number of database upgrade scripts.  Over time we've made them 
less fragile, by checking for the existence of tables, indexes, and most 
recently, columns.  The usual sequence is:
1) check the existence of an index; check that the first few columns are 
correct; if not, drop the index
2) repeat for other indexes that have changed definition over time
3) check the existence of the table; create with current layout if it is missing
4) check for the presence of a column; if missing, alter the table to add it 
(of course, we can only add new columns at the end, and occasionally delete a 
column)
5) repeat for more columns
6) check the existence of 

Re: [HACKERS] proposal - assign result of query to psql variable

2012-10-16 Thread Shigeru HANADA
Hi Pavel,

On Tue, Oct 16, 2012 at 6:59 AM, Pavel Stehule 
wrote:
> here is updated patch, I moved lot of code from lexer to command.com,
> and now more \gset doesn't disable other backslash commands on same
> line.

* lexer changes
IIUC, new function psql_scan_varlist_varname scans input and returns a
variable name or a comma at each call, and command.c handles the error
such as invalid # of variables.  This new design seems better than old one.

However, IMHO the name psql_scan_varlist_varname sounds redundant and
unintuitive.  I'd prefer psql_scan_slash_varlist, because it indicates
that that function is expected to be used for arguments of backslash
commands, like psql_scan_slash_command and psql_scan_slash_option.
Thoughts?

* multiple meta command
Now both of the command sequences

  $ SELECT 1, 2 \gset var1, var2 \g foo.txt
  $ SELECT 1, 2 \g foo.txt \gset var1, var2

set var1 and v2 to "1" and "2" respectively, and also write the result
into foo.txt.  This would be what users expected.

* Duplication of variables
I found an issue we have not discussed.  Currently \gset accepts same
variable names in the list, and stores last SELECT item in duplicated
variables.  For instance,

  $ SELECT 1, 2 \gset var, var

stores "2" into var.  I think this behavior is acceptable, but it might
be worth mentioning in document.

* extra fixes
I fixed some minor issues below.  Please see attached v10 patch for details.

  * remove unused macro OT_VARLIST
  * remove unnecessary #include directive for common.h
  * fill comment within 80 columns
  * indent short variable name with tab
  * add regression test case for combination of \g and \gset

* bug on FETCH_COUNT = 1
When FETCH_COUNT is set to 1, and the number of rows returned is 1 too,
\gset shows extra "(1 row)".  This would be a bug in
ExecQueryUsingCursor.  Please see the last test case in regression test
psql_cmd.

I'll mark this patch as "waiting author".

Regards,
-- 
Shigeru HANADA
*** a/doc/src/sgml/ref/psql-ref.sgml
--- b/doc/src/sgml/ref/psql-ref.sgml
***
*** 1483,1490  testdb=>
  way. Use \i for that.) This means that
  if the query ends with (or contains) a semicolon, it is
  immediately executed. Otherwise it will merely wait in the
! query buffer; type semicolon or \g to send it, or
! \r to cancel.
  
  
  
--- 1483,1490 
  way. Use \i for that.) This means that
  if the query ends with (or contains) a semicolon, it is
  immediately executed. Otherwise it will merely wait in the
! query buffer; type semicolon, \g or
! \gset to send it, or \r to cancel.
  
  
  
***
*** 1617,1622  Tue Oct 26 21:40:57 CEST 1999
--- 1617,1644 

  

+ \gset variable [ ,variable ... ] 
+ 
+ 
+ 
+  Sends the current query input buffer to the server and stores the
+  query's output into corresponding variable.  The preceding query must
+  return only one row, and the number of variables must be same as the
+  number of elements in SELECT list.  If you don't
+  need any of items in SELECT list, you can omit
+  corresponding variable.
+  Example:
+ 
+ foo=> SELECT 'hello', 'wonderful', 'world!' \gset var1,,var3 
+ foo=> \echo :var1 :var3
+ hello world!
+ 
+ 
+ 
+   
+ 
+   
  \h or \help [ command ]
  
  
*** a/src/bin/psql/command.c
--- b/src/bin/psql/command.c
***
*** 71,77  static void printSSLInfo(void);
  static void checkWin32Codepage(void);
  #endif
  
! 
  
  /*--
   * HandleSlashCmds:
--- 71,86 
  static void checkWin32Codepage(void);
  #endif
  
! /*
!  * Possible states for simple state machine, that is used for
!  * parsing target list - list of varnames separated by comma.
!  */
! typedef enum
! {
! 	VARLIST_INITIAL,
! 	VARLIST_EXPECTED_COMMA,
! 	VARLIST_EXPECTED_COMMA_OR_IDENT
! } VarlistParserState;
  
  /*--
   * HandleSlashCmds:
***
*** 748,753  exec_command(const char *cmd,
--- 757,831 
  		status = PSQL_CMD_SEND;
  	}
  
+ 	/* \gset send query and store result */
+ 	else if (strcmp(cmd, "gset") == 0)
+ 	{
+ 		bool value_is_valid;
+ 		char *value;
+ 		VarlistParserState state = VARLIST_INITIAL;
+ 
+ 		/* expected valid target list */
+ 		success = true;
+ 
+ 		pset.gvars = NULL;
+ 		while ((value = psql_scan_varlist_varname(scan_state, &value_is_valid)))
+ 		{
+ 			if (value_is_valid && success)
+ 			{
+ if (strcmp(value, ",") == 0)
+ {
+ 	if (state == VARLIST_INITIAL ||
+ 			state == VARLIST_EXPECTED_COMMA_OR_IDENT)
+ 		pset.gvars = tglist_add(pset.gvars, NULL);
+ 	state = VARLIST_EXPECTED_COMMA_OR_IDENT;
+ }
+ else
+ {
+ 	if (state == VARLIST_INITIAL ||
+ 			state == VARLIST_EXPECTED_COMMA_OR_IDENT)
+ 	{
+ 		pset.gvars = tglist_add(pset.gvars, value);
+ 		

Re: [HACKERS] Global Sequences

2012-10-16 Thread Tom Lane
Simon Riggs  writes:
> On 16 October 2012 17:15, Tom Lane  wrote:
>> So I fully
>> expect that we're going to need something different from bog-standard
>> CREATE SEQUENCE.

> There's no point in that at all, as explained. It's sequences that
> need to work. We can already call my_nextval() rather than nextval()
> if we want a roll-your own sequence facility and can rewrite
> applications to call that, assuming UUID isn't appropriate.

I wasn't objecting to the concept of allowing nextval() to have
overloaded behaviors; more saying that that wasn't where to start the
design process.

In particular, the reason proposing a hook first seems backwards is that
if we have a catalog-level representation that some sequences are local
and others not, we should be using that to drive the determination of
whether to call a substitute function --- and maybe which one to call.
For instance, I could see attaching a function OID to each sequence
and then having nextval() call that function, instead of a hook per se.

Or maybe better, invent a level of indirection like a "sequence access
method" (comparable to index access methods) that provides a compatible
set of substitute functions for sequence operations.  If you want to
override nextval() for a sequence, don't you likely also need to
override setval(), currval(), etc?  Not to mention overriding ALTER
SEQUENCE's behavior.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Global Sequences

2012-10-16 Thread Simon Riggs
On 16 October 2012 17:17, Peter Eisentraut  wrote:
> On 10/16/12 9:20 AM, Simon Riggs wrote:
>> I've proposed a plugin for the allocation only. So the allocation
>> looks like anything you want.
>
> Are you planning to provide a reference implementation of some kind?

I'll provide hooks and a stub for testing.

A full implementation relies upon the physical transport used. For
BDR, there will be a full working version that relies upon that. TPL,
PGDG.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Global Sequences

2012-10-16 Thread Simon Riggs
On 16 October 2012 17:15, Tom Lane  wrote:
> Simon Riggs  writes:
>> On 16 October 2012 15:15, Tom Lane  wrote:
>>> I think this is a fundamentally wrong way to go about doing what you
>>> want to do.  It presumes that DDL-level manipulation of global sequences
>>> is exactly like local sequences; an assumption that is obviously false.
>
>> The message queue concept doesn't exist at all yet, so when we create
>> it we can specify anything we want. That is a different situation and
>> hence a different solution. CREATE SEQUENCE is SQL Standard and used
>> by SERIAL, many people's SQL, SQL generation tools etc.. My objective
>> is to come up with something that makes the standard code work
>> correctly in a replicated environment.
>
> I challenge you to find anything in the SQL standard that suggests that
> sequences have any nonlocal behavior.

No need for challenge, I agree, the SQL standard doesn't speak about
that. I didn't say it did.

> If anything, what you propose
> violates the standard,

And so that doesn't follow, but anyway...

> it doesn't make us follow it more closely.

I wasn't arguing that my proposal did that (made us follow standard
more closely).

My point is that application code exists that expects sequences to
Just Work, and so the aim of the proposal was to do that in a
replicated environment as well as single node.

> Furthermore, I find it hard to believe that people won't want both
> local and global sequences in the same database --- so one way or the
> other we need some DDL-level reflection of the difference.

Agreed, which is why I proposed some DDL-level syntax. Was that OK?

> A larger point though is that the various implementation choices you
> mentioned probably need to be configurable by DDL options.  I doubt that
> it will work well to say "install plugin A to get behavior X, or install
> plugin B to get behavior Y, and whichever you choose is not further
> configurable, it'll be the same behavior for all sequences".

Again, I accept that as of my last post, and I proposed syntax to provide it.

> So I fully
> expect that we're going to need something different from bog-standard
> CREATE SEQUENCE.

There's no point in that at all, as explained. It's sequences that
need to work. We can already call my_nextval() rather than nextval()
if we want a roll-your own sequence facility and can rewrite
applications to call that, assuming UUID isn't appropriate.

Please don't force people to rewrite their applications; it might not
go in the direction we want.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Global Sequences

2012-10-16 Thread Daniel Farina
On Tue, Oct 16, 2012 at 5:54 AM, Peter Eisentraut  wrote:
> On 10/15/12 5:33 PM, Simon Riggs wrote:
>> There are a few options
>> 1) Manual separation of the value space, so that N1 has 50% of
>> possible values and N2 has 50%. That has problems when we reconfigure
>> the cluster, and requires complex manual reallocation of values. So it
>> starts good but ends badly.
>> 2) Automatic separation of the value space. This could mimic the
>> manual operation, so it does everything for you - but thats just
>> making a bad idea automatic
>> 3) Lazy allocation from the value space. When a node is close to
>> running out of values, it requests a new allocation and coordinates
>> with all nodes to confirm the new allocation is good.
>
> What would the allocation service look like?  Is it another PostgreSQL
> server?  What's the communication protocol?  How would backups work?
> Crash recovery?

As a reasonable proxy to look at the first question, one may look at
how twitter uses their home-grown software snowflake.

https://github.com/twitter/snowflake

A colleague, Blake Mizerany, wrote a smaller version called "noeqd",
based on the same ideas, but he wanted something with fewer
dependencies.  Unless you are very Java-library-and-toolchain adept
you might find this more fun to play with.

https://github.com/bmizerany/noeqd

-- 
fdr


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Fix for log_line_prefix and session display

2012-10-16 Thread Bruce Momjian

Applied.

---

On Mon, Oct 15, 2012 at 02:22:33PM -0400, Bruce Momjian wrote:
> On Mon, Oct 15, 2012 at 10:01:29AM +0200, Albe Laurenz wrote:
> > Bruce Momjian wrote:
> > > Currently, our session id, displayed by log_line_prefix and CSV
> > output,
> > > is made up of the session start time epoch seconds and the process id.
> > > The problem is that the printf mask is currently %lx.%x, causing a
> > > process id less than 4096 to not display a full four hex digits after
> > > the decimal point.  I think this is confusing because the number .423
> > > appears higher than .1423, though it is not.  Here is what our current
> > > output looks like with log_line_prefix="%c: ":
> > > 
> > >   50785b3e.7ff9: ERROR:  syntax error at or near "test" at
> > character 1
> > >   50785b3e.7ff9: STATEMENT:  test
> > >   50785b3e.144: ERROR:  syntax error at or near "test" at
> > character 1
> > >   50785b3e.144: STATEMENT:  test
> > > 
> > > With my fix, here is the updated output:
> > > 
> > >   507864d3.7ff2: ERROR:  syntax error at or near "test" at
> > character 1
> > >   507864d3.7ff2: STATEMENT:  test
> > >   507864d3.013d: ERROR:  syntax error at or near "test" at
> > character 1
> > >   507864d3.013d: STATEMENT:  test
> > > 
> > > Patch attached.
> > 
> > Do you think that anybody wants to apply a linear ordering on
> > the second part of the session ID?  If you need the pid, you
> > can use %p.
> > 
> > I would say that this change makes sense if it causes disturbance
> > that the part after the period can be than 4 characters long
> > (it did not disturb me when I wrote a log file parser).
> > 
> > If that need is not urgent enough, maybe it would be better to
> > preserve the current behaviour in the (unlikely) event that somebody
> > relies on it.
> 
> I don't think anyone is picking apart the session id, but I do think the
> current output is confusing because the session id string length is
> pretty variable.  Anyone who is parsing the current session id will
> easily be able to parse the more consistent output.
> 
> -- 
>   Bruce Momjian  http://momjian.us
>   EnterpriseDB http://enterprisedb.com
> 
>   + It's impossible for everything to be true. +
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Global Sequences

2012-10-16 Thread Peter Eisentraut
On 10/16/12 9:20 AM, Simon Riggs wrote:
> I've proposed a plugin for the allocation only. So the allocation
> looks like anything you want.

Are you planning to provide a reference implementation of some kind?



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Global Sequences

2012-10-16 Thread Tom Lane
Simon Riggs  writes:
> On 16 October 2012 15:15, Tom Lane  wrote:
>> I think this is a fundamentally wrong way to go about doing what you
>> want to do.  It presumes that DDL-level manipulation of global sequences
>> is exactly like local sequences; an assumption that is obviously false.

> The message queue concept doesn't exist at all yet, so when we create
> it we can specify anything we want. That is a different situation and
> hence a different solution. CREATE SEQUENCE is SQL Standard and used
> by SERIAL, many people's SQL, SQL generation tools etc.. My objective
> is to come up with something that makes the standard code work
> correctly in a replicated environment.

I challenge you to find anything in the SQL standard that suggests that
sequences have any nonlocal behavior.  If anything, what you propose
violates the standard, it doesn't make us follow it more closely.

Furthermore, I find it hard to believe that people won't want both
local and global sequences in the same database --- so one way or the
other we need some DDL-level reflection of the difference.

A larger point though is that the various implementation choices you
mentioned probably need to be configurable by DDL options.  I doubt that
it will work well to say "install plugin A to get behavior X, or install
plugin B to get behavior Y, and whichever you choose is not further
configurable, it'll be the same behavior for all sequences".  So I fully
expect that we're going to need something different from bog-standard
CREATE SEQUENCE.  Exactly what isn't clear --- but I think modifying
nextval's behavior is way down the list of concerns.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Bugs in planner's equivalence-class processing

2012-10-16 Thread Tom Lane
I looked into the problem reported in bug #7604 (Bill MacArthur was kind
enough to send me a reproducer off-list).  The error can be demonstrated
by this example in the regression test database:

select f1, unique2, case when unique2 is null then f1 else 0 end
  from int4_tbl a left join tenk1 b on f1 = unique2
  where (case when unique2 is null then f1 else 0 end) = 0;

If you look at the output of the basic join:

regression=# select f1, unique2, case when unique2 is null then f1 else 0 end
regression-#   from int4_tbl a left join tenk1 b on f1 = unique2;
 f1  | unique2 |case 
-+-+-
   0 |   0 |   0
  123456 | |  123456
 -123456 | | -123456
  2147483647 | |  2147483647
 -2147483647 | | -2147483647
(5 rows)

it's obvious that only the first of these rows passes the added WHERE
condition, so the query should produce just that row; but what you
actually get in 9.2 is all five rows!  EXPLAIN gives a clue what's
going wrong:

 Nested Loop Left Join  (cost=0.00..22.51 rows=1 width=8)
   ->  Seq Scan on int4_tbl a  (cost=0.00..1.05 rows=5 width=4)
   ->  Index Only Scan using tenk1_unique2 on tenk1 b  (cost=0.00..4.28 rows=1 w
idth=4)
 Index Cond: (unique2 = a.f1)
 Filter: (CASE WHEN (unique2 IS NULL) THEN a.f1 ELSE 0 END = 0)
(5 rows)

The filter condition has been dropped down to the scan of tenk1 (relying
on a.f1 being passed in as a parameter), which is wrong since there it
is unable to filter null-extended rows produced by the left join
operator.

At first I didn't see how this could be happening: the code for placing
quals looks at RestrictInfo.nullable_relids, and this qual should have
tenk1 in its nullable_relids since it's above a left join to tenk1.
But gdb soon told me that that wasn't what join_clause_is_movable_to was
seeing.  Eventually I figured out that the equivalence-class machinery
was eating the clause (forming an EquivalenceClass consisting of the
CASE expression and the constant 0) and reconstructing a new clause that
didn't have nullable_relids set.

So this is essentially an oversight in the patch that added tracking of
nullable_relids.  I got confused about the difference between
outerjoin_delayed (this clause, as a whole, is not outerjoin_delayed
because its natural semantic level would be at the join anyway) and
having nonempty nullable_relids, and thought that equivalence-class
processing would never see such clauses because it doesn't accept
outerjoin_delayed clauses.  So there's no code in equivclass.c to
compute nullable_relids sets for constructed clauses.  At some point
it might be worth adding same, but for now I'm just going to tweak
distribute_qual_to_rels to not believe that clauses with nonempty
nullable_relids can be equivalence clauses.

It gets worse though.  Tracking clauses' nullable_relids is new in 9.2,
but variants of this example can be demonstrated to fail as far back
as 7.4.  Here's one:

select q1, unique2, thousand, hundred
 from int8_tbl a left join tenk1 b on q1 = unique2
where coalesce(thousand,123) = q1 and q1 = coalesce(hundred,123);

If you look at just the bare join output, there are no rows that
should pass the WHERE clause, but actually what you get in 7.4-9.1
is

 q1  | unique2 | thousand | hundred 
-+-+--+-
 123 | |  |
 123 | |  |
(2 rows)

which is completely broken because it's not even a subset of the bare
join output :-(.  9.1 shows this EXPLAIN output:

 Nested Loop Left Join  (cost=0.00..42.48 rows=1 width=20)
   Filter: (a.q1 = COALESCE(b.thousand, 123))
   ->  Seq Scan on int8_tbl a  (cost=0.00..1.05 rows=5 width=8)
   ->  Index Scan using tenk1_unique2 on tenk1 b  (cost=0.00..8.27 rows=1 width=
12)
 Index Cond: (a.q1 = unique2)
 Filter: (COALESCE(thousand, 123) = COALESCE(hundred, 123))

What's happening here is that because the two WHERE clauses are not
outerjoin_delayed, they're being deconstructed as equivalence clauses,
and then we synthesize an equivalence constraint on the two COALESCE
expressions that can be applied at the tenk1 scan level.  That gets
rid of the row that should join to q1=123, allowing the bogus
null-extended join rows to be formed.

So this shows that, although 9.2 has a more extensive form of the
disease, the use of outerjoin_delayed to decide whether a clause can be
an equivalence clause is really quite wrong.  I believe what I need to
do to fix this is back-port the 9.2 logic that computes a
nullable_relids set for each clause.  We won't need to store it, just
check whether it's empty before letting the clause be treated as an
equivalence clause.

Is anybody concerned about the compatibility implications of fixing this
bug in the back branches?  I'm worried about people complaining that we
broke their application in a minor release.  Maybe they were depending
on incorrect behavior, but they mig

Re: [HACKERS] pg_stat_lwlocks view - lwlocks statistics, round 2

2012-10-16 Thread Satoshi Nagayasu

2012/10/16 2:40, Jeff Janes wrote:

On Sun, Oct 14, 2012 at 9:43 AM, Tom Lane  wrote:

Satoshi Nagayasu  writes:

(2012/10/14 13:26), Fujii Masao wrote:

The tracing lwlock usage seems to still cause a small performance
overhead even if reporting is disabled. I believe some users would
prefer to avoid such overhead even if pg_stat_lwlocks is not available.
It should be up to a user to decide whether to trace lwlock usage, e.g.,
by using trace_lwlock parameter, I think.



Frankly speaking, I do not agree with disabling performance
instrument to improve performance. DBA must *always* monitor
the performance metrix when having such heavy workload.


This brings up a question that I don't think has been honestly
considered, which is exactly whom a feature like this is targeted at.
TBH I think it's of about zero use to DBAs (making the above argument
bogus).  It is potentially of use to developers, but a DBA is unlikely
to be able to do anything about lwlock-level contention even if he has
the knowledge to interpret the data.


Waiting on BufFreelistLock suggests increasing shared_buffers.

Waiting on ProcArrayLock perhaps suggests use of a connection pooler
(or does it?)

WALWriteLock suggests doing something about IO, either moving logs to
different disks, or getting BBU, or something.

WALInsertLock suggests trying to adapt your data loading process so it
can take advantage of the bulk, or maybe increasing wal_buffers.

And a lot of waiting on any of the locks gives a piece of information
the DBA can use when asking the mailing lists for help, even if it
doesn't allow him to take unilateral action.


So I feel it isn't something that should be turned on in production
builds.  I'd vote for enabling it by a non-default configure option,
and making sure that it doesn't introduce any overhead when the option
is off.


I think hackers would benefit from getting reports from DBAs in the
field with concrete data on bottlenecks.

If the only way to get this is to do some non-standard compile and
deploy it to production, or to create a "benchmarking" copy of the
production database system including a realistic work-load driver and
run the non-standard compile there; either of those is going to
dramatically cut down on the participation.


Agreed.

The hardest thing to investigate performance issue is
reproducing a situation in the different environment
from the production environment.

I often see people struggling to reproduce a situation
with different hardware and (similar but) different
workload. It is very time consuming, and also it often
fails.

So, we need to collect any piece of information, which
would help us to understand what's going on within
the production PostgreSQL, without any changes of
binaries and configurations in the production environment.

That's the reason why I stick to a "built-in" instrument,
and I disagree to disable such instrument even if it has
minor performance overhead.

A flight-recorder must not be disabled. Collecting
performance data must be top priority for DBA.

Regards,



Cheers,

Jeff




--
Satoshi Nagayasu 
Uptime Technologies, LLC. http://www.uptime.jp


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

2012-10-16 Thread Jan Wieck

On 10/15/2012 3:25 PM, Andres Freund wrote:

On Monday, October 15, 2012 09:18:57 PM Peter Geoghegan wrote:

On 15 October 2012 19:19, Bruce Momjian  wrote:
> I think Robert is right that if Slony can't use the API, it is unlikely
> any other replication system could use it.

I don't accept that. Clearly there is a circular dependency, and
someone has to go first - why should the Slony guys invest in adopting
this technology if it is going to necessitate using a forked Postgres
with an uncertain future?


Well. I don't think (hope) anybody proposed making something release worthy for
slony but rather a POC patch that proofs the API is generic enough to be used
by them. If I (or somebody else familiar with this) work together with somebody
familiar with with slony internals I think such a POC shouldn't be too hard to
do.
I think some more input from that side is a good idea. I plan to send out an
email to possibly interested parties in about two weeks...


What Slony essentially sends to the receiver node is a COPY stream in 
the format, Christopher described. That stream is directly copied into 
the receiving node's sl_log_N table and picked up there by an apply 
trigger BEFORE INSERT, that performs the corresponding 
INSERT/UPDATE/DELETE operation via prepared plans to the user tables.


For a POC I think it is sufficient to demonstrate that this copy stream 
can be generated out of the WAL decoding.


Note that Slony today does not touch columns in an UPDATE, that have not 
changed in the original UPDATE on the origin. Sending toasted column 
values, that haven't changed, would be a substantial change to the 
storage efficiency on the replica. The consequence of this is that the 
number of colums that need to be in the UPDATE's SET clause varies. The 
log_cmdupdncols is to separate the new column/value pairs from the 
column/key pairs of the updated row. The old row "key" in Slony is based 
on a unique index (preferably a PK, but any unique key will do). This 
makes that cmdupdncols simply the number of column/value pairs minus the 
number of key columns. So it isn't too hard to figure out.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Global Sequences

2012-10-16 Thread Simon Riggs
On 16 October 2012 15:15, Tom Lane  wrote:
> Simon Riggs  writes:
>> So, proposal is to allow nextval() allocation to access a plugin,
>> rather than simply write a WAL record and increment. If the plugin is
>> loaded all sequences call it (not OIDs).
>
> I think this is a fundamentally wrong way to go about doing what you
> want to do.  It presumes that DDL-level manipulation of global sequences
> is exactly like local sequences; an assumption that is obviously false.
> What you really want is something vaguely like nextval but applied to
> a distinct type of object.  That is, I think we first need a different
> kind of object called a "global sequence" with its own DDL operations.
>
> The nearby thread about "message queue" objects seems rather closely
> related.  Perhaps it would be fruitful to think about the commonalities
> involved in two (or more?) new relkinds for global objects.

The message queue concept doesn't exist at all yet, so when we create
it we can specify anything we want. That is a different situation and
hence a different solution. CREATE SEQUENCE is SQL Standard and used
by SERIAL, many people's SQL, SQL generation tools etc.. My objective
is to come up with something that makes the standard code work
correctly in a replicated environment.

If rewriting the application was acceptable, we could just do as Peter
suggests and say "use UUIDs". Many other people who think rewriting
everything is OK spell that "CouchDB" etc.. But that doesn't solve the
problem at hand, which is making existing things work, rather than
force people to rethink and rewrite.

So CREATE GLOBAL SEQUENCE as a new kind of object altogether wouldn't
solve the problem I'm trying to address.

I guess we could use a decoration syntax on a sequence, like this...

ALTER SEQUENCE foo GLOBAL
or
ALTER SEQUENCE foo ALLOCATION FUNCTION myglobalalloc(); -- if we cared
to specify the alloc function on a per object basis.
or
ALTER SEQUENCE foo WITH (allocation=global);

So that we can explicitly specify which sequences to replicate
globally and which locally.

We'd need to have a userset GUC
   default_sequence_allocation = local (default) | global
so that SERIAL and other new sequences could pick up the required
behaviour when created.

In any case, we need to have a plugin/user definable allocation
function for next few years at least.

Thoughts?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in -c CLI option of pg_dump/pg_restore

2012-10-16 Thread Guillaume Lelarge
On Sat, 2012-10-13 at 16:47 +0200, Guillaume Lelarge wrote:
> Hi,
> 
> One of my colleagues, Jehan-Guillaume de Rorthais, found a weird
> behaviour of the "-c" command line option in the pg_restore tool while
> doing a training. Here is the following steps he followed:
> 
> createdb foo
> 
> pg_dump -Fc foo > foo.dump
> createdb bar
> pg_restore -c -d bar foo.dump
> 
> bar contains the same objects as foo (nothing unusual here), but... foo
> is no longer present. Actually, if you use the "-c" command line option,
> you get a "DROP DATABASE" statement. To me, it feels like a quite
> terrible bug.
> 
> It's quite easy to reproduce. Just create a database, and use pg_dump
> with the "-c" option:
> 
> createdb foo
> pg_dump -s -c foo | grep DATABASE
> 
> and you end up with this:
> 
> DROP DATABASE foo;
> 
> I tried from 8.3 till 9.2, and only 9.2 has this behaviour.
> 
> You'll find attached a patch that fixes this issue. Another colleague,
> Gilles Darold, tried it in every possible way, and it works. I'm not
> sure the test I added makes it a very good patch, but it fixes the bug.
> 

Any comments on this?


-- 
Guillaume
http://blog.guillaume.lelarge.info
http://www.dalibo.com



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC][PATCH] wal decoding, attempt #2 - Design Documents (really attached)

2012-10-16 Thread Jan Wieck

On 10/15/2012 4:43 PM, Simon Riggs wrote:

Jan spoke at length at PgCon, for all to hear, that what we are
building is a much better way than the trigger logging approach Slony
uses. I don't take that as carte blanche for approval of everything
being done, but its going in the right direction with an open heart,
which is about as good as it gets.


The mechanism you are building for capturing changes is certainly a lot 
better than what Bucardo, Londiste and Slony are doing today. That much 
is true.


The flip side of the coin however is that all of today's logical 
replication systems are designed Postgres version agnostic to a degree. 
This means that the transition time from the existing, trigger based 
approach to the new WAL based mechanism will see both technologies in 
parallel, which is no small thing to support. And that transition time 
may last for a good while. We still have people installing Slony 1.2 
because 2.0 (3 years old by now) requires Postgres 8.3 minimum.



Jan

--
Anyone who trades liberty for security deserves neither
liberty nor security. -- Benjamin Franklin


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Global Sequences

2012-10-16 Thread Tom Lane
Simon Riggs  writes:
> So, proposal is to allow nextval() allocation to access a plugin,
> rather than simply write a WAL record and increment. If the plugin is
> loaded all sequences call it (not OIDs).

I think this is a fundamentally wrong way to go about doing what you
want to do.  It presumes that DDL-level manipulation of global sequences
is exactly like local sequences; an assumption that is obviously false.
What you really want is something vaguely like nextval but applied to
a distinct type of object.  That is, I think we first need a different
kind of object called a "global sequence" with its own DDL operations.

The nearby thread about "message queue" objects seems rather closely
related.  Perhaps it would be fruitful to think about the commonalities
involved in two (or more?) new relkinds for global objects.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] tuplesort memory usage: grow_memtuples

2012-10-16 Thread Simon Riggs
On 16 October 2012 13:42, Peter Geoghegan  wrote:
> On 14 October 2012 09:19, Simon Riggs  wrote:
>> This is a very useful optimisation, for both the low and the high end.
>
> Well, I'm about ready to mark this one "ready for committer". There is
> this outstanding issue in my revision of August 17th, though:
>
> +   /*
> +* XXX: This feels quite brittle; is there a better 
> principled approach,
> +* that does not violate modularity?
> +*/
> +   newmemtupsize = (int) floor(oldmemtupsize * allowedMem / 
> memNowUsed);
> +   state->fin_growth = true;
>
> I suppose that I should just recognise that this *is* nothing more
> than a heuristic, and leave it at that.

It's a simple and reasonable heuristic, and a great improvement on the
previous situation.

If you describe in detail that it is a heuristic and why that is
proposed over other approaches that should be sufficient for future
generations to read and understand.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Global Sequences

2012-10-16 Thread Simon Riggs
On 16 October 2012 13:54, Peter Eisentraut  wrote:
> On 10/15/12 5:33 PM, Simon Riggs wrote:
>> There are a few options
>> 1) Manual separation of the value space, so that N1 has 50% of
>> possible values and N2 has 50%. That has problems when we reconfigure
>> the cluster, and requires complex manual reallocation of values. So it
>> starts good but ends badly.
>> 2) Automatic separation of the value space. This could mimic the
>> manual operation, so it does everything for you - but thats just
>> making a bad idea automatic
>> 3) Lazy allocation from the value space. When a node is close to
>> running out of values, it requests a new allocation and coordinates
>> with all nodes to confirm the new allocation is good.
>
> What would the allocation service look like?  Is it another PostgreSQL
> server?  What's the communication protocol?  How would backups work?
> Crash recovery?

I've proposed a plugin for the allocation only. So the allocation
looks like anything you want.

Crash recovery and backups would not need changes.

> Option 4 is of course to use UUIDs.

That is a user level option. If user chooses sequences, then we must
support them.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Global Sequences

2012-10-16 Thread Andres Freund
On Tuesday, October 16, 2012 02:58:11 PM Andrew Dunstan wrote:
> On 10/16/2012 08:54 AM, Peter Eisentraut wrote:
> > Option 4 is of course to use UUIDs.
> 
> Yeah, I was wondering what this would really solve that using UUIDs
> wouldn't solve.

Large indexes over random values perform notably worse than mostly/completely 
ordered ones as they can be perfectly packed. Beside the fact that uuids have 
2/4 times the storage overhead of int4/int8.

That has influences both in query and modification performance.

Also, not allowing plainly numeric pk's makes porting an application pretty 
annoying...

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Global Sequences

2012-10-16 Thread Andrew Dunstan


On 10/16/2012 08:54 AM, Peter Eisentraut wrote:


Option 4 is of course to use UUIDs.




Yeah, I was wondering what this would really solve that using UUIDs 
wouldn't solve.


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Global Sequences

2012-10-16 Thread Peter Eisentraut
On 10/15/12 5:33 PM, Simon Riggs wrote:
> There are a few options
> 1) Manual separation of the value space, so that N1 has 50% of
> possible values and N2 has 50%. That has problems when we reconfigure
> the cluster, and requires complex manual reallocation of values. So it
> starts good but ends badly.
> 2) Automatic separation of the value space. This could mimic the
> manual operation, so it does everything for you - but thats just
> making a bad idea automatic
> 3) Lazy allocation from the value space. When a node is close to
> running out of values, it requests a new allocation and coordinates
> with all nodes to confirm the new allocation is good.

What would the allocation service look like?  Is it another PostgreSQL
server?  What's the communication protocol?  How would backups work?
Crash recovery?

Option 4 is of course to use UUIDs.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] BUG #7534: walreceiver takes long time to detect n/w breakdown

2012-10-16 Thread Heikki Linnakangas

On 15.10.2012 19:31, Fujii Masao wrote:

On Mon, Oct 15, 2012 at 11:27 PM, Heikki Linnakangas
  wrote:

On 15.10.2012 13:13, Heikki Linnakangas wrote:


Oh, I didn't remember that we've documented the specific structs that we
pass around. It's quite bogus anyway to explain the messages the way we
do currently, as they are actually dependent on the underlying
architecture's endianess and padding. I think we should refactor the
protocol to not transmit raw structs, but use pq_sentint and friends to
construct the messages. This was discussed earlier (see

http://archives.postgresql.org/message-id/4fe2279c.2070...@enterprisedb.com),
I think there's consensus that 9.3 would be a good time to do that as we
changed the XLogRecPtr format anyway.



This is what I came up with. The replication protocol is now
architecture-independent. The WAL format itself is still
architecture-independent, of course, but this is useful if you want to e.g
use pg_receivexlog to back up a server that runs on a different platform.

I chose the int64 format to transmit timestamps, even when compiled with
--disable-integer-datetimes.

Please review if you have the time..


Thanks for the patch!

When I ran pg_receivexlog, I encountered the following error.


Yeah, clearly I didn't test this near enough...

I fixed the bugs you bumped into, new version attached.


+   hdrlen = sizeof(int64) + sizeof(int64) + 
sizeof(int64);
+   hdrlen = sizeof(int64) + sizeof(int64) + 
sizeof(char);

These should be macro, to avoid calculation overhead?


The compiler will calculate this at compilation time, it's going to be a 
constant at runtime.


- Heikki
diff --git a/doc/src/sgml/protocol.sgml b/doc/src/sgml/protocol.sgml
index 3d72a16..5a32517 100644
--- a/doc/src/sgml/protocol.sgml
+++ b/doc/src/sgml/protocol.sgml
@@ -1366,7 +1366,8 @@ The commands accepted in walsender mode are:
   WAL data is sent as a series of CopyData messages.  (This allows
   other information to be intermixed; in particular the server can send
   an ErrorResponse message if it encounters a failure after beginning
-  to stream.)  The payload in each CopyData message follows this format:
+  to stream.)  The payload of each CopyData message from server to the
+  client contains a message of one of the following formats:
  
 
  
@@ -1390,34 +1391,32 @@ The commands accepted in walsender mode are:
   
   
   
-  Byte8
+  Int64
   
   
   
-  The starting point of the WAL data in this message, given in
-  XLogRecPtr format.
+  The starting point of the WAL data in this message.
   
   
   
   
   
-  Byte8
+  Int64
   
   
   
-  The current end of WAL on the server, given in
-  XLogRecPtr format.
+  The current end of WAL on the server.
   
   
   
   
   
-  Byte8
+  Int64
   
   
   
-  The server's system clock at the time of transmission,
-  given in TimestampTz format.
+  The server's system clock at the time of transmission, as
+  microseconds since midnight on 2000-01-01.
   
   
   
@@ -1445,25 +1444,12 @@ The commands accepted in walsender mode are:
continuation records can be sent in different CopyData messages.
  
  
-   Note that all fields within the WAL data and the above-described header
-   will be in the sending server's native format.  Endianness, and the
-   format for the timestamp, are unpredictable unless the receiver has
-   verified that the sender's system identifier matches its own
-   pg_control contents.
- 
- 
If the WAL sender process is terminated normally (during postmaster
shutdown), it will send a CommandComplete message before exiting.
This might not happen during an abnormal shutdown, of course.
  
 
  
-   The receiving process can send replies back to the sender at any time,
-   using one of the following message formats (also in the payload of a
-   CopyData message):
- 
-
- 
   
   
   
@@ -1495,12 +1481,23 @@ The commands accepted in walsender mode are:
   
   
   
-  Byte8
+  Int64
   
   
   
-  The server's system clock at the time of transmission,
-  given in TimestampTz format.
+  The server's system clock at the time of transmission, as
+  microseconds since midnight on 2000-01-01.
+  
+  
+  
+  
+  
+  Byte1
+  
+  
+  
+  1 means that the client should reply to this message as soon as
+  possible, to avoid a timeout disconnect. 0 otherwise.
   
   
   
@@ -1512,6 +1509,12 @@ The commands accepted in walsender mode are:
  
 
  
+   The receiving process can send re

Re: [HACKERS] tuplesort memory usage: grow_memtuples

2012-10-16 Thread Peter Geoghegan
On 14 October 2012 09:19, Simon Riggs  wrote:
> This is a very useful optimisation, for both the low and the high end.

Well, I'm about ready to mark this one "ready for committer". There is
this outstanding issue in my revision of August 17th, though:

+   /*
+* XXX: This feels quite brittle; is there a better principled 
approach,
+* that does not violate modularity?
+*/
+   newmemtupsize = (int) floor(oldmemtupsize * allowedMem / 
memNowUsed);
+   state->fin_growth = true;

I suppose that I should just recognise that this *is* nothing more
than a heuristic, and leave it at that.

-- 
Peter Geoghegan   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Global Sequences

2012-10-16 Thread Simon Riggs
On 16 October 2012 13:26, Markus Wanner  wrote:

> Why does a "Global Sequences" API necessarily hook at the nextval() and
> setval() level? That sounds like it yields an awkward amount of
> duplicate work. Reading this thread, so far it looks like we agree that
> option 3) is the most feasible optimization (the strict ordering being
> the un-optimized starting point). Do we really need an API that allows
> for implementations of options 1) and 2)?

Where else would you put the hook? The hook's location as described
won't change whether you decide you want 1, 2 or 3.

> What I'd appreciate more is a common implementation for option 3) with
> an API to plug in different solutions to the underlying consensus problem.

Implementations will be similar, differing mostly in the topology and
transport layer, which means its not going to be possible to provide
such a thing initially without slowing it down to the point we don't
actually get it at all.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Global Sequences

2012-10-16 Thread Markus Wanner
On 10/16/2012 12:47 AM, Josh Berkus wrote:
> I'd also love to hear from the PostgresXC folks on whether this solution
> works for them.  Postgres-R too.

In Postgres-R, option 3) is implemented. Though, by default sequences
work just like on a single machine, giving you monotonically increasing
sequence values - independent from the node you call nextval() from. IMO
that's the user's expectation. (And yes, this has a performance penalty.
But no, there's no compromise in availability).

It is implemented very much like the per-backend cache we already have
in vanilla Postgres, but taken to the per-node level. This gives the
user a nice compromise between strongly ordered and entirely random
values, allowing fine-tuning the trade off between performance and
laziness in the ordering (think of CACHE 10 vs. CACHE 1).

> If it works for all three of those
> tools, it's liable to work for any potential new tool.

In Postgres-R, this per-node cache uses additional attributes in the
pg_sequence system catalog to store state of this cache. This is
something I'm sure is not feasible to do from within a plugin.

Why does a "Global Sequences" API necessarily hook at the nextval() and
setval() level? That sounds like it yields an awkward amount of
duplicate work. Reading this thread, so far it looks like we agree that
option 3) is the most feasible optimization (the strict ordering being
the un-optimized starting point). Do we really need an API that allows
for implementations of options 1) and 2)?

What I'd appreciate more is a common implementation for option 3) with
an API to plug in different solutions to the underlying consensus problem.

Regards

Markus Wanner


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] First draft of snapshot snapshot building design document

2012-10-16 Thread Andres Freund
Hi All,

On Thursday, October 11, 2012 01:02:26 AM Peter Geoghegan wrote:
> The design document [2] really just explains the problem (which is the
> need for catalog metadata at a point in time to make sense of heap
> tuples), without describing the solution that this patch offers with
> any degree of detail. Rather, [2] says "How we build snapshots is
> somewhat intricate and complicated and seems to be out of scope for
> this document", which is unsatisfactory. I look forward to reading the
> promised document that describes this mechanism in more detail.

Here's the first version of the promised document. I hope it answers most of 
the questions.

Input welcome!

Greetings,

Andres
-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
= Snapshot Building =
:author: Andres Freund, 2nQuadrant Ltd

== Why do we need timetravel catalog access ==

When doing wal decoding (see DESIGN.txt for reasons to do so) we need to know
how the catalog looked at the point a record was inserted into the WAL because
without that information we don't know much more about the record but its
length. Its just an arbitrary bunch of bytes without further information.
Unfortunately due the possibility of the table definition changing we cannot
just access a newer version of the catalog and assume the table definition is 
just the same.

If only the type information were required it might be enough to annotate the
wal records with a bit more information (table oid, table name, column name,
column type) but as we want to be able to convert the output to more useful
formats like text we need to be able to call output functions. Those need a
normal environment including the usual caches and normal catalog access to
lookup operators, functions and other types.

Our solution to this is to add the capability to access the catalog in a way
that makes it look like it did when the record was inserted into the WAL. The
locking used during WAL generation guarantees the catalog is/was in a consistent
state at that point.

Interesting cases include:
- enums
- composite types
- extension types
- non-C functions
- relfilenode to table oid mapping

Due to postgres' MVCC nature regular modification of a tables contents are
theoretically non-destructive. The problem is that there is no way to access
arbitrary points in time even if the data for it is there.

This module adds the capability to do so in the very limited set of
circumstances we need it in for wal decoding. It does *not* provide a facility
to do so in general.

A 'Snapshot' is the datastructure used in postgres to describe which tuples are
visible and which are not. We need to build a Snapshot which can be used to
access the catalog the way it looked when the wal record was inserted.

Restrictions:
* Only works for catalog tables
* Snapshot modifications are somewhat expensive
* it cannot build initial visibility information for every point in time, it
  needs a specific set of circumstances for that
* limited window in which we can build snapshots

== How do we build timetravel snapshots ==

Hot Standby added infrastructure to build snapshots from WAL during recovery in
the 9.0 release. Most of that can be reused for our purposes.

We cannot reuse all of the HS infrastructure because:
* we are not in recovery
* we need to look *inside* transactions
* we need the capability to have multiple different snapshots arround at the 
same time

We need to provide two kinds of snapshots that are implemented rather
differently in their plain postgres incarnation:
* SnapshotNow
* SnapshotMVCC

We need both because if any operators use normal functions they will get
executed with SnapshotMVCC semantics while the catcache and related things will
rely on SnapshotNow semantics. Note that SnapshotNow here cannot be a normal
SnapshotNow because we wouldn't access the old version of the catalog in that
case. Instead something like an MVCC snapshot with the correct visibility
information. That also means that snapshot won't have some race issues normal
SnapshotNow has.

Everytime a transaction that changed the catalog commits all other transactions
will need a new snapshot that marks that transaction (and its subtransactions)
as visible.

Our snapshot representation is a bit different from normal snapshots, but we
still reuse the normal SnapshotData struct:
* Snapshot->xip contains all transaction we consider committed
* Snapshot->subxip contains all transactions belonging to our transaction,
  including the toplevel one

The meaning of ->xip is inverted in comparison with non-timetravel snapshots
because usually only a tiny percentage of comitted transactions will have
modified the catalog between xmin and xmax. It also makes subtransaction
handling easier (we cannot query pg_subtrans).

== Building of initial snapshot ==

We can start building an initial snapshot as soon as we find either an
XLOG_RUNNING_XACTS or an XLOG_CHECKPOINT

Re: [HACKERS] [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility

2012-10-16 Thread Hannu Krosing

On 10/16/2012 11:29 AM, Hannu Krosing wrote:

On 10/16/2012 11:18 AM, Simon Riggs wrote:

On 16 October 2012 09:56, Hannu Krosing  wrote:

Hallo postgresql and replication hackers

This mail is an additional RFC which proposes a simple way to extend 
the

new logical replication feature so it can cover most usages of
skytools/pgq/londiste

While the current work for BDR/LCR (bi-directional replication/logical
replication) using WAL is theoretically enought to cover 
_replication_ offered by
Londiste it falls short in one important way - there is currently no 
support for pure
queueing, that is for "streams" of data which does not need to be 
stored in the source

database.

Fortunately there is a simple solution - do not store it in the source
database :)

The only thing needed for adding this is to have a table type which

a) generates a INSERT record in WAL

and

b) does not actually store the data in a local file

If implemented in userspace it would be a VIEW (or table) with a
before/instead
trigger which logs the inserted data and then cancels the insert.

I'm sure this thing could be implemented, but I leave the tech 
discussion to

those who are currently deep in WAL generation/reconstruction .

If we implement logged only tables / queues we would not only enable 
a more

performant pgQ replacement for implementing full Londiste / skytools
functionality
but would also become a very strong player to be used as persistent 
basis
for message queueing solutions like ActiveMQ, StorMQ, any Advanced 
Message

Queuing Protocol (AMQP) and so on.


Hmm, I was assuming that we'd be able to do that by just writing extra
WAL directly. But now you've made me think about it, that would be
very ugly.

Doing it this was, as you suggest, would allow us to write WAL records
for queuing/replication to specific queue ids. It also allows us to
have privileges assigned. So this looks like a good idea and might
even be possible for 9.3.

I've got a feeling we may want the word QUEUE again in the future, so
I think we should call this a MESSAGE QUEUE.

CREATE MESSAGE QUEUE foo;
DROP MESSAGE QUEUE foo;

I would like this to be very similar to a table, so it would be

CREATE MESSAGE QUEUE(fieldname type, ...) foo;

perhaps even allowing defaults and constraints. again, this
depends on how complecxt the implementation would be.

for the receiving side it would look like a table with only inserts,
and in this case there could even be a possibility to use it as
a remote log table.


To clarify - this is intended to be a mirror image of UNLOGGED table

That is , as much as possible a full table, except that no data gets
written, which means that

a) indexes do not make any sense
b) exclusion and unique constraints dont make any sense
c) select, update and delete always see an empty table

all these should probably throw and error, analogous to how VIEWs 
currently work.


It could be also described as a write-only table, except that it is
possible to materialise it as a real table on the receiving side





GRANT INSERT ON MESSAGE QUEUE foo TO ...;
REVOKE INSERT ON MESSAGE QUEUE foo TO ...;

Rules wouldn't. DELETE and UPDATE wouldn't work, nor would SELECT.

Things for next release: Triggers, SELECT sees a stream of changes,
CHECK clauses to constrain what can be written.

One question: would we require the INSERT statement to parse against a
tupledesc, or would it be just a single blob of TEXT or can we send
any payload? I'd suggest just a single blob of TEXT, since that can be
XML or JSON etc easily enough.









--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility

2012-10-16 Thread Simon Riggs
On 16 October 2012 10:29, Hannu Krosing  wrote:

> I would like this to be very similar to a table, so it would be
>
> CREATE MESSAGE QUEUE(fieldname type, ...) foo;
>
> perhaps even allowing defaults and constraints. again, this
> depends on how complecxt the implementation would be.

Presumably just CHECK constraints, not UNIQUE or FKs.
Indexes would not be allowed.

> for the receiving side it would look like a table with only inserts,
> and in this case there could even be a possibility to use it as
> a remote log table.

The queue data would be available via the API, so it can look like anything.

It would be good to identify this with a new rmgr id.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility

2012-10-16 Thread Hannu Krosing

On 10/16/2012 11:18 AM, Simon Riggs wrote:

On 16 October 2012 09:56, Hannu Krosing  wrote:

Hallo postgresql and replication hackers

This mail is an additional RFC which proposes a simple way to extend the
new logical replication feature so it can cover most usages of
skytools/pgq/londiste

While the current work for BDR/LCR (bi-directional replication/logical
replication) using WAL is theoretically enought to cover _replication_ offered 
by
Londiste it falls short in one important way - there is currently no support 
for pure
queueing, that is for "streams" of data which does not need to be stored in the 
source
database.

Fortunately there is a simple solution - do not store it in the source
database :)

The only thing needed for adding this is to have a table type which

a) generates a INSERT record in WAL

and

b) does not actually store the data in a local file

If implemented in userspace it would be a VIEW (or table) with a
before/instead
trigger which logs the inserted data and then cancels the insert.

I'm sure this thing could be implemented, but I leave the tech discussion to
those who are currently deep in WAL generation/reconstruction .

If we implement logged only tables / queues we would not only enable a more
performant pgQ replacement for implementing full Londiste / skytools
functionality
but would also become a very strong player to be used as persistent basis
for message queueing solutions like ActiveMQ, StorMQ, any Advanced Message
Queuing Protocol (AMQP) and so on.


Hmm, I was assuming that we'd be able to do that by just writing extra
WAL directly. But now you've made me think about it, that would be
very ugly.

Doing it this was, as you suggest, would allow us to write WAL records
for queuing/replication to specific queue ids. It also allows us to
have privileges assigned. So this looks like a good idea and might
even be possible for 9.3.

I've got a feeling we may want the word QUEUE again in the future, so
I think we should call this a MESSAGE QUEUE.

CREATE MESSAGE QUEUE foo;
DROP MESSAGE QUEUE foo;

I would like this to be very similar to a table, so it would be

CREATE MESSAGE QUEUE(fieldname type, ...) foo;

perhaps even allowing defaults and constraints. again, this
depends on how complecxt the implementation would be.

for the receiving side it would look like a table with only inserts,
and in this case there could even be a possibility to use it as
a remote log table.



GRANT INSERT ON MESSAGE QUEUE foo TO ...;
REVOKE INSERT ON MESSAGE QUEUE foo TO ...;

Rules wouldn't. DELETE and UPDATE wouldn't work, nor would SELECT.

Things for next release: Triggers, SELECT sees a stream of changes,
CHECK clauses to constrain what can be written.

One question: would we require the INSERT statement to parse against a
tupledesc, or would it be just a single blob of TEXT or can we send
any payload? I'd suggest just a single blob of TEXT, since that can be
XML or JSON etc easily enough.





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility

2012-10-16 Thread Simon Riggs
On 16 October 2012 09:56, Hannu Krosing  wrote:
> Hallo postgresql and replication hackers
>
> This mail is an additional RFC which proposes a simple way to extend the
> new logical replication feature so it can cover most usages of
> skytools/pgq/londiste
>
> While the current work for BDR/LCR (bi-directional replication/logical
> replication) using WAL is theoretically enought to cover _replication_ 
> offered by
> Londiste it falls short in one important way - there is currently no support 
> for pure
> queueing, that is for "streams" of data which does not need to be stored in 
> the source
> database.
>
> Fortunately there is a simple solution - do not store it in the source
> database :)
>
> The only thing needed for adding this is to have a table type which
>
> a) generates a INSERT record in WAL
>
> and
>
> b) does not actually store the data in a local file
>
> If implemented in userspace it would be a VIEW (or table) with a
> before/instead
> trigger which logs the inserted data and then cancels the insert.
>
> I'm sure this thing could be implemented, but I leave the tech discussion to
> those who are currently deep in WAL generation/reconstruction .
>
> If we implement logged only tables / queues we would not only enable a more
> performant pgQ replacement for implementing full Londiste / skytools
> functionality
> but would also become a very strong player to be used as persistent basis
> for message queueing solutions like ActiveMQ, StorMQ, any Advanced Message
> Queuing Protocol (AMQP) and so on.


Hmm, I was assuming that we'd be able to do that by just writing extra
WAL directly. But now you've made me think about it, that would be
very ugly.

Doing it this was, as you suggest, would allow us to write WAL records
for queuing/replication to specific queue ids. It also allows us to
have privileges assigned. So this looks like a good idea and might
even be possible for 9.3.

I've got a feeling we may want the word QUEUE again in the future, so
I think we should call this a MESSAGE QUEUE.

CREATE MESSAGE QUEUE foo;
DROP MESSAGE QUEUE foo;

GRANT INSERT ON MESSAGE QUEUE foo TO ...;
REVOKE INSERT ON MESSAGE QUEUE foo TO ...;

Rules wouldn't. DELETE and UPDATE wouldn't work, nor would SELECT.

Things for next release: Triggers, SELECT sees a stream of changes,
CHECK clauses to constrain what can be written.

One question: would we require the INSERT statement to parse against a
tupledesc, or would it be just a single blob of TEXT or can we send
any payload? I'd suggest just a single blob of TEXT, since that can be
XML or JSON etc easily enough.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [RFC] CREATE QUEUE (log-only table) for londiste/pgQ ccompatibility

2012-10-16 Thread Hannu Krosing

Hallo postgresql and replication hackers

This mail is an additional RFC which proposes a simple way to extend the
new logical replication feature so it can cover most usages of 
skytools/pgq/londiste


While the current work for BDR/LCR (bi-directional replication/logical 
replication)
using WAL is theoretically enought to cover _replication_ offered by 
Londiste it
 falls short in one important way - there is currently no support for 
pure queueing,
that is for "streams" of data which does not need to be stored in the 
source database.


Fortunately there is a simple solution - do not store it in the source 
database :)


The only thing needed for adding this is to have a table type which

a) generates a INSERT record in WAL

and

b) does not actually store the data in a local file

If implemented in userspace it would be a VIEW (or table) with a 
before/instead

trigger which logs the inserted data and then cancels the insert.

I'm sure this thing could be implemented, but I leave the tech 
discussion to those

who are currently deep in WAL generation/reconstruction .

If we implement logged only tables / queues we would not only enable a more
performant pgQ replacement for implementing full Londiste / skytools 
functionality
but would also become a very strong player to be used as persistent 
basis for

message queueing solutions like ActiveMQ, StorMQ, any Advanced Message
Queuing Protocol (AMQP) and so on.

comments ?


Hannu Krosing



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Global Sequences

2012-10-16 Thread Yeb Havinga

On 2012-10-15 23:33, Simon Riggs wrote:

So, proposal is to allow nextval() allocation to access a plugin,
rather than simply write a WAL record and increment. If the plugin is
loaded all sequences call it (not OIDs).
+1. It is currently impossible to alter nextvals behaviour, without 
making changes in core. It is possible to define an alternative 
implementation and try to force to use it by using the search_path, but 
serial datatypes are always bound to pg_catalog.nextval(). This would 
enable every distributed PostgreSQL system to make a cleaner 
implementation for global sequences than they currently have, and would 
also encourage reuse of distributed nextval implementations.


regards,
Yeb

--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers