Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Fujii Masao
On Fri, Mar 18, 2011 at 2:46 AM, Robert Haas robertmh...@gmail.com wrote:
 On further review, I've changed my mind.  Making synchronous_commit
 trump synchronous_replication is appealing conceptually, but it's
 going to lead to some weird corner cases.  For example, a transaction
 that drops a non-temporary relation always commits synchronously; and
 2PC also ignores synchronous_commit.  In the case where
 synchronous_commit=off and synchronous_replication=on, we'd either
 have to decide that these sorts of transactions aren't going to
 replicate synchronously (which would give synchronous_commit a rather
 long reach into areas it doesn't currently touch) or else that it's OK
 for CREATE TABLE foo () to be totally asynchronous but that DROP TABLE
 foo requires sync commit AND sync rep.  That's pretty weird.

 What makes more sense to me after having thought about this more
 carefully is to simply make a blanket rule that when
 synchronous_replication=on, synchronous_commit has no effect.  That is
 easy to understand and document.  I'm inclined to think it's OK to let
 synchronous_replication have this effect even if max_wal_senders=0 or
 synchronous_standby_names=''; you shouldn't turn
 synchronous_replication on just for kicks, and I don't think we want
 to complicate the test in RecordTransactionCommit() more than
 necessary.  We should, however, adjust the logic so that a transaction
 which has not written WAL can still commit asynchronously, because
 such a transaction has only touched temp or unlogged tables and so
 it's not important for it to make it to the standby, where that data
 doesn't exist anyway.

In the first place, I think that it's complicated to keep those two parameters
separately. What about merging them to one parameter? What I'm thinking
is to remove synchronous_replication and to increase the valid values of
synchronous_commit from on/off to async/local/remote/both. Each value
works as follows.

async   = (synchronous_commit = off  synchronous_replication = off)
async makes a transaction do local WAL flush and replication
asynchronously.

local = (synchronous_commit = on  synchronous_replication = off)
local makes a transaction wait for only local WAL flush.

remote = (synchronous_commit = off  synchronous_replication = on)
remote makes a transaction wait for only replication. Local WAL flush is
performed by walwriter. This is useless in 9.1 because we always must
wait for local WAL flush when we wait for replication. But in the future,
if we'll be able to send WAL before WAL write (i.e., send WAL from
wal_buffers), this might become useful. In 9.1, it seems reasonable to
remove this value.

both = (synchronous_commit = on  synchronous_replication = on)
both makes a transaction wait for local WAL flush and replication.

Thought?

Regards,

-- 
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI bug?

2011-03-18 Thread YAMAMOTO Takashi
hi,

thanks for quickly fixing problems.
i tested the later version (a2eb9e0c08ee73208b5419f5a53a6eba55809b92)
and only errors i got was out of shared memory.  i'm not sure if
it was caused by SSI activities or not.

YAMAMOTO Takashi


the following is a snippet from my application log:

PG_DIAG_SEVERITY: WARNING
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: shmem.c
PG_DIAG_SOURCE_LINE: 190
PG_DIAG_SOURCE_FUNCTION: ShmemAlloc

PG_DIAG_SEVERITY: ERROR
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: dynahash.c
PG_DIAG_SOURCE_LINE: 925
PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value

PG_DIAG_SEVERITY: WARNING
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: shmem.c
PG_DIAG_SOURCE_LINE: 190
PG_DIAG_SOURCE_FUNCTION: ShmemAlloc

PG_DIAG_SEVERITY: ERROR
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: dynahash.c
PG_DIAG_SOURCE_LINE: 925
PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value

PG_DIAG_SEVERITY: WARNING
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: shmem.c
PG_DIAG_SOURCE_LINE: 190
PG_DIAG_SOURCE_FUNCTION: ShmemAlloc

PG_DIAG_SEVERITY: ERROR
PG_DIAG_SQLSTATE: 53200
PG_DIAG_MESSAGE_PRIMARY: out of shared memory
PG_DIAG_SOURCE_FILE: dynahash.c
PG_DIAG_SOURCE_LINE: 925
PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Basic Recovery Control functions for use in Hot Standby. Pause,

2011-03-18 Thread Heikki Linnakangas

On 18.03.2011 07:13, Fujii Masao wrote:

On Fri, Mar 18, 2011 at 1:17 AM, Robert Haasrobertmh...@gmail.com  wrote:

One thing I'm not quite clear on is what happens if we reach the
recovery target before we reach consistency.  i.e. create restore
point, flush xlog, abnormal shutdown, try to recover to named restore
point.  Is there any possibility that we can end up paused before Hot
Standby has actually started up.  Because that would be fairly useless
and annoying.


Good catch! In that case, the same situation as (3) would happen.
I think that recovery should ignore pause_at_recovery_target until
it reaches consistent point. If we do so, when recovery target is
ahead of consistent point, recovery just ends in inconsistent point
and throws FATAL error.


If recovery target is set to before its consistent, ie. before 
minRecoveryPoint, we should throw an error before recovery even starts. 
I'm not sure if we check that at the moment.


Not sure what to to do recovery target is beyond minRecoveryPoint and 
pause_at_recovery_target=true, but the server hasn't been opened for hot 
standby yet (because it hasn't seen a running-xacts record yet). I agree 
it's pretty useless and annoying to stop there.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Simon Riggs
On Fri, 2011-03-18 at 14:45 +0900, Fujii Masao wrote:
 On Fri, Mar 18, 2011 at 2:52 AM, Robert Haas robertmh...@gmail.com wrote:
  On Thu, Mar 10, 2011 at 3:04 PM, Robert Haas robertmh...@gmail.com wrote:
  -   /* Let the master know that we received some 
  data. */
  -   XLogWalRcvSendReply();
  -   XLogWalRcvSendHSFeedback();
 
  This change completely eliminates the difference between write_location
  and flush_location in pg_stat_replication. If this change is reasoable, we
  should get rid of write_location from pg_stat_replication since it's 
  useless.
  If not, this change should be reverted. I'm not sure whether monitoring
  the difference between write and flush locations is useful. But I guess 
  that
  someone thought so and that code was added.
 
  I could go either way on this but clearly we need to do one or the other.
 
  I'm not really sure why this was part of the synchronous replication
  patch, but after mulling it over I think it's probably right to rip
  out write_location completely.  There shouldn't ordinarily be much of
  a gap between write location and flush location, so it's probably not
  worth the extra network overhead to keep track of it.  We might need
  to re-add some form of this in the future if we have a version of
  synchronous replication that only waits for confirmation of receipt
  rather than for confirmation of flush, but we don't have that in 9.1,
  so why bother?
 
  Barring objections, I'll go do that.
 
 I agree to get rid of write_location.

No, don't remove it.

We seem to be just looking for things to tweak without any purpose.
Removing this adds nothing for us.

We will have the column in the future, it is there now, so leave it.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Allowing multiple concurrent base backups

2011-03-18 Thread Heikki Linnakangas

On 17.03.2011 21:39, Robert Haas wrote:

On Mon, Jan 31, 2011 at 10:45 PM, Fujii Masaomasao.fu...@gmail.com  wrote:

On Tue, Feb 1, 2011 at 1:31 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

Hmm, good point. It's harmless, but creating the history file in the first
place sure seems like a waste of time.


The attached patch changes pg_stop_backup so that it doesn't create
the backup history file if archiving is not enabled.

When I tested the multiple backups, I found that they can have the same
checkpoint location and the same history file name.


$ for ((i=0; i4; i++)); do
pg_basebackup -D test$i -c fast -x -l test$i
done

$ cat test0/backup_label
START WAL LOCATION: 0/2B0 (file 00010002)
CHECKPOINT LOCATION: 0/2E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test0

$ cat test1/backup_label
START WAL LOCATION: 0/2B0 (file 00010002)
CHECKPOINT LOCATION: 0/2E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test1

$ cat test2/backup_label
START WAL LOCATION: 0/2B0 (file 00010002)
CHECKPOINT LOCATION: 0/2E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test2

$ cat test3/backup_label
START WAL LOCATION: 0/2B0 (file 00010002)
CHECKPOINT LOCATION: 0/2E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test3

$ ls archive/*.backup
archive/00010002.00B0.backup


This would cause a serious problem. Because the backup-end record
which indicates the same START WAL LOCATION can be written by the
first backup before the other finishes. So we might think wrongly that
we've already reached a consistency state by reading the backup-end
record (written by the first backup) before reading the last required WAL
file.

/*
 * Force a CHECKPOINT.  Aside from being necessary to prevent 
torn
 * page problems, this guarantees that two successive backup 
runs will
 * have different checkpoint positions and hence different 
history
 * file names, even if nothing happened in between.
 *
 * We use CHECKPOINT_IMMEDIATE only if requested by user (via 
passing
 * fast = true).  Otherwise this can take awhile.
 */
RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT |
  (fast ? CHECKPOINT_IMMEDIATE 
: 0));

This problem happens because the above code (in do_pg_start_backup)
actually doesn't ensure that the concurrent backups have the different
checkpoint locations. ISTM that we should change the above or elsewhere
to ensure that.


Yes, good point.


Or we should include backup label name in the backup-end
record, to prevent a recovery from reading not-its-own backup-end record.


Backup labels are not guaranteed to be unique either, so including 
backup label in the backup-end-record doesn't solve the problem. But 
something else like a backup-start counter in shared memory or process 
id would work.


It won't make the history file names unique, though. Now that we use on 
the end-of-backup record for detecting end-of-backup, the history files 
are just for documenting purposes. Do we want to give up on history 
files for backups performed with pg_basebackup? Or we can include the 
backup counter or similar in the filename too.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Markus Wanner
Hi,

sorry for being late to join that bike-shedding discussion.

On 03/07/2011 05:09 PM, Alvaro Herrera wrote:
 I think these terms are used inconsistenly enough across the industry
 that what would make the most sense would be to use the common term and
 document accurately what we mean by it, rather than relying on some
 external entity's definition, which could change (like wikipedia's).

I absolutely agree to Alvaro here.

The Wikipedia definition seems to only speak about one local and one
remote node.  Requiring an ack from at least one remote node seems to
cover that.

Not even Wikipedia goes further in their definition and tries to explain
what 'synchronous replication' could mean in case we have more than two
nodes.  A somewhat common expectation is, that all nodes would have to
ack.  However, with such a requirement a single node failure brings your
cluster to a full stop.  So this isn't a practical option.

Google invented the term semi-syncronous for something that's
essentially the same that we have, now, I think.  However, I full
heartedly hate that term (based on the reasoning that there's no
semi-pregnant, either).

Others (like me) use synchronous or (lately rather) eager to mean
that only a majority of nodes need to send an ACK.  I have to explain
what I mean every time.

In the end, I don't have a strong opinion either way, anymore.  I'm
happy to think of the replication between the master and the one standby
that's sending an ACK first as synchronous.  (Even if those may well
be different standbies for different transactions).

Hope to have brought some light into this discussion.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] I am confused after reading codes of PostgreSQL three week

2011-03-18 Thread Markus Wanner
Hom,

On 03/17/2011 04:49 PM, Kevin Grittner wrote:
 That's ambitious.

Absolutely, yes.  Exercise patience with yourself.

A method that hasn't been mentioned, yet, is digging out your debugger
and attach it to a connected Postgres backend.  You can then issue a
query you are interested in and follow the backend doing its work.

That's particularly helpful in trying to find a certain spot of
interest.  Of course, it doesn't help much in getting the big picture.

Good luck on your journey through the code base.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] I am confused after reading codes of PostgreSQL three week

2011-03-18 Thread Brendan Jurd
On 18 March 2011 01:57, hom obsidian...@gmail.com wrote:
  I try to known how a database is implemented

This objective is so vast and so vague that it's difficult to give
meaningful help.

I'd emphasise Kevin Grittner's very worthwhile advice.  Try to break
your question down into smaller, more specific ones.  With a question
like how does postgres work you're likely to flounder.  But with a
more targeted question, e.g., what format does postgres use to save
data to disk or how does postgres implement ORDER BY, you can make
easier progress, and perhaps you could get more useful pointers from
the people on this list.

Have you read through the Overview of System Internals chapter in
the documentation [1]?  Perhaps it will help you identify the areas
you wish to explore further, and form more specific questions.

[1] http://www.postgresql.org/docs/current/static/overview.html

Cheers,
BJ

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] I am confused after reading codes of PostgreSQL three week

2011-03-18 Thread Vaibhav Kaushal
Hi,

That was the question I was facing 5 months ago and trust me I am doing it
even now. With an average of 6+ hours going into PostgreSQL Code, even with
best practices (as suggested by the developers) I still think I know less
than 10 percent. It is too huge to be swallowed at once.

I too had to break it down into pieces and because everything is so
interconnected with everything else, it is quite complicated in the
beginning. Start with one piece; planner, parser, executor, storage
management whatever and slowly it should help you get the bigger picture.

regards,
Vaibhav

I had to break it into

On Fri, Mar 18, 2011 at 3:39 PM, Brendan Jurd dire...@gmail.com wrote:

 On 18 March 2011 01:57, hom obsidian...@gmail.com wrote:
   I try to known how a database is implemented

 This objective is so vast and so vague that it's difficult to give
 meaningful help.

 I'd emphasise Kevin Grittner's very worthwhile advice.  Try to break
 your question down into smaller, more specific ones.  With a question
 like how does postgres work you're likely to flounder.  But with a
 more targeted question, e.g., what format does postgres use to save
 data to disk or how does postgres implement ORDER BY, you can make
 easier progress, and perhaps you could get more useful pointers from
 the people on this list.

 Have you read through the Overview of System Internals chapter in
 the documentation [1]?  Perhaps it will help you identify the areas
 you wish to explore further, and form more specific questions.

 [1] http://www.postgresql.org/docs/current/static/overview.html

 Cheers,
 BJ

 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Allowing multiple concurrent base backups

2011-03-18 Thread Heikki Linnakangas

On 18.03.2011 10:48, Heikki Linnakangas wrote:

On 17.03.2011 21:39, Robert Haas wrote:

On Mon, Jan 31, 2011 at 10:45 PM, Fujii Masaomasao.fu...@gmail.com
wrote:

On Tue, Feb 1, 2011 at 1:31 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:

Hmm, good point. It's harmless, but creating the history file in the
first
place sure seems like a waste of time.


The attached patch changes pg_stop_backup so that it doesn't create
the backup history file if archiving is not enabled.

When I tested the multiple backups, I found that they can have the same
checkpoint location and the same history file name.


$ for ((i=0; i4; i++)); do
pg_basebackup -D test$i -c fast -x -l test$i
done

$ cat test0/backup_label
START WAL LOCATION: 0/2B0 (file 00010002)
CHECKPOINT LOCATION: 0/2E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test0

$ cat test1/backup_label
START WAL LOCATION: 0/2B0 (file 00010002)
CHECKPOINT LOCATION: 0/2E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test1

$ cat test2/backup_label
START WAL LOCATION: 0/2B0 (file 00010002)
CHECKPOINT LOCATION: 0/2E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test2

$ cat test3/backup_label
START WAL LOCATION: 0/2B0 (file 00010002)
CHECKPOINT LOCATION: 0/2E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test3

$ ls archive/*.backup
archive/00010002.00B0.backup


This would cause a serious problem. Because the backup-end record
which indicates the same START WAL LOCATION can be written by the
first backup before the other finishes. So we might think wrongly that
we've already reached a consistency state by reading the backup-end
record (written by the first backup) before reading the last required
WAL
file.

/*
* Force a CHECKPOINT. Aside from being necessary to prevent torn
* page problems, this guarantees that two successive backup runs will
* have different checkpoint positions and hence different history
* file names, even if nothing happened in between.
*
* We use CHECKPOINT_IMMEDIATE only if requested by user (via passing
* fast = true). Otherwise this can take awhile.
*/
RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT |
(fast ? CHECKPOINT_IMMEDIATE : 0));

This problem happens because the above code (in do_pg_start_backup)
actually doesn't ensure that the concurrent backups have the different
checkpoint locations. ISTM that we should change the above or elsewhere
to ensure that.


Yes, good point.


Here's a patch based on that approach, ensuring that each base backup 
uses a different checkpoint as the start location. I think I'll commit 
this, rather than invent a new unique ID mechanism for backups. The 
latter would need changes in recovery and control file too, and I don't 
feel like tinkering with that at this stage.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 15af669..570f02b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -355,10 +355,13 @@ typedef struct XLogCtlInsert
 	 * exclusiveBackup is true if a backup started with pg_start_backup() is
 	 * in progress, and nonExclusiveBackups is a counter indicating the number
 	 * of streaming base backups currently in progress. forcePageWrites is
-	 * set to true when either of these is non-zero.
+	 * set to true when either of these is non-zero. lastBackupStart is the
+	 * latest checkpoint redo location used as a starting point for an online
+	 * backup.
 	 */
 	bool		exclusiveBackup;
 	int			nonExclusiveBackups;
+	XLogRecPtr	lastBackupStart;
 } XLogCtlInsert;
 
 /*
@@ -8809,6 +8812,19 @@ do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
 		MAXPGPATH)));
 
 	/*
+	 * Force an XLOG file switch before the checkpoint, to ensure that the WAL
+	 * segment the checkpoint is written to doesn't contain pages with old
+	 * timeline IDs. That would otherwise happen if you called
+	 * pg_start_backup() right after restoring from a PITR archive: the first
+	 * WAL segment containing the startup checkpoint has pages in the
+	 * beginning with the old timeline ID. That can cause trouble at recovery:
+	 * we won't have a history file covering the old timeline if pg_xlog
+	 * directory was not included in the base backup and the WAL archive was
+	 * cleared too before starting the backup.
+	 */
+	RequestXLogSwitch();
+
+	/*
 	 * Mark backup active in shared memory.  We must do full-page WAL writes
 	 * during an on-line backup even if not doing so at other times, because
 	 * it's quite possible for the backup dump to obtain a torn (partially
@@ -8843,43 +8859,54 @@ do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
 	XLogCtl-Insert.forcePageWrites = true;
 	LWLockRelease(WALInsertLock);
 
-	/*
-	 * Force an XLOG file switch before the checkpoint, to 

Re: [HACKERS] Re: [COMMITTERS] pgsql: Basic Recovery Control functions for use in Hot Standby. Pause,

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 3:22 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 If recovery target is set to before its consistent, ie. before
 minRecoveryPoint, we should throw an error before recovery even starts. I'm
 not sure if we check that at the moment.

I don't see how you could check that anyway.  How do you know where
you're going to see the given XID/timestamp/named restore point until
you actually get there?

 Not sure what to to do recovery target is beyond minRecoveryPoint and
 pause_at_recovery_target=true, but the server hasn't been opened for hot
 standby yet (because it hasn't seen a running-xacts record yet). I agree
 it's pretty useless and annoying to stop there.

I think the reasonable options are enter normal running and shut down.

In any event, it sounds like someone needs to fix this, and I don't
know enough to do it.  Can you or Fujii Masao do it?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 3:52 AM, Simon Riggs si...@2ndquadrant.com wrote:
 I agree to get rid of write_location.

 No, don't remove it.

 We seem to be just looking for things to tweak without any purpose.
 Removing this adds nothing for us.

 We will have the column in the future, it is there now, so leave it.

Well then can we revert the part of your patch that causes it to not
actually work any more?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 2:25 AM, Fujii Masao masao.fu...@gmail.com wrote:
 In the first place, I think that it's complicated to keep those two parameters
 separately. What about merging them to one parameter? What I'm thinking
 is to remove synchronous_replication and to increase the valid values of
 synchronous_commit from on/off to async/local/remote/both. Each value
 works as follows.

    async   = (synchronous_commit = off  synchronous_replication = off)
    async makes a transaction do local WAL flush and replication
 asynchronously.

    local     = (synchronous_commit = on  synchronous_replication = off)
    local makes a transaction wait for only local WAL flush.

    remote = (synchronous_commit = off  synchronous_replication = on)
    remote makes a transaction wait for only replication. Local WAL flush is
    performed by walwriter. This is useless in 9.1 because we always must
    wait for local WAL flush when we wait for replication. But in the future,
    if we'll be able to send WAL before WAL write (i.e., send WAL from
    wal_buffers), this might become useful. In 9.1, it seems reasonable to
    remove this value.

    both     = (synchronous_commit = on  synchronous_replication = on)
    both makes a transaction wait for local WAL flush and replication.

 Thought?

Well, if we want to make this all use one parameter, the obvious way
to do it that wouldn't break backward compatibility is to remove the
synchronous_replication parameter altogether and let
synchronous_commit take on the values on/local/off, where on means
wait for sync rep if it's enabled (i.e.
synchronous_standby_names!=''max_wal_senders0) or otherwise just
wait for local WAL flush, local means just wait for local WAL flush,
and off means commit asynchronously.

I'm OK with doing it that way if there's consensus on it, but I'm not
eager to break backward compatibility.  Simon/Heikki, any opinion on
that approach?

If we don't have consensus on that then I think we should just do what
I proposed above (and Simon agreed to).  I am not eager to spend any
longer than necessary hammering this out; I want to get to beta.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Basic Recovery Control functions for use in Hot Standby. Pause,

2011-03-18 Thread Heikki Linnakangas

On 18.03.2011 14:14, Robert Haas wrote:

On Fri, Mar 18, 2011 at 3:22 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com  wrote:

If recovery target is set to before its consistent, ie. before
minRecoveryPoint, we should throw an error before recovery even starts. I'm
not sure if we check that at the moment.


I don't see how you could check that anyway.  How do you know where
you're going to see the given XID/timestamp/named restore point until
you actually get there?


Oh, good point. I was thinking that the recovery target is a particular 
LSN, but clearly it's not.



Not sure what to to do recovery target is beyond minRecoveryPoint and
pause_at_recovery_target=true, but the server hasn't been opened for hot
standby yet (because it hasn't seen a running-xacts record yet). I agree
it's pretty useless and annoying to stop there.


I think the reasonable options are enter normal running and shut down.

In any event, it sounds like someone needs to fix this, and I don't
know enough to do it.  Can you or Fujii Masao do it?


You could also argue for log a warning, continue until we can open for 
Hot standby, then pause.


I can write the patch once we know what we want. All of those options 
sound reasonable to me. This is such a corner-case that it doesn't make 
sense to make it user-configurable, though.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19

2011-03-18 Thread Robert Haas
On Thu, Mar 17, 2011 at 6:00 PM, Jeff Davis pg...@j-davis.com wrote:
 On Wed, 2011-03-16 at 13:35 -0400, Robert Haas wrote:
 2. If a query cancel interrupt is received (pg_cancel_backend or ^C),
 then cancel the sync rep wait and issue a warning before acknowledging
 the commit.

 When I saw this commit, I noticed that the WARNING doesn't have an
 errcode(). It seems like it should -- this is the kind of thing that the
 client is likely to care about, and may want to handle specially.

Should I invent ERRCODE_WARNING_TRANSACTION_NOT_REPLICATED?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Basic Recovery Control functions for use in Hot Standby. Pause,

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 8:27 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 You could also argue for log a warning, continue until we can open for Hot
 standby, then pause.

I don't like that one much.

 I can write the patch once we know what we want. All of those options sound
 reasonable to me. This is such a corner-case that it doesn't make sense to
 make it user-configurable, though.

I agree.  Since pause_at_recovery_target is ignored when
hot_standby=off, I think it would be consistent to treat the case
where hot_standby=on but can't actually be initiated the same way -
just ignore the pause request and enter normal running.  However, I
don't have a super-strong feeling that that's the only sensible way to
go, so count me as +0.5 for that approach.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 9:16 AM, MARK CALLAGHAN mdcal...@gmail.com wrote:
 On Fri, Mar 18, 2011 at 9:27 AM, Markus Wanner mar...@bluegap.ch wrote:
 Google invented the term semi-syncronous for something that's
 essentially the same that we have, now, I think.  However, I full
 heartedly hate that term (based on the reasoning that there's no
 semi-pregnant, either).

 We didn't invent the term, we just implemented something that Heikki
 Tuuri briefly described, for example:
 http://bugs.mysql.com/bug.php?id=7440

 In the Google patch and official MySQL version, the sequence is:
 1) commit on master
 2) wait for slave to ack
 3) return to user

 After step 1 another user on the master can observe the commit and the
 following is possible:
 1) commit on master
 2) other user observes that commit on master
 3) master blows up and a user observed a commit that never made it to a slave

 I do not think this sequence should be possible in a sync replication
 system. But it is possible in what has been implemented for MySQL.
 Thus it was named semi-sync rather than sync.

Thanks for the insight.  That can't happen with our implementation, I believe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FK constraints NOT VALID by default?

2011-03-18 Thread Robert Haas
On Thu, Mar 17, 2011 at 5:32 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Mar 17, 2011 at 5:29 PM, Andrew Dunstan and...@dunslane.net wrote:
 Is this really intended?

 I sure hope not.

 That's a bug.  Not sure if it's a psql bug or a backend bug, but it's
 definitely a bug.

It's a backend bug.  Prior to Simon's patch, there was an existing
skip_validation flag in the Constraint node that indicated whether or
not a validation pass was necessary - in a newly created table, for
example, we know that it's NOT necessary, because the table can't
contain any rows (and therefore there can't be any rows that violate
the constraint).  The patch tries to make the very same flag indicate
whether the user wants the constraint to be added with the NOT VALID
attribute, which of course falls over because the Boolean only has two
values and there are three cases (validate it, don't validate it but
do mark it valid because the table is guaranteed to be empty, don't
validate it and mark it not valid).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Kevin Grittner
MARK CALLAGHAN mdcal...@gmail.com wrote:
 Markus Wanner mar...@bluegap.ch wrote:
 
 Google invented the term semi-syncronous for something that's
 essentially the same that we have, now, I think.  However, I full
 heartedly hate that term (based on the reasoning that there's no
 semi-pregnant, either).
 
To be fair, what we're considering calling semi-synchronous is
something which tries to stay in synchronous mode but switches out
of it when necessary to meet availability targets.  Your analogy
doesn't match up at all well -- at least without getting really
ugly.
 
 We didn't invent the term, we just implemented something that
 Heikki Tuuri briefly described, for example:
 http://bugs.mysql.com/bug.php?id=7440
 
 In the Google patch and official MySQL version, the sequence is:
 1) commit on master
 2) wait for slave to ack
 3) return to user
 
 After step 1 another user on the master can observe the commit and
 the following is possible:
 1) commit on master
 2) other user observes that commit on master
 3) master blows up and a user observed a commit that never made it
 to a slave
 
 I do not think this sequence should be possible in a sync
 replication system.
 
Then the only thing you would consider sync replication, as far as I
can see, is two phase commit, which we already have.  So your use
case seems to be covered already, and we're trying to address other
people's needs.  The guarantee that some people are looking for is
that a successful commit means that the data has been persisted on
two separate servers.  Others want to try for that, but are willing
to compromise it for HA; in general I think they want to know when
the guarantee is not there so they can take action to get back to a
safer condition.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_dump -X

2011-03-18 Thread Robert Haas
On Mon, Mar 14, 2011 at 9:56 AM, Robert Haas robertmh...@gmail.com wrote:
 On Sat, Mar 12, 2011 at 12:56 AM, Bruce Momjian br...@momjian.us wrote:
 Presumably the point of deprecating the feature is that we'd
 eventually remove it.  If 4 major releases isn't long enough, what is?

 Good point.

 Unless there are further objections, I think we should go ahead and remove 
 this.

 If there ARE further objections, then please say what release you
 think it would be OK to remove it in, or why you think it's worth
 keeping around indefinitely given that the last version in which it
 was documented is now EOL.

Hearing no further objections, I have removed this code.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Robert Haas
On Mon, Mar 7, 2011 at 3:44 AM, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Mar 7, 2011 at 5:27 PM, Fujii Masao masao.fu...@gmail.com wrote:
 On Mon, Mar 7, 2011 at 7:51 AM, Simon Riggs si...@2ndquadrant.com wrote:
 Efficient transaction-controlled synchronous replication.
 If a standby is broadcasting reply messages and we have named
 one or more standbys in synchronous_standby_names then allow
 users who set synchronous_replication to wait for commit, which
 then provides strict data integrity guarantees. Design avoids
 sending and receiving transaction state information so minimises
 bookkeeping overheads. We synchronize with the highest priority
 standby that is connected and ready to synchronize. Other standbys
 can be defined to takeover in case of standby failure.

 This version has very strict behaviour; more relaxed options
 may be added at a later date.

 Pretty cool! I'd appreciate very much your efforts and contributions.

 And,, I found one bug ;) You seem to have wrongly removed the check
 of max_wal_senders in SyncRepWaitForLSN. This can make the
 backend wait for replication even if max_wal_senders = 0. I could produce
 this problematic situation in my machine. The attached patch fixes this 
 problem.

        if (strlen(SyncRepStandbyNames)  0  max_wal_senders == 0)
                ereport(ERROR,
                                (errmsg(Synchronous replication requires WAL 
 streaming
 (max_wal_senders  0;

 The above check should be required also after pg_ctl reload since
 synchronous_standby_names can be changed by SIGHUP?
 Or how about just removing that? If the patch I submitted is
 committed,empty synchronous_standby_names and max_wal_senders = 0
 settings is no longer unsafe.

This configuration is now harmless in the sense that it no longer
horribly breaks the entire system, but it's still pretty useless, so
this might be deemed a valuable sanity check.  However, I'm reluctant
to leave it in there, because someone could change their config to
this state, pg_ctl reload, see everything working, and then later stop
the cluster and be unable to start it back up again.  Since most
people don't shut their database systems down very often, they might
not discover that they have an invalid config until much later.  I
think it's probably not a good idea to have configs that are valid on
reload but prevent startup, so I'm inclined to either remove this
check altogether or downgrade it to a warning.

As a side note, it's not very obvious why some parts of PostmasterMain
report problems by doing write_stderr() and exit() while other parts
use ereport(ERROR).  This check and the nearby checks on WAL level are
immediately preceded and followed by other checks that use the
opposite technique.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] I am confused after reading codes of PostgreSQL three week

2011-03-18 Thread hom
2011/3/17 Bruce Momjian br...@momjian.us:
 hom wrote:
 Hi,

   I try to known how a database is implemented and I have been reading
 PG source codes for a month.

 Now, I only know a little about how PG work.  :(

 I just know PG work like this but I don't know why PG work like this.  :(  :(

 even worse, I feel I can better understand the source code. it may be
 that I could't split the large module into small piece which may help
 to understand.

 Is there any article or some way could help understand the source code ?

 I assume you have looked at these places:

        http://wiki.postgresql.org/wiki/Developer_FAQ
        http://www.postgresql.org/developer/coding

 --
  Bruce Momjian  br...@momjian.us        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +


Thanks Bruce.
I am also reading your book PostgreSQL Introduction and Concepts. :)

-- 
Best Wishes!

                                     hom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] I am confused after reading codes of PostgreSQL three week

2011-03-18 Thread hom
2011/3/17 Kevin Grittner kevin.gritt...@wicourts.gov:
 hom obsidian...@gmail.com wrote:

 I try to known how a database is implemented and I have been
 reading PG source codes for a month.

 That's ambitious.

 find -name '*.h' -or -name '*.c' \
  | egrep -v '^\./src/test/.+/tmp_check/' \
  | xargs cat | wc -l
 1059144

 Depending on how you do the math, that's about 50,000 lines of code
 per day to get through it in the time you mention.

 Is there any article or some way could help understand the source
 code ?

 Your best bet would be to follow links from the Developers tab on
 the main PostgreSQL web site:

 http://www.postgresql.org/developer/

 In particular the Developer FAQ page:

 http://wiki.postgresql.org/wiki/Developer_FAQ

 And the Coding links:

 http://www.postgresql.org/developer/coding

 may help.

 Before reading code in a directory, be sure to read any README
 file(s) in that directory carefully.

 It helps to read this list.

 In spite of reviewing all of that myself, it was rather intimidating
 when I went to work on a major patch 14 months ago.  Robert Haas
 offered some good advice which served me well in that effort --
 divide the effort in to a series of incremental steps, each of which
 deals with a small enough portion of the code to get your head
 around.  As you work in any one narrow area, it becomes increasingly
 clear; with that as a base you can expand your scope.

 When you're working in the code, it is tremendously helpful to use
 an editor with ctags support (or similar IDE functionality).

 I hope this is helpful.  Good luck.

 -Kevin


Thanks Kevin.
I will follow your advice and I will also post the question to the
mail list for help.
Thanks a lot.

-- 
Best Wishes!

                                     hom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FK constraints NOT VALID by default?

2011-03-18 Thread Simon Riggs
On Fri, 2011-03-18 at 09:39 -0400, Robert Haas wrote:
 On Thu, Mar 17, 2011 at 5:32 PM, Robert Haas robertmh...@gmail.com wrote:
  On Thu, Mar 17, 2011 at 5:29 PM, Andrew Dunstan and...@dunslane.net wrote:
  Is this really intended?
 
  I sure hope not.
 
  That's a bug.  Not sure if it's a psql bug or a backend bug, but it's
  definitely a bug.
 
 It's a backend bug.  Prior to Simon's patch, there was an existing
 skip_validation flag in the Constraint node that indicated whether or
 not a validation pass was necessary - in a newly created table, for
 example, we know that it's NOT necessary, because the table can't
 contain any rows (and therefore there can't be any rows that violate
 the constraint).  The patch tries to make the very same flag indicate
 whether the user wants the constraint to be added with the NOT VALID
 attribute, which of course falls over because the Boolean only has two
 values and there are three cases (validate it, don't validate it but
 do mark it valid because the table is guaranteed to be empty, don't
 validate it and mark it not valid).

Thanks Robert. Yes, my bad. Will fix.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19

2011-03-18 Thread Jeff Davis
On Fri, 2011-03-18 at 08:27 -0400, Robert Haas wrote:
 On Thu, Mar 17, 2011 at 6:00 PM, Jeff Davis pg...@j-davis.com wrote:
  On Wed, 2011-03-16 at 13:35 -0400, Robert Haas wrote:
  2. If a query cancel interrupt is received (pg_cancel_backend or ^C),
  then cancel the sync rep wait and issue a warning before acknowledging
  the commit.
 
  When I saw this commit, I noticed that the WARNING doesn't have an
  errcode(). It seems like it should -- this is the kind of thing that the
  client is likely to care about, and may want to handle specially.
 
 Should I invent ERRCODE_WARNING_TRANSACTION_NOT_REPLICATED?

I think it's reasonable to invent a new code here. Perhaps use the word
synchronous rather than replicated, though?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Simon Riggs
On Fri, 2011-03-18 at 13:16 +, MARK CALLAGHAN wrote:
 On Fri, Mar 18, 2011 at 9:27 AM, Markus Wanner mar...@bluegap.ch wrote:
  Google invented the term semi-syncronous for something that's
  essentially the same that we have, now, I think.  However, I full
  heartedly hate that term (based on the reasoning that there's no
  semi-pregnant, either).
 
 We didn't invent the term, we just implemented something that Heikki
 Tuuri briefly described, for example:
 http://bugs.mysql.com/bug.php?id=7440
 
 In the Google patch and official MySQL version, the sequence is:
 1) commit on master
 2) wait for slave to ack
 3) return to user
 
 After step 1 another user on the master can observe the commit and the
 following is possible:
 1) commit on master
 2) other user observes that commit on master
 3) master blows up and a user observed a commit that never made it to a slave
 
 I do not think this sequence should be possible in a sync replication
 system. But it is possible in what has been implemented for MySQL.
 Thus it was named semi-sync rather than sync.

Thanks for clearing it up Mark.

We should definitely not be calling what we have semi-sync. The
semantics are very different.

In PostgreSQL other users cannot observe the commit until an
acknowledgement has been received.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Markus Wanner
Mark,

On 03/18/2011 02:16 PM, MARK CALLAGHAN wrote:
 We didn't invent the term, we just implemented something that Heikki
 Tuuri briefly described, for example:
 http://bugs.mysql.com/bug.php?id=7440

Oh, okay, good to know who to blame ;-)  However, I didn't mean to
offend anybody.

 I do not think this sequence should be possible in a sync replication
 system. But it is possible in what has been implemented for MySQL.
 Thus it was named semi-sync rather than sync.

Sure?

Their documentation [1] isn't entirely clear on that first: the master
blocks after the commit is done and waits until at least one
semisynchronous slave acknowledges that it has received all events for
the transaction and the slave acknowledges receipt of a transaction's
events only after the events have been written to its relay log and
flushed to disk.

But then continues to say that [the master is] waiting for
acknowledgment from a slave after having performed a commit, so this
indeed sounds like the transaction is visible to other sessions before
the slave ACKs.

So, semi-sync may show temporary inconsistencies in case of a master
failure.  Wow!

Regards

Markus Wanner


[1] MySQL 5.5 reference manual, 17.3.8. Semisynchronous Replication:
http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Fix various possible problems with synchronous replication.

2011-03-18 Thread Robert Haas
On Thu, Mar 17, 2011 at 1:59 PM, Thom Brown t...@linux.com wrote:
 On 17 March 2011 17:55, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Mar 17, 2011 at 1:24 PM, Thom Brown t...@linux.com wrote:
 errdetail(The transaction has already been committed locally but
 might have not been replicated to the standby.)));
 errdetail(The transaction has committed locally, but may not have
 replicated to the standby.)));

 Could we have these saying precisely the same thing?

 Yeah.  Which is better?

 Personally I prefer the 2nd.  It reads better somehow.

I hacked on this a bit more and ended up with a hybrid of the two.
Hope you like it; but anyway it's consistent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 10:17 AM, Jeff Davis pg...@j-davis.com wrote:
 On Fri, 2011-03-18 at 08:27 -0400, Robert Haas wrote:
 On Thu, Mar 17, 2011 at 6:00 PM, Jeff Davis pg...@j-davis.com wrote:
  On Wed, 2011-03-16 at 13:35 -0400, Robert Haas wrote:
  2. If a query cancel interrupt is received (pg_cancel_backend or ^C),
  then cancel the sync rep wait and issue a warning before acknowledging
  the commit.
 
  When I saw this commit, I noticed that the WARNING doesn't have an
  errcode(). It seems like it should -- this is the kind of thing that the
  client is likely to care about, and may want to handle specially.

 Should I invent ERRCODE_WARNING_TRANSACTION_NOT_REPLICATED?

 I think it's reasonable to invent a new code here. Perhaps use the word
 synchronous rather than replicated, though?

I think we have to, because it's definitely not the same situation
that someone would expect after ERRCODE_QUERY_CANCELLED.

But ERRCODE_WARNING_TRANSACTION_NOT_SYNCHRONOUS, which is what I read
you reply as suggesting, seems pretty wonky.  I wouldn't know what
that meant.  Another option might be:

ERRCODE_(WARNING_?)REPLICATION_WAIT_CANCELLED

...which might have something to recommend it.

Other thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] I am confused after reading codes of PostgreSQL three week

2011-03-18 Thread hom
2011/3/18 Markus Wanner mar...@bluegap.ch:
 Hom,

 On 03/17/2011 04:49 PM, Kevin Grittner wrote:
 That's ambitious.

 Absolutely, yes.  Exercise patience with yourself.

 A method that hasn't been mentioned, yet, is digging out your debugger
 and attach it to a connected Postgres backend.  You can then issue a
 query you are interested in and follow the backend doing its work.

 That's particularly helpful in trying to find a certain spot of
 interest.  Of course, it doesn't help much in getting the big picture.

 Good luck on your journey through the code base.

 Regards

 Markus Wanner


Thanks Markus.
It's hard time at the beginning.
I should keep patient. :)

-- 
Best Wishes!

                                     hom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] I am confused after reading codes of PostgreSQL three week

2011-03-18 Thread hom
2011/3/18 Brendan Jurd dire...@gmail.com:
 On 18 March 2011 01:57, hom obsidian...@gmail.com wrote:
  I try to known how a database is implemented

 This objective is so vast and so vague that it's difficult to give
 meaningful help.

 I'd emphasise Kevin Grittner's very worthwhile advice.  Try to break
 your question down into smaller, more specific ones.  With a question
 like how does postgres work you're likely to flounder.  But with a
 more targeted question, e.g., what format does postgres use to save
 data to disk or how does postgres implement ORDER BY, you can make
 easier progress, and perhaps you could get more useful pointers from
 the people on this list.

 Have you read through the Overview of System Internals chapter in
 the documentation [1]?  Perhaps it will help you identify the areas
 you wish to explore further, and form more specific questions.

 [1] http://www.postgresql.org/docs/current/static/overview.html

 Cheers,
 BJ


Thanks Brendan.
I have a quickly glance on Overview of System Internals  before.
I think it is time to read it again.

-- 
Best Wishes!

                                     hom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Markus Wanner
Hi,

On 03/18/2011 02:40 PM, Kevin Grittner wrote:
 Then the only thing you would consider sync replication, as far as I
 can see, is two phase commit

I think waiting for the ACK before actually making the changes from the
transaction visible (COMMIT) would suffice for disallowing such an
inconsistency to manifest.  But obviously, MySQL decided it's not worth
doing that, as it's such a rare event and a short period of time that
may show inconsistencies...

 people's needs.  The guarantee that some people are looking for is
 that a successful commit means that the data has been persisted on
 two separate servers.

Well, MySQL's semi-sync also seems to guarantee that WRT the client
confirmation.  And transactions always appear committed *before* the
client receives the COMMIT acknowledgement, due to the time it takes for
the ACK to arrive at the client.

It's just the commit *before* receiving the slave's ACK, which might
make a transaction visible that's not durable, yet.  But I guess that
simplified implementation for them...

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Fix various possible problems with synchronous replication.

2011-03-18 Thread Thom Brown
On 18 March 2011 14:23, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Mar 17, 2011 at 1:59 PM, Thom Brown t...@linux.com wrote:
 On 17 March 2011 17:55, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Mar 17, 2011 at 1:24 PM, Thom Brown t...@linux.com wrote:
 errdetail(The transaction has already been committed locally but
 might have not been replicated to the standby.)));
 errdetail(The transaction has committed locally, but may not have
 replicated to the standby.)));

 Could we have these saying precisely the same thing?

 Yeah.  Which is better?

 Personally I prefer the 2nd.  It reads better somehow.

 I hacked on this a bit more and ended up with a hybrid of the two.
 Hope you like it; but anyway it's consistent.

Yes, cheers :)

-- 
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] I am confused after reading codes of PostgreSQL three week

2011-03-18 Thread hom
2011/3/18 Vaibhav Kaushal vaibhavkaushal...@gmail.com:
 Hi,
 That was the question I was facing 5 months ago and trust me I am doing it
 even now. With an average of 6+ hours going into PostgreSQL Code, even with
 best practices (as suggested by the developers) I still think I know less
 than 10 percent. It is too huge to be swallowed at once.
 I too had to break it down into pieces and because everything is so
 interconnected with everything else, it is quite complicated in the
 beginning. Start with one piece; planner, parser, executor, storage
 management whatever and slowly it should help you get the bigger picture.
 regards,
 Vaibhav
 I had to break it into

Thanks Vaibhav .
I have step into parser before but I meet a problem:

when I debug step in the scanner_init(),  Eclipse always finds scan.l
and the excute order is not match the file.
I think it should be scan.c actually but I don't known how to trace
into scan.c :(
PS: I have turn Search for duplicate source files option on.

I have posted to the mail list, but it have not solved.

here is the link:
http://postgresql.1045698.n5.nabble.com/Open-unmatch-source-file-when-step-into-parse-analyze-in-Eclipse-td3408033.html

-- 
Best Wishes!

                                     hom

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Kevin Grittner
Simon Riggs si...@2ndquadrant.com wrote:
 
 In PostgreSQL other users cannot observe the commit until an
 acknowledgement has been received.
 
Really?  I hadn't picked up on that.  That makes for a lot of
complication on crash-and-recovery of a master, but if we can pull
it off, that's really cool.  If we do that and MySQL doesn't, we
definitely don't want to use the same terminology they do, which
would imply the same behavior.
 
Apologies for not picking up on that aspect of the implementation.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Greg Stark
On Thu, Mar 17, 2011 at 5:46 PM, Robert Haas robertmh...@gmail.com wrote:
 What makes more sense to me after having thought about this more
 carefully is to simply make a blanket rule that when
 synchronous_replication=on, synchronous_commit has no effect.  That is
 easy to understand and document.

For what it's worth has no effect doesn't make much sense to me.
It's a boolean, either commits are going to block or they're not.

What happened to the idea of a three-way switch?

synchronous_commit = off
synchronous_commit = disk
synchronous_commit = replica

With on being a synonym for disk for backwards compatibility.

Then we could add more options later for more complex conditions like
waiting for one server in each data centre or waiting for one of a
certain set of servers ignoring the less reliable mirrors, etc.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread rsmogura

On Thu, 17 Mar 2011 16:02:18 -0500, Kevin Grittner wrote:

Rados*aw Smogurarsmog...@softperience.eu wrote:


I have implemented initial concept of 2nd level cache. Idea is to
keep some segments of shared memory for special buffers (e.g.
indices) to prevent overwrite those by other operations. I added
those functionality to nbtree index scan.

I tested this with doing index scan, seq read, drop system
buffers, do index scan and in few places I saw performance
improvements, but actually, I'm not sure if this was just random
or intended improvement.


I've often wondered about this.  In a database I developed back in
the '80s it was clearly a win to have a special cache for index
entries and other special pages closer to the database than the
general cache.  A couple things have changed since the '80s (I mean,
besides my waistline and hair color), and PostgreSQL has many
differences from that other database, so I haven't been sure it
would help as much, but I have wondered.

I can't really look at this for a couple weeks, but I'm definitely
interested.  I suggest that you add this to the next CommitFest as a
WIP patch, under the Performance category.

https://commitfest.postgresql.org/action/commitfest_view/open


There is few places to optimize code as well, and patch need many
work, but may you see it and give opinions?


For something like this it makes perfect sense to show proof of
concept before trying to cover everything.

-Kevin


Yes, there is some change, and I looked at this more carefully, as my 
performance results wasn't such as I expected. I found PG uses 
BufferAccessStrategy to do sequence scans, so my test query took only 32 
buffers from pool and didn't overwritten index pool too much. This BAS 
is really surprising. In any case when I end polishing I will send good 
patch, with proof.


Actually idea of this patch was like this:
Some operations requires many buffers, PG uses clock sweep to get 
next free buffer, so it may overwrite index buffer. From point of view 
of good database design We should use indices, so purging out index from 
cache will affect performance.


As the side effect I saw that this 2nd level keeps pg_* indices in 
memory too, so I think to include 3rd level cache for some pg_* tables.


Regards,
Radek

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 10:55 AM, Greg Stark gsst...@mit.edu wrote:
 On Thu, Mar 17, 2011 at 5:46 PM, Robert Haas robertmh...@gmail.com wrote:
 What makes more sense to me after having thought about this more
 carefully is to simply make a blanket rule that when
 synchronous_replication=on, synchronous_commit has no effect.  That is
 easy to understand and document.

 For what it's worth has no effect doesn't make much sense to me.
 It's a boolean, either commits are going to block or they're not.

 What happened to the idea of a three-way switch?

 synchronous_commit = off
 synchronous_commit = disk
 synchronous_commit = replica

 With on being a synonym for disk for backwards compatibility.

 Then we could add more options later for more complex conditions like
 waiting for one server in each data centre or waiting for one of a
 certain set of servers ignoring the less reliable mirrors, etc.

This is similar to what I suggested upthread, except that I suggested
on/local/off, with the default being on.  That way if you set
synchronous_standby_names, you get synchronous replication without
changing another setting, but you can say local instead if for some
reason you want the middle behavior.  If we're going to do it all with
one GUC, I think that way makes more sense.  If you're running sync
rep, you might still have some transactions that you don't care about,
but that's what async commit is for.  It's a funny kind of transaction
that we're OK with losing if we have a failover but we're not OK with
losing if we have a local crash from which we recover without failing
over.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread Kevin Grittner
rsmogura rsmog...@softperience.eu wrote:
 
 Yes, there is some change, and I looked at this more carefully, as
 my performance results wasn't such as I expected. I found PG uses 
 BufferAccessStrategy to do sequence scans, so my test query took
 only 32 buffers from pool and didn't overwritten index pool too
 much. This BAS is really surprising. In any case when I end
 polishing I will send good patch, with proof.
 
Yeah, that heuristic makes this less critical, for sure.
 
 Actually idea of this patch was like this:
 Some operations requires many buffers, PG uses clock sweep to
 get next free buffer, so it may overwrite index buffer. From point
 of view of good database design We should use indices, so purging
 out index from cache will affect performance.
 
 As the side effect I saw that this 2nd level keeps pg_* indices
 in memory too, so I think to include 3rd level cache for some pg_*
 tables.
 
Well, the more complex you make it the more overhead there is, which
makes it harder to come out ahead.  FWIW, in musing about it (as
recently as this week), my idea was to add another field which would
factor into the clock sweep calculations.  For indexes, it might be
levels above leaf pages.  I haven't reviewed the code in depth to
know how to use it, this was just idle daydreaming based on that
prior experience.  It's far from certain that the concept will
actually prove beneficial in PostgreSQL.
 
Maybe the thing to focus on first is the oft-discussed benchmark
farm (similar to the build farm), with a good mix of loads, so
that the impact of changes can be better tracked for multiple
workloads on a variety of platforms and configurations.  Without
something like that it is very hard to justify the added complexity
of an idea like this in terms of the performance benefit gained.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19

2011-03-18 Thread Jeff Davis
On Fri, 2011-03-18 at 10:27 -0400, Robert Haas wrote:
 ERRCODE_(WARNING_?)REPLICATION_WAIT_CANCELLED
 
 ...which might have something to recommend it.

Works for me.

Regards,
Jeff Davis


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Simon Riggs
On Fri, 2011-03-18 at 11:07 -0400, Robert Haas wrote:
 On Fri, Mar 18, 2011 at 10:55 AM, Greg Stark gsst...@mit.edu wrote:
  On Thu, Mar 17, 2011 at 5:46 PM, Robert Haas robertmh...@gmail.com wrote:
  What makes more sense to me after having thought about this more
  carefully is to simply make a blanket rule that when
  synchronous_replication=on, synchronous_commit has no effect.  That is
  easy to understand and document.
 
  For what it's worth has no effect doesn't make much sense to me.
  It's a boolean, either commits are going to block or they're not.
 
  What happened to the idea of a three-way switch?
 
  synchronous_commit = off
  synchronous_commit = disk
  synchronous_commit = replica
 
  With on being a synonym for disk for backwards compatibility.
 
  Then we could add more options later for more complex conditions like
  waiting for one server in each data centre or waiting for one of a
  certain set of servers ignoring the less reliable mirrors, etc.
 
 This is similar to what I suggested upthread, except that I suggested
 on/local/off, with the default being on.  That way if you set
 synchronous_standby_names, you get synchronous replication without
 changing another setting, but you can say local instead if for some
 reason you want the middle behavior.  If we're going to do it all with
 one GUC, I think that way makes more sense.  If you're running sync
 rep, you might still have some transactions that you don't care about,
 but that's what async commit is for.  It's a funny kind of transaction
 that we're OK with losing if we have a failover but we're not OK with
 losing if we have a local crash from which we recover without failing
 over.

I much prefer a single switch, which is what I originally suggested.
Changing the meaning of synchronous_commit seems a problem.

durability = localmemory
durability = localdisk
(durability = remotereceive - has no meaning in current code)
durability = remotedisk
durability = remoteapply

it also allows us to have in the future

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Markus Wanner
On 03/18/2011 03:52 PM, Kevin Grittner wrote:
 Really?  I hadn't picked up on that.  That makes for a lot of
 complication on crash-and-recovery of a master

What complication do you have in mind here?

I think of it the opposite way (at least for Postgres, that is):
committing a transaction that's not acknowledged means having to revert
a (locally only) committed transaction if you want to use the current
data to recover to some cluster-agreed state.  (Of course, you can
always simply transfer the whole

If you don't commit the transaction before the ACK in the first place,
you don't have anything special to do upon recovery.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Heikki Linnakangas

On 18.03.2011 16:52, Kevin Grittner wrote:

Simon Riggssi...@2ndquadrant.com  wrote:


In PostgreSQL other users cannot observe the commit until an
acknowledgement has been received.


Really?  I hadn't picked up on that.  That makes for a lot of
complication on crash-and-recovery of a master, but if we can pull
it off, that's really cool.  If we do that and MySQL doesn't, we
definitely don't want to use the same terminology they do, which
would imply the same behavior.


To be clear: other users cannot observe the commit until standby 
acknowledges it - unless the master crashes while waiting for the 
acknowledgment. If that happens, the commit will be visible to everyone 
after recovery.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread MARK CALLAGHAN
On Fri, Mar 18, 2011 at 2:19 PM, Markus Wanner mar...@bluegap.ch wrote:

 Their documentation [1] isn't entirely clear on that first: the master
 blocks after the commit is done and waits until at least one
 semisynchronous slave acknowledges that it has received all events for
 the transaction and the slave acknowledges receipt of a transaction's
 events only after the events have been written to its relay log and
 flushed to disk.

 But then continues to say that [the master is] waiting for
 acknowledgment from a slave after having performed a commit, so this
 indeed sounds like the transaction is visible to other sessions before
 the slave ACKs.

Yes, their docs are not clear on this.

-- 
Mark Callaghan
mdcal...@gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19

2011-03-18 Thread Heikki Linnakangas

On 18.03.2011 17:38, Jeff Davis wrote:

On Fri, 2011-03-18 at 10:27 -0400, Robert Haas wrote:

ERRCODE_(WARNING_?)REPLICATION_WAIT_CANCELLED

...which might have something to recommend it.


Works for me.


Yes, sounds reasonable. Without WARNING_, please.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread MARK CALLAGHAN
On Fri, Mar 18, 2011 at 2:37 PM, Markus Wanner mar...@bluegap.ch wrote:
 Hi,

 On 03/18/2011 02:40 PM, Kevin Grittner wrote:
 Then the only thing you would consider sync replication, as far as I
 can see, is two phase commit

 I think waiting for the ACK before actually making the changes from the
 transaction visible (COMMIT) would suffice for disallowing such an
 inconsistency to manifest.  But obviously, MySQL decided it's not worth
 doing that, as it's such a rare event and a short period of time that
 may show inconsistencies...

There are fewer options for implementing this in MySQL because
replication requires a binlog on the master and that requires the
internal use of XA to keep the binlog and InnoDB in sync as they are
separate resource managers. In theory, this can be changed so that
commit is only forced for the binlog and then on a crash missing
transactions could be copied from the binlog to InnoDB but I don't
think this will ever change.

By fewer options I mean that commit in MySQL with InnoDB and the
binlog requires:
1) prepare to InnoDB (force transaction log to disk for changes from
this transaction)
2) write binlog events from this transaction to the binlog
3) write XID event to the binlog (at this point transaction commit is
official, will survive a crash)
4) force binlog to disk
5) release row locks held by transaction in innodb
6) write commit record to innodb transaction log
7) force write of commit record to disk

Group commit is done for the fsyncs from steps 1 and 7. It is not done
for the fsync done in step 4.

Regardless, the processing above is complicated even without
semi-sync. AFAIK, semi-sync code occurs after step 7 but I have not
looked at the official version of semi-sync code in MySQL and my
memory of the work we did at Google is vague.

It is great if Postgres doesn't have this issue. It wasn't clear to me
from lurking on this list. I hope your docs highlight the behavior as
not having the issue is a big deal.

-- 
Mark Callaghan
mdcal...@gmail.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Simon Riggs
On Fri, 2011-03-18 at 17:47 +0200, Heikki Linnakangas wrote:
 On 18.03.2011 16:52, Kevin Grittner wrote:
  Simon Riggssi...@2ndquadrant.com  wrote:
 
  In PostgreSQL other users cannot observe the commit until an
  acknowledgement has been received.
 
  Really?  I hadn't picked up on that.  That makes for a lot of
  complication on crash-and-recovery of a master, but if we can pull
  it off, that's really cool.  If we do that and MySQL doesn't, we
  definitely don't want to use the same terminology they do, which
  would imply the same behavior.
 
 To be clear: other users cannot observe the commit until standby 
 acknowledges it - unless the master crashes while waiting for the 
 acknowledgment. If that happens, the commit will be visible to everyone 
 after recovery.

No, only in the case where you choose not to failover to the standby
when you crash, which would be a fairly strange choice after the effort
to set up the standby. In a correctly configured and operated cluster
what I say above is fully correct and needs no addendum.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Maybe the thing to focus on first is the oft-discussed benchmark
 farm (similar to the build farm), with a good mix of loads, so
 that the impact of changes can be better tracked for multiple
 workloads on a variety of platforms and configurations.  Without
 something like that it is very hard to justify the added complexity
 of an idea like this in terms of the performance benefit gained.

A related area that could use some looking at is why performance tops
out at shared_buffers ~8GB and starts to fall thereafter.  InnoDB can
apparently handle much larger buffer pools without a performance
drop-off.  There are some advantages to our reliance on the OS buffer
cache, to be sure, but as RAM continues to grow this might start to
get annoying.  On a 4GB system you might have shared_buffers set to
25% of memory, but on a 64GB system it'll be a smaller percentage, and
as memory capacities continue to clime it'll be smaller still.
Unfortunately I don't have the hardware to investigate this, but it's
worth thinking about, especially if we're thinking of doing things
that add more caching.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Kevin Grittner
 On 18.03.2011 16:52, Kevin Grittner wrote:
 Simon Riggssi...@2ndquadrant.com  wrote:
 
 In PostgreSQL other users cannot observe the commit until an
 acknowledgement has been received.
 
 Really?  I hadn't picked up on that.  That makes for a lot of
 complication on crash-and-recovery of a master, but if we can
 pull it off, that's really cool.
 
Markus Wanner mar...@bluegap.ch wrote:
 
 What complication do you have in mind here?
 
Basically, what Heikki addresses.  It has to be committed after
crash and recovery, and deal with replicas which may or may not have
been notified and may or may not have applied the transaction.
 
Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote:
 
 To be clear: other users cannot observe the commit until standby 
 acknowledges it - unless the master crashes while waiting for the 
 acknowledgment. If that happens, the commit will be visible to
 everyone after recovery.
 
Right.  If other transactions cannot see the transaction before the
COMMIT returns, I was kinda assuming that this was the behavior,
because otherwise one or more replicas could be ahead of the master
after recovery, which would be horribly broken.  I agree that the
behavior which you describe is much better than allowing other
transactions to see the work of the pending COMMIT.
 
In fact, on further reflection, allowing other transactions to see
work before the committing transaction returns could lead to broken
behavior if that viewing transaction took some action based on the
that, the master crashed, recovery was done using a standby, and
that standby hadn't persisted the transaction.  So this behavior is
necessary for good behavior.  Even though that perfect storm of
events might be fairly rare, the difference in the level of
confidence in correctness is significant, and certainly something to
brag about.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 12:19 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Fri, 2011-03-18 at 17:47 +0200, Heikki Linnakangas wrote:
 On 18.03.2011 16:52, Kevin Grittner wrote:
  Simon Riggssi...@2ndquadrant.com  wrote:
 
  In PostgreSQL other users cannot observe the commit until an
  acknowledgement has been received.
 
  Really?  I hadn't picked up on that.  That makes for a lot of
  complication on crash-and-recovery of a master, but if we can pull
  it off, that's really cool.  If we do that and MySQL doesn't, we
  definitely don't want to use the same terminology they do, which
  would imply the same behavior.

 To be clear: other users cannot observe the commit until standby
 acknowledges it - unless the master crashes while waiting for the
 acknowledgment. If that happens, the commit will be visible to everyone
 after recovery.

 No, only in the case where you choose not to failover to the standby
 when you crash, which would be a fairly strange choice after the effort
 to set up the standby. In a correctly configured and operated cluster
 what I say above is fully correct and needs no addendum.

Except it doesn't work that way.  If, say, a backend on the master
core dumps, the system will perform a crash and restart cycle, and the
transaction will become visible whether it's yet been replicated or
not.  Since we now have a GUC to suppress restart after a backend
crash, it's theoretically possible to set up the system so that this
doesn't occur, but it'd take quite a bit of work to make it robust and
automatic, and it's certainly not the default out of the box.

The fundamental problem here is that once you update CLOG and flush
the corresponding WAL record, there is no going backward.  You can
hold the system in some intermediate state where the transaction still
holds locks and is excluded from MVCC snapshots, but there's no way to
back up.  So there are bound to be corner cases where the where the
wait doesn't last as long as you want, and stuff leaks out around the
edges.  It's fundamentally impossible to guarantee that you'll remain
in that intermediate state forever - what do you do if a meteor hits
the synchronous standby and at the same time you lose power to the
master?  No amount of configuration will save you from coming back on
line with a visible-but-unreplicated transaction.  I'm not knocking
the system; I think what we have is impressively good.  But pretending
that corner cases can't happen gets us nowhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 11:58 AM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 On 18.03.2011 17:38, Jeff Davis wrote:

 On Fri, 2011-03-18 at 10:27 -0400, Robert Haas wrote:

 ERRCODE_(WARNING_?)REPLICATION_WAIT_CANCELLED

 ...which might have something to recommend it.

 Works for me.

 Yes, sounds reasonable. Without WARNING_, please.

The reason I included WARNING is because warnings have their own
section in errcodes.txt, and each errcode is marked E for error or W
for warning.  Since we CAN'T actually error out here, I thought it
might be more appropriate to make this a warning; and all of the
existing such codes contain WARNING.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: pgindent (was Re: [HACKERS] Header comments in the recently added files)

2011-03-18 Thread Robert Haas
On Thu, Mar 10, 2011 at 11:25 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Mar 10, 2011 at 10:59 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Speaking of running scripts, I think we should run pgindent now.

 Yeah, +1 for doing it as soon as Tom is at a good stopping point.  It
 makes things a lot simpler later on.

 IIRC the argument for an early pgindent run was to standardize the new
 code for easier review.  I expect to be spending a whole lot of time
 reading collate and SSI code over the next few weeks, so I'm in favor
 of pgindent'ing that stuff first.  But I guess we need the typedef
 list update before anything can happen.

 That's one good reason.  Another is that this is presumably the time
 of the cycle when there are the fewest outstanding patches, making it
 a good time for changes that are likely to conflict with lots of other
 things.

 At any rate, it sounds like Andrew needs a few days to get the typedef
 list together, so let's wait for that to happen and then we'll see
 where we are.

Andrew, any update on this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 Simon Riggs si...@2ndquadrant.com wrote:
 
 No, only in the case where you choose not to failover to the
 standby when you crash, which would be a fairly strange choice
 after the effort to set up the standby. In a correctly configured
 and operated cluster what I say above is fully correct and needs
 no addendum.
 
 what do you do if a meteor hits the synchronous standby and at the
 same time you lose power to the master?  No amount of
 configuration will save you from coming back on line with a
 visible-but-unreplicated transaction. 
 
You don't even need to postulate an extreme condition like that; we
prefer to have a DBA pull the trigger on a failover, rather than
trust the STONITH call to software.  This is particularly true when
the master is local to its primary users and the replica is remote
to them.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread Alvaro Herrera
Excerpts from rsmogura's message of vie mar 18 11:57:48 -0300 2011:

  Actually idea of this patch was like this:
  Some operations requires many buffers, PG uses clock sweep to get 
  next free buffer, so it may overwrite index buffer. From point of view 
  of good database design We should use indices, so purging out index from 
  cache will affect performance.

The BufferAccessStrategy stuff was written to solve this problem.

  As the side effect I saw that this 2nd level keeps pg_* indices in 
  memory too, so I think to include 3rd level cache for some pg_* tables.

Keep in mind that there's already another layer of caching (see
syscache.c) for system catalogs on top of the buffer cache.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19

2011-03-18 Thread Simon Riggs
On Thu, 2011-03-17 at 09:33 -0400, Robert Haas wrote:

 Thanks for the review!

Lets have a look here...

You've added a test inside the lock to see if there is a standby, which
I took out for performance reasons. Maybe there's another way, I know
that code is fiddly.

You've also added back in the lock acquisition at wakeup with very
little justification, which was a major performance hit.

Together that's about a 20% hit in performance in Yeb's tests. I think
you should spend a little time thinking how to retune that.


I see handling added for ProcDiePending and QueryCancelPending directly
into syncrep.c without any comments in postgres.c to indicate that you
bypass ProcessInterrupts() in some cases. That looks pretty hokey to me.

SyncRepUpdateSyncStandbysDefined() is added into walwriter, which means
waiters won't be released if we do a sighup during a fast shutdown,
since the walwriter gets killed as soon as that starts. I'm thinking
bgwriter should handle that now.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_ctl restart - behaviour based on wrong instance

2011-03-18 Thread Erik Rijkers
I am not sure the following pg_ctl behaviour is really a bug, but I find it 
unexpected enough to
report.

I was testing synchronous replication in a test setup on a single machine.  
(After all, one could
have different instances on different arrays, right?  If you think this is an 
unlikely use-case,
perhaps the following is not important.)

There are two installations of 9.1devel (git as of today):
  primary: /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1
  standby: /var/data1/pg_stuff/pg_installations/pgsql.vanilla_2

The standby's data_directory is generated by pg_basebackup from vanilla_1.

The problem is the very first run of  pg_ctl restart:

pg_ctl first correctly decides that the standby instance (=vanilla_2) isn't yet 
running:

pg_ctl: PID file 
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/data/postmaster.pid does
not exist

This is OK and expected.  But then it continues (in the logfile) with:

FATAL:  lock file postmaster.pid already exists
HINT:  Is another postmaster (PID 20519) running in data directory
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data?

So, complaints about the *other* instance.  It doesn't happen once a successful 
start (with pg_ctl
start) has happened.

It starts fine when started right away with 'start' instead of  'restart'.

Also, if it has been started once, it will react to 'pg_ctl restart' without 
the errors.

I'll attach a shell-script, that provokes the error, see the 'restart' on the 
line with the
comment: 'HERE'

It would seem (see below) that pg_ctl's final decision about the standby, (that 
is has started up)
is wrong; the standby does *not* eventually start.


Below the output of the attached shell script.  (careful - it deletes stuff)
(It still contains some debug lines, but I didn't want to change it too much.)


$ clear; ./split_vanilla.sh

PGPASSFILE=/home/rijkers/.pg_rijkers
waiting for server to shut down done
server stopped
waiting for server to shut down done
server stopped
waiting for server to start done
server started
removed `/var/data1/pg_stuff/archive_dir/00010018'
removed `/var/data1/pg_stuff/archive_dir/00010019'
removed 
`/var/data1/pg_stuff/archive_dir/00010019.0020.backup'
removed `/var/data1/pg_stuff/archive_dir/0001001A'
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin/pg_basebackup
NOTICE:  pg_stop_backup complete, all required WAL segments have been archived

BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin
PGPORT=6564
PGPASSFILE=/home/rijkers/.pg_rijkers
PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin/pg_ctl
waiting for server to shut down done
server stopped
waiting for server to start done
server started
UIDPID  PPID  C STIME TTY  STAT   TIME CMD
rijkers  20519 1 20 17:19 pts/25   S+ 0:00
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin/postgres -D
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data
rijkers  20521 20519  0 17:19 ?Ss 0:00  \_ postgres: writer process
rijkers  20522 20519  0 17:19 ?Ss 0:00  \_ postgres: wal writer 
process
rijkers  20523 20519  0 17:19 ?Ss 0:00  \_ postgres: autovacuum 
launcher process
rijkers  20524 20519  0 17:19 ?Ss 0:00  \_ postgres: archiver 
process
rijkers  20525 20519  0 17:19 ?Ss 0:00  \_ postgres: stats 
collector process

BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/bin
PGPORT=6664
PGPASSFILE=/home/rijkers/.pg_rijkers
PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/data
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/bin/pg_ctl
pg_ctl: PID file 
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/data/postmaster.pid does
not exist
Is server running?
starting server anyway
waiting for server to 
start... done
server started

-- logfile 1:
LOG:  database system is shut down
LOG:  database system was shut down at 2011-03-18 17:19:54 CET
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections

-- logfile 2:
LOG:  shutting down
LOG:  database system is shut down
FATAL:  lock file postmaster.pid already exists
HINT:  Is another postmaster (PID 20519) running in data directory
/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data?




thanks,

Erik Rijkers


split_vanilla.sh
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 1:15 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Thu, 2011-03-17 at 09:33 -0400, Robert Haas wrote:
 Thanks for the review!

 Lets have a look here...

 You've added a test inside the lock to see if there is a standby, which
 I took out for performance reasons. Maybe there's another way, I know
 that code is fiddly.

 You've also added back in the lock acquisition at wakeup with very
 little justification, which was a major performance hit.

 Together that's about a 20% hit in performance in Yeb's tests. I think
 you should spend a little time thinking how to retune that.

Ouch.  Do you have a link that describes his testing methodology?  I
will look at it.

 I see handling added for ProcDiePending and QueryCancelPending directly
 into syncrep.c without any comments in postgres.c to indicate that you
 bypass ProcessInterrupts() in some cases. That looks pretty hokey to me.

I can add some comments.  Unfortunately, it's not feasible to call
ProcessInterrupts() directly from this point in the code - it causes a
database panic.

 SyncRepUpdateSyncStandbysDefined() is added into walwriter, which means
 waiters won't be released if we do a sighup during a fast shutdown,
 since the walwriter gets killed as soon as that starts. I'm thinking
 bgwriter should handle that now.

Hmm.  I was thinking that doing it in WAL writer would make it more
responsive, but since this is a fairly unlikely scenario, it's
probably not worth complicating the shutdown sequence to do it the way
I did.  I'll move it to bgwriter.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Greg Stark
On Fri, Mar 18, 2011 at 4:33 PM, Robert Haas robertmh...@gmail.com wrote:
 The fundamental problem here is that once you update CLOG and flush
 the corresponding WAL record, there is no going backward.  You can
 hold the system in some intermediate state where the transaction still
 holds locks and is excluded from MVCC snapshots, but there's no way to
 back up.  So there are bound to be corner cases where the where the
 wait doesn't last as long as you want, and stuff leaks out around the
 edges.


I'm finding this whole idea of hiding the committed transaction until
the slave acks it kind of strange. It means there are times when the
slave is actually *ahead* of the master which would actually be kind
of hard to code against if you're trying to use the slave as a
possibly-not-up-to-date mirror.

I think promising that the COMMIT doesn't return until the transaction
and all previous transactions are replicated is enough. We don't have
to promise that nobody else will see it either. Those same
transactions eventually have to commit as well and if they want that
level of protection they can block waiting until they're replicated as
well which will imply that anything they depended on will be
replicated.

This is akin to the synchronous_commit=off case where other
transactions can see your data as soon as you commit even before the
xlog is fsynced. If you have synchronous_commit mode enabled then
you'll block until your xlog is fsynced and that will implicitly mean
the other transactions you saw were also fsynced.

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI bug?

2011-03-18 Thread Kevin Grittner
YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote:
 
 thanks for quickly fixing problems.
 
Thanks for the rigorous testing.  :-)
 
 i tested the later version
 (a2eb9e0c08ee73208b5419f5a53a6eba55809b92) and only errors i got
 was out of shared memory.  i'm not sure if it was caused by SSI
 activities or not.
 
 PG_DIAG_SEVERITY: WARNING
 PG_DIAG_SQLSTATE: 53200
 PG_DIAG_MESSAGE_PRIMARY: out of shared memory
 PG_DIAG_SOURCE_FILE: shmem.c
 PG_DIAG_SOURCE_LINE: 190
 PG_DIAG_SOURCE_FUNCTION: ShmemAlloc
 
 PG_DIAG_SEVERITY: ERROR
 PG_DIAG_SQLSTATE: 53200
 PG_DIAG_MESSAGE_PRIMARY: out of shared memory
 PG_DIAG_SOURCE_FILE: dynahash.c
 PG_DIAG_SOURCE_LINE: 925
 PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value
 
Nor am I.  Some additional information would help.
 
(1)  Could you post the non-default configuration settings?
 
(2)  How many connections are in use in your testing?
 
(3)  Can you give a rough categorization of how many of what types
of transactions are in the mix?
 
(4)  Are there any long-running transactions?
 
(5)  How many of these errors do you get in what amount of time?
 
(6)  Does the application continue to run relatively sanely, or does
it fall over at this point?
 
(7)  The message hint would help pin it down, or a stack trace at
the point of the error would help more.  Is it possible to get
either?  Looking over the code, it appears that the only places that
SSI could generate that error, it would cancel that transaction with
the hint You might need to increase
max_pred_locks_per_transaction. and otherwise allow normal
processing.
 
Even with the above information it may be far from clear where
allocations are going past their maximum, since one HTAB could grab
more than its share and starve another which is staying below its
maximum.  I'll take a look at the possibility of adding a warning
or some such when an HTAB expands past its maximum size.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Japanese developers?

2011-03-18 Thread Josh Berkus
All,

I've heard from JPUG and all directors are OK.  Not sure about all
members, though.  All staff of SRA are also OK.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI bug?

2011-03-18 Thread Dan Ports

It would probably also be worth monitoring the size of pg_locks to see
how many predicate locks are being held.

On Fri, Mar 18, 2011 at 12:50:16PM -0500, Kevin Grittner wrote:
 Even with the above information it may be far from clear where
 allocations are going past their maximum, since one HTAB could grab
 more than its share and starve another which is staying below its
 maximum.  I'll take a look at the possibility of adding a warning
 or some such when an HTAB expands past its maximum size.

Yes -- considering how few shared memory HTABs have sizes that are
really dynamic, I'd be inclined to take a close look at SSI and
max_predicate_locks_per_transaction regardless of where the failed
allocation took place. But I am surprised to see that error message
without SSI's hint about increasing max_predicate_locks_per_xact.

Dan

-- 
Dan R. K. Ports  MIT CSAILhttp://drkp.net/

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread Jim Nasby
On Mar 18, 2011, at 11:19 AM, Robert Haas wrote:
 On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner
 kevin.gritt...@wicourts.gov wrote:
 A related area that could use some looking at is why performance tops
 out at shared_buffers ~8GB and starts to fall thereafter.  InnoDB can
 apparently handle much larger buffer pools without a performance
 drop-off.  There are some advantages to our reliance on the OS buffer
 cache, to be sure, but as RAM continues to grow this might start to
 get annoying.  On a 4GB system you might have shared_buffers set to
 25% of memory, but on a 64GB system it'll be a smaller percentage, and
 as memory capacities continue to clime it'll be smaller still.
 Unfortunately I don't have the hardware to investigate this, but it's
 worth thinking about, especially if we're thinking of doing things
 that add more caching.

+1

To take the opposite approach... has anyone looked at having the OS just manage 
all caching for us? Something like MMAPed shared buffers? Even if we find the 
issue with large shared buffers, we still can't dedicate serious amounts of 
memory to them because of work_mem issues. Granted, that's something else on 
the TODO list, but it really seems like we're re-inventing the wheels that the 
OS has already created here...
--
Jim C. Nasby, Database Architect   j...@nasby.net
512.569.9461 (cell) http://jim.nasby.net



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19

2011-03-18 Thread Alvaro Herrera
Excerpts from Robert Haas's message of vie mar 18 14:25:16 -0300 2011:
 On Fri, Mar 18, 2011 at 1:15 PM, Simon Riggs si...@2ndquadrant.com wrote:

  SyncRepUpdateSyncStandbysDefined() is added into walwriter, which means
  waiters won't be released if we do a sighup during a fast shutdown,
  since the walwriter gets killed as soon as that starts. I'm thinking
  bgwriter should handle that now.
 
 Hmm.  I was thinking that doing it in WAL writer would make it more
 responsive, but since this is a fairly unlikely scenario, it's
 probably not worth complicating the shutdown sequence to do it the way
 I did.  I'll move it to bgwriter.

Can't they both do it?

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Markus Wanner
On 03/18/2011 06:35 PM, Greg Stark wrote:
 I think promising that the COMMIT doesn't return until the transaction
 and all previous transactions are replicated is enough. We don't have
 to promise that nobody else will see it either. Those same
 transactions eventually have to commit as well

No, they don't have to.  They can ROLLBACK, get aborted, lose connection
to the master, etc..  The issue here is that, given the MySQL scheme,
these transactions see a snapshot that's not durable, because at that
point in time, no standby guarantees to have stored the transaction to
be committed, yet.  So in case of a failover, you'd suddenly see a
different snapshot (and lose changes of that transaction).

 This is akin to the synchronous_commit=off case where other
 transactions can see your data as soon as you commit even before the
 xlog is fsynced. If you have synchronous_commit mode enabled then
 you'll block until your xlog is fsynced and that will implicitly mean
 the other transactions you saw were also fsynced.

Somewhat, yes.  And for exactly that reason, most users run with
synchronous_commit enabled.  They don't want to lose committed transactions.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Markus Wanner
Simon,

On 03/18/2011 05:19 PM, Simon Riggs wrote:
 Simon Riggssi...@2ndquadrant.com  wrote:
 In PostgreSQL other users cannot observe the commit until an
 acknowledgement has been received.

On other nodes as well?  To me that means the standby needs to hold back
COMMIT of an ACKed transaction, until receives a re-ACK from the master,
that it committed the transaction there.  How else could the slave know
when to commit its ACKed transactions?

 No, only in the case where you choose not to failover to the standby
 when you crash, which would be a fairly strange choice after the effort
 to set up the standby. In a correctly configured and operated cluster
 what I say above is fully correct and needs no addendum.

If you don't failover, how can the standby be ahead of the master, given
it takes measures not to be during normal operation?

Eager to understand... ;-)

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Markus Wanner
On 03/18/2011 05:27 PM, Kevin Grittner wrote:
 Basically, what Heikki addresses.  It has to be committed after
 crash and recovery, and deal with replicas which may or may not have
 been notified and may or may not have applied the transaction.

Huh?  I'm not quite following here.  Committing additional transactions
isn't a problem, reverting committed transactions is.

And yes, given that we only wait for ACK from a single standby, you'd
have to failover to exactly *that* standby to guarantee consistency.

 In fact, on further reflection, allowing other transactions to see
 work before the committing transaction returns could lead to broken
 behavior if that viewing transaction took some action based on the
 that, the master crashed, recovery was done using a standby, and
 that standby hadn't persisted the transaction.  So this behavior is
 necessary for good behavior.

I fully agree to that.

Regards

Markus

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Simon Riggs
On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote:
 Simon,
 
 On 03/18/2011 05:19 PM, Simon Riggs wrote:
  Simon Riggssi...@2ndquadrant.com  wrote:
  In PostgreSQL other users cannot observe the commit until an
  acknowledgement has been received.
 
 On other nodes as well?  To me that means the standby needs to hold back
 COMMIT of an ACKed transaction, until receives a re-ACK from the master,
 that it committed the transaction there.  How else could the slave know
 when to commit its ACKed transactions?

We could do that easily enough, actually, if we wished.

Do we wish?

  No, only in the case where you choose not to failover to the standby
  when you crash, which would be a fairly strange choice after the effort
  to set up the standby. In a correctly configured and operated cluster
  what I say above is fully correct and needs no addendum.
 
 If you don't failover, how can the standby be ahead of the master, given
 it takes measures not to be during normal operation?
 
 Eager to understand... ;-)
 
 Regards
 
 Markus

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Kevin Grittner
Simon Riggs si...@2ndquadrant.com wrote:
 On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote:
 
  Simon Riggssi...@2ndquadrant.com  wrote:
  In PostgreSQL other users cannot observe the commit until an
  acknowledgement has been received.
 
 On other nodes as well?  To me that means the standby needs to
 hold back COMMIT of an ACKed transaction, until receives a re-ACK
 from the master, that it committed the transaction there.  How
 else could the slave know when to commit its ACKed transactions?
 
 We could do that easily enough, actually, if we wished.
 
 Do we wish?
 
+1
 
If we're going out of our way to suppress it on the master until the
COMMIT returns, it shouldn't be showing on the replicas before that.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Markus Wanner
On 03/18/2011 08:29 PM, Simon Riggs wrote:
 We could do that easily enough, actually, if we wished.
 
 Do we wish?

I personally don't see any problem letting a standby show a snapshot
before the master.  I'd consider it unneeded network traffic.  But then
again, I'm completely biased.

Regards

Markus Wanner

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI bug?

2011-03-18 Thread Kevin Grittner
Dan Ports d...@csail.mit.edu wrote:
 
 I am surprised to see that error message without SSI's hint about
 increasing max_predicate_locks_per_xact.
 
After reviewing this, I think something along the following lines
might be needed, for a start.  I'm not sure the Asserts are actually
needed; they basically are checking that the current behavior of
hash_search doesn't change.
 
I'm still looking at whether it's sane to try to issue a warning
when an HTAB exceeds the number of entries declared as its max_size
when it was created.
 
-Kevin

--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1604,12 +1604,7 @@ RegisterPredicateLockingXid(const TransactionId xid)
sxid = (SERIALIZABLEXID *) hash_search(SerializableXidHash,

   sxidtag,

   HASH_ENTER, found);
-   if (!sxid)
-   /* This should not be possible, based on allocation. */
-   ereport(ERROR,
-   (errcode(ERRCODE_OUT_OF_MEMORY),
-errmsg(out of shared memory)));
-
+   Assert(sxid != NULL);
Assert(!found);
 
/* Initialize the structure. */
@@ -2046,7 +2041,7 @@ CreatePredicateLock(const PREDICATELOCKTARGETTAG 
*targettag,
target = (PREDICATELOCKTARGET *)
hash_search_with_hash_value(PredicateLockTargetHash,

targettag, targettaghash,
-   
HASH_ENTER, found);
+   
HASH_ENTER_NULL, found);
if (!target)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -2061,7 +2056,7 @@ CreatePredicateLock(const PREDICATELOCKTARGETTAG 
*targettag,
lock = (PREDICATELOCK *)
hash_search_with_hash_value(PredicateLockHash, locktag,
PredicateLockHashCodeFromTargetHashCode(locktag, 
targettaghash),
-   
HASH_ENTER, found);
+   
HASH_ENTER_NULL, found);
if (!lock)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -3252,7 +3247,7 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool 
partial,
predlock = 
hash_search_with_hash_value(PredicateLockHash, tag,

PredicateLockHashCodeFromTargetHashCode(tag,

  targettaghash),
-   
   HASH_ENTER, found);
+   
   HASH_ENTER_NULL, found);
if (!predlock)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -4279,10 +4274,7 @@ predicatelock_twophase_recover(TransactionId xid, uint16 
info,
sxid = (SERIALIZABLEXID *) hash_search(SerializableXidHash,

   sxidtag,

   HASH_ENTER, found);
-   if (!sxid)
-   ereport(ERROR,
-   (errcode(ERRCODE_OUT_OF_MEMORY),
-errmsg(out of shared memory)));
+   Assert(sxid != NULL);
Assert(!found);
sxid-myXact = (SERIALIZABLEXACT *) sxact;
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] sync rep fsync=off

2011-03-18 Thread Robert Haas
While investigating Simon's complaint about my patch of a few days
ago, I discovered that synchronous replication appears to slow to a
crawl if fsync is turned off on the standby.

I'm not sure why this is happening or what the right behavior is in
this case, but I think some kind of adjustment is needed because the
current behavior is quite surprising.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 2:15 PM, Jim Nasby j...@nasby.net wrote:
 +1

 To take the opposite approach... has anyone looked at having the OS just 
 manage all caching for us? Something like MMAPed shared buffers? Even if we 
 find the issue with large shared buffers, we still can't dedicate serious 
 amounts of memory to them because of work_mem issues. Granted, that's 
 something else on the TODO list, but it really seems like we're re-inventing 
 the wheels that the OS has already created here...

The problem is that the OS doesn't offer any mechanism that would
allow us to obey the WAL-before-data rule.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 3:29 PM, Simon Riggs si...@2ndquadrant.com wrote:
 On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote:
 Simon,

 On 03/18/2011 05:19 PM, Simon Riggs wrote:
  Simon Riggssi...@2ndquadrant.com  wrote:
  In PostgreSQL other users cannot observe the commit until an
  acknowledgement has been received.

 On other nodes as well?  To me that means the standby needs to hold back
 COMMIT of an ACKed transaction, until receives a re-ACK from the master,
 that it committed the transaction there.  How else could the slave know
 when to commit its ACKed transactions?

 We could do that easily enough, actually, if we wished.

 Do we wish?

Seems like it would be nice, but isn't it dreadfully expensive?
Wouldn't you need to prevent the slave from applying the WAL until the
master has released the sync rep waiters?  You'd need a whole new
series of messages back and forth.

Since the current solution is intended to support data-loss-free
failover, but NOT to guarantee a consistent view of the world from a
SQL level, I doubt it's worth paying any price for this.  Certainly in
the hot_standby=off case it's a nonissue.  We might need to think
harder about it when and if someone impements an 'apply' level though,
because this would seem more of a concern in that case (though I
haven't thought through all the details).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 Since the current solution is intended to support data-loss-free
 failover, but NOT to guarantee a consistent view of the world from
 a SQL level, I doubt it's worth paying any price for this.
 
Well, that brings us back to the question of why we would want to
suppress the view of the data on the master until the replica
acknowledges the commit.  It *is* committed on the master, we're
just holding off on telling the committer about it until we can
honor the guarantee of replication.  If it can be seen on the
replica before the committer get such acknowledgment, why not on the
master?
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Simon Riggs
On Fri, 2011-03-18 at 17:08 -0400, Aidan Van Dyk wrote:
 On Fri, Mar 18, 2011 at 3:41 PM, Markus Wanner mar...@bluegap.ch wrote:
  On 03/18/2011 08:29 PM, Simon Riggs wrote:
  We could do that easily enough, actually, if we wished.
 
  Do we wish?
 
  I personally don't see any problem letting a standby show a snapshot
  before the master.  I'd consider it unneeded network traffic.  But then
  again, I'm completely biased.
 
 In fact, we *need* to have standbys show a snapshot before the master.
 
 By the time the master acks the commit to the client, the snapshot
 must be visible to all client connected to both the master and the
 syncronous slave.
 
 Even with just a single server postgresql cluster, other
 clients(backends) can see the commit before the commiting client
 receives the ACK.  Just that on a single server, the time period for
 that is small.
 
 Sync rep increases that time period by the length of time from when
 the slave reaches the commit point in the WAL stream to when it's ack
 of that point get's back to the wal sender.  Ideally, that ACK time is
 small.
 
 Adding another round trip in there just for a go almost to $COMIT,
 ok, now go to $COMMIT type of WAL/ack is going to be pessimal for
 performance, and still not improve the *guarentees* it can make.
 
 It can only slightly reduce, but not eliminated that window where them
 master has WAL that the slave doesn't, and without a complete
 elimination (where you just switch the problem to be the slave has the
 data that the master doesn't), you haven't changed any of the
 guarantees sync rep can make (or not).

Well explained observation. Agreed.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Simon Riggs
On Fri, 2011-03-18 at 16:24 -0500, Kevin Grittner wrote:
 Robert Haas robertmh...@gmail.com wrote:
  
  Since the current solution is intended to support data-loss-free
  failover, but NOT to guarantee a consistent view of the world from
  a SQL level, I doubt it's worth paying any price for this.
  
 Well, that brings us back to the question of why we would want to
 suppress the view of the data on the master until the replica
 acknowledges the commit.  It *is* committed on the master, we're
 just holding off on telling the committer about it until we can
 honor the guarantee of replication.  If it can be seen on the
 replica before the committer get such acknowledgment, why not on the
 master?

I think the issue is explicit acknowledgement, not visibility.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/books/
 PostgreSQL Development, 24x7 Support, Training and Services
 


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 2:55 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Robert Haas's message of vie mar 18 14:25:16 -0300 2011:
 On Fri, Mar 18, 2011 at 1:15 PM, Simon Riggs si...@2ndquadrant.com wrote:

  SyncRepUpdateSyncStandbysDefined() is added into walwriter, which means
  waiters won't be released if we do a sighup during a fast shutdown,
  since the walwriter gets killed as soon as that starts. I'm thinking
  bgwriter should handle that now.

 Hmm.  I was thinking that doing it in WAL writer would make it more
 responsive, but since this is a fairly unlikely scenario, it's
 probably not worth complicating the shutdown sequence to do it the way
 I did.  I'll move it to bgwriter.

 Can't they both do it?

Yeah, but it seems fairly pointless.  In retrospect, I probably should
have done it the way Simon is proposing to begin with.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 5:24 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:

 Since the current solution is intended to support data-loss-free
 failover, but NOT to guarantee a consistent view of the world from
 a SQL level, I doubt it's worth paying any price for this.

 Well, that brings us back to the question of why we would want to
 suppress the view of the data on the master until the replica
 acknowledges the commit.  It *is* committed on the master, we're
 just holding off on telling the committer about it until we can
 honor the guarantee of replication.  If it can be seen on the
 replica before the committer get such acknowledgment, why not on the
 master?

Well, the idea is that we don't want to let people depend on the value
until it's guaranteed to be durably committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Kevin Grittner
Robert Haas robertmh...@gmail.com wrote:
 
 Well, the idea is that we don't want to let people depend on the
 value until it's guaranteed to be durably committed.
 
OK, so if you see it on the replica, you know it is in at least two
places.  I guess that makes sense.  It kinda feels wrong to see a
view of the replica which is ahead of the master, but I guess it's
the least of the evils.  I guess we should document it, though, so
nobody has a false expectation that seeing something on the replica
means that a connection looking at the master will see something
that current.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SSI bug?

2011-03-18 Thread Kevin Grittner
Kevin Grittner kevin.gritt...@wicourts.gov wrote:
 
 I'm still looking at whether it's sane to try to issue a warning
 when an HTAB exceeds the number of entries declared as its
 max_size when it was created.
 
I think this does it.
 
If nothing else, it might be instructive to use it while testing the
SSI patch.  Would it make any sense to slip this into 9.1, or should
I add it to the first 9.2 CF?
 
-Kevin

*** a/src/backend/storage/ipc/shmem.c
--- b/src/backend/storage/ipc/shmem.c
***
*** 268,273  ShmemInitHash(const char *name, /* table string name for shmem 
index */
--- 268,274 
 *
 * The shared memory allocator must be specified too.
 */
+   infoP-max_size = max_size;
infoP-dsize = infoP-max_dsize = hash_select_dirsize(max_size);
infoP-alloc = ShmemAlloc;
hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
*** a/src/backend/utils/hash/dynahash.c
--- b/src/backend/utils/hash/dynahash.c
***
*** 129,134  struct HASHHDR
--- 129,135 
longffactor;/* target fill factor */
longmax_dsize;  /* 'dsize' limit if directory 
is fixed size */
longssize;  /* segment size --- must be 
power of 2 */
+   longmax_size;   /* maximum number of entries 
expected */
int sshift; /* segment shift = 
log2(ssize) */
int nelem_alloc;/* number of entries to 
allocate at once */
  
***
*** 368,373  hash_create(const char *tabname, long nelem, HASHCTL *info, 
int flags)
--- 369,375 
hdefault(hashp);
  
hctl = hashp-hctl;
+   hctl-max_size = info-max_size;
  
if (flags  HASH_PARTITION)
{
***
*** 1333,1338  element_alloc(HTAB *hashp, int nelem)
--- 1335,1341 
HASHELEMENT *tmpElement;
HASHELEMENT *prevElement;
int i;
+   boolwarningNeeded;
  
/* Each element has a HASHELEMENT header plus user data. */
elementSize = MAXALIGN(sizeof(HASHELEMENT)) + 
MAXALIGN(hctlv-entrysize);
***
*** 1360,1369  element_alloc(HTAB *hashp, int nelem)
--- 1363,1378 
/* freelist could be nonempty if two backends did this concurrently */
firstElement-link = hctlv-freeList;
hctlv-freeList = prevElement;
+   warningNeeded = (hctlv-max_size  0  hctlv-nentries == 
hctlv-max_size);
  
if (IS_PARTITIONED(hctlv))
SpinLockRelease(hctlv-mutex);
  
+   if (warningNeeded)
+   ereport(WARNING,
+   (errmsg(hash table \%s\ has more entries 
than expected, hashp-tabname),
+errdetail(The maximum was set to %li on 
creation., hctlv-max_size)));
+ 
return true;
  }
  
*** a/src/include/utils/hsearch.h
--- b/src/include/utils/hsearch.h
***
*** 69,74  typedef struct HASHCTL
--- 69,75 
longdsize;  /* (initial) directory size */
longmax_dsize;  /* limit to dsize if dir size 
is limited */
longffactor;/* fill factor */
+   longmax_size;   /* maximum number of entries 
expected */
Sizekeysize;/* hash key length in bytes */
Sizeentrysize;  /* total user element size in 
bytes */
HashValueFunc hash; /* hash function */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19

2011-03-18 Thread Robert Haas
Responding to this again, somewhat out of order...

On Fri, Mar 18, 2011 at 1:15 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Together that's about a 20% hit in performance in Yeb's tests. I think
 you should spend a little time thinking how to retune that.

I've spent some time playing around with pgbench and so far I haven't
been able to reliably reproduce this, which is not to say I don't
believe the effect is real, but rather that either I'm doing something
completely wrong, or it requires some specific setup to measure that
doesn't match my environment, or that it's somewhat finicky to
reproduce, or some combination of the above.

 You've added a test inside the lock to see if there is a standby, which
 I took out for performance reasons. Maybe there's another way, I know
 that code is fiddly.

It seems pretty easy to remove the branch from the test at the top of
the function by just rearranging things a bit.  Patch attached; does
this help?

 You've also added back in the lock acquisition at wakeup with very
 little justification, which was a major performance hit.

I have a very difficult time believing this is a real problem.  That
extra lock acquisition and release only happens if WaitLatchOrSocket()
returns but MyProc-syncRepState still appears to be SYNC_REP_WAITING.
 That should only happen if the latch wait hits the timeout (which
takes 60 s!) or if the precise memory ordering problem that was put in
to fix is occurring (in which case it should dramatically *improve*
performance, by avoiding an extra 60 s wait).  I stuck in a call to
elog(LOG, got here) and it didn't fire even once in a 5-minute
pgbench test (~45k transactions).  So I have a hard time crediting
this for any performance problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


sync-standbys-defined-rearrangement.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 5:48 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 Robert Haas robertmh...@gmail.com wrote:
 Well, the idea is that we don't want to let people depend on the
 value until it's guaranteed to be durably committed.

 OK, so if you see it on the replica, you know it is in at least two
 places.  I guess that makes sense.  It kinda feels wrong to see a
 view of the replica which is ahead of the master, but I guess it's
 the least of the evils.  I guess we should document it, though, so
 nobody has a false expectation that seeing something on the replica
 means that a connection looking at the master will see something
 that current.

Yeah, it can go both ways: a snapshot taken on the standby can be
either earlier or later in the commit ordering than the master.
That's counterintuitive, but I see no reason to stress about it.  It's
perfectly reasonable to set up a server with synchronous replication
for enhanced durability and also enable hot standby just for
convenience, but without actually relying on it all that heavily, or
only for non-critical reporting purposes.  Synchronous replication,
like asynchronous replication, is basically a high-availability tool.
As long as it does that well, I'm not going to get worked up about the
fact that it doesn't address every other use case someone might want.
We can always add more frammishes in future releases.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_last_xact_replay_timestamp meaning

2011-03-18 Thread Bruce Momjian
I just applied a doc patch for pg_last_xact_replay_timestamp, and the
text now says:

   entryGet timestamp of last transaction replayed during recovery.
This is the time at which the commit or abort WAL record for that
transaction was generated on the primary.
If no transactions have been replayed during recovery, this function
returns NULL.  Otherwise, if recovery is still in progress this will
increase monotonically.  If recovery has completed then this value will
remain static at the value of the last transaction applied during that
recovery.  When the server has been started normally without recovery
the function returns NULL.

Is this really the last commit/abort record or the last WAL record? 
What should it be?  Is the name of this function correct?  Do we care
only about commit/abort records? Why?

-- 
  Bruce Momjian  br...@momjian.ushttp://momjian.us
  EnterpriseDB http://enterprisedb.com

  + It's impossible for everything to be true. +

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread Radosław Smogura
Kevin Grittner kevin.gritt...@wicourts.gov Thursday 17 March 2011 22:02:18
 Rados*aw Smogurarsmog...@softperience.eu wrote:
  I have implemented initial concept of 2nd level cache. Idea is to
  keep some segments of shared memory for special buffers (e.g.
  indices) to prevent overwrite those by other operations. I added
  those functionality to nbtree index scan.
  
  I tested this with doing index scan, seq read, drop system
  buffers, do index scan and in few places I saw performance
  improvements, but actually, I'm not sure if this was just random
  or intended improvement.
 
 I've often wondered about this.  In a database I developed back in
 the '80s it was clearly a win to have a special cache for index
 entries and other special pages closer to the database than the
 general cache.  A couple things have changed since the '80s (I mean,
 besides my waistline and hair color), and PostgreSQL has many
 differences from that other database, so I haven't been sure it
 would help as much, but I have wondered.
 
 I can't really look at this for a couple weeks, but I'm definitely
 interested.  I suggest that you add this to the next CommitFest as a
 WIP patch, under the Performance category.
 
 https://commitfest.postgresql.org/action/commitfest_view/open
 
  There is few places to optimize code as well, and patch need many
  work, but may you see it and give opinions?
 
 For something like this it makes perfect sense to show proof of
 concept before trying to cover everything.
 
 -Kevin

Here I attach latest version of patch with few performance improvements (code 
is still dirty) and some reports from test, as well my simple tests.

Actually there is small improvement without dropping system caches, and bigger 
with dropping. I have small performance decrease (if we can talk about 
measuring basing on this tests) to original PG version when dealing with same 
configuration, but increase is with 2nd level buffers... or maybe I badly 
compared reports.

In tests I tried to choose typical, simple queries. 

Regards,
Radek


2nd_lvl_cache_20110318.diff.bz2
Description: application/bzip


test-scritps_20110319_0026.tar.bz2
Description: application/bzip-compressed-tar


reports_20110318.tar.bz2
Description: application/bzip-compressed-tar

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 2nd Level Buffer Cache

2011-03-18 Thread Josh Berkus
On 3/18/11 11:15 AM, Jim Nasby wrote:
 To take the opposite approach... has anyone looked at having the OS just 
 manage all caching for us? Something like MMAPed shared buffers? Even if we 
 find the issue with large shared buffers, we still can't dedicate serious 
 amounts of memory to them because of work_mem issues. Granted, that's 
 something else on the TODO list, but it really seems like we're re-inventing 
 the wheels that the OS has already created here...

As far as I know, no OS has a more sophisticated approach to eviction
than LRU.  And clock-sweep is a significant improvement on performance
over LRU for frequently accessed database objects ... plus our
optimizations around not overwriting the whole cache for things like VACUUM.

2-level caches work well for a variety of applications.

Now, what would be *really* useful is some way to avoid all the data
copying we do between shared_buffers and the FS cache.

-- 
  -- Josh Berkus
 PostgreSQL Experts Inc.
 http://www.pgexperts.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: [COMMITTERS] pgsql: Basic Recovery Control functions for use in Hot Standby. Pause,

2011-03-18 Thread Tom Lane
Fujii Masao masao.fu...@gmail.com writes:
 On Fri, Mar 18, 2011 at 1:17 AM, Robert Haas robertmh...@gmail.com wrote:
 Sorry, I've not been able to understand the point well yet. We should
 just use elog(ERROR) instead? But since ERROR in startup process
 is treated as FATAL, I'm not sure whether it's worth using ERROR
 instead. Or you meant another things?

 Yeah, I think he's saying that an ERROR in the startup process is
 better than a FATAL, even though the effect is the same.

 We've already been using FATAL all over the recovery code. We should
 s/FATAL/ERROR/g there (at least readRecoveryCommandFile)?

Possibly, but as you say, it doesn't make that much difference in the
startup process.  What is bothering me is the prospect of elog(FATAL)
in the postmaster.  Code associated with GUC validity checking is likely
to get executed in the postmaster, which is why it should not throw
anything stronger than the normal GUC complaint levels.  Even if the
patch as proposed is for code that could only be reached in the startup
process today, somebody might decide to rearrange it ...

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.

2011-03-18 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 As a side note, it's not very obvious why some parts of PostmasterMain
 report problems by doing write_stderr() and exit() while other parts
 use ereport(ERROR).  This check and the nearby checks on WAL level are
 immediately preceded and followed by other checks that use the
 opposite technique.

This question is answered in postmaster.c's header comment:

 * Error Reporting:
 *  Use write_stderr() only for reporting interactive errors
 *  (essentially, bogus arguments on the command line).  Once the
 *  postmaster is launched, use ereport().  In particular, don't use
 *  write_stderr() for anything that occurs after pmdaemonize.

Code that is involved in GUC variable processing is in a gray area, though,
since it can be invoked both before and after pmdaemonize.  It might be
a good idea to convert all the calls into ereports and maintain a state
flag in elog.c to determine what to do.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_ctl restart - behaviour based on wrong instance

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers e...@xs4all.nl wrote:
 This is OK and expected.  But then it continues (in the logfile) with:

 FATAL:  lock file postmaster.pid already exists
 HINT:  Is another postmaster (PID 20519) running in data directory
 /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data?

 So, complaints about the *other* instance.  It doesn't happen once a 
 successful start (with pg_ctl
 start) has happened.

I'm guessing that leftover postmaster.pid contents might be
responsible for this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_last_xact_replay_timestamp meaning

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 7:23 PM, Bruce Momjian br...@momjian.us wrote:
 I just applied a doc patch for pg_last_xact_replay_timestamp, and the
 text now says:

       entryGet timestamp of last transaction replayed during recovery.
        This is the time at which the commit or abort WAL record for that
        transaction was generated on the primary.
        If no transactions have been replayed during recovery, this function
        returns NULL.  Otherwise, if recovery is still in progress this will
        increase monotonically.  If recovery has completed then this value will
        remain static at the value of the last transaction applied during that
        recovery.  When the server has been started normally without recovery
        the function returns NULL.

 Is this really the last commit/abort record or the last WAL record?
 What should it be?  Is the name of this function correct?  Do we care
 only about commit/abort records? Why?

Commit and abort records have a timestamp.  Other WAL records don't.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_last_xact_replay_timestamp meaning

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 9:35 PM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, Mar 18, 2011 at 7:23 PM, Bruce Momjian br...@momjian.us wrote:
 I just applied a doc patch for pg_last_xact_replay_timestamp, and the
 text now says:

       entryGet timestamp of last transaction replayed during recovery.
        This is the time at which the commit or abort WAL record for that
        transaction was generated on the primary.
        If no transactions have been replayed during recovery, this function
        returns NULL.  Otherwise, if recovery is still in progress this will
        increase monotonically.  If recovery has completed then this value 
 will
        remain static at the value of the last transaction applied during that
        recovery.  When the server has been started normally without recovery
        the function returns NULL.

 Is this really the last commit/abort record or the last WAL record?
 What should it be?  Is the name of this function correct?  Do we care
 only about commit/abort records? Why?

 Commit and abort records have a timestamp.  Other WAL records don't.

Incidentally, there's an open item related to this:

* pg_last_xact_replay_timestamp limitations
linking to 
http://archives.postgresql.org/message-id/201012071131.55211.gabi.jul...@broadsign.com

I'm not sure why this is important enough to be worth being on this
list, but... is this resolved now?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep v19

2011-03-18 Thread Robert Haas
On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao masao.fu...@gmail.com wrote:
 * Smart shutdown
 Smart shutdown should wait for all the waiting backends to be acked, and
 should not cause them to forcibly exit. But this leads shutdown to get stuck
 infinitely if there is no walsender at that time. To enable them to be acked
 even in that situation, we need to change postmaster so that it accepts the
 replication connection even during smart shutdown (until we reach
 PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser
 connection to cancel backup during smart shutdown. So I don't think that
 the idea to accept the replication connection during smart shutdown is so
 ugly.

 * Fast shutdown
 I agree with you about fast shutdown. Fast shutdown should cause all the
 backends including waiting ones to exit immediately. At that time, the
 non-acked backend should not return the success, according to the
 definition of sync rep. So we need to change a backend so that it gets rid
 of itself from the waiting queue and exits before returning the success,
 when it receives SIGTERM. This change leads the waiting backends to
 do the same even when pg_terminate_backend is called. But since
 they've not been acked yet, it seems to be reasonable to prevent them
 from returning the COMMIT.

 Comments? I'll create the patch barring objection.

The fast smart shutdown part of this problem has been addressed.  The
smart shutdown case still needs work, and I think the consensus was
that your proposal above was the best way to go with it.

Do you still want to work up a patch for this?  If so, I can review.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sync Rep v19

2011-03-18 Thread Robert Haas
On Fri, Mar 18, 2011 at 10:25 PM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao masao.fu...@gmail.com wrote:
 * Smart shutdown
 Smart shutdown should wait for all the waiting backends to be acked, and
 should not cause them to forcibly exit. But this leads shutdown to get stuck
 infinitely if there is no walsender at that time. To enable them to be acked
 even in that situation, we need to change postmaster so that it accepts the
 replication connection even during smart shutdown (until we reach
 PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser
 connection to cancel backup during smart shutdown. So I don't think that
 the idea to accept the replication connection during smart shutdown is so
 ugly.

 * Fast shutdown
 I agree with you about fast shutdown. Fast shutdown should cause all the
 backends including waiting ones to exit immediately. At that time, the
 non-acked backend should not return the success, according to the
 definition of sync rep. So we need to change a backend so that it gets rid
 of itself from the waiting queue and exits before returning the success,
 when it receives SIGTERM. This change leads the waiting backends to
 do the same even when pg_terminate_backend is called. But since
 they've not been acked yet, it seems to be reasonable to prevent them
 from returning the COMMIT.

 Comments? I'll create the patch barring objection.

 The fast smart shutdown part of this problem has been addressed.  The

Ugh.  I mean the fast shutdown, of course, not the fast smart
shutdown.  Anyway, point is:

fast shutdown now OK
smart shutdown still not OK
do you want to write a patch?

:-)

 smart shutdown case still needs work, and I think the consensus was
 that your proposal above was the best way to go with it.

 Do you still want to work up a patch for this?  If so, I can review.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Collations versus record-returning functions

2011-03-18 Thread Tom Lane
I'm making pretty good progress on the task of splitting input and
output collations for expression nodes.  There remains one case in the
regression tests that is giving a non-expected result.  It involves this
function:

CREATE FUNCTION dup (f1 anyelement, f2 out anyelement, f3 out anyarray)
AS 'select $1, array[$1,$1]' LANGUAGE sql;

which is used like this:

SELECT a, (dup(b)).* FROM collate_test3 ORDER BY 2;

where the column collate_test3.b is declared with a nondefault
collation.  In HEAD, the output of dup() inherits that collation from
the b Var, and then so does the field selected from it, and so the ORDER
BY sorts according to the column collation.  With my patch as it stands,
you get sorting by the default collation.  The reason is that the output
of dup() is of type RECORD, which is a noncollatable type according to
pg_type, so the function result is labeled noncollatable even though it
has collatable inputs.  The FieldSelect node then has to choose the
default collation since there's nothing else available.

ISTM there are basically three things we might do about this:

(1) Decide that the patch's behavior is correct and what's embodied in
the regression expected file is wrong.

(2) Allow collations to propagate up through nodes that deliver
noncollatable outputs.

(3) Decide that composite types are collatable.

I don't much care for #2, mainly because it seems likely to cause a lot
of unnecessary collation conflicts.  It has a lot of flavor of action
at a distance in most cases, too, even though it might seem natural
in this particular example.

#3 has some appeal but it also seems fraught with definitional issues,
for instance:

* Something like
row('a' collate C, 'b' collate en_US)
doesn't have a resolvable collation, but I don't think we want it
to throw an error.

* If composites are collatable, one would logically expect a table's
rowtype to inherit the common collation of its columns (if there is
one).  This is problematic because then the rowtype collation can change
after ALTER TABLE ADD COLUMN/DROP COLUMN/ALTER COLUMN TYPE, which is
something we can't support even if it seemed like a good idea.

Also, I fail to see any support for #3 in the SQL standard.  So far as
I can find, the standard only considers that string types are collatable.
By the same token, it doesn't look to me like the standard is expecting
#2 to happen, since only string types can carry collation.

Thoughts?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [COMMITTERS] pgsql: Document the all-balls IPv6 address.

2011-03-18 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Fri, Mar 18, 2011 at 10:19 PM, Andrew Dunstan and...@dunslane.net wrote:
 On 03/18/2011 09:18 PM, Robert Haas wrote:
 all balls seems like a colloquialism best avoided in our documentation.

 It's already there, although I agree it's infelicitous.

 I vote for taking it out.  I think that could be interpreted as inappropriate.

IIRC, the pre-existing usage refers to time 00:00:00.  It does not seem
especially useful to adopt the same terminology for network addresses;
that's more likely to confuse people than anything else.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] VACUUM FULL deadlock with backend startup

2011-03-18 Thread Nikhil Sontakke
Hi,

We encountered a deadlock involving VACUUM FULL (surprise surprise!
:)) in PG 8.3.13 (and still not fixed in 9.0 AFAICS although the
window appears much smaller). The call spits out the following
deadlock info:

ERROR: SQLSTATE 40P01: deadlock detected
DETAIL:   Process 12479 waits for AccessExclusiveLock on relation 2663
of database 16384; blocked by process 14827.
Process 14827 waits for AccessShareLock on relation 1259 of database
16384; blocked by process 12479.
LOCATION: DeadLockReport, deadlock.c:918

It looked familiar, so I dug up the archives and found that Tom had
committed a fix for a similar deadlock via git commitid: 715120e7

However this current deadlock involved an index with oid 2663, which
is ClassNameNspIndexId. Clearly this was another case of locking the
index directly without taking a lock on the parent catalog. Further
sleuthing revealed that the culprit function was InitCatCachePhase2,
which directly calls index_open in the process startup phase.

Reproducing this was easy once you know the culprit, (excruciatingly
difficult if you do not know the exact race window). I added a sleep
inside the InitCatCachePhase2 function before calling index_open. Then
I invoked a VACUUM FULL pg_class from another session, halting it in
gdb just before taking the exclusive lock via try_relation_open. When
a new PG process sleeps inside InitCatCachePhase2, we then take the
lock in the VF process, waiting just after it.

When the startup continues after the sleep, it will take the
ClassNameNspIndexId share lock, but hang to take a share lock on
pg_class in RelationReloadIndexInfo. Simply continue the VF process in
gdb which will try to take the exclusive lock to vacuum the index.
This will reproduce the deadlock in all its glory.

The fix is similar to the earlier commit by Tom. I tested this fix
against 8.3.13. We lock the parent catalog now before calling
index_open. Patch against git HEAD attached with this mail. I guess we
will backpatch this? Tom's last commit was backpatched till 8.2 I
think.

Regards,
Nikhils
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index d0e364e..c9386aa 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -26,6 +26,7 @@
 #ifdef CATCACHE_STATS
 #include storage/ipc.h		/* for on_proc_exit */
 #endif
+#include storage/lmgr.h
 #include utils/builtins.h
 #include utils/fmgroids.h
 #include utils/inval.h
@@ -967,8 +968,16 @@ InitCatCachePhase2(CatCache *cache, bool touch_index)
 	{
 		Relation	idesc;
 
+		/*
+		 * We must lock the underlying catalog before locking the index to
+		 * avoid deadlock, since RelationReloadIndexInfo might well need to
+		 * read the catalog, and if anyone else is exclusive-locking this
+		 * catalog and index they'll be doing it in that order.
+		 */
+		LockRelationOid(cache-cc_reloid, AccessShareLock);
 		idesc = index_open(cache-cc_indexoid, AccessShareLock);
 		index_close(idesc, AccessShareLock);
+		UnlockRelationOid(cache-cc_reloid, AccessShareLock);
 	}
 }
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] VACUUM FULL deadlock with backend startup

2011-03-18 Thread Nikhil Sontakke
Hi,

We encountered a deadlock involving VACUUM FULL (surprise surprise!
:)) in PG 8.3.13 (and still not fixed in 9.0 AFAICS although the
window appears much smaller). The call spits out the following
deadlock info:

ERROR: SQLSTATE 40P01: deadlock detected
DETAIL:   Process 12479 waits for AccessExclusiveLock on relation 2663
of database 16384; blocked by process 14827.
Process 14827 waits for AccessShareLock on relation 1259 of database
16384; blocked by process 12479.
LOCATION: DeadLockReport, deadlock.c:918

It looked familiar, so I dug up the archives and found that Tom had
committed a fix for a similar deadlock via git commitid: 715120e7

However this current deadlock involved an index with oid 2663, which
is ClassNameNspIndexId. Clearly this was another case of locking the
index directly without taking a lock on the parent catalog. Further
sleuthing revealed that the culprit function was InitCatCachePhase2,
which directly calls index_open in the process startup phase.

Reproducing this was easy once you know the culprit, (excruciatingly
difficult if you do not know the exact race window). I added a sleep
inside the InitCatCachePhase2 function before calling index_open. Then
I invoked a VACUUM FULL pg_class from another session, halting it in
gdb just before taking the exclusive lock via try_relation_open. When
a new PG process sleeps inside InitCatCachePhase2, we then take the
lock in the VF process, waiting just after it.

When the startup continues after the sleep, it will take the
ClassNameNspIndexId share lock, but hang to take a share lock on
pg_class in RelationReloadIndexInfo. Simply continue the VF process in
gdb which will try to take the exclusive lock to vacuum the index.
This will reproduce the deadlock in all its glory.

The fix is similar to the earlier commit by Tom. I tested this fix
against 8.3.13. We lock the parent catalog now before calling
index_open. Patch against git HEAD attached with this mail. I guess we
will backpatch this? Tom's last commit was backpatched till 8.2 I
think.

Regards,
Nikhils
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index d0e364e..c9386aa 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -26,6 +26,7 @@
 #ifdef CATCACHE_STATS
 #include storage/ipc.h		/* for on_proc_exit */
 #endif
+#include storage/lmgr.h
 #include utils/builtins.h
 #include utils/fmgroids.h
 #include utils/inval.h
@@ -967,8 +968,16 @@ InitCatCachePhase2(CatCache *cache, bool touch_index)
 	{
 		Relation	idesc;
 
+		/*
+		 * We must lock the underlying catalog before locking the index to
+		 * avoid deadlock, since RelationReloadIndexInfo might well need to
+		 * read the catalog, and if anyone else is exclusive-locking this
+		 * catalog and index they'll be doing it in that order.
+		 */
+		LockRelationOid(cache-cc_reloid, AccessShareLock);
 		idesc = index_open(cache-cc_indexoid, AccessShareLock);
 		index_close(idesc, AccessShareLock);
+		UnlockRelationOid(cache-cc_reloid, AccessShareLock);
 	}
 }
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] I am confused after reading codes of PostgreSQL three week

2011-03-18 Thread Vaibhav Kaushal
Hello hom,

Frankly I am a learner as well. The experts here are almost always ready
to help and would be a better source of information. 

Moreover I am also using eclipse but I do not use it for building the
source. I use it only as a source code browser (its easy in GUI; isn't
it? ). I am trying to learn about the executor so can't say much about
the parser. However I suppose that you must be knowing the rules of the
tools flex and bison to understand the parser. And why are you into
scan.c? It is created by flex dear. Read the scan.l and gram.y instead.
It is these files which are responsible for the major work done by the
parser. 

If you are keen about the parser, go learn lex and yacc (or flex and
bison ... they are almost the same) and then go through the scan.l and
gram.y files. It is actually an _extremely_ tough job to read the
generated files. Once again, do turn off the Search for duplicate
source files option. There are no duplicate files in the source tree.

Also, if you are using the copy of source tree which was built once in
the workspace, things can be a little different.

@others: Well, I do know that there are a few books in the market
written by the devs but how much does it help when I am already banging
my head into source since last 5 months? 


Regards,
Vaibhav



On Fri, 2011-03-18 at 22:44 +0800, hom wrote:
 2011/3/18 Vaibhav Kaushal vaibhavkaushal...@gmail.com:
  Hi,
  That was the question I was facing 5 months ago and trust me I am doing it
  even now. With an average of 6+ hours going into PostgreSQL Code, even with
  best practices (as suggested by the developers) I still think I know less
  than 10 percent. It is too huge to be swallowed at once.
  I too had to break it down into pieces and because everything is so
  interconnected with everything else, it is quite complicated in the
  beginning. Start with one piece; planner, parser, executor, storage
  management whatever and slowly it should help you get the bigger picture.
  regards,
  Vaibhav
  I had to break it into
 
 Thanks Vaibhav .
 I have step into parser before but I meet a problem:
 
 when I debug step in the scanner_init(),  Eclipse always finds scan.l
 and the excute order is not match the file.
 I think it should be scan.c actually but I don't known how to trace
 into scan.c :(
 PS: I have turn Search for duplicate source files option on.
 
 I have posted to the mail list, but it have not solved.
 
 here is the link:
 http://postgresql.1045698.n5.nabble.com/Open-unmatch-source-file-when-step-into-parse-analyze-in-Eclipse-td3408033.html
 



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] tolower() identifier downcasing versus multibyte encodings

2011-03-18 Thread Tom Lane
I've been able to reproduce the behavior described here:
http://archives.postgresql.org/pgsql-general/2011-03/msg00538.php
It's specific to UTF8 locales on Mac OS X.  I'm not sure if the
problem can manifest anywhere else; considering that OS X's UTF8
locales have a general reputation of being broken, it may only
happen on that platform.

What is happening is that downcase_truncate_identifier() tries to
downcase identifiers like this:

unsigned char ch = (unsigned char) ident[i];

if (ch = 'A'  ch = 'Z')
ch += 'a' - 'A';
else if (IS_HIGHBIT_SET(ch)  isupper(ch))
ch = tolower(ch);
result[i] = (char) ch;

This is of course incapable of successfully downcasing any multibyte
characters, but there's an assumption that isupper() won't return TRUE
for a character fragment in a multibyte locale.  However, on OS X
it seems that that's not the case :-(.  For the particular example
cited by Francisco Figueiredo, I see the byte sequence \303\251
converted to \343\251, because isupper() returns TRUE for \303 and
then tolower() returns \343.  The byte \251 is not changed, but the
damage is already done: we now have an invalidly-encoded string.

It looks like the blame for the subsequent disappearance of the bogus
data lies with fprintf back on the client side; that surprises me a bit
because I'd only heard of glibc being so cavalier with data it thought
was invalidly encoded.  But anyway, the origin of the problem is in the
downcasing transformation.

We could possibly fix this by not attempting the downcasing
transformation on high-bit-set characters unless the encoding is
single-byte.  However, we have the exact same downcasing logic embedded
in the functions in src/port/pgstrcasecmp.c, and those don't have any
convenient way of knowing what the prevailing encoding is --- when
compiled for frontend use, they can't use pg_database_encoding_max_length.

Or we could bite the bullet and start using str_tolower(), but the
performance implications of that are unpleasant; not to mention that
we really don't want to re-introduce the Turkish problem with
unexpected handling of i/I in identifiers.

Or we could go the other way and stop downcasing non-ASCII letters
altogether.

None of these options seem terribly attractive.  Thoughts?

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers