Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, Mar 18, 2011 at 2:46 AM, Robert Haas robertmh...@gmail.com wrote: On further review, I've changed my mind. Making synchronous_commit trump synchronous_replication is appealing conceptually, but it's going to lead to some weird corner cases. For example, a transaction that drops a non-temporary relation always commits synchronously; and 2PC also ignores synchronous_commit. In the case where synchronous_commit=off and synchronous_replication=on, we'd either have to decide that these sorts of transactions aren't going to replicate synchronously (which would give synchronous_commit a rather long reach into areas it doesn't currently touch) or else that it's OK for CREATE TABLE foo () to be totally asynchronous but that DROP TABLE foo requires sync commit AND sync rep. That's pretty weird. What makes more sense to me after having thought about this more carefully is to simply make a blanket rule that when synchronous_replication=on, synchronous_commit has no effect. That is easy to understand and document. I'm inclined to think it's OK to let synchronous_replication have this effect even if max_wal_senders=0 or synchronous_standby_names=''; you shouldn't turn synchronous_replication on just for kicks, and I don't think we want to complicate the test in RecordTransactionCommit() more than necessary. We should, however, adjust the logic so that a transaction which has not written WAL can still commit asynchronously, because such a transaction has only touched temp or unlogged tables and so it's not important for it to make it to the standby, where that data doesn't exist anyway. In the first place, I think that it's complicated to keep those two parameters separately. What about merging them to one parameter? What I'm thinking is to remove synchronous_replication and to increase the valid values of synchronous_commit from on/off to async/local/remote/both. Each value works as follows. async = (synchronous_commit = off synchronous_replication = off) async makes a transaction do local WAL flush and replication asynchronously. local = (synchronous_commit = on synchronous_replication = off) local makes a transaction wait for only local WAL flush. remote = (synchronous_commit = off synchronous_replication = on) remote makes a transaction wait for only replication. Local WAL flush is performed by walwriter. This is useless in 9.1 because we always must wait for local WAL flush when we wait for replication. But in the future, if we'll be able to send WAL before WAL write (i.e., send WAL from wal_buffers), this might become useful. In 9.1, it seems reasonable to remove this value. both = (synchronous_commit = on synchronous_replication = on) both makes a transaction wait for local WAL flush and replication. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
hi, thanks for quickly fixing problems. i tested the later version (a2eb9e0c08ee73208b5419f5a53a6eba55809b92) and only errors i got was out of shared memory. i'm not sure if it was caused by SSI activities or not. YAMAMOTO Takashi the following is a snippet from my application log: PG_DIAG_SEVERITY: WARNING PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: shmem.c PG_DIAG_SOURCE_LINE: 190 PG_DIAG_SOURCE_FUNCTION: ShmemAlloc PG_DIAG_SEVERITY: ERROR PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: dynahash.c PG_DIAG_SOURCE_LINE: 925 PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value PG_DIAG_SEVERITY: WARNING PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: shmem.c PG_DIAG_SOURCE_LINE: 190 PG_DIAG_SOURCE_FUNCTION: ShmemAlloc PG_DIAG_SEVERITY: ERROR PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: dynahash.c PG_DIAG_SOURCE_LINE: 925 PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value PG_DIAG_SEVERITY: WARNING PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: shmem.c PG_DIAG_SOURCE_LINE: 190 PG_DIAG_SOURCE_FUNCTION: ShmemAlloc PG_DIAG_SEVERITY: ERROR PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: dynahash.c PG_DIAG_SOURCE_LINE: 925 PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Basic Recovery Control functions for use in Hot Standby. Pause,
On 18.03.2011 07:13, Fujii Masao wrote: On Fri, Mar 18, 2011 at 1:17 AM, Robert Haasrobertmh...@gmail.com wrote: One thing I'm not quite clear on is what happens if we reach the recovery target before we reach consistency. i.e. create restore point, flush xlog, abnormal shutdown, try to recover to named restore point. Is there any possibility that we can end up paused before Hot Standby has actually started up. Because that would be fairly useless and annoying. Good catch! In that case, the same situation as (3) would happen. I think that recovery should ignore pause_at_recovery_target until it reaches consistent point. If we do so, when recovery target is ahead of consistent point, recovery just ends in inconsistent point and throws FATAL error. If recovery target is set to before its consistent, ie. before minRecoveryPoint, we should throw an error before recovery even starts. I'm not sure if we check that at the moment. Not sure what to to do recovery target is beyond minRecoveryPoint and pause_at_recovery_target=true, but the server hasn't been opened for hot standby yet (because it hasn't seen a running-xacts record yet). I agree it's pretty useless and annoying to stop there. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, 2011-03-18 at 14:45 +0900, Fujii Masao wrote: On Fri, Mar 18, 2011 at 2:52 AM, Robert Haas robertmh...@gmail.com wrote: On Thu, Mar 10, 2011 at 3:04 PM, Robert Haas robertmh...@gmail.com wrote: - /* Let the master know that we received some data. */ - XLogWalRcvSendReply(); - XLogWalRcvSendHSFeedback(); This change completely eliminates the difference between write_location and flush_location in pg_stat_replication. If this change is reasoable, we should get rid of write_location from pg_stat_replication since it's useless. If not, this change should be reverted. I'm not sure whether monitoring the difference between write and flush locations is useful. But I guess that someone thought so and that code was added. I could go either way on this but clearly we need to do one or the other. I'm not really sure why this was part of the synchronous replication patch, but after mulling it over I think it's probably right to rip out write_location completely. There shouldn't ordinarily be much of a gap between write location and flush location, so it's probably not worth the extra network overhead to keep track of it. We might need to re-add some form of this in the future if we have a version of synchronous replication that only waits for confirmation of receipt rather than for confirmation of flush, but we don't have that in 9.1, so why bother? Barring objections, I'll go do that. I agree to get rid of write_location. No, don't remove it. We seem to be just looking for things to tweak without any purpose. Removing this adds nothing for us. We will have the column in the future, it is there now, so leave it. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allowing multiple concurrent base backups
On 17.03.2011 21:39, Robert Haas wrote: On Mon, Jan 31, 2011 at 10:45 PM, Fujii Masaomasao.fu...@gmail.com wrote: On Tue, Feb 1, 2011 at 1:31 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Hmm, good point. It's harmless, but creating the history file in the first place sure seems like a waste of time. The attached patch changes pg_stop_backup so that it doesn't create the backup history file if archiving is not enabled. When I tested the multiple backups, I found that they can have the same checkpoint location and the same history file name. $ for ((i=0; i4; i++)); do pg_basebackup -D test$i -c fast -x -l test$i done $ cat test0/backup_label START WAL LOCATION: 0/2B0 (file 00010002) CHECKPOINT LOCATION: 0/2E8 START TIME: 2011-02-01 12:12:31 JST LABEL: test0 $ cat test1/backup_label START WAL LOCATION: 0/2B0 (file 00010002) CHECKPOINT LOCATION: 0/2E8 START TIME: 2011-02-01 12:12:31 JST LABEL: test1 $ cat test2/backup_label START WAL LOCATION: 0/2B0 (file 00010002) CHECKPOINT LOCATION: 0/2E8 START TIME: 2011-02-01 12:12:31 JST LABEL: test2 $ cat test3/backup_label START WAL LOCATION: 0/2B0 (file 00010002) CHECKPOINT LOCATION: 0/2E8 START TIME: 2011-02-01 12:12:31 JST LABEL: test3 $ ls archive/*.backup archive/00010002.00B0.backup This would cause a serious problem. Because the backup-end record which indicates the same START WAL LOCATION can be written by the first backup before the other finishes. So we might think wrongly that we've already reached a consistency state by reading the backup-end record (written by the first backup) before reading the last required WAL file. /* * Force a CHECKPOINT. Aside from being necessary to prevent torn * page problems, this guarantees that two successive backup runs will * have different checkpoint positions and hence different history * file names, even if nothing happened in between. * * We use CHECKPOINT_IMMEDIATE only if requested by user (via passing * fast = true). Otherwise this can take awhile. */ RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT | (fast ? CHECKPOINT_IMMEDIATE : 0)); This problem happens because the above code (in do_pg_start_backup) actually doesn't ensure that the concurrent backups have the different checkpoint locations. ISTM that we should change the above or elsewhere to ensure that. Yes, good point. Or we should include backup label name in the backup-end record, to prevent a recovery from reading not-its-own backup-end record. Backup labels are not guaranteed to be unique either, so including backup label in the backup-end-record doesn't solve the problem. But something else like a backup-start counter in shared memory or process id would work. It won't make the history file names unique, though. Now that we use on the end-of-backup record for detecting end-of-backup, the history files are just for documenting purposes. Do we want to give up on history files for backups performed with pg_basebackup? Or we can include the backup counter or similar in the filename too. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Hi, sorry for being late to join that bike-shedding discussion. On 03/07/2011 05:09 PM, Alvaro Herrera wrote: I think these terms are used inconsistenly enough across the industry that what would make the most sense would be to use the common term and document accurately what we mean by it, rather than relying on some external entity's definition, which could change (like wikipedia's). I absolutely agree to Alvaro here. The Wikipedia definition seems to only speak about one local and one remote node. Requiring an ack from at least one remote node seems to cover that. Not even Wikipedia goes further in their definition and tries to explain what 'synchronous replication' could mean in case we have more than two nodes. A somewhat common expectation is, that all nodes would have to ack. However, with such a requirement a single node failure brings your cluster to a full stop. So this isn't a practical option. Google invented the term semi-syncronous for something that's essentially the same that we have, now, I think. However, I full heartedly hate that term (based on the reasoning that there's no semi-pregnant, either). Others (like me) use synchronous or (lately rather) eager to mean that only a majority of nodes need to send an ACK. I have to explain what I mean every time. In the end, I don't have a strong opinion either way, anymore. I'm happy to think of the replication between the master and the one standby that's sending an ACK first as synchronous. (Even if those may well be different standbies for different transactions). Hope to have brought some light into this discussion. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] I am confused after reading codes of PostgreSQL three week
Hom, On 03/17/2011 04:49 PM, Kevin Grittner wrote: That's ambitious. Absolutely, yes. Exercise patience with yourself. A method that hasn't been mentioned, yet, is digging out your debugger and attach it to a connected Postgres backend. You can then issue a query you are interested in and follow the backend doing its work. That's particularly helpful in trying to find a certain spot of interest. Of course, it doesn't help much in getting the big picture. Good luck on your journey through the code base. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] I am confused after reading codes of PostgreSQL three week
On 18 March 2011 01:57, hom obsidian...@gmail.com wrote: I try to known how a database is implemented This objective is so vast and so vague that it's difficult to give meaningful help. I'd emphasise Kevin Grittner's very worthwhile advice. Try to break your question down into smaller, more specific ones. With a question like how does postgres work you're likely to flounder. But with a more targeted question, e.g., what format does postgres use to save data to disk or how does postgres implement ORDER BY, you can make easier progress, and perhaps you could get more useful pointers from the people on this list. Have you read through the Overview of System Internals chapter in the documentation [1]? Perhaps it will help you identify the areas you wish to explore further, and form more specific questions. [1] http://www.postgresql.org/docs/current/static/overview.html Cheers, BJ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] I am confused after reading codes of PostgreSQL three week
Hi, That was the question I was facing 5 months ago and trust me I am doing it even now. With an average of 6+ hours going into PostgreSQL Code, even with best practices (as suggested by the developers) I still think I know less than 10 percent. It is too huge to be swallowed at once. I too had to break it down into pieces and because everything is so interconnected with everything else, it is quite complicated in the beginning. Start with one piece; planner, parser, executor, storage management whatever and slowly it should help you get the bigger picture. regards, Vaibhav I had to break it into On Fri, Mar 18, 2011 at 3:39 PM, Brendan Jurd dire...@gmail.com wrote: On 18 March 2011 01:57, hom obsidian...@gmail.com wrote: I try to known how a database is implemented This objective is so vast and so vague that it's difficult to give meaningful help. I'd emphasise Kevin Grittner's very worthwhile advice. Try to break your question down into smaller, more specific ones. With a question like how does postgres work you're likely to flounder. But with a more targeted question, e.g., what format does postgres use to save data to disk or how does postgres implement ORDER BY, you can make easier progress, and perhaps you could get more useful pointers from the people on this list. Have you read through the Overview of System Internals chapter in the documentation [1]? Perhaps it will help you identify the areas you wish to explore further, and form more specific questions. [1] http://www.postgresql.org/docs/current/static/overview.html Cheers, BJ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Allowing multiple concurrent base backups
On 18.03.2011 10:48, Heikki Linnakangas wrote: On 17.03.2011 21:39, Robert Haas wrote: On Mon, Jan 31, 2011 at 10:45 PM, Fujii Masaomasao.fu...@gmail.com wrote: On Tue, Feb 1, 2011 at 1:31 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Hmm, good point. It's harmless, but creating the history file in the first place sure seems like a waste of time. The attached patch changes pg_stop_backup so that it doesn't create the backup history file if archiving is not enabled. When I tested the multiple backups, I found that they can have the same checkpoint location and the same history file name. $ for ((i=0; i4; i++)); do pg_basebackup -D test$i -c fast -x -l test$i done $ cat test0/backup_label START WAL LOCATION: 0/2B0 (file 00010002) CHECKPOINT LOCATION: 0/2E8 START TIME: 2011-02-01 12:12:31 JST LABEL: test0 $ cat test1/backup_label START WAL LOCATION: 0/2B0 (file 00010002) CHECKPOINT LOCATION: 0/2E8 START TIME: 2011-02-01 12:12:31 JST LABEL: test1 $ cat test2/backup_label START WAL LOCATION: 0/2B0 (file 00010002) CHECKPOINT LOCATION: 0/2E8 START TIME: 2011-02-01 12:12:31 JST LABEL: test2 $ cat test3/backup_label START WAL LOCATION: 0/2B0 (file 00010002) CHECKPOINT LOCATION: 0/2E8 START TIME: 2011-02-01 12:12:31 JST LABEL: test3 $ ls archive/*.backup archive/00010002.00B0.backup This would cause a serious problem. Because the backup-end record which indicates the same START WAL LOCATION can be written by the first backup before the other finishes. So we might think wrongly that we've already reached a consistency state by reading the backup-end record (written by the first backup) before reading the last required WAL file. /* * Force a CHECKPOINT. Aside from being necessary to prevent torn * page problems, this guarantees that two successive backup runs will * have different checkpoint positions and hence different history * file names, even if nothing happened in between. * * We use CHECKPOINT_IMMEDIATE only if requested by user (via passing * fast = true). Otherwise this can take awhile. */ RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT | (fast ? CHECKPOINT_IMMEDIATE : 0)); This problem happens because the above code (in do_pg_start_backup) actually doesn't ensure that the concurrent backups have the different checkpoint locations. ISTM that we should change the above or elsewhere to ensure that. Yes, good point. Here's a patch based on that approach, ensuring that each base backup uses a different checkpoint as the start location. I think I'll commit this, rather than invent a new unique ID mechanism for backups. The latter would need changes in recovery and control file too, and I don't feel like tinkering with that at this stage. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 15af669..570f02b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -355,10 +355,13 @@ typedef struct XLogCtlInsert * exclusiveBackup is true if a backup started with pg_start_backup() is * in progress, and nonExclusiveBackups is a counter indicating the number * of streaming base backups currently in progress. forcePageWrites is - * set to true when either of these is non-zero. + * set to true when either of these is non-zero. lastBackupStart is the + * latest checkpoint redo location used as a starting point for an online + * backup. */ bool exclusiveBackup; int nonExclusiveBackups; + XLogRecPtr lastBackupStart; } XLogCtlInsert; /* @@ -8809,6 +8812,19 @@ do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile) MAXPGPATH))); /* + * Force an XLOG file switch before the checkpoint, to ensure that the WAL + * segment the checkpoint is written to doesn't contain pages with old + * timeline IDs. That would otherwise happen if you called + * pg_start_backup() right after restoring from a PITR archive: the first + * WAL segment containing the startup checkpoint has pages in the + * beginning with the old timeline ID. That can cause trouble at recovery: + * we won't have a history file covering the old timeline if pg_xlog + * directory was not included in the base backup and the WAL archive was + * cleared too before starting the backup. + */ + RequestXLogSwitch(); + + /* * Mark backup active in shared memory. We must do full-page WAL writes * during an on-line backup even if not doing so at other times, because * it's quite possible for the backup dump to obtain a torn (partially @@ -8843,43 +8859,54 @@ do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile) XLogCtl-Insert.forcePageWrites = true; LWLockRelease(WALInsertLock); - /* - * Force an XLOG file switch before the checkpoint, to
Re: [HACKERS] Re: [COMMITTERS] pgsql: Basic Recovery Control functions for use in Hot Standby. Pause,
On Fri, Mar 18, 2011 at 3:22 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: If recovery target is set to before its consistent, ie. before minRecoveryPoint, we should throw an error before recovery even starts. I'm not sure if we check that at the moment. I don't see how you could check that anyway. How do you know where you're going to see the given XID/timestamp/named restore point until you actually get there? Not sure what to to do recovery target is beyond minRecoveryPoint and pause_at_recovery_target=true, but the server hasn't been opened for hot standby yet (because it hasn't seen a running-xacts record yet). I agree it's pretty useless and annoying to stop there. I think the reasonable options are enter normal running and shut down. In any event, it sounds like someone needs to fix this, and I don't know enough to do it. Can you or Fujii Masao do it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, Mar 18, 2011 at 3:52 AM, Simon Riggs si...@2ndquadrant.com wrote: I agree to get rid of write_location. No, don't remove it. We seem to be just looking for things to tweak without any purpose. Removing this adds nothing for us. We will have the column in the future, it is there now, so leave it. Well then can we revert the part of your patch that causes it to not actually work any more? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, Mar 18, 2011 at 2:25 AM, Fujii Masao masao.fu...@gmail.com wrote: In the first place, I think that it's complicated to keep those two parameters separately. What about merging them to one parameter? What I'm thinking is to remove synchronous_replication and to increase the valid values of synchronous_commit from on/off to async/local/remote/both. Each value works as follows. async = (synchronous_commit = off synchronous_replication = off) async makes a transaction do local WAL flush and replication asynchronously. local = (synchronous_commit = on synchronous_replication = off) local makes a transaction wait for only local WAL flush. remote = (synchronous_commit = off synchronous_replication = on) remote makes a transaction wait for only replication. Local WAL flush is performed by walwriter. This is useless in 9.1 because we always must wait for local WAL flush when we wait for replication. But in the future, if we'll be able to send WAL before WAL write (i.e., send WAL from wal_buffers), this might become useful. In 9.1, it seems reasonable to remove this value. both = (synchronous_commit = on synchronous_replication = on) both makes a transaction wait for local WAL flush and replication. Thought? Well, if we want to make this all use one parameter, the obvious way to do it that wouldn't break backward compatibility is to remove the synchronous_replication parameter altogether and let synchronous_commit take on the values on/local/off, where on means wait for sync rep if it's enabled (i.e. synchronous_standby_names!=''max_wal_senders0) or otherwise just wait for local WAL flush, local means just wait for local WAL flush, and off means commit asynchronously. I'm OK with doing it that way if there's consensus on it, but I'm not eager to break backward compatibility. Simon/Heikki, any opinion on that approach? If we don't have consensus on that then I think we should just do what I proposed above (and Simon agreed to). I am not eager to spend any longer than necessary hammering this out; I want to get to beta. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Basic Recovery Control functions for use in Hot Standby. Pause,
On 18.03.2011 14:14, Robert Haas wrote: On Fri, Mar 18, 2011 at 3:22 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: If recovery target is set to before its consistent, ie. before minRecoveryPoint, we should throw an error before recovery even starts. I'm not sure if we check that at the moment. I don't see how you could check that anyway. How do you know where you're going to see the given XID/timestamp/named restore point until you actually get there? Oh, good point. I was thinking that the recovery target is a particular LSN, but clearly it's not. Not sure what to to do recovery target is beyond minRecoveryPoint and pause_at_recovery_target=true, but the server hasn't been opened for hot standby yet (because it hasn't seen a running-xacts record yet). I agree it's pretty useless and annoying to stop there. I think the reasonable options are enter normal running and shut down. In any event, it sounds like someone needs to fix this, and I don't know enough to do it. Can you or Fujii Masao do it? You could also argue for log a warning, continue until we can open for Hot standby, then pause. I can write the patch once we know what we want. All of those options sound reasonable to me. This is such a corner-case that it doesn't make sense to make it user-configurable, though. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19
On Thu, Mar 17, 2011 at 6:00 PM, Jeff Davis pg...@j-davis.com wrote: On Wed, 2011-03-16 at 13:35 -0400, Robert Haas wrote: 2. If a query cancel interrupt is received (pg_cancel_backend or ^C), then cancel the sync rep wait and issue a warning before acknowledging the commit. When I saw this commit, I noticed that the WARNING doesn't have an errcode(). It seems like it should -- this is the kind of thing that the client is likely to care about, and may want to handle specially. Should I invent ERRCODE_WARNING_TRANSACTION_NOT_REPLICATED? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Basic Recovery Control functions for use in Hot Standby. Pause,
On Fri, Mar 18, 2011 at 8:27 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: You could also argue for log a warning, continue until we can open for Hot standby, then pause. I don't like that one much. I can write the patch once we know what we want. All of those options sound reasonable to me. This is such a corner-case that it doesn't make sense to make it user-configurable, though. I agree. Since pause_at_recovery_target is ignored when hot_standby=off, I think it would be consistent to treat the case where hot_standby=on but can't actually be initiated the same way - just ignore the pause request and enter normal running. However, I don't have a super-strong feeling that that's the only sensible way to go, so count me as +0.5 for that approach. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, Mar 18, 2011 at 9:16 AM, MARK CALLAGHAN mdcal...@gmail.com wrote: On Fri, Mar 18, 2011 at 9:27 AM, Markus Wanner mar...@bluegap.ch wrote: Google invented the term semi-syncronous for something that's essentially the same that we have, now, I think. However, I full heartedly hate that term (based on the reasoning that there's no semi-pregnant, either). We didn't invent the term, we just implemented something that Heikki Tuuri briefly described, for example: http://bugs.mysql.com/bug.php?id=7440 In the Google patch and official MySQL version, the sequence is: 1) commit on master 2) wait for slave to ack 3) return to user After step 1 another user on the master can observe the commit and the following is possible: 1) commit on master 2) other user observes that commit on master 3) master blows up and a user observed a commit that never made it to a slave I do not think this sequence should be possible in a sync replication system. But it is possible in what has been implemented for MySQL. Thus it was named semi-sync rather than sync. Thanks for the insight. That can't happen with our implementation, I believe. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FK constraints NOT VALID by default?
On Thu, Mar 17, 2011 at 5:32 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Mar 17, 2011 at 5:29 PM, Andrew Dunstan and...@dunslane.net wrote: Is this really intended? I sure hope not. That's a bug. Not sure if it's a psql bug or a backend bug, but it's definitely a bug. It's a backend bug. Prior to Simon's patch, there was an existing skip_validation flag in the Constraint node that indicated whether or not a validation pass was necessary - in a newly created table, for example, we know that it's NOT necessary, because the table can't contain any rows (and therefore there can't be any rows that violate the constraint). The patch tries to make the very same flag indicate whether the user wants the constraint to be added with the NOT VALID attribute, which of course falls over because the Boolean only has two values and there are three cases (validate it, don't validate it but do mark it valid because the table is guaranteed to be empty, don't validate it and mark it not valid). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
MARK CALLAGHAN mdcal...@gmail.com wrote: Markus Wanner mar...@bluegap.ch wrote: Google invented the term semi-syncronous for something that's essentially the same that we have, now, I think. However, I full heartedly hate that term (based on the reasoning that there's no semi-pregnant, either). To be fair, what we're considering calling semi-synchronous is something which tries to stay in synchronous mode but switches out of it when necessary to meet availability targets. Your analogy doesn't match up at all well -- at least without getting really ugly. We didn't invent the term, we just implemented something that Heikki Tuuri briefly described, for example: http://bugs.mysql.com/bug.php?id=7440 In the Google patch and official MySQL version, the sequence is: 1) commit on master 2) wait for slave to ack 3) return to user After step 1 another user on the master can observe the commit and the following is possible: 1) commit on master 2) other user observes that commit on master 3) master blows up and a user observed a commit that never made it to a slave I do not think this sequence should be possible in a sync replication system. Then the only thing you would consider sync replication, as far as I can see, is two phase commit, which we already have. So your use case seems to be covered already, and we're trying to address other people's needs. The guarantee that some people are looking for is that a successful commit means that the data has been persisted on two separate servers. Others want to try for that, but are willing to compromise it for HA; in general I think they want to know when the guarantee is not there so they can take action to get back to a safer condition. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_dump -X
On Mon, Mar 14, 2011 at 9:56 AM, Robert Haas robertmh...@gmail.com wrote: On Sat, Mar 12, 2011 at 12:56 AM, Bruce Momjian br...@momjian.us wrote: Presumably the point of deprecating the feature is that we'd eventually remove it. If 4 major releases isn't long enough, what is? Good point. Unless there are further objections, I think we should go ahead and remove this. If there ARE further objections, then please say what release you think it would be OK to remove it in, or why you think it's worth keeping around indefinitely given that the last version in which it was documented is now EOL. Hearing no further objections, I have removed this code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Mon, Mar 7, 2011 at 3:44 AM, Fujii Masao masao.fu...@gmail.com wrote: On Mon, Mar 7, 2011 at 5:27 PM, Fujii Masao masao.fu...@gmail.com wrote: On Mon, Mar 7, 2011 at 7:51 AM, Simon Riggs si...@2ndquadrant.com wrote: Efficient transaction-controlled synchronous replication. If a standby is broadcasting reply messages and we have named one or more standbys in synchronous_standby_names then allow users who set synchronous_replication to wait for commit, which then provides strict data integrity guarantees. Design avoids sending and receiving transaction state information so minimises bookkeeping overheads. We synchronize with the highest priority standby that is connected and ready to synchronize. Other standbys can be defined to takeover in case of standby failure. This version has very strict behaviour; more relaxed options may be added at a later date. Pretty cool! I'd appreciate very much your efforts and contributions. And,, I found one bug ;) You seem to have wrongly removed the check of max_wal_senders in SyncRepWaitForLSN. This can make the backend wait for replication even if max_wal_senders = 0. I could produce this problematic situation in my machine. The attached patch fixes this problem. if (strlen(SyncRepStandbyNames) 0 max_wal_senders == 0) ereport(ERROR, (errmsg(Synchronous replication requires WAL streaming (max_wal_senders 0; The above check should be required also after pg_ctl reload since synchronous_standby_names can be changed by SIGHUP? Or how about just removing that? If the patch I submitted is committed,empty synchronous_standby_names and max_wal_senders = 0 settings is no longer unsafe. This configuration is now harmless in the sense that it no longer horribly breaks the entire system, but it's still pretty useless, so this might be deemed a valuable sanity check. However, I'm reluctant to leave it in there, because someone could change their config to this state, pg_ctl reload, see everything working, and then later stop the cluster and be unable to start it back up again. Since most people don't shut their database systems down very often, they might not discover that they have an invalid config until much later. I think it's probably not a good idea to have configs that are valid on reload but prevent startup, so I'm inclined to either remove this check altogether or downgrade it to a warning. As a side note, it's not very obvious why some parts of PostmasterMain report problems by doing write_stderr() and exit() while other parts use ereport(ERROR). This check and the nearby checks on WAL level are immediately preceded and followed by other checks that use the opposite technique. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] I am confused after reading codes of PostgreSQL three week
2011/3/17 Bruce Momjian br...@momjian.us: hom wrote: Hi, I try to known how a database is implemented and I have been reading PG source codes for a month. Now, I only know a little about how PG work. :( I just know PG work like this but I don't know why PG work like this. :( :( even worse, I feel I can better understand the source code. it may be that I could't split the large module into small piece which may help to understand. Is there any article or some way could help understand the source code ? I assume you have looked at these places: http://wiki.postgresql.org/wiki/Developer_FAQ http://www.postgresql.org/developer/coding -- Bruce Momjian br...@momjian.us http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + Thanks Bruce. I am also reading your book PostgreSQL Introduction and Concepts. :) -- Best Wishes! hom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] I am confused after reading codes of PostgreSQL three week
2011/3/17 Kevin Grittner kevin.gritt...@wicourts.gov: hom obsidian...@gmail.com wrote: I try to known how a database is implemented and I have been reading PG source codes for a month. That's ambitious. find -name '*.h' -or -name '*.c' \ | egrep -v '^\./src/test/.+/tmp_check/' \ | xargs cat | wc -l 1059144 Depending on how you do the math, that's about 50,000 lines of code per day to get through it in the time you mention. Is there any article or some way could help understand the source code ? Your best bet would be to follow links from the Developers tab on the main PostgreSQL web site: http://www.postgresql.org/developer/ In particular the Developer FAQ page: http://wiki.postgresql.org/wiki/Developer_FAQ And the Coding links: http://www.postgresql.org/developer/coding may help. Before reading code in a directory, be sure to read any README file(s) in that directory carefully. It helps to read this list. In spite of reviewing all of that myself, it was rather intimidating when I went to work on a major patch 14 months ago. Robert Haas offered some good advice which served me well in that effort -- divide the effort in to a series of incremental steps, each of which deals with a small enough portion of the code to get your head around. As you work in any one narrow area, it becomes increasingly clear; with that as a base you can expand your scope. When you're working in the code, it is tremendously helpful to use an editor with ctags support (or similar IDE functionality). I hope this is helpful. Good luck. -Kevin Thanks Kevin. I will follow your advice and I will also post the question to the mail list for help. Thanks a lot. -- Best Wishes! hom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FK constraints NOT VALID by default?
On Fri, 2011-03-18 at 09:39 -0400, Robert Haas wrote: On Thu, Mar 17, 2011 at 5:32 PM, Robert Haas robertmh...@gmail.com wrote: On Thu, Mar 17, 2011 at 5:29 PM, Andrew Dunstan and...@dunslane.net wrote: Is this really intended? I sure hope not. That's a bug. Not sure if it's a psql bug or a backend bug, but it's definitely a bug. It's a backend bug. Prior to Simon's patch, there was an existing skip_validation flag in the Constraint node that indicated whether or not a validation pass was necessary - in a newly created table, for example, we know that it's NOT necessary, because the table can't contain any rows (and therefore there can't be any rows that violate the constraint). The patch tries to make the very same flag indicate whether the user wants the constraint to be added with the NOT VALID attribute, which of course falls over because the Boolean only has two values and there are three cases (validate it, don't validate it but do mark it valid because the table is guaranteed to be empty, don't validate it and mark it not valid). Thanks Robert. Yes, my bad. Will fix. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19
On Fri, 2011-03-18 at 08:27 -0400, Robert Haas wrote: On Thu, Mar 17, 2011 at 6:00 PM, Jeff Davis pg...@j-davis.com wrote: On Wed, 2011-03-16 at 13:35 -0400, Robert Haas wrote: 2. If a query cancel interrupt is received (pg_cancel_backend or ^C), then cancel the sync rep wait and issue a warning before acknowledging the commit. When I saw this commit, I noticed that the WARNING doesn't have an errcode(). It seems like it should -- this is the kind of thing that the client is likely to care about, and may want to handle specially. Should I invent ERRCODE_WARNING_TRANSACTION_NOT_REPLICATED? I think it's reasonable to invent a new code here. Perhaps use the word synchronous rather than replicated, though? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, 2011-03-18 at 13:16 +, MARK CALLAGHAN wrote: On Fri, Mar 18, 2011 at 9:27 AM, Markus Wanner mar...@bluegap.ch wrote: Google invented the term semi-syncronous for something that's essentially the same that we have, now, I think. However, I full heartedly hate that term (based on the reasoning that there's no semi-pregnant, either). We didn't invent the term, we just implemented something that Heikki Tuuri briefly described, for example: http://bugs.mysql.com/bug.php?id=7440 In the Google patch and official MySQL version, the sequence is: 1) commit on master 2) wait for slave to ack 3) return to user After step 1 another user on the master can observe the commit and the following is possible: 1) commit on master 2) other user observes that commit on master 3) master blows up and a user observed a commit that never made it to a slave I do not think this sequence should be possible in a sync replication system. But it is possible in what has been implemented for MySQL. Thus it was named semi-sync rather than sync. Thanks for clearing it up Mark. We should definitely not be calling what we have semi-sync. The semantics are very different. In PostgreSQL other users cannot observe the commit until an acknowledgement has been received. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Mark, On 03/18/2011 02:16 PM, MARK CALLAGHAN wrote: We didn't invent the term, we just implemented something that Heikki Tuuri briefly described, for example: http://bugs.mysql.com/bug.php?id=7440 Oh, okay, good to know who to blame ;-) However, I didn't mean to offend anybody. I do not think this sequence should be possible in a sync replication system. But it is possible in what has been implemented for MySQL. Thus it was named semi-sync rather than sync. Sure? Their documentation [1] isn't entirely clear on that first: the master blocks after the commit is done and waits until at least one semisynchronous slave acknowledges that it has received all events for the transaction and the slave acknowledges receipt of a transaction's events only after the events have been written to its relay log and flushed to disk. But then continues to say that [the master is] waiting for acknowledgment from a slave after having performed a commit, so this indeed sounds like the transaction is visible to other sessions before the slave ACKs. So, semi-sync may show temporary inconsistencies in case of a master failure. Wow! Regards Markus Wanner [1] MySQL 5.5 reference manual, 17.3.8. Semisynchronous Replication: http://dev.mysql.com/doc/refman/5.5/en/replication-semisync.html -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Fix various possible problems with synchronous replication.
On Thu, Mar 17, 2011 at 1:59 PM, Thom Brown t...@linux.com wrote: On 17 March 2011 17:55, Robert Haas robertmh...@gmail.com wrote: On Thu, Mar 17, 2011 at 1:24 PM, Thom Brown t...@linux.com wrote: errdetail(The transaction has already been committed locally but might have not been replicated to the standby.))); errdetail(The transaction has committed locally, but may not have replicated to the standby.))); Could we have these saying precisely the same thing? Yeah. Which is better? Personally I prefer the 2nd. It reads better somehow. I hacked on this a bit more and ended up with a hybrid of the two. Hope you like it; but anyway it's consistent. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19
On Fri, Mar 18, 2011 at 10:17 AM, Jeff Davis pg...@j-davis.com wrote: On Fri, 2011-03-18 at 08:27 -0400, Robert Haas wrote: On Thu, Mar 17, 2011 at 6:00 PM, Jeff Davis pg...@j-davis.com wrote: On Wed, 2011-03-16 at 13:35 -0400, Robert Haas wrote: 2. If a query cancel interrupt is received (pg_cancel_backend or ^C), then cancel the sync rep wait and issue a warning before acknowledging the commit. When I saw this commit, I noticed that the WARNING doesn't have an errcode(). It seems like it should -- this is the kind of thing that the client is likely to care about, and may want to handle specially. Should I invent ERRCODE_WARNING_TRANSACTION_NOT_REPLICATED? I think it's reasonable to invent a new code here. Perhaps use the word synchronous rather than replicated, though? I think we have to, because it's definitely not the same situation that someone would expect after ERRCODE_QUERY_CANCELLED. But ERRCODE_WARNING_TRANSACTION_NOT_SYNCHRONOUS, which is what I read you reply as suggesting, seems pretty wonky. I wouldn't know what that meant. Another option might be: ERRCODE_(WARNING_?)REPLICATION_WAIT_CANCELLED ...which might have something to recommend it. Other thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] I am confused after reading codes of PostgreSQL three week
2011/3/18 Markus Wanner mar...@bluegap.ch: Hom, On 03/17/2011 04:49 PM, Kevin Grittner wrote: That's ambitious. Absolutely, yes. Exercise patience with yourself. A method that hasn't been mentioned, yet, is digging out your debugger and attach it to a connected Postgres backend. You can then issue a query you are interested in and follow the backend doing its work. That's particularly helpful in trying to find a certain spot of interest. Of course, it doesn't help much in getting the big picture. Good luck on your journey through the code base. Regards Markus Wanner Thanks Markus. It's hard time at the beginning. I should keep patient. :) -- Best Wishes! hom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] I am confused after reading codes of PostgreSQL three week
2011/3/18 Brendan Jurd dire...@gmail.com: On 18 March 2011 01:57, hom obsidian...@gmail.com wrote: I try to known how a database is implemented This objective is so vast and so vague that it's difficult to give meaningful help. I'd emphasise Kevin Grittner's very worthwhile advice. Try to break your question down into smaller, more specific ones. With a question like how does postgres work you're likely to flounder. But with a more targeted question, e.g., what format does postgres use to save data to disk or how does postgres implement ORDER BY, you can make easier progress, and perhaps you could get more useful pointers from the people on this list. Have you read through the Overview of System Internals chapter in the documentation [1]? Perhaps it will help you identify the areas you wish to explore further, and form more specific questions. [1] http://www.postgresql.org/docs/current/static/overview.html Cheers, BJ Thanks Brendan. I have a quickly glance on Overview of System Internals before. I think it is time to read it again. -- Best Wishes! hom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Hi, On 03/18/2011 02:40 PM, Kevin Grittner wrote: Then the only thing you would consider sync replication, as far as I can see, is two phase commit I think waiting for the ACK before actually making the changes from the transaction visible (COMMIT) would suffice for disallowing such an inconsistency to manifest. But obviously, MySQL decided it's not worth doing that, as it's such a rare event and a short period of time that may show inconsistencies... people's needs. The guarantee that some people are looking for is that a successful commit means that the data has been persisted on two separate servers. Well, MySQL's semi-sync also seems to guarantee that WRT the client confirmation. And transactions always appear committed *before* the client receives the COMMIT acknowledgement, due to the time it takes for the ACK to arrive at the client. It's just the commit *before* receiving the slave's ACK, which might make a transaction visible that's not durable, yet. But I guess that simplified implementation for them... Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Fix various possible problems with synchronous replication.
On 18 March 2011 14:23, Robert Haas robertmh...@gmail.com wrote: On Thu, Mar 17, 2011 at 1:59 PM, Thom Brown t...@linux.com wrote: On 17 March 2011 17:55, Robert Haas robertmh...@gmail.com wrote: On Thu, Mar 17, 2011 at 1:24 PM, Thom Brown t...@linux.com wrote: errdetail(The transaction has already been committed locally but might have not been replicated to the standby.))); errdetail(The transaction has committed locally, but may not have replicated to the standby.))); Could we have these saying precisely the same thing? Yeah. Which is better? Personally I prefer the 2nd. It reads better somehow. I hacked on this a bit more and ended up with a hybrid of the two. Hope you like it; but anyway it's consistent. Yes, cheers :) -- Thom Brown Twitter: @darkixion IRC (freenode): dark_ixion Registered Linux user: #516935 EnterpriseDB UK: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] I am confused after reading codes of PostgreSQL three week
2011/3/18 Vaibhav Kaushal vaibhavkaushal...@gmail.com: Hi, That was the question I was facing 5 months ago and trust me I am doing it even now. With an average of 6+ hours going into PostgreSQL Code, even with best practices (as suggested by the developers) I still think I know less than 10 percent. It is too huge to be swallowed at once. I too had to break it down into pieces and because everything is so interconnected with everything else, it is quite complicated in the beginning. Start with one piece; planner, parser, executor, storage management whatever and slowly it should help you get the bigger picture. regards, Vaibhav I had to break it into Thanks Vaibhav . I have step into parser before but I meet a problem: when I debug step in the scanner_init(), Eclipse always finds scan.l and the excute order is not match the file. I think it should be scan.c actually but I don't known how to trace into scan.c :( PS: I have turn Search for duplicate source files option on. I have posted to the mail list, but it have not solved. here is the link: http://postgresql.1045698.n5.nabble.com/Open-unmatch-source-file-when-step-into-parse-analyze-in-Eclipse-td3408033.html -- Best Wishes! hom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Simon Riggs si...@2ndquadrant.com wrote: In PostgreSQL other users cannot observe the commit until an acknowledgement has been received. Really? I hadn't picked up on that. That makes for a lot of complication on crash-and-recovery of a master, but if we can pull it off, that's really cool. If we do that and MySQL doesn't, we definitely don't want to use the same terminology they do, which would imply the same behavior. Apologies for not picking up on that aspect of the implementation. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Thu, Mar 17, 2011 at 5:46 PM, Robert Haas robertmh...@gmail.com wrote: What makes more sense to me after having thought about this more carefully is to simply make a blanket rule that when synchronous_replication=on, synchronous_commit has no effect. That is easy to understand and document. For what it's worth has no effect doesn't make much sense to me. It's a boolean, either commits are going to block or they're not. What happened to the idea of a three-way switch? synchronous_commit = off synchronous_commit = disk synchronous_commit = replica With on being a synonym for disk for backwards compatibility. Then we could add more options later for more complex conditions like waiting for one server in each data centre or waiting for one of a certain set of servers ignoring the less reliable mirrors, etc. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Thu, 17 Mar 2011 16:02:18 -0500, Kevin Grittner wrote: Rados*aw Smogurarsmog...@softperience.eu wrote: I have implemented initial concept of 2nd level cache. Idea is to keep some segments of shared memory for special buffers (e.g. indices) to prevent overwrite those by other operations. I added those functionality to nbtree index scan. I tested this with doing index scan, seq read, drop system buffers, do index scan and in few places I saw performance improvements, but actually, I'm not sure if this was just random or intended improvement. I've often wondered about this. In a database I developed back in the '80s it was clearly a win to have a special cache for index entries and other special pages closer to the database than the general cache. A couple things have changed since the '80s (I mean, besides my waistline and hair color), and PostgreSQL has many differences from that other database, so I haven't been sure it would help as much, but I have wondered. I can't really look at this for a couple weeks, but I'm definitely interested. I suggest that you add this to the next CommitFest as a WIP patch, under the Performance category. https://commitfest.postgresql.org/action/commitfest_view/open There is few places to optimize code as well, and patch need many work, but may you see it and give opinions? For something like this it makes perfect sense to show proof of concept before trying to cover everything. -Kevin Yes, there is some change, and I looked at this more carefully, as my performance results wasn't such as I expected. I found PG uses BufferAccessStrategy to do sequence scans, so my test query took only 32 buffers from pool and didn't overwritten index pool too much. This BAS is really surprising. In any case when I end polishing I will send good patch, with proof. Actually idea of this patch was like this: Some operations requires many buffers, PG uses clock sweep to get next free buffer, so it may overwrite index buffer. From point of view of good database design We should use indices, so purging out index from cache will affect performance. As the side effect I saw that this 2nd level keeps pg_* indices in memory too, so I think to include 3rd level cache for some pg_* tables. Regards, Radek -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, Mar 18, 2011 at 10:55 AM, Greg Stark gsst...@mit.edu wrote: On Thu, Mar 17, 2011 at 5:46 PM, Robert Haas robertmh...@gmail.com wrote: What makes more sense to me after having thought about this more carefully is to simply make a blanket rule that when synchronous_replication=on, synchronous_commit has no effect. That is easy to understand and document. For what it's worth has no effect doesn't make much sense to me. It's a boolean, either commits are going to block or they're not. What happened to the idea of a three-way switch? synchronous_commit = off synchronous_commit = disk synchronous_commit = replica With on being a synonym for disk for backwards compatibility. Then we could add more options later for more complex conditions like waiting for one server in each data centre or waiting for one of a certain set of servers ignoring the less reliable mirrors, etc. This is similar to what I suggested upthread, except that I suggested on/local/off, with the default being on. That way if you set synchronous_standby_names, you get synchronous replication without changing another setting, but you can say local instead if for some reason you want the middle behavior. If we're going to do it all with one GUC, I think that way makes more sense. If you're running sync rep, you might still have some transactions that you don't care about, but that's what async commit is for. It's a funny kind of transaction that we're OK with losing if we have a failover but we're not OK with losing if we have a local crash from which we recover without failing over. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
rsmogura rsmog...@softperience.eu wrote: Yes, there is some change, and I looked at this more carefully, as my performance results wasn't such as I expected. I found PG uses BufferAccessStrategy to do sequence scans, so my test query took only 32 buffers from pool and didn't overwritten index pool too much. This BAS is really surprising. In any case when I end polishing I will send good patch, with proof. Yeah, that heuristic makes this less critical, for sure. Actually idea of this patch was like this: Some operations requires many buffers, PG uses clock sweep to get next free buffer, so it may overwrite index buffer. From point of view of good database design We should use indices, so purging out index from cache will affect performance. As the side effect I saw that this 2nd level keeps pg_* indices in memory too, so I think to include 3rd level cache for some pg_* tables. Well, the more complex you make it the more overhead there is, which makes it harder to come out ahead. FWIW, in musing about it (as recently as this week), my idea was to add another field which would factor into the clock sweep calculations. For indexes, it might be levels above leaf pages. I haven't reviewed the code in depth to know how to use it, this was just idle daydreaming based on that prior experience. It's far from certain that the concept will actually prove beneficial in PostgreSQL. Maybe the thing to focus on first is the oft-discussed benchmark farm (similar to the build farm), with a good mix of loads, so that the impact of changes can be better tracked for multiple workloads on a variety of platforms and configurations. Without something like that it is very hard to justify the added complexity of an idea like this in terms of the performance benefit gained. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19
On Fri, 2011-03-18 at 10:27 -0400, Robert Haas wrote: ERRCODE_(WARNING_?)REPLICATION_WAIT_CANCELLED ...which might have something to recommend it. Works for me. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, 2011-03-18 at 11:07 -0400, Robert Haas wrote: On Fri, Mar 18, 2011 at 10:55 AM, Greg Stark gsst...@mit.edu wrote: On Thu, Mar 17, 2011 at 5:46 PM, Robert Haas robertmh...@gmail.com wrote: What makes more sense to me after having thought about this more carefully is to simply make a blanket rule that when synchronous_replication=on, synchronous_commit has no effect. That is easy to understand and document. For what it's worth has no effect doesn't make much sense to me. It's a boolean, either commits are going to block or they're not. What happened to the idea of a three-way switch? synchronous_commit = off synchronous_commit = disk synchronous_commit = replica With on being a synonym for disk for backwards compatibility. Then we could add more options later for more complex conditions like waiting for one server in each data centre or waiting for one of a certain set of servers ignoring the less reliable mirrors, etc. This is similar to what I suggested upthread, except that I suggested on/local/off, with the default being on. That way if you set synchronous_standby_names, you get synchronous replication without changing another setting, but you can say local instead if for some reason you want the middle behavior. If we're going to do it all with one GUC, I think that way makes more sense. If you're running sync rep, you might still have some transactions that you don't care about, but that's what async commit is for. It's a funny kind of transaction that we're OK with losing if we have a failover but we're not OK with losing if we have a local crash from which we recover without failing over. I much prefer a single switch, which is what I originally suggested. Changing the meaning of synchronous_commit seems a problem. durability = localmemory durability = localdisk (durability = remotereceive - has no meaning in current code) durability = remotedisk durability = remoteapply it also allows us to have in the future -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On 03/18/2011 03:52 PM, Kevin Grittner wrote: Really? I hadn't picked up on that. That makes for a lot of complication on crash-and-recovery of a master What complication do you have in mind here? I think of it the opposite way (at least for Postgres, that is): committing a transaction that's not acknowledged means having to revert a (locally only) committed transaction if you want to use the current data to recover to some cluster-agreed state. (Of course, you can always simply transfer the whole If you don't commit the transaction before the ACK in the first place, you don't have anything special to do upon recovery. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On 18.03.2011 16:52, Kevin Grittner wrote: Simon Riggssi...@2ndquadrant.com wrote: In PostgreSQL other users cannot observe the commit until an acknowledgement has been received. Really? I hadn't picked up on that. That makes for a lot of complication on crash-and-recovery of a master, but if we can pull it off, that's really cool. If we do that and MySQL doesn't, we definitely don't want to use the same terminology they do, which would imply the same behavior. To be clear: other users cannot observe the commit until standby acknowledges it - unless the master crashes while waiting for the acknowledgment. If that happens, the commit will be visible to everyone after recovery. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, Mar 18, 2011 at 2:19 PM, Markus Wanner mar...@bluegap.ch wrote: Their documentation [1] isn't entirely clear on that first: the master blocks after the commit is done and waits until at least one semisynchronous slave acknowledges that it has received all events for the transaction and the slave acknowledges receipt of a transaction's events only after the events have been written to its relay log and flushed to disk. But then continues to say that [the master is] waiting for acknowledgment from a slave after having performed a commit, so this indeed sounds like the transaction is visible to other sessions before the slave ACKs. Yes, their docs are not clear on this. -- Mark Callaghan mdcal...@gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19
On 18.03.2011 17:38, Jeff Davis wrote: On Fri, 2011-03-18 at 10:27 -0400, Robert Haas wrote: ERRCODE_(WARNING_?)REPLICATION_WAIT_CANCELLED ...which might have something to recommend it. Works for me. Yes, sounds reasonable. Without WARNING_, please. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, Mar 18, 2011 at 2:37 PM, Markus Wanner mar...@bluegap.ch wrote: Hi, On 03/18/2011 02:40 PM, Kevin Grittner wrote: Then the only thing you would consider sync replication, as far as I can see, is two phase commit I think waiting for the ACK before actually making the changes from the transaction visible (COMMIT) would suffice for disallowing such an inconsistency to manifest. But obviously, MySQL decided it's not worth doing that, as it's such a rare event and a short period of time that may show inconsistencies... There are fewer options for implementing this in MySQL because replication requires a binlog on the master and that requires the internal use of XA to keep the binlog and InnoDB in sync as they are separate resource managers. In theory, this can be changed so that commit is only forced for the binlog and then on a crash missing transactions could be copied from the binlog to InnoDB but I don't think this will ever change. By fewer options I mean that commit in MySQL with InnoDB and the binlog requires: 1) prepare to InnoDB (force transaction log to disk for changes from this transaction) 2) write binlog events from this transaction to the binlog 3) write XID event to the binlog (at this point transaction commit is official, will survive a crash) 4) force binlog to disk 5) release row locks held by transaction in innodb 6) write commit record to innodb transaction log 7) force write of commit record to disk Group commit is done for the fsyncs from steps 1 and 7. It is not done for the fsync done in step 4. Regardless, the processing above is complicated even without semi-sync. AFAIK, semi-sync code occurs after step 7 but I have not looked at the official version of semi-sync code in MySQL and my memory of the work we did at Google is vague. It is great if Postgres doesn't have this issue. It wasn't clear to me from lurking on this list. I hope your docs highlight the behavior as not having the issue is a big deal. -- Mark Callaghan mdcal...@gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, 2011-03-18 at 17:47 +0200, Heikki Linnakangas wrote: On 18.03.2011 16:52, Kevin Grittner wrote: Simon Riggssi...@2ndquadrant.com wrote: In PostgreSQL other users cannot observe the commit until an acknowledgement has been received. Really? I hadn't picked up on that. That makes for a lot of complication on crash-and-recovery of a master, but if we can pull it off, that's really cool. If we do that and MySQL doesn't, we definitely don't want to use the same terminology they do, which would imply the same behavior. To be clear: other users cannot observe the commit until standby acknowledges it - unless the master crashes while waiting for the acknowledgment. If that happens, the commit will be visible to everyone after recovery. No, only in the case where you choose not to failover to the standby when you crash, which would be a fairly strange choice after the effort to set up the standby. In a correctly configured and operated cluster what I say above is fully correct and needs no addendum. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Maybe the thing to focus on first is the oft-discussed benchmark farm (similar to the build farm), with a good mix of loads, so that the impact of changes can be better tracked for multiple workloads on a variety of platforms and configurations. Without something like that it is very hard to justify the added complexity of an idea like this in terms of the performance benefit gained. A related area that could use some looking at is why performance tops out at shared_buffers ~8GB and starts to fall thereafter. InnoDB can apparently handle much larger buffer pools without a performance drop-off. There are some advantages to our reliance on the OS buffer cache, to be sure, but as RAM continues to grow this might start to get annoying. On a 4GB system you might have shared_buffers set to 25% of memory, but on a 64GB system it'll be a smaller percentage, and as memory capacities continue to clime it'll be smaller still. Unfortunately I don't have the hardware to investigate this, but it's worth thinking about, especially if we're thinking of doing things that add more caching. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On 18.03.2011 16:52, Kevin Grittner wrote: Simon Riggssi...@2ndquadrant.com wrote: In PostgreSQL other users cannot observe the commit until an acknowledgement has been received. Really? I hadn't picked up on that. That makes for a lot of complication on crash-and-recovery of a master, but if we can pull it off, that's really cool. Markus Wanner mar...@bluegap.ch wrote: What complication do you have in mind here? Basically, what Heikki addresses. It has to be committed after crash and recovery, and deal with replicas which may or may not have been notified and may or may not have applied the transaction. Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: To be clear: other users cannot observe the commit until standby acknowledges it - unless the master crashes while waiting for the acknowledgment. If that happens, the commit will be visible to everyone after recovery. Right. If other transactions cannot see the transaction before the COMMIT returns, I was kinda assuming that this was the behavior, because otherwise one or more replicas could be ahead of the master after recovery, which would be horribly broken. I agree that the behavior which you describe is much better than allowing other transactions to see the work of the pending COMMIT. In fact, on further reflection, allowing other transactions to see work before the committing transaction returns could lead to broken behavior if that viewing transaction took some action based on the that, the master crashed, recovery was done using a standby, and that standby hadn't persisted the transaction. So this behavior is necessary for good behavior. Even though that perfect storm of events might be fairly rare, the difference in the level of confidence in correctness is significant, and certainly something to brag about. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, Mar 18, 2011 at 12:19 PM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, 2011-03-18 at 17:47 +0200, Heikki Linnakangas wrote: On 18.03.2011 16:52, Kevin Grittner wrote: Simon Riggssi...@2ndquadrant.com wrote: In PostgreSQL other users cannot observe the commit until an acknowledgement has been received. Really? I hadn't picked up on that. That makes for a lot of complication on crash-and-recovery of a master, but if we can pull it off, that's really cool. If we do that and MySQL doesn't, we definitely don't want to use the same terminology they do, which would imply the same behavior. To be clear: other users cannot observe the commit until standby acknowledges it - unless the master crashes while waiting for the acknowledgment. If that happens, the commit will be visible to everyone after recovery. No, only in the case where you choose not to failover to the standby when you crash, which would be a fairly strange choice after the effort to set up the standby. In a correctly configured and operated cluster what I say above is fully correct and needs no addendum. Except it doesn't work that way. If, say, a backend on the master core dumps, the system will perform a crash and restart cycle, and the transaction will become visible whether it's yet been replicated or not. Since we now have a GUC to suppress restart after a backend crash, it's theoretically possible to set up the system so that this doesn't occur, but it'd take quite a bit of work to make it robust and automatic, and it's certainly not the default out of the box. The fundamental problem here is that once you update CLOG and flush the corresponding WAL record, there is no going backward. You can hold the system in some intermediate state where the transaction still holds locks and is excluded from MVCC snapshots, but there's no way to back up. So there are bound to be corner cases where the where the wait doesn't last as long as you want, and stuff leaks out around the edges. It's fundamentally impossible to guarantee that you'll remain in that intermediate state forever - what do you do if a meteor hits the synchronous standby and at the same time you lose power to the master? No amount of configuration will save you from coming back on line with a visible-but-unreplicated transaction. I'm not knocking the system; I think what we have is impressively good. But pretending that corner cases can't happen gets us nowhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19
On Fri, Mar 18, 2011 at 11:58 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 18.03.2011 17:38, Jeff Davis wrote: On Fri, 2011-03-18 at 10:27 -0400, Robert Haas wrote: ERRCODE_(WARNING_?)REPLICATION_WAIT_CANCELLED ...which might have something to recommend it. Works for me. Yes, sounds reasonable. Without WARNING_, please. The reason I included WARNING is because warnings have their own section in errcodes.txt, and each errcode is marked E for error or W for warning. Since we CAN'T actually error out here, I thought it might be more appropriate to make this a warning; and all of the existing such codes contain WARNING. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pgindent (was Re: [HACKERS] Header comments in the recently added files)
On Thu, Mar 10, 2011 at 11:25 AM, Robert Haas robertmh...@gmail.com wrote: On Thu, Mar 10, 2011 at 10:59 AM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: Speaking of running scripts, I think we should run pgindent now. Yeah, +1 for doing it as soon as Tom is at a good stopping point. It makes things a lot simpler later on. IIRC the argument for an early pgindent run was to standardize the new code for easier review. I expect to be spending a whole lot of time reading collate and SSI code over the next few weeks, so I'm in favor of pgindent'ing that stuff first. But I guess we need the typedef list update before anything can happen. That's one good reason. Another is that this is presumably the time of the cycle when there are the fewest outstanding patches, making it a good time for changes that are likely to conflict with lots of other things. At any rate, it sounds like Andrew needs a few days to get the typedef list together, so let's wait for that to happen and then we'll see where we are. Andrew, any update on this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Robert Haas robertmh...@gmail.com wrote: Simon Riggs si...@2ndquadrant.com wrote: No, only in the case where you choose not to failover to the standby when you crash, which would be a fairly strange choice after the effort to set up the standby. In a correctly configured and operated cluster what I say above is fully correct and needs no addendum. what do you do if a meteor hits the synchronous standby and at the same time you lose power to the master? No amount of configuration will save you from coming back on line with a visible-but-unreplicated transaction. You don't even need to postulate an extreme condition like that; we prefer to have a DBA pull the trigger on a failover, rather than trust the STONITH call to software. This is particularly true when the master is local to its primary users and the replica is remote to them. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Excerpts from rsmogura's message of vie mar 18 11:57:48 -0300 2011: Actually idea of this patch was like this: Some operations requires many buffers, PG uses clock sweep to get next free buffer, so it may overwrite index buffer. From point of view of good database design We should use indices, so purging out index from cache will affect performance. The BufferAccessStrategy stuff was written to solve this problem. As the side effect I saw that this 2nd level keeps pg_* indices in memory too, so I think to include 3rd level cache for some pg_* tables. Keep in mind that there's already another layer of caching (see syscache.c) for system catalogs on top of the buffer cache. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19
On Thu, 2011-03-17 at 09:33 -0400, Robert Haas wrote: Thanks for the review! Lets have a look here... You've added a test inside the lock to see if there is a standby, which I took out for performance reasons. Maybe there's another way, I know that code is fiddly. You've also added back in the lock acquisition at wakeup with very little justification, which was a major performance hit. Together that's about a 20% hit in performance in Yeb's tests. I think you should spend a little time thinking how to retune that. I see handling added for ProcDiePending and QueryCancelPending directly into syncrep.c without any comments in postgres.c to indicate that you bypass ProcessInterrupts() in some cases. That looks pretty hokey to me. SyncRepUpdateSyncStandbysDefined() is added into walwriter, which means waiters won't be released if we do a sighup during a fast shutdown, since the walwriter gets killed as soon as that starts. I'm thinking bgwriter should handle that now. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_ctl restart - behaviour based on wrong instance
I am not sure the following pg_ctl behaviour is really a bug, but I find it unexpected enough to report. I was testing synchronous replication in a test setup on a single machine. (After all, one could have different instances on different arrays, right? If you think this is an unlikely use-case, perhaps the following is not important.) There are two installations of 9.1devel (git as of today): primary: /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1 standby: /var/data1/pg_stuff/pg_installations/pgsql.vanilla_2 The standby's data_directory is generated by pg_basebackup from vanilla_1. The problem is the very first run of pg_ctl restart: pg_ctl first correctly decides that the standby instance (=vanilla_2) isn't yet running: pg_ctl: PID file /var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/data/postmaster.pid does not exist This is OK and expected. But then it continues (in the logfile) with: FATAL: lock file postmaster.pid already exists HINT: Is another postmaster (PID 20519) running in data directory /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data? So, complaints about the *other* instance. It doesn't happen once a successful start (with pg_ctl start) has happened. It starts fine when started right away with 'start' instead of 'restart'. Also, if it has been started once, it will react to 'pg_ctl restart' without the errors. I'll attach a shell-script, that provokes the error, see the 'restart' on the line with the comment: 'HERE' It would seem (see below) that pg_ctl's final decision about the standby, (that is has started up) is wrong; the standby does *not* eventually start. Below the output of the attached shell script. (careful - it deletes stuff) (It still contains some debug lines, but I didn't want to change it too much.) $ clear; ./split_vanilla.sh PGPASSFILE=/home/rijkers/.pg_rijkers waiting for server to shut down done server stopped waiting for server to shut down done server stopped waiting for server to start done server started removed `/var/data1/pg_stuff/archive_dir/00010018' removed `/var/data1/pg_stuff/archive_dir/00010019' removed `/var/data1/pg_stuff/archive_dir/00010019.0020.backup' removed `/var/data1/pg_stuff/archive_dir/0001001A' /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin/pg_basebackup NOTICE: pg_stop_backup complete, all required WAL segments have been archived BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin PGPORT=6564 PGPASSFILE=/home/rijkers/.pg_rijkers PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin/pg_ctl waiting for server to shut down done server stopped waiting for server to start done server started UIDPID PPID C STIME TTY STAT TIME CMD rijkers 20519 1 20 17:19 pts/25 S+ 0:00 /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/bin/postgres -D /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data rijkers 20521 20519 0 17:19 ?Ss 0:00 \_ postgres: writer process rijkers 20522 20519 0 17:19 ?Ss 0:00 \_ postgres: wal writer process rijkers 20523 20519 0 17:19 ?Ss 0:00 \_ postgres: autovacuum launcher process rijkers 20524 20519 0 17:19 ?Ss 0:00 \_ postgres: archiver process rijkers 20525 20519 0 17:19 ?Ss 0:00 \_ postgres: stats collector process BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/bin PGPORT=6664 PGPASSFILE=/home/rijkers/.pg_rijkers PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/data /var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/bin/pg_ctl pg_ctl: PID file /var/data1/pg_stuff/pg_installations/pgsql.vanilla_2/data/postmaster.pid does not exist Is server running? starting server anyway waiting for server to start... done server started -- logfile 1: LOG: database system is shut down LOG: database system was shut down at 2011-03-18 17:19:54 CET LOG: autovacuum launcher started LOG: database system is ready to accept connections -- logfile 2: LOG: shutting down LOG: database system is shut down FATAL: lock file postmaster.pid already exists HINT: Is another postmaster (PID 20519) running in data directory /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data? thanks, Erik Rijkers split_vanilla.sh Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19
On Fri, Mar 18, 2011 at 1:15 PM, Simon Riggs si...@2ndquadrant.com wrote: On Thu, 2011-03-17 at 09:33 -0400, Robert Haas wrote: Thanks for the review! Lets have a look here... You've added a test inside the lock to see if there is a standby, which I took out for performance reasons. Maybe there's another way, I know that code is fiddly. You've also added back in the lock acquisition at wakeup with very little justification, which was a major performance hit. Together that's about a 20% hit in performance in Yeb's tests. I think you should spend a little time thinking how to retune that. Ouch. Do you have a link that describes his testing methodology? I will look at it. I see handling added for ProcDiePending and QueryCancelPending directly into syncrep.c without any comments in postgres.c to indicate that you bypass ProcessInterrupts() in some cases. That looks pretty hokey to me. I can add some comments. Unfortunately, it's not feasible to call ProcessInterrupts() directly from this point in the code - it causes a database panic. SyncRepUpdateSyncStandbysDefined() is added into walwriter, which means waiters won't be released if we do a sighup during a fast shutdown, since the walwriter gets killed as soon as that starts. I'm thinking bgwriter should handle that now. Hmm. I was thinking that doing it in WAL writer would make it more responsive, but since this is a fairly unlikely scenario, it's probably not worth complicating the shutdown sequence to do it the way I did. I'll move it to bgwriter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, Mar 18, 2011 at 4:33 PM, Robert Haas robertmh...@gmail.com wrote: The fundamental problem here is that once you update CLOG and flush the corresponding WAL record, there is no going backward. You can hold the system in some intermediate state where the transaction still holds locks and is excluded from MVCC snapshots, but there's no way to back up. So there are bound to be corner cases where the where the wait doesn't last as long as you want, and stuff leaks out around the edges. I'm finding this whole idea of hiding the committed transaction until the slave acks it kind of strange. It means there are times when the slave is actually *ahead* of the master which would actually be kind of hard to code against if you're trying to use the slave as a possibly-not-up-to-date mirror. I think promising that the COMMIT doesn't return until the transaction and all previous transactions are replicated is enough. We don't have to promise that nobody else will see it either. Those same transactions eventually have to commit as well and if they want that level of protection they can block waiting until they're replicated as well which will imply that anything they depended on will be replicated. This is akin to the synchronous_commit=off case where other transactions can see your data as soon as you commit even before the xlog is fsynced. If you have synchronous_commit mode enabled then you'll block until your xlog is fsynced and that will implicitly mean the other transactions you saw were also fsynced. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
YAMAMOTO Takashi y...@mwd.biglobe.ne.jp wrote: thanks for quickly fixing problems. Thanks for the rigorous testing. :-) i tested the later version (a2eb9e0c08ee73208b5419f5a53a6eba55809b92) and only errors i got was out of shared memory. i'm not sure if it was caused by SSI activities or not. PG_DIAG_SEVERITY: WARNING PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: shmem.c PG_DIAG_SOURCE_LINE: 190 PG_DIAG_SOURCE_FUNCTION: ShmemAlloc PG_DIAG_SEVERITY: ERROR PG_DIAG_SQLSTATE: 53200 PG_DIAG_MESSAGE_PRIMARY: out of shared memory PG_DIAG_SOURCE_FILE: dynahash.c PG_DIAG_SOURCE_LINE: 925 PG_DIAG_SOURCE_FUNCTION: hash_search_with_hash_value Nor am I. Some additional information would help. (1) Could you post the non-default configuration settings? (2) How many connections are in use in your testing? (3) Can you give a rough categorization of how many of what types of transactions are in the mix? (4) Are there any long-running transactions? (5) How many of these errors do you get in what amount of time? (6) Does the application continue to run relatively sanely, or does it fall over at this point? (7) The message hint would help pin it down, or a stack trace at the point of the error would help more. Is it possible to get either? Looking over the code, it appears that the only places that SSI could generate that error, it would cancel that transaction with the hint You might need to increase max_pred_locks_per_transaction. and otherwise allow normal processing. Even with the above information it may be far from clear where allocations are going past their maximum, since one HTAB could grab more than its share and starve another which is staying below its maximum. I'll take a look at the possibility of adding a warning or some such when an HTAB expands past its maximum size. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Japanese developers?
All, I've heard from JPUG and all directors are OK. Not sure about all members, though. All staff of SRA are also OK. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
It would probably also be worth monitoring the size of pg_locks to see how many predicate locks are being held. On Fri, Mar 18, 2011 at 12:50:16PM -0500, Kevin Grittner wrote: Even with the above information it may be far from clear where allocations are going past their maximum, since one HTAB could grab more than its share and starve another which is staying below its maximum. I'll take a look at the possibility of adding a warning or some such when an HTAB expands past its maximum size. Yes -- considering how few shared memory HTABs have sizes that are really dynamic, I'd be inclined to take a close look at SSI and max_predicate_locks_per_transaction regardless of where the failed allocation took place. But I am surprised to see that error message without SSI's hint about increasing max_predicate_locks_per_xact. Dan -- Dan R. K. Ports MIT CSAILhttp://drkp.net/ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Mar 18, 2011, at 11:19 AM, Robert Haas wrote: On Fri, Mar 18, 2011 at 11:14 AM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: A related area that could use some looking at is why performance tops out at shared_buffers ~8GB and starts to fall thereafter. InnoDB can apparently handle much larger buffer pools without a performance drop-off. There are some advantages to our reliance on the OS buffer cache, to be sure, but as RAM continues to grow this might start to get annoying. On a 4GB system you might have shared_buffers set to 25% of memory, but on a 64GB system it'll be a smaller percentage, and as memory capacities continue to clime it'll be smaller still. Unfortunately I don't have the hardware to investigate this, but it's worth thinking about, especially if we're thinking of doing things that add more caching. +1 To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here... -- Jim C. Nasby, Database Architect j...@nasby.net 512.569.9461 (cell) http://jim.nasby.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19
Excerpts from Robert Haas's message of vie mar 18 14:25:16 -0300 2011: On Fri, Mar 18, 2011 at 1:15 PM, Simon Riggs si...@2ndquadrant.com wrote: SyncRepUpdateSyncStandbysDefined() is added into walwriter, which means waiters won't be released if we do a sighup during a fast shutdown, since the walwriter gets killed as soon as that starts. I'm thinking bgwriter should handle that now. Hmm. I was thinking that doing it in WAL writer would make it more responsive, but since this is a fairly unlikely scenario, it's probably not worth complicating the shutdown sequence to do it the way I did. I'll move it to bgwriter. Can't they both do it? -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On 03/18/2011 06:35 PM, Greg Stark wrote: I think promising that the COMMIT doesn't return until the transaction and all previous transactions are replicated is enough. We don't have to promise that nobody else will see it either. Those same transactions eventually have to commit as well No, they don't have to. They can ROLLBACK, get aborted, lose connection to the master, etc.. The issue here is that, given the MySQL scheme, these transactions see a snapshot that's not durable, because at that point in time, no standby guarantees to have stored the transaction to be committed, yet. So in case of a failover, you'd suddenly see a different snapshot (and lose changes of that transaction). This is akin to the synchronous_commit=off case where other transactions can see your data as soon as you commit even before the xlog is fsynced. If you have synchronous_commit mode enabled then you'll block until your xlog is fsynced and that will implicitly mean the other transactions you saw were also fsynced. Somewhat, yes. And for exactly that reason, most users run with synchronous_commit enabled. They don't want to lose committed transactions. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Simon, On 03/18/2011 05:19 PM, Simon Riggs wrote: Simon Riggssi...@2ndquadrant.com wrote: In PostgreSQL other users cannot observe the commit until an acknowledgement has been received. On other nodes as well? To me that means the standby needs to hold back COMMIT of an ACKed transaction, until receives a re-ACK from the master, that it committed the transaction there. How else could the slave know when to commit its ACKed transactions? No, only in the case where you choose not to failover to the standby when you crash, which would be a fairly strange choice after the effort to set up the standby. In a correctly configured and operated cluster what I say above is fully correct and needs no addendum. If you don't failover, how can the standby be ahead of the master, given it takes measures not to be during normal operation? Eager to understand... ;-) Regards Markus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On 03/18/2011 05:27 PM, Kevin Grittner wrote: Basically, what Heikki addresses. It has to be committed after crash and recovery, and deal with replicas which may or may not have been notified and may or may not have applied the transaction. Huh? I'm not quite following here. Committing additional transactions isn't a problem, reverting committed transactions is. And yes, given that we only wait for ACK from a single standby, you'd have to failover to exactly *that* standby to guarantee consistency. In fact, on further reflection, allowing other transactions to see work before the committing transaction returns could lead to broken behavior if that viewing transaction took some action based on the that, the master crashed, recovery was done using a standby, and that standby hadn't persisted the transaction. So this behavior is necessary for good behavior. I fully agree to that. Regards Markus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote: Simon, On 03/18/2011 05:19 PM, Simon Riggs wrote: Simon Riggssi...@2ndquadrant.com wrote: In PostgreSQL other users cannot observe the commit until an acknowledgement has been received. On other nodes as well? To me that means the standby needs to hold back COMMIT of an ACKed transaction, until receives a re-ACK from the master, that it committed the transaction there. How else could the slave know when to commit its ACKed transactions? We could do that easily enough, actually, if we wished. Do we wish? No, only in the case where you choose not to failover to the standby when you crash, which would be a fairly strange choice after the effort to set up the standby. In a correctly configured and operated cluster what I say above is fully correct and needs no addendum. If you don't failover, how can the standby be ahead of the master, given it takes measures not to be during normal operation? Eager to understand... ;-) Regards Markus -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Simon Riggs si...@2ndquadrant.com wrote: On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote: Simon Riggssi...@2ndquadrant.com wrote: In PostgreSQL other users cannot observe the commit until an acknowledgement has been received. On other nodes as well? To me that means the standby needs to hold back COMMIT of an ACKed transaction, until receives a re-ACK from the master, that it committed the transaction there. How else could the slave know when to commit its ACKed transactions? We could do that easily enough, actually, if we wished. Do we wish? +1 If we're going out of our way to suppress it on the master until the COMMIT returns, it shouldn't be showing on the replicas before that. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On 03/18/2011 08:29 PM, Simon Riggs wrote: We could do that easily enough, actually, if we wished. Do we wish? I personally don't see any problem letting a standby show a snapshot before the master. I'd consider it unneeded network traffic. But then again, I'm completely biased. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
Dan Ports d...@csail.mit.edu wrote: I am surprised to see that error message without SSI's hint about increasing max_predicate_locks_per_xact. After reviewing this, I think something along the following lines might be needed, for a start. I'm not sure the Asserts are actually needed; they basically are checking that the current behavior of hash_search doesn't change. I'm still looking at whether it's sane to try to issue a warning when an HTAB exceeds the number of entries declared as its max_size when it was created. -Kevin --- a/src/backend/storage/lmgr/predicate.c +++ b/src/backend/storage/lmgr/predicate.c @@ -1604,12 +1604,7 @@ RegisterPredicateLockingXid(const TransactionId xid) sxid = (SERIALIZABLEXID *) hash_search(SerializableXidHash, sxidtag, HASH_ENTER, found); - if (!sxid) - /* This should not be possible, based on allocation. */ - ereport(ERROR, - (errcode(ERRCODE_OUT_OF_MEMORY), -errmsg(out of shared memory))); - + Assert(sxid != NULL); Assert(!found); /* Initialize the structure. */ @@ -2046,7 +2041,7 @@ CreatePredicateLock(const PREDICATELOCKTARGETTAG *targettag, target = (PREDICATELOCKTARGET *) hash_search_with_hash_value(PredicateLockTargetHash, targettag, targettaghash, - HASH_ENTER, found); + HASH_ENTER_NULL, found); if (!target) ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), @@ -2061,7 +2056,7 @@ CreatePredicateLock(const PREDICATELOCKTARGETTAG *targettag, lock = (PREDICATELOCK *) hash_search_with_hash_value(PredicateLockHash, locktag, PredicateLockHashCodeFromTargetHashCode(locktag, targettaghash), - HASH_ENTER, found); + HASH_ENTER_NULL, found); if (!lock) ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), @@ -3252,7 +3247,7 @@ ReleaseOneSerializableXact(SERIALIZABLEXACT *sxact, bool partial, predlock = hash_search_with_hash_value(PredicateLockHash, tag, PredicateLockHashCodeFromTargetHashCode(tag, targettaghash), - HASH_ENTER, found); + HASH_ENTER_NULL, found); if (!predlock) ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), @@ -4279,10 +4274,7 @@ predicatelock_twophase_recover(TransactionId xid, uint16 info, sxid = (SERIALIZABLEXID *) hash_search(SerializableXidHash, sxidtag, HASH_ENTER, found); - if (!sxid) - ereport(ERROR, - (errcode(ERRCODE_OUT_OF_MEMORY), -errmsg(out of shared memory))); + Assert(sxid != NULL); Assert(!found); sxid-myXact = (SERIALIZABLEXACT *) sxact; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] sync rep fsync=off
While investigating Simon's complaint about my patch of a few days ago, I discovered that synchronous replication appears to slow to a crawl if fsync is turned off on the standby. I'm not sure why this is happening or what the right behavior is in this case, but I think some kind of adjustment is needed because the current behavior is quite surprising. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On Fri, Mar 18, 2011 at 2:15 PM, Jim Nasby j...@nasby.net wrote: +1 To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here... The problem is that the OS doesn't offer any mechanism that would allow us to obey the WAL-before-data rule. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, Mar 18, 2011 at 3:29 PM, Simon Riggs si...@2ndquadrant.com wrote: On Fri, 2011-03-18 at 20:19 +0100, Markus Wanner wrote: Simon, On 03/18/2011 05:19 PM, Simon Riggs wrote: Simon Riggssi...@2ndquadrant.com wrote: In PostgreSQL other users cannot observe the commit until an acknowledgement has been received. On other nodes as well? To me that means the standby needs to hold back COMMIT of an ACKed transaction, until receives a re-ACK from the master, that it committed the transaction there. How else could the slave know when to commit its ACKed transactions? We could do that easily enough, actually, if we wished. Do we wish? Seems like it would be nice, but isn't it dreadfully expensive? Wouldn't you need to prevent the slave from applying the WAL until the master has released the sync rep waiters? You'd need a whole new series of messages back and forth. Since the current solution is intended to support data-loss-free failover, but NOT to guarantee a consistent view of the world from a SQL level, I doubt it's worth paying any price for this. Certainly in the hot_standby=off case it's a nonissue. We might need to think harder about it when and if someone impements an 'apply' level though, because this would seem more of a concern in that case (though I haven't thought through all the details). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Robert Haas robertmh...@gmail.com wrote: Since the current solution is intended to support data-loss-free failover, but NOT to guarantee a consistent view of the world from a SQL level, I doubt it's worth paying any price for this. Well, that brings us back to the question of why we would want to suppress the view of the data on the master until the replica acknowledges the commit. It *is* committed on the master, we're just holding off on telling the committer about it until we can honor the guarantee of replication. If it can be seen on the replica before the committer get such acknowledgment, why not on the master? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, 2011-03-18 at 17:08 -0400, Aidan Van Dyk wrote: On Fri, Mar 18, 2011 at 3:41 PM, Markus Wanner mar...@bluegap.ch wrote: On 03/18/2011 08:29 PM, Simon Riggs wrote: We could do that easily enough, actually, if we wished. Do we wish? I personally don't see any problem letting a standby show a snapshot before the master. I'd consider it unneeded network traffic. But then again, I'm completely biased. In fact, we *need* to have standbys show a snapshot before the master. By the time the master acks the commit to the client, the snapshot must be visible to all client connected to both the master and the syncronous slave. Even with just a single server postgresql cluster, other clients(backends) can see the commit before the commiting client receives the ACK. Just that on a single server, the time period for that is small. Sync rep increases that time period by the length of time from when the slave reaches the commit point in the WAL stream to when it's ack of that point get's back to the wal sender. Ideally, that ACK time is small. Adding another round trip in there just for a go almost to $COMIT, ok, now go to $COMMIT type of WAL/ack is going to be pessimal for performance, and still not improve the *guarentees* it can make. It can only slightly reduce, but not eliminated that window where them master has WAL that the slave doesn't, and without a complete elimination (where you just switch the problem to be the slave has the data that the master doesn't), you haven't changed any of the guarantees sync rep can make (or not). Well explained observation. Agreed. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, 2011-03-18 at 16:24 -0500, Kevin Grittner wrote: Robert Haas robertmh...@gmail.com wrote: Since the current solution is intended to support data-loss-free failover, but NOT to guarantee a consistent view of the world from a SQL level, I doubt it's worth paying any price for this. Well, that brings us back to the question of why we would want to suppress the view of the data on the master until the replica acknowledges the commit. It *is* committed on the master, we're just holding off on telling the committer about it until we can honor the guarantee of replication. If it can be seen on the replica before the committer get such acknowledgment, why not on the master? I think the issue is explicit acknowledgement, not visibility. -- Simon Riggs http://www.2ndQuadrant.com/books/ PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19
On Fri, Mar 18, 2011 at 2:55 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: Excerpts from Robert Haas's message of vie mar 18 14:25:16 -0300 2011: On Fri, Mar 18, 2011 at 1:15 PM, Simon Riggs si...@2ndquadrant.com wrote: SyncRepUpdateSyncStandbysDefined() is added into walwriter, which means waiters won't be released if we do a sighup during a fast shutdown, since the walwriter gets killed as soon as that starts. I'm thinking bgwriter should handle that now. Hmm. I was thinking that doing it in WAL writer would make it more responsive, but since this is a fairly unlikely scenario, it's probably not worth complicating the shutdown sequence to do it the way I did. I'll move it to bgwriter. Can't they both do it? Yeah, but it seems fairly pointless. In retrospect, I probably should have done it the way Simon is proposing to begin with. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, Mar 18, 2011 at 5:24 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Robert Haas robertmh...@gmail.com wrote: Since the current solution is intended to support data-loss-free failover, but NOT to guarantee a consistent view of the world from a SQL level, I doubt it's worth paying any price for this. Well, that brings us back to the question of why we would want to suppress the view of the data on the master until the replica acknowledges the commit. It *is* committed on the master, we're just holding off on telling the committer about it until we can honor the guarantee of replication. If it can be seen on the replica before the committer get such acknowledgment, why not on the master? Well, the idea is that we don't want to let people depend on the value until it's guaranteed to be durably committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Robert Haas robertmh...@gmail.com wrote: Well, the idea is that we don't want to let people depend on the value until it's guaranteed to be durably committed. OK, so if you see it on the replica, you know it is in at least two places. I guess that makes sense. It kinda feels wrong to see a view of the replica which is ahead of the master, but I guess it's the least of the evils. I guess we should document it, though, so nobody has a false expectation that seeing something on the replica means that a connection looking at the master will see something that current. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] SSI bug?
Kevin Grittner kevin.gritt...@wicourts.gov wrote: I'm still looking at whether it's sane to try to issue a warning when an HTAB exceeds the number of entries declared as its max_size when it was created. I think this does it. If nothing else, it might be instructive to use it while testing the SSI patch. Would it make any sense to slip this into 9.1, or should I add it to the first 9.2 CF? -Kevin *** a/src/backend/storage/ipc/shmem.c --- b/src/backend/storage/ipc/shmem.c *** *** 268,273 ShmemInitHash(const char *name, /* table string name for shmem index */ --- 268,274 * * The shared memory allocator must be specified too. */ + infoP-max_size = max_size; infoP-dsize = infoP-max_dsize = hash_select_dirsize(max_size); infoP-alloc = ShmemAlloc; hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE; *** a/src/backend/utils/hash/dynahash.c --- b/src/backend/utils/hash/dynahash.c *** *** 129,134 struct HASHHDR --- 129,135 longffactor;/* target fill factor */ longmax_dsize; /* 'dsize' limit if directory is fixed size */ longssize; /* segment size --- must be power of 2 */ + longmax_size; /* maximum number of entries expected */ int sshift; /* segment shift = log2(ssize) */ int nelem_alloc;/* number of entries to allocate at once */ *** *** 368,373 hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) --- 369,375 hdefault(hashp); hctl = hashp-hctl; + hctl-max_size = info-max_size; if (flags HASH_PARTITION) { *** *** 1333,1338 element_alloc(HTAB *hashp, int nelem) --- 1335,1341 HASHELEMENT *tmpElement; HASHELEMENT *prevElement; int i; + boolwarningNeeded; /* Each element has a HASHELEMENT header plus user data. */ elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctlv-entrysize); *** *** 1360,1369 element_alloc(HTAB *hashp, int nelem) --- 1363,1378 /* freelist could be nonempty if two backends did this concurrently */ firstElement-link = hctlv-freeList; hctlv-freeList = prevElement; + warningNeeded = (hctlv-max_size 0 hctlv-nentries == hctlv-max_size); if (IS_PARTITIONED(hctlv)) SpinLockRelease(hctlv-mutex); + if (warningNeeded) + ereport(WARNING, + (errmsg(hash table \%s\ has more entries than expected, hashp-tabname), +errdetail(The maximum was set to %li on creation., hctlv-max_size))); + return true; } *** a/src/include/utils/hsearch.h --- b/src/include/utils/hsearch.h *** *** 69,74 typedef struct HASHCTL --- 69,75 longdsize; /* (initial) directory size */ longmax_dsize; /* limit to dsize if dir size is limited */ longffactor;/* fill factor */ + longmax_size; /* maximum number of entries expected */ Sizekeysize;/* hash key length in bytes */ Sizeentrysize; /* total user element size in bytes */ HashValueFunc hash; /* hash function */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Sync Rep and shutdown Re: [HACKERS] Sync Rep v19
Responding to this again, somewhat out of order... On Fri, Mar 18, 2011 at 1:15 PM, Simon Riggs si...@2ndquadrant.com wrote: Together that's about a 20% hit in performance in Yeb's tests. I think you should spend a little time thinking how to retune that. I've spent some time playing around with pgbench and so far I haven't been able to reliably reproduce this, which is not to say I don't believe the effect is real, but rather that either I'm doing something completely wrong, or it requires some specific setup to measure that doesn't match my environment, or that it's somewhat finicky to reproduce, or some combination of the above. You've added a test inside the lock to see if there is a standby, which I took out for performance reasons. Maybe there's another way, I know that code is fiddly. It seems pretty easy to remove the branch from the test at the top of the function by just rearranging things a bit. Patch attached; does this help? You've also added back in the lock acquisition at wakeup with very little justification, which was a major performance hit. I have a very difficult time believing this is a real problem. That extra lock acquisition and release only happens if WaitLatchOrSocket() returns but MyProc-syncRepState still appears to be SYNC_REP_WAITING. That should only happen if the latch wait hits the timeout (which takes 60 s!) or if the precise memory ordering problem that was put in to fix is occurring (in which case it should dramatically *improve* performance, by avoiding an extra 60 s wait). I stuck in a call to elog(LOG, got here) and it didn't fire even once in a 5-minute pgbench test (~45k transactions). So I have a hard time crediting this for any performance problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company sync-standbys-defined-rearrangement.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
On Fri, Mar 18, 2011 at 5:48 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Robert Haas robertmh...@gmail.com wrote: Well, the idea is that we don't want to let people depend on the value until it's guaranteed to be durably committed. OK, so if you see it on the replica, you know it is in at least two places. I guess that makes sense. It kinda feels wrong to see a view of the replica which is ahead of the master, but I guess it's the least of the evils. I guess we should document it, though, so nobody has a false expectation that seeing something on the replica means that a connection looking at the master will see something that current. Yeah, it can go both ways: a snapshot taken on the standby can be either earlier or later in the commit ordering than the master. That's counterintuitive, but I see no reason to stress about it. It's perfectly reasonable to set up a server with synchronous replication for enhanced durability and also enable hot standby just for convenience, but without actually relying on it all that heavily, or only for non-critical reporting purposes. Synchronous replication, like asynchronous replication, is basically a high-availability tool. As long as it does that well, I'm not going to get worked up about the fact that it doesn't address every other use case someone might want. We can always add more frammishes in future releases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] pg_last_xact_replay_timestamp meaning
I just applied a doc patch for pg_last_xact_replay_timestamp, and the text now says: entryGet timestamp of last transaction replayed during recovery. This is the time at which the commit or abort WAL record for that transaction was generated on the primary. If no transactions have been replayed during recovery, this function returns NULL. Otherwise, if recovery is still in progress this will increase monotonically. If recovery has completed then this value will remain static at the value of the last transaction applied during that recovery. When the server has been started normally without recovery the function returns NULL. Is this really the last commit/abort record or the last WAL record? What should it be? Is the name of this function correct? Do we care only about commit/abort records? Why? -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
Kevin Grittner kevin.gritt...@wicourts.gov Thursday 17 March 2011 22:02:18 Rados*aw Smogurarsmog...@softperience.eu wrote: I have implemented initial concept of 2nd level cache. Idea is to keep some segments of shared memory for special buffers (e.g. indices) to prevent overwrite those by other operations. I added those functionality to nbtree index scan. I tested this with doing index scan, seq read, drop system buffers, do index scan and in few places I saw performance improvements, but actually, I'm not sure if this was just random or intended improvement. I've often wondered about this. In a database I developed back in the '80s it was clearly a win to have a special cache for index entries and other special pages closer to the database than the general cache. A couple things have changed since the '80s (I mean, besides my waistline and hair color), and PostgreSQL has many differences from that other database, so I haven't been sure it would help as much, but I have wondered. I can't really look at this for a couple weeks, but I'm definitely interested. I suggest that you add this to the next CommitFest as a WIP patch, under the Performance category. https://commitfest.postgresql.org/action/commitfest_view/open There is few places to optimize code as well, and patch need many work, but may you see it and give opinions? For something like this it makes perfect sense to show proof of concept before trying to cover everything. -Kevin Here I attach latest version of patch with few performance improvements (code is still dirty) and some reports from test, as well my simple tests. Actually there is small improvement without dropping system caches, and bigger with dropping. I have small performance decrease (if we can talk about measuring basing on this tests) to original PG version when dealing with same configuration, but increase is with 2nd level buffers... or maybe I badly compared reports. In tests I tried to choose typical, simple queries. Regards, Radek 2nd_lvl_cache_20110318.diff.bz2 Description: application/bzip test-scritps_20110319_0026.tar.bz2 Description: application/bzip-compressed-tar reports_20110318.tar.bz2 Description: application/bzip-compressed-tar -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 2nd Level Buffer Cache
On 3/18/11 11:15 AM, Jim Nasby wrote: To take the opposite approach... has anyone looked at having the OS just manage all caching for us? Something like MMAPed shared buffers? Even if we find the issue with large shared buffers, we still can't dedicate serious amounts of memory to them because of work_mem issues. Granted, that's something else on the TODO list, but it really seems like we're re-inventing the wheels that the OS has already created here... As far as I know, no OS has a more sophisticated approach to eviction than LRU. And clock-sweep is a significant improvement on performance over LRU for frequently accessed database objects ... plus our optimizations around not overwriting the whole cache for things like VACUUM. 2-level caches work well for a variety of applications. Now, what would be *really* useful is some way to avoid all the data copying we do between shared_buffers and the FS cache. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Basic Recovery Control functions for use in Hot Standby. Pause,
Fujii Masao masao.fu...@gmail.com writes: On Fri, Mar 18, 2011 at 1:17 AM, Robert Haas robertmh...@gmail.com wrote: Sorry, I've not been able to understand the point well yet. We should just use elog(ERROR) instead? But since ERROR in startup process is treated as FATAL, I'm not sure whether it's worth using ERROR instead. Or you meant another things? Yeah, I think he's saying that an ERROR in the startup process is better than a FATAL, even though the effect is the same. We've already been using FATAL all over the recovery code. We should s/FATAL/ERROR/g there (at least readRecoveryCommandFile)? Possibly, but as you say, it doesn't make that much difference in the startup process. What is bothering me is the prospect of elog(FATAL) in the postmaster. Code associated with GUC validity checking is likely to get executed in the postmaster, which is why it should not throw anything stronger than the normal GUC complaint levels. Even if the patch as proposed is for code that could only be reached in the startup process today, somebody might decide to rearrange it ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Efficient transaction-controlled synchronous replication.
Robert Haas robertmh...@gmail.com writes: As a side note, it's not very obvious why some parts of PostmasterMain report problems by doing write_stderr() and exit() while other parts use ereport(ERROR). This check and the nearby checks on WAL level are immediately preceded and followed by other checks that use the opposite technique. This question is answered in postmaster.c's header comment: * Error Reporting: * Use write_stderr() only for reporting interactive errors * (essentially, bogus arguments on the command line). Once the * postmaster is launched, use ereport(). In particular, don't use * write_stderr() for anything that occurs after pmdaemonize. Code that is involved in GUC variable processing is in a gray area, though, since it can be invoked both before and after pmdaemonize. It might be a good idea to convert all the calls into ereports and maintain a state flag in elog.c to determine what to do. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_ctl restart - behaviour based on wrong instance
On Fri, Mar 18, 2011 at 1:19 PM, Erik Rijkers e...@xs4all.nl wrote: This is OK and expected. But then it continues (in the logfile) with: FATAL: lock file postmaster.pid already exists HINT: Is another postmaster (PID 20519) running in data directory /var/data1/pg_stuff/pg_installations/pgsql.vanilla_1/data? So, complaints about the *other* instance. It doesn't happen once a successful start (with pg_ctl start) has happened. I'm guessing that leftover postmaster.pid contents might be responsible for this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_last_xact_replay_timestamp meaning
On Fri, Mar 18, 2011 at 7:23 PM, Bruce Momjian br...@momjian.us wrote: I just applied a doc patch for pg_last_xact_replay_timestamp, and the text now says: entryGet timestamp of last transaction replayed during recovery. This is the time at which the commit or abort WAL record for that transaction was generated on the primary. If no transactions have been replayed during recovery, this function returns NULL. Otherwise, if recovery is still in progress this will increase monotonically. If recovery has completed then this value will remain static at the value of the last transaction applied during that recovery. When the server has been started normally without recovery the function returns NULL. Is this really the last commit/abort record or the last WAL record? What should it be? Is the name of this function correct? Do we care only about commit/abort records? Why? Commit and abort records have a timestamp. Other WAL records don't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_last_xact_replay_timestamp meaning
On Fri, Mar 18, 2011 at 9:35 PM, Robert Haas robertmh...@gmail.com wrote: On Fri, Mar 18, 2011 at 7:23 PM, Bruce Momjian br...@momjian.us wrote: I just applied a doc patch for pg_last_xact_replay_timestamp, and the text now says: entryGet timestamp of last transaction replayed during recovery. This is the time at which the commit or abort WAL record for that transaction was generated on the primary. If no transactions have been replayed during recovery, this function returns NULL. Otherwise, if recovery is still in progress this will increase monotonically. If recovery has completed then this value will remain static at the value of the last transaction applied during that recovery. When the server has been started normally without recovery the function returns NULL. Is this really the last commit/abort record or the last WAL record? What should it be? Is the name of this function correct? Do we care only about commit/abort records? Why? Commit and abort records have a timestamp. Other WAL records don't. Incidentally, there's an open item related to this: * pg_last_xact_replay_timestamp limitations linking to http://archives.postgresql.org/message-id/201012071131.55211.gabi.jul...@broadsign.com I'm not sure why this is important enough to be worth being on this list, but... is this resolved now? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep v19
On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao masao.fu...@gmail.com wrote: * Smart shutdown Smart shutdown should wait for all the waiting backends to be acked, and should not cause them to forcibly exit. But this leads shutdown to get stuck infinitely if there is no walsender at that time. To enable them to be acked even in that situation, we need to change postmaster so that it accepts the replication connection even during smart shutdown (until we reach PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser connection to cancel backup during smart shutdown. So I don't think that the idea to accept the replication connection during smart shutdown is so ugly. * Fast shutdown I agree with you about fast shutdown. Fast shutdown should cause all the backends including waiting ones to exit immediately. At that time, the non-acked backend should not return the success, according to the definition of sync rep. So we need to change a backend so that it gets rid of itself from the waiting queue and exits before returning the success, when it receives SIGTERM. This change leads the waiting backends to do the same even when pg_terminate_backend is called. But since they've not been acked yet, it seems to be reasonable to prevent them from returning the COMMIT. Comments? I'll create the patch barring objection. The fast smart shutdown part of this problem has been addressed. The smart shutdown case still needs work, and I think the consensus was that your proposal above was the best way to go with it. Do you still want to work up a patch for this? If so, I can review. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Sync Rep v19
On Fri, Mar 18, 2011 at 10:25 PM, Robert Haas robertmh...@gmail.com wrote: On Tue, Mar 8, 2011 at 7:05 AM, Fujii Masao masao.fu...@gmail.com wrote: * Smart shutdown Smart shutdown should wait for all the waiting backends to be acked, and should not cause them to forcibly exit. But this leads shutdown to get stuck infinitely if there is no walsender at that time. To enable them to be acked even in that situation, we need to change postmaster so that it accepts the replication connection even during smart shutdown (until we reach PM_SHUTDOWN_2 state). Postmaster has already accepted the superuser connection to cancel backup during smart shutdown. So I don't think that the idea to accept the replication connection during smart shutdown is so ugly. * Fast shutdown I agree with you about fast shutdown. Fast shutdown should cause all the backends including waiting ones to exit immediately. At that time, the non-acked backend should not return the success, according to the definition of sync rep. So we need to change a backend so that it gets rid of itself from the waiting queue and exits before returning the success, when it receives SIGTERM. This change leads the waiting backends to do the same even when pg_terminate_backend is called. But since they've not been acked yet, it seems to be reasonable to prevent them from returning the COMMIT. Comments? I'll create the patch barring objection. The fast smart shutdown part of this problem has been addressed. The Ugh. I mean the fast shutdown, of course, not the fast smart shutdown. Anyway, point is: fast shutdown now OK smart shutdown still not OK do you want to write a patch? :-) smart shutdown case still needs work, and I think the consensus was that your proposal above was the best way to go with it. Do you still want to work up a patch for this? If so, I can review. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Collations versus record-returning functions
I'm making pretty good progress on the task of splitting input and output collations for expression nodes. There remains one case in the regression tests that is giving a non-expected result. It involves this function: CREATE FUNCTION dup (f1 anyelement, f2 out anyelement, f3 out anyarray) AS 'select $1, array[$1,$1]' LANGUAGE sql; which is used like this: SELECT a, (dup(b)).* FROM collate_test3 ORDER BY 2; where the column collate_test3.b is declared with a nondefault collation. In HEAD, the output of dup() inherits that collation from the b Var, and then so does the field selected from it, and so the ORDER BY sorts according to the column collation. With my patch as it stands, you get sorting by the default collation. The reason is that the output of dup() is of type RECORD, which is a noncollatable type according to pg_type, so the function result is labeled noncollatable even though it has collatable inputs. The FieldSelect node then has to choose the default collation since there's nothing else available. ISTM there are basically three things we might do about this: (1) Decide that the patch's behavior is correct and what's embodied in the regression expected file is wrong. (2) Allow collations to propagate up through nodes that deliver noncollatable outputs. (3) Decide that composite types are collatable. I don't much care for #2, mainly because it seems likely to cause a lot of unnecessary collation conflicts. It has a lot of flavor of action at a distance in most cases, too, even though it might seem natural in this particular example. #3 has some appeal but it also seems fraught with definitional issues, for instance: * Something like row('a' collate C, 'b' collate en_US) doesn't have a resolvable collation, but I don't think we want it to throw an error. * If composites are collatable, one would logically expect a table's rowtype to inherit the common collation of its columns (if there is one). This is problematic because then the rowtype collation can change after ALTER TABLE ADD COLUMN/DROP COLUMN/ALTER COLUMN TYPE, which is something we can't support even if it seemed like a good idea. Also, I fail to see any support for #3 in the SQL standard. So far as I can find, the standard only considers that string types are collatable. By the same token, it doesn't look to me like the standard is expecting #2 to happen, since only string types can carry collation. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [COMMITTERS] pgsql: Document the all-balls IPv6 address.
Robert Haas robertmh...@gmail.com writes: On Fri, Mar 18, 2011 at 10:19 PM, Andrew Dunstan and...@dunslane.net wrote: On 03/18/2011 09:18 PM, Robert Haas wrote: all balls seems like a colloquialism best avoided in our documentation. It's already there, although I agree it's infelicitous. I vote for taking it out. I think that could be interpreted as inappropriate. IIRC, the pre-existing usage refers to time 00:00:00. It does not seem especially useful to adopt the same terminology for network addresses; that's more likely to confuse people than anything else. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] VACUUM FULL deadlock with backend startup
Hi, We encountered a deadlock involving VACUUM FULL (surprise surprise! :)) in PG 8.3.13 (and still not fixed in 9.0 AFAICS although the window appears much smaller). The call spits out the following deadlock info: ERROR: SQLSTATE 40P01: deadlock detected DETAIL: Process 12479 waits for AccessExclusiveLock on relation 2663 of database 16384; blocked by process 14827. Process 14827 waits for AccessShareLock on relation 1259 of database 16384; blocked by process 12479. LOCATION: DeadLockReport, deadlock.c:918 It looked familiar, so I dug up the archives and found that Tom had committed a fix for a similar deadlock via git commitid: 715120e7 However this current deadlock involved an index with oid 2663, which is ClassNameNspIndexId. Clearly this was another case of locking the index directly without taking a lock on the parent catalog. Further sleuthing revealed that the culprit function was InitCatCachePhase2, which directly calls index_open in the process startup phase. Reproducing this was easy once you know the culprit, (excruciatingly difficult if you do not know the exact race window). I added a sleep inside the InitCatCachePhase2 function before calling index_open. Then I invoked a VACUUM FULL pg_class from another session, halting it in gdb just before taking the exclusive lock via try_relation_open. When a new PG process sleeps inside InitCatCachePhase2, we then take the lock in the VF process, waiting just after it. When the startup continues after the sleep, it will take the ClassNameNspIndexId share lock, but hang to take a share lock on pg_class in RelationReloadIndexInfo. Simply continue the VF process in gdb which will try to take the exclusive lock to vacuum the index. This will reproduce the deadlock in all its glory. The fix is similar to the earlier commit by Tom. I tested this fix against 8.3.13. We lock the parent catalog now before calling index_open. Patch against git HEAD attached with this mail. I guess we will backpatch this? Tom's last commit was backpatched till 8.2 I think. Regards, Nikhils diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index d0e364e..c9386aa 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -26,6 +26,7 @@ #ifdef CATCACHE_STATS #include storage/ipc.h /* for on_proc_exit */ #endif +#include storage/lmgr.h #include utils/builtins.h #include utils/fmgroids.h #include utils/inval.h @@ -967,8 +968,16 @@ InitCatCachePhase2(CatCache *cache, bool touch_index) { Relation idesc; + /* + * We must lock the underlying catalog before locking the index to + * avoid deadlock, since RelationReloadIndexInfo might well need to + * read the catalog, and if anyone else is exclusive-locking this + * catalog and index they'll be doing it in that order. + */ + LockRelationOid(cache-cc_reloid, AccessShareLock); idesc = index_open(cache-cc_indexoid, AccessShareLock); index_close(idesc, AccessShareLock); + UnlockRelationOid(cache-cc_reloid, AccessShareLock); } } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] VACUUM FULL deadlock with backend startup
Hi, We encountered a deadlock involving VACUUM FULL (surprise surprise! :)) in PG 8.3.13 (and still not fixed in 9.0 AFAICS although the window appears much smaller). The call spits out the following deadlock info: ERROR: SQLSTATE 40P01: deadlock detected DETAIL: Process 12479 waits for AccessExclusiveLock on relation 2663 of database 16384; blocked by process 14827. Process 14827 waits for AccessShareLock on relation 1259 of database 16384; blocked by process 12479. LOCATION: DeadLockReport, deadlock.c:918 It looked familiar, so I dug up the archives and found that Tom had committed a fix for a similar deadlock via git commitid: 715120e7 However this current deadlock involved an index with oid 2663, which is ClassNameNspIndexId. Clearly this was another case of locking the index directly without taking a lock on the parent catalog. Further sleuthing revealed that the culprit function was InitCatCachePhase2, which directly calls index_open in the process startup phase. Reproducing this was easy once you know the culprit, (excruciatingly difficult if you do not know the exact race window). I added a sleep inside the InitCatCachePhase2 function before calling index_open. Then I invoked a VACUUM FULL pg_class from another session, halting it in gdb just before taking the exclusive lock via try_relation_open. When a new PG process sleeps inside InitCatCachePhase2, we then take the lock in the VF process, waiting just after it. When the startup continues after the sleep, it will take the ClassNameNspIndexId share lock, but hang to take a share lock on pg_class in RelationReloadIndexInfo. Simply continue the VF process in gdb which will try to take the exclusive lock to vacuum the index. This will reproduce the deadlock in all its glory. The fix is similar to the earlier commit by Tom. I tested this fix against 8.3.13. We lock the parent catalog now before calling index_open. Patch against git HEAD attached with this mail. I guess we will backpatch this? Tom's last commit was backpatched till 8.2 I think. Regards, Nikhils diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index d0e364e..c9386aa 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -26,6 +26,7 @@ #ifdef CATCACHE_STATS #include storage/ipc.h /* for on_proc_exit */ #endif +#include storage/lmgr.h #include utils/builtins.h #include utils/fmgroids.h #include utils/inval.h @@ -967,8 +968,16 @@ InitCatCachePhase2(CatCache *cache, bool touch_index) { Relation idesc; + /* + * We must lock the underlying catalog before locking the index to + * avoid deadlock, since RelationReloadIndexInfo might well need to + * read the catalog, and if anyone else is exclusive-locking this + * catalog and index they'll be doing it in that order. + */ + LockRelationOid(cache-cc_reloid, AccessShareLock); idesc = index_open(cache-cc_indexoid, AccessShareLock); index_close(idesc, AccessShareLock); + UnlockRelationOid(cache-cc_reloid, AccessShareLock); } } -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] I am confused after reading codes of PostgreSQL three week
Hello hom, Frankly I am a learner as well. The experts here are almost always ready to help and would be a better source of information. Moreover I am also using eclipse but I do not use it for building the source. I use it only as a source code browser (its easy in GUI; isn't it? ). I am trying to learn about the executor so can't say much about the parser. However I suppose that you must be knowing the rules of the tools flex and bison to understand the parser. And why are you into scan.c? It is created by flex dear. Read the scan.l and gram.y instead. It is these files which are responsible for the major work done by the parser. If you are keen about the parser, go learn lex and yacc (or flex and bison ... they are almost the same) and then go through the scan.l and gram.y files. It is actually an _extremely_ tough job to read the generated files. Once again, do turn off the Search for duplicate source files option. There are no duplicate files in the source tree. Also, if you are using the copy of source tree which was built once in the workspace, things can be a little different. @others: Well, I do know that there are a few books in the market written by the devs but how much does it help when I am already banging my head into source since last 5 months? Regards, Vaibhav On Fri, 2011-03-18 at 22:44 +0800, hom wrote: 2011/3/18 Vaibhav Kaushal vaibhavkaushal...@gmail.com: Hi, That was the question I was facing 5 months ago and trust me I am doing it even now. With an average of 6+ hours going into PostgreSQL Code, even with best practices (as suggested by the developers) I still think I know less than 10 percent. It is too huge to be swallowed at once. I too had to break it down into pieces and because everything is so interconnected with everything else, it is quite complicated in the beginning. Start with one piece; planner, parser, executor, storage management whatever and slowly it should help you get the bigger picture. regards, Vaibhav I had to break it into Thanks Vaibhav . I have step into parser before but I meet a problem: when I debug step in the scanner_init(), Eclipse always finds scan.l and the excute order is not match the file. I think it should be scan.c actually but I don't known how to trace into scan.c :( PS: I have turn Search for duplicate source files option on. I have posted to the mail list, but it have not solved. here is the link: http://postgresql.1045698.n5.nabble.com/Open-unmatch-source-file-when-step-into-parse-analyze-in-Eclipse-td3408033.html -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] tolower() identifier downcasing versus multibyte encodings
I've been able to reproduce the behavior described here: http://archives.postgresql.org/pgsql-general/2011-03/msg00538.php It's specific to UTF8 locales on Mac OS X. I'm not sure if the problem can manifest anywhere else; considering that OS X's UTF8 locales have a general reputation of being broken, it may only happen on that platform. What is happening is that downcase_truncate_identifier() tries to downcase identifiers like this: unsigned char ch = (unsigned char) ident[i]; if (ch = 'A' ch = 'Z') ch += 'a' - 'A'; else if (IS_HIGHBIT_SET(ch) isupper(ch)) ch = tolower(ch); result[i] = (char) ch; This is of course incapable of successfully downcasing any multibyte characters, but there's an assumption that isupper() won't return TRUE for a character fragment in a multibyte locale. However, on OS X it seems that that's not the case :-(. For the particular example cited by Francisco Figueiredo, I see the byte sequence \303\251 converted to \343\251, because isupper() returns TRUE for \303 and then tolower() returns \343. The byte \251 is not changed, but the damage is already done: we now have an invalidly-encoded string. It looks like the blame for the subsequent disappearance of the bogus data lies with fprintf back on the client side; that surprises me a bit because I'd only heard of glibc being so cavalier with data it thought was invalidly encoded. But anyway, the origin of the problem is in the downcasing transformation. We could possibly fix this by not attempting the downcasing transformation on high-bit-set characters unless the encoding is single-byte. However, we have the exact same downcasing logic embedded in the functions in src/port/pgstrcasecmp.c, and those don't have any convenient way of knowing what the prevailing encoding is --- when compiled for frontend use, they can't use pg_database_encoding_max_length. Or we could bite the bullet and start using str_tolower(), but the performance implications of that are unpleasant; not to mention that we really don't want to re-introduce the Turkish problem with unexpected handling of i/I in identifiers. Or we could go the other way and stop downcasing non-ASCII letters altogether. None of these options seem terribly attractive. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers