Re: [HACKERS] 16-bit page checksums for 9.2
On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark st...@mit.edu wrote: On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs si...@2ndquadrant.com wrote: Checksums merely detect a problem, whereas FPWs correct a problem if it happens, but only in crash situations. So this does nothing to remove the need for FPWs, though checksum detection could be used for double write buffers also. This is missing the point. If you have a torn page on a page that is only dirty due to hint bits then the checksum will show a spurious checksum failure. It will detect a problem that isn't there. It will detect a problem that *is* there, but one you are classifying it as a non-problem because it is a correctable or acceptable bit error. Given that acceptable bit errors on hints cover no more than 1% of a block, the great likelihood is that the bit error is unacceptable in any case, so false positives page errors are in fact very rare. Any bit error is an indicator of problems on the external device, so many would regard any bit error as unacceptable. The problem is that there is no WAL indicating the hint bit change. And if the torn page includes the new checksum but not the new hint bit or vice versa it will be a checksum mismatch. The strategy discussed in the past was moving all the hint bits to a common area and skipping them in the checksum. No amount of double writing or buffering or locking will avoid this problem. I completely agree we should do this, but we are unable to do it now, so this patch is a stop-gap and provides a much requested feature *now*. In the future, we will be able to tell the difference between an acceptable and an unacceptable bit error. Right now, all we have is the ability to detect a bit error and as I point out above that is 99% of the problem solves, at least. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
Simon Riggs wrote: On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark wrote: The problem is that there is no WAL indicating the hint bit change. And if the torn page includes the new checksum but not the new hint bit or vice versa it will be a checksum mismatch. With *just* this patch, true. An OS crash or hardware failure could sometimes create an invalid page. The strategy discussed in the past was moving all the hint bits to a common area and skipping them in the checksum. No amount of double writing or buffering or locking will avoid this problem. I don't believe that. Double-writing is a technique to avoid torn pages, but it requires a checksum to work. This chicken-and-egg problem requires the checksum to be implemented first. I completely agree we should do this, but we are unable to do it now, so this patch is a stop-gap and provides a much requested feature *now*. Yes, for people who trust their environment to prevent torn pages, or who are willing to tolerate one bad page per OS crash in return for quick reporting of data corruption from unreliable file systems, this is a good feature even without double-writes. In the future, we will be able to tell the difference between an acceptable and an unacceptable bit error. A double-write patch would provide that, and it sounds like VMware has a working patch for that which is being polished for submission. It would need to wait until we have some consensus on the checksum patch before it can be finalized. I'll try to review the patch from this thread today, to do what I can to move that along. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Sat, Dec 24, 2011 at 04:01:02PM +, Simon Riggs wrote: On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund and...@anarazel.de wrote: Why don't you use the same tricks as the former patch and copy the buffer, compute the checksum on that, and then write out that copy (you can even do both at the same time). I have a hard time believing that the additional copy is more expensive than the locking. ISTM we can't write and copy at the same time because the cheksum is not a trailer field. Ofcourse you can. If the checksum is in the trailer field you get the nice property that the whole block has a constant checksum. However, if you store the checksum elsewhere you just need to change the checking algorithm to copy the checksum out, zero those bytes and run the checksum and compare with the extracted checksum. Not pretty, but I don't think it makes a difference in performence. Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] reprise: pretty print viewdefs
On 12/24/2011 02:26 PM, Greg Stark wrote: On Thu, Dec 22, 2011 at 5:52 PM, Andrew Dunstanand...@dunslane.net wrote: I've looked at that, and it was discussed a bit previously. It's more complex because it requires that we keep track of (or calculate) where we are on the line, You might try a compromise, just spit out all the columns on one line *unless* either the previous or next column is longer than something like 30 columns. So if you have a long list of short columns it just gets wrapped by your terminal but if you have complex expressions like CASE expressions or casts or so on they go on a line by themselves. I think that sounds too complex, honestly. Here's what I have working: /* * If the field we're adding already has a leading newline * or wrap mode is disabled (pretty_wrap 0), don't add one. * Otherwise, add one, plus some indentation, * if either the new field would cause an * overflow or the last field had a multiline spec. */ Here's an illustration: http://developer.postgresql.org/~adunstan/pg_get_viewdef.png cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] 16-bit page checksums for 9.2
On Sun, Dec 25, 2011 at 5:08 AM, Simon Riggs si...@2ndquadrant.com wrote: On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark st...@mit.edu wrote: On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs si...@2ndquadrant.com wrote: Checksums merely detect a problem, whereas FPWs correct a problem if it happens, but only in crash situations. So this does nothing to remove the need for FPWs, though checksum detection could be used for double write buffers also. This is missing the point. If you have a torn page on a page that is only dirty due to hint bits then the checksum will show a spurious checksum failure. It will detect a problem that isn't there. It will detect a problem that *is* there, but one you are classifying it as a non-problem because it is a correctable or acceptable bit error. I don't agree with this. We don't WAL-log hint bit changes precisely because it's OK if they make it to disk and it's OK if they don't. Given that, I don't see how we can say that writing out only half of a page that has had hint bit changes is a problem. It's not. (And if it is, then we ought to WAL-log all such changes regardless of whether CRCs are in use.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Moving more work outside WALInsertLock
On Fri, Dec 23, 2011 at 2:54 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Sorry. Last minute changes, didn't retest properly.. Here's another attempt. I tried this one out on Nate Boley's system. Looks pretty good. m = master, x = with xloginsert-scale-2 patch. shared_buffers = 8GB, maintenance_work_mem = 1GB, synchronous_commit = off, checkpoint_segments = 300, checkpoint_timeout = 15min, checkpoint_completion_target = 0.9, wal_writer_delay = 20ms. pgbench, scale factor 100, median of five five-minute runs. Permanent tables: m01 tps = 631.875547 (including connections establishing) x01 tps = 611.443724 (including connections establishing) m08 tps = 4573.701237 (including connections establishing) x08 tps = 4576.242333 (including connections establishing) m16 tps = 7697.783265 (including connections establishing) x16 tps = 7837.028713 (including connections establishing) m24 tps = 11613.690878 (including connections establishing) x24 tps = 12924.027954 (including connections establishing) m32 tps = 10684.931858 (including connections establishing) x32 tps = 14168.419730 (including connections establishing) m80 tps = 10259.628774 (including connections establishing) x80 tps = 13864.651340 (including connections establishing) And, on unlogged tables: m01 tps = 681.805851 (including connections establishing) x01 tps = 665.120212 (including connections establishing) m08 tps = 4753.823067 (including connections establishing) x08 tps = 4638.690397 (including connections establishing) m16 tps = 8150.519673 (including connections establishing) x16 tps = 8082.504658 (including connections establishing) m24 tps = 14069.077657 (including connections establishing) x24 tps = 13934.955205 (including connections establishing) m32 tps = 18736.317650 (including connections establishing) x32 tps = 1.585420 (including connections establishing) m80 tps = 17709.683344 (including connections establishing) x80 tps = 18330.488958 (including connections establishing) Unfortunately, it does look like there is some raw loss of performance when WALInsertLock is NOT badly contended; hence the drop-off at a single client on permanent tables, and up through 24 clients on unlogged tables. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Standalone synchronous master
Hi all, I’m new here so maybe someone else already has this in the works ? Anyway, proposed change/patch : Add a new parameter : synchronous_standalone_master = on | off To control whether a master configured with synchronous_commit = on is allowed to stop waiting for standby WAL sync when all synchronous standby WAL senders are disconnected. Current behavior is that the master waits indefinitely until a synchronous standby becomes available or until synchronous_commit is disabled manually. This would still be the default, so synchronous_standalone_master defaults to off. Previously discussed here : http://archives.postgresql.org/pgsql-hackers/2010-10/msg01009.php I’m attaching a working patch against master/HEAD and I hope the spirit of christmas will make you see kindly on my attempt :) or something ... It works fine and I added some extra logging so that it would be possible to follow more easily from an admins point of view. It looks like this when starting the primary server with synchronous_standalone_master = on : $ ./postgres LOG: database system was shut down at 2011-12-25 20:27:13 CET -- No standby is connected at startup LOG: not waiting for standby synchronization LOG: autovacuum launcher started LOG: database system is ready to accept connections -- First sync standby connects here so switch to sync mode LOG: standby tx0113 is now the synchronous standby with priority 1 LOG: waiting for standby synchronization -- standby wal receiver on the standby is killed (SIGKILL) LOG: unexpected EOF on standby connection LOG: not waiting for standby synchronization -- restart standby so that it connects again LOG: standby tx0113 is now the synchronous standby with priority 1 LOG: waiting for standby synchronization -- standby wal receiver is first stopped (SIGSTOP) to make sure we have outstanding waits in the primary, then killed (SIGKILL) LOG: could not receive data from client: Connection reset by peer LOG: unexpected EOF on standby connection LOG: not waiting for standby synchronization -- client now finally receives commit ACK that was hanging due to the SIGSTOP:ed wal receiver on the standby node And so on ... any comments are welcome :) Thanks and cheers, /A diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index 0cc3296..6367dcc 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -2182,6 +2182,24 @@ SET ENABLE_SEQSCAN TO OFF; /listitem /varlistentry + varlistentry id=guc-synchronous-standalone-master xreflabel=synchronous-standalone-master + termvarnamesynchronous_standalone_master/varname (typeboolean/type)/term + indexterm + primaryvarnamesynchronous_standalone_master/ configuration parameter/primary + /indexterm + listitem + para + Specifies how the master behaves when xref linkend=guc-synchronous-commit + is set to literalon/ and xref linkend=guc-synchronous-standby-names is configured but no +appropriate standby servers are currently connected. If enabled, the master will +continue processing transactions alone. If disabled, all the transactions on the +master are blocked until a synchronous standby has appeared. + + The default is disabled. + /para + /listitem + /varlistentry + /variablelist /sect2 diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index e9ae1e8..706af88 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -353,6 +353,8 @@ CheckpointerMain(void) /* Do this once before starting the loop, then just at SIGHUP time. */ SyncRepUpdateSyncStandbysDefined(); + SyncRepUpdateSyncStandaloneAllowed(); + SyncRepCheckIfStandaloneMaster(); /* * Loop forever @@ -382,6 +384,7 @@ CheckpointerMain(void) ProcessConfigFile(PGC_SIGHUP); /* update global shmem state for sync rep */ SyncRepUpdateSyncStandbysDefined(); + SyncRepUpdateSyncStandaloneAllowed(); } if (checkpoint_requested) { @@ -658,6 +661,7 @@ CheckpointWriteDelay(int flags, double progress) ProcessConfigFile(PGC_SIGHUP); /* update global shmem state for sync rep */ SyncRepUpdateSyncStandbysDefined(); + SyncRepUpdateSyncStandaloneAllowed(); } AbsorbFsyncRequests(); diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c index 95de6c7..fd3e782 100644 --- a/src/backend/replication/syncrep.c +++ b/src/backend/replication/syncrep.c @@ -59,6 +59,8 @@ /* User-settable parameters for sync rep */ char *SyncRepStandbyNames; +bool SyncRepStandaloneMasterAllowed; + #define
Re: [HACKERS] Page Checksums
On Mon, Dec 19, 2011 at 7:16 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: It seems to me that on a typical production system you would probably have zero or one such page per OS crash Incidentally I don't think this is right. There are really two kinds of torn pages: 1) The kernel vm has many dirty 4k pages and decides to flush one 4k page of a Postgres 8k buffer but not the other one. It doesn't sound very logical for it to do this but it has the same kind of tradeoffs to make that Postgres does and there could easily be cases where the extra book-keeping required to avoid it isn't deemed worthwhile. The two memory pages might not even land on the same part of the disk anyways so flushing one and not the other might be reasonable. In this case there could be an unbounded number of such torn pages and they can stay torn on disk for a long period of time so the torn pages may not have been actively being written when the crash occurred. On Linux these torn pages will always be on memory page boundaries -- ie 4k blocks on x86. 2) The i/o system was in the process of writing out blocks and the system lost power or crashed as they were being written out. In this case there will probably only be 0 or 1 torn pages -- perhaps as many as the scsi queue depth if there's some weird i/o scheduling going on. In this case the torn page could be on a hardware block boundary -- often 512 byte boundaries (or if the drives don't guarantee otherwise it could corrupt a disk block). -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers