Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-25 Thread Simon Riggs
On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark st...@mit.edu wrote:
 On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Checksums merely detect a problem, whereas FPWs correct a problem if
 it happens, but only in crash situations.

 So this does nothing to remove the need for FPWs, though checksum
 detection could be used for double write buffers also.

 This is missing the point. If you have a torn page on a page that is
 only dirty due to hint bits then the checksum will show a spurious
 checksum failure. It will detect a problem that isn't there.

It will detect a problem that *is* there, but one you are classifying
it as a non-problem because it is a correctable or acceptable bit
error. Given that acceptable bit errors on hints cover no more than 1%
of a block, the great likelihood is that the bit error is unacceptable
in any case, so false positives page errors are in fact very rare.

Any bit error is an indicator of problems on the external device, so
many would regard any bit error as unacceptable.

 The problem is that there is no WAL indicating the hint bit change.
 And if the torn page includes the new checksum but not the new hint
 bit or vice versa it will be a checksum mismatch.

 The strategy discussed in the past was moving all the hint bits to a
 common area and skipping them in the checksum. No amount of double
 writing or buffering or locking will avoid this problem.

I completely agree we should do this, but we are unable to do it now,
so this patch is a stop-gap and provides a much requested feature
*now*.

In the future, we will be able to tell the difference between an
acceptable and an unacceptable bit error. Right now, all we have is
the ability to detect a bit error and as I point out above that is 99%
of the problem solves, at least.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-25 Thread Kevin Grittner
 Simon Riggs  wrote:
 On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark  wrote:
 
 The problem is that there is no WAL indicating the hint bit
 change. And if the torn page includes the new checksum but not the
 new hint bit or vice versa it will be a checksum mismatch.
 
With *just* this patch, true.  An OS crash or hardware failure could
sometimes create an invalid page.
 
 The strategy discussed in the past was moving all the hint bits to
 a common area and skipping them in the checksum. No amount of
 double writing or buffering or locking will avoid this problem.
 
I don't believe that.  Double-writing is a technique to avoid torn
pages, but it requires a checksum to work.  This chicken-and-egg
problem requires the checksum to be implemented first.
 
 I completely agree we should do this, but we are unable to do it
 now, so this patch is a stop-gap and provides a much requested
 feature *now*.
 
Yes, for people who trust their environment to prevent torn pages, or
who are willing to tolerate one bad page per OS crash in return for
quick reporting of data corruption from unreliable file systems, this
is a good feature even without double-writes.
 
 In the future, we will be able to tell the difference between an
 acceptable and an unacceptable bit error.
 
A double-write patch would provide that, and it sounds like VMware
has a working patch for that which is being polished for submission. 
It would need to wait until we have some consensus on the checksum
patch before it can be finalized.  I'll try to review the patch from
this thread today, to do what I can to move that along.
 
-Kevin

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-25 Thread Martijn van Oosterhout
On Sat, Dec 24, 2011 at 04:01:02PM +, Simon Riggs wrote:
 On Sat, Dec 24, 2011 at 3:54 PM, Andres Freund and...@anarazel.de wrote:
  Why don't you use the same tricks as the former patch and copy the buffer,
  compute the checksum on that, and then write out that copy (you can even do
  both at the same time). I have a hard time believing that the additional 
  copy
  is more expensive than the locking.
 
 ISTM we can't write and copy at the same time because the cheksum is
 not a trailer field.

Ofcourse you can. If the checksum is in the trailer field you get the
nice property that the whole block has a constant checksum. However, if
you store the checksum elsewhere you just need to change the checking
algorithm to copy the checksum out, zero those bytes and run the
checksum and compare with the extracted checksum.

Not pretty, but I don't think it makes a difference in performence.

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] reprise: pretty print viewdefs

2011-12-25 Thread Andrew Dunstan



On 12/24/2011 02:26 PM, Greg Stark wrote:

On Thu, Dec 22, 2011 at 5:52 PM, Andrew Dunstanand...@dunslane.net  wrote:

I've looked at that, and it was discussed a bit previously. It's more
complex because it requires that we keep track of (or calculate) where we
are on the line,

You might try a compromise, just spit out all the columns on one line
*unless* either the previous or next column is longer than something
like 30 columns. So if you have a long list of short columns it just
gets wrapped by your terminal but if you have complex expressions like
CASE expressions or casts or so on they go on a line by themselves.



I think that sounds too complex, honestly. Here's what I have working:

/*
 * If the field we're adding already has a leading newline
 * or wrap mode is disabled (pretty_wrap  0), don't add one.
 * Otherwise, add one, plus some  indentation,
 * if either the new field would cause an
 * overflow or the last field  had a multiline spec.
 */

Here's an illustration: 
http://developer.postgresql.org/~adunstan/pg_get_viewdef.png


cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] 16-bit page checksums for 9.2

2011-12-25 Thread Robert Haas
On Sun, Dec 25, 2011 at 5:08 AM, Simon Riggs si...@2ndquadrant.com wrote:
 On Sat, Dec 24, 2011 at 8:06 PM, Greg Stark st...@mit.edu wrote:
 On Sat, Dec 24, 2011 at 4:06 PM, Simon Riggs si...@2ndquadrant.com wrote:
 Checksums merely detect a problem, whereas FPWs correct a problem if
 it happens, but only in crash situations.

 So this does nothing to remove the need for FPWs, though checksum
 detection could be used for double write buffers also.

 This is missing the point. If you have a torn page on a page that is
 only dirty due to hint bits then the checksum will show a spurious
 checksum failure. It will detect a problem that isn't there.

 It will detect a problem that *is* there, but one you are classifying
 it as a non-problem because it is a correctable or acceptable bit
 error.

I don't agree with this.  We don't WAL-log hint bit changes precisely
because it's OK if they make it to disk and it's OK if they don't.
Given that, I don't see how we can say that writing out only half of a
page that has had hint bit changes is a problem.  It's not.

(And if it is, then we ought to WAL-log all such changes regardless of
whether CRCs are in use.)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Moving more work outside WALInsertLock

2011-12-25 Thread Robert Haas
On Fri, Dec 23, 2011 at 2:54 PM, Heikki Linnakangas
heikki.linnakan...@enterprisedb.com wrote:
 Sorry. Last minute changes, didn't retest properly.. Here's another attempt.

I tried this one out on Nate Boley's system.  Looks pretty good.

m = master, x = with xloginsert-scale-2 patch.  shared_buffers = 8GB,
maintenance_work_mem = 1GB, synchronous_commit = off,
checkpoint_segments = 300, checkpoint_timeout = 15min,
checkpoint_completion_target = 0.9, wal_writer_delay = 20ms.  pgbench,
scale factor 100, median of five five-minute runs.

Permanent tables:

m01 tps = 631.875547 (including connections establishing)
x01 tps = 611.443724 (including connections establishing)
m08 tps = 4573.701237 (including connections establishing)
x08 tps = 4576.242333 (including connections establishing)
m16 tps = 7697.783265 (including connections establishing)
x16 tps = 7837.028713 (including connections establishing)
m24 tps = 11613.690878 (including connections establishing)
x24 tps = 12924.027954 (including connections establishing)
m32 tps = 10684.931858 (including connections establishing)
x32 tps = 14168.419730 (including connections establishing)
m80 tps = 10259.628774 (including connections establishing)
x80 tps = 13864.651340 (including connections establishing)

And, on unlogged tables:

m01 tps = 681.805851 (including connections establishing)
x01 tps = 665.120212 (including connections establishing)
m08 tps = 4753.823067 (including connections establishing)
x08 tps = 4638.690397 (including connections establishing)
m16 tps = 8150.519673 (including connections establishing)
x16 tps = 8082.504658 (including connections establishing)
m24 tps = 14069.077657 (including connections establishing)
x24 tps = 13934.955205 (including connections establishing)
m32 tps = 18736.317650 (including connections establishing)
x32 tps = 1.585420 (including connections establishing)
m80 tps = 17709.683344 (including connections establishing)
x80 tps = 18330.488958 (including connections establishing)

Unfortunately, it does look like there is some raw loss of performance
when WALInsertLock is NOT badly contended; hence the drop-off at a
single client on permanent tables, and up through 24 clients on
unlogged tables.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Standalone synchronous master

2011-12-25 Thread Alexander Björnhagen
Hi all,

I’m new here so maybe someone else already has this in the works ?

Anyway, proposed change/patch :

Add a new parameter :

synchronous_standalone_master = on | off

To control whether a master configured with synchronous_commit = on is
allowed to stop waiting for standby WAL sync when all synchronous
standby WAL senders are disconnected.

Current behavior is that the master waits indefinitely until a
synchronous standby becomes available or until synchronous_commit is
disabled manually. This would still be the default, so
synchronous_standalone_master defaults to off.

Previously discussed here :

http://archives.postgresql.org/pgsql-hackers/2010-10/msg01009.php


I’m attaching a working patch against master/HEAD and I hope the
spirit of christmas will make you see kindly on my attempt :) or
something ...

It works fine and I added some extra logging so that it would be
possible to follow more easily from an admins point of view.

It looks like this when starting the primary server with
synchronous_standalone_master = on :

$ ./postgres
LOG:  database system was shut down at 2011-12-25 20:27:13 CET
  -- No standby is connected at startup
LOG:  not waiting for standby synchronization
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
  -- First sync standby connects here so switch to sync mode
LOG:  standby tx0113 is now the synchronous standby with priority 1
LOG:  waiting for standby synchronization
  -- standby wal receiver on the standby is killed (SIGKILL)
LOG:  unexpected EOF on standby connection
LOG:  not waiting for standby synchronization
  -- restart standby so that it connects again
LOG:  standby tx0113 is now the synchronous standby with priority 1
LOG:  waiting for standby synchronization
  -- standby wal receiver is first stopped (SIGSTOP) to make sure
we have outstanding waits in the primary, then killed (SIGKILL)
LOG:  could not receive data from client: Connection reset by peer
LOG:  unexpected EOF on standby connection
LOG:  not waiting for standby synchronization
  -- client now finally receives commit ACK that was hanging due
to the SIGSTOP:ed wal receiver on the standby node


And so on ... any comments are welcome :)

Thanks and cheers,

/A
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 0cc3296..6367dcc 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2182,6 +2182,24 @@ SET ENABLE_SEQSCAN TO OFF;
   /listitem
  /varlistentry
 
+ varlistentry id=guc-synchronous-standalone-master 
xreflabel=synchronous-standalone-master
+  termvarnamesynchronous_standalone_master/varname 
(typeboolean/type)/term
+  indexterm
+   primaryvarnamesynchronous_standalone_master/ configuration 
parameter/primary
+  /indexterm
+  listitem
+   para
+   Specifies how the master behaves when xref 
linkend=guc-synchronous-commit
+   is set to literalon/ and xref 
linkend=guc-synchronous-standby-names is configured but no
+appropriate standby servers are currently connected. If enabled, the 
master will
+continue processing transactions alone. If disabled, all the 
transactions on the
+master are blocked until a synchronous standby has appeared.
+
+   The default is disabled.
+   /para
+  /listitem
+ /varlistentry
+
  /variablelist
 /sect2
 
diff --git a/src/backend/postmaster/checkpointer.c 
b/src/backend/postmaster/checkpointer.c
index e9ae1e8..706af88 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -353,6 +353,8 @@ CheckpointerMain(void)
 
/* Do this once before starting the loop, then just at SIGHUP time. */
SyncRepUpdateSyncStandbysDefined();
+   SyncRepUpdateSyncStandaloneAllowed();
+   SyncRepCheckIfStandaloneMaster();
 
/*
 * Loop forever
@@ -382,6 +384,7 @@ CheckpointerMain(void)
ProcessConfigFile(PGC_SIGHUP);
/* update global shmem state for sync rep */
SyncRepUpdateSyncStandbysDefined();
+   SyncRepUpdateSyncStandaloneAllowed();
}
if (checkpoint_requested)
{
@@ -658,6 +661,7 @@ CheckpointWriteDelay(int flags, double progress)
ProcessConfigFile(PGC_SIGHUP);
/* update global shmem state for sync rep */
SyncRepUpdateSyncStandbysDefined();
+   SyncRepUpdateSyncStandaloneAllowed();
}
 
AbsorbFsyncRequests();
diff --git a/src/backend/replication/syncrep.c 
b/src/backend/replication/syncrep.c
index 95de6c7..fd3e782 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -59,6 +59,8 @@
 /* User-settable parameters for sync rep */
 char  *SyncRepStandbyNames;
 
+bool   SyncRepStandaloneMasterAllowed;
+
 #define 

Re: [HACKERS] Page Checksums

2011-12-25 Thread Greg Stark
On Mon, Dec 19, 2011 at 7:16 PM, Kevin Grittner
kevin.gritt...@wicourts.gov wrote:
 It seems to me that on a typical production system you would
 probably have zero or one such page per OS crash

Incidentally I don't think this is right. There are really two kinds
of torn pages:

1) The kernel vm has many dirty 4k pages and decides to flush one 4k
page of a Postgres 8k buffer but not the other one. It doesn't sound
very logical for it to do this but it has the same kind of tradeoffs
to make that Postgres does and there could easily be cases where the
extra book-keeping required to avoid it isn't deemed worthwhile. The
two memory pages might not even land on the same part of the disk
anyways so flushing one and not the other might be reasonable.

In this case there could be an unbounded number of such torn pages and
they can stay torn on disk for a long period of time so the torn pages
may not have been actively being written when the crash occurred. On
Linux these torn pages will always be on memory page boundaries -- ie
4k blocks on x86.

2) The i/o system was in the process of writing out blocks and the
system lost power or crashed as they were being written out. In this
case there will probably only be 0 or 1 torn pages -- perhaps as many
as the scsi queue depth if there's some weird i/o scheduling going on.
In this case the torn page could be on a hardware block boundary --
often 512 byte boundaries (or if the drives don't guarantee otherwise
it could corrupt a disk block).

-- 
greg

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers