Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Andres Freund
On 2013-03-06 13:34:21 +0200, Heikki Linnakangas wrote:
 On 06.03.2013 10:41, Simon Riggs wrote:
 On 5 March 2013 18:02, Jeff Davispg...@j-davis.com  wrote:
 
 Fletcher is probably significantly faster than CRC-16, because I'm just
 doing int32 addition in a tight loop.
 
 Simon originally chose Fletcher, so perhaps he has more to say.
 
 IIRC the research showed Fletcher was significantly faster for only a
 small loss in error detection rate.
 
 It was sufficient to make our error detection  1 million times
 better, possibly more. That seems sufficient to enable early detection
 of problems, since if we missed the first error, a second is very
 likely to be caught (etc). So I am assuming that we're trying to catch
 a pattern of errors early, rather than guarantee we can catch the very
 first error.
 
 Fletcher's checksum is good in general, I was mainly worried about
 truncating the Fletcher-64 into two 8-bit values. I can't spot any obvious
 weakness in it, but if it's indeed faster and as good as a straightforward
 Fletcher-16, I wonder why that method is not more widely used.

I personally am not that convinced that fletcher is a such good choice given
that it afaik doesn't distinguish between all-zero and all-one runs that are
long enough.

 Another thought is that perhaps something like CRC32C would be faster to
 calculate on modern hardware, and could be safely truncated to 16-bits using
 the same technique you're using to truncate the Fletcher's Checksum. Greg's
 tests showed that the overhead of CRC calculation is significant in some
 workloads, so it would be good to spend some time to optimize that. It'd be
 difficult to change the algorithm in a future release without breaking
 on-disk compatibility, so let's make sure we pick the best one.

I had implemented a noticeably faster CRC32 implementation somewhere
around 201005202227.49990.and...@anarazel.de . I have since repeatedly
seen pg's CRC32 implementation being a major limitation, so I think
brushing up that patch would be a good idea.
We might think about switching the polynom for WAL at the same time,
given, as you say, CRC32c is available in hardware. The bigger problem
is probably stuff like the control file et al.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Garick Hamlin
On Wed, Mar 06, 2013 at 01:34:21PM +0200, Heikki Linnakangas wrote:
 On 06.03.2013 10:41, Simon Riggs wrote:
 On 5 March 2013 18:02, Jeff Davispg...@j-davis.com  wrote:

 Fletcher is probably significantly faster than CRC-16, because I'm just
 doing int32 addition in a tight loop.

 Simon originally chose Fletcher, so perhaps he has more to say.

 IIRC the research showed Fletcher was significantly faster for only a
 small loss in error detection rate.

 It was sufficient to make our error detection  1 million times
 better, possibly more. That seems sufficient to enable early detection
 of problems, since if we missed the first error, a second is very
 likely to be caught (etc). So I am assuming that we're trying to catch
 a pattern of errors early, rather than guarantee we can catch the very
 first error.

 Fletcher's checksum is good in general, I was mainly worried about  
 truncating the Fletcher-64 into two 8-bit values. I can't spot any  
 obvious weakness in it, but if it's indeed faster and as good as a  
 straightforward Fletcher-16, I wonder why that method is not more widely  
 used.

I was wondering about the effectiveness of this resulting truncated hash
function as well.

 Another thought is that perhaps something like CRC32C would be faster to  
 calculate on modern hardware, and could be safely truncated to 16-bits  
 using the same technique you're using to truncate the Fletcher's  
 Checksum. Greg's tests showed that the overhead of CRC calculation is  
 significant in some workloads, so it would be good to spend some time to  
 optimize that. It'd be difficult to change the algorithm in a future  
 release without breaking on-disk compatibility, so let's make sure we  
 pick the best one.

If picking a CRC why not a short optimal one rather than truncate CRC32C?

I've been reading about optimal checksum for small messages for other 
reasons and found this paper quite good.

http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf

I was interested in small messages and small checksums so this paper may not be
as much help here.

Other than CRCs and fletcher sums, Pearson hashing with a 16-bit block might
be worth considering.  Either a pearson hash or a 16-CRC is small enough to
implement with a lookup table rather than a formula.  

I've been wondering what kind of errors we expect?  Single bit flips?  Large
swaths of bytes corrupted?  Are we more worried about collisions (the odds 
total garbage has the same checksum) or the odds we detect a flip of n-bits.
I would think since the message is large and a write to the wrong location
seems about as likely as a bit flip a pearson hash be good.

Any choice seems like it would be a nice improvement of noticing a storage stack
problem.  The difference would be subtle.  Can I estimate the odds of 
undetected corruption that occurred since the condition was first detected
accurately or does the checksum/hash perform poorly?

Garick



 -- 
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Andres Freund
On 2013-03-06 11:21:21 -0500, Garick Hamlin wrote:
 If picking a CRC why not a short optimal one rather than truncate CRC32C?

CRC32C is available in hardware since SSE4.2.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Robert Haas
On Mon, Mar 4, 2013 at 3:13 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 04.03.2013 20:58, Greg Smith wrote:

 There
 is no such thing as a stable release of btrfs, and no timetable for when
 there will be one. I could do some benchmarks of that but I didn't think
 they were very relevant. Who cares how fast something might run when it
 may not work correctly? btrfs might as well be /dev/null to me right
 now--sure it's fast, but maybe the data won't be there at all.

 This PostgreSQL patch hasn't seen any production use, either. In fact, I'd
 consider btrfs to be more mature than this patch. Unless you think that
 there will be some major changes to the worse in performance in btrfs, it's
 perfectly valid and useful to compare the two.

 A comparison with ZFS would be nice too. That's mature, and has checksums.

We've had a few EnterpriseDB customers who have had fantastically
painful experiences with PostgreSQL + ZFS.  Supposedly, aligning the
ZFS block size to the PostgreSQL block size is supposed to make these
problems go away, but in my experience it does not have that effect.
So I think telling people who want checksums go use ZFS is a lot
like telling them oh, I see you have a hangnail, we recommend that
you solve that by cutting your arm off with a rusty saw.

There may be good reasons to reject this patch.  Or there may not.
But I completely disagree with the idea that asking them to solve the
problem at the filesystem level is sensible.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Josh Berkus

 There may be good reasons to reject this patch.  Or there may not.
 But I completely disagree with the idea that asking them to solve the
 problem at the filesystem level is sensible.

Yes, can we get back to the main issues with the patch?

1) argument over whether the checksum is sufficient to detect most
errors, or if it will give users false confidence.

2) performance overhead.

Based on Smith's report, I consider (2) to be a deal-killer right now.
The level of overhead reported by him would prevent the users I work
with from ever employing checksums on production systems.

Specifically, the writing checksums for a read-only query is a defect I
think is prohibitively bad.  When we first talked about this feature for
9.2, we were going to exclude hint bits from checksums, in order to
avoid this issue; what happened to that?

(FWIW, I still support the idea of moving hint bits to a separate
filehandle, as we do with the FSM, but clearly that's not happening for
9.3 ...)

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Josh Berkus
Robert,

 We've had a few EnterpriseDB customers who have had fantastically
 painful experiences with PostgreSQL + ZFS.  Supposedly, aligning the
 ZFS block size to the PostgreSQL block size is supposed to make these
 problems go away, but in my experience it does not have that effect.
 So I think telling people who want checksums go use ZFS is a lot
 like telling them oh, I see you have a hangnail, we recommend that
 you solve that by cutting your arm off with a rusty saw.

Wow, what platform are you using ZFS on?

(we have a half-dozen clients on ZFS ...)

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Robert Haas
On Wed, Mar 6, 2013 at 2:14 PM, Josh Berkus j...@agliodbs.com wrote:
 Based on Smith's report, I consider (2) to be a deal-killer right now.

I was pretty depressed by those numbers, too.

 The level of overhead reported by him would prevent the users I work
 with from ever employing checksums on production systems.

Agreed.

 Specifically, the writing checksums for a read-only query is a defect I
 think is prohibitively bad.

That particular part doesn't bother me so much as some of the others -
but let's step back and look at the larger issue.  I suspect we can
all agree that the performance of this feature is terrible.  The
questions I think we should be asking are:

1. Are the problems fundamental, or things where we can reasonable
foresee future improvement?  The latter situation wouldn't bother me
very much even if the current situation is pretty bad, but if there's
no real hope of improvement, that's more of a problem.

2. Are the performance results sufficiently bad that we think this
would be more of a liability than an asset?

 When we first talked about this feature for
 9.2, we were going to exclude hint bits from checksums, in order to
 avoid this issue; what happened to that?

I don't think anyone ever thought that was a particularly practical
design.  I certainly don't.

 (FWIW, I still support the idea of moving hint bits to a separate
 filehandle, as we do with the FSM, but clearly that's not happening for
 9.3 ...)

Or, most likely, ever.  The whole benefit of hint bits is that the
information you need is available in the same bytes you have to read
anyway.  Moving the information to another fork (not filehandle) would
probably give up most of the benefit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Robert Haas
On Wed, Mar 6, 2013 at 6:00 PM, Josh Berkus j...@agliodbs.com wrote:
 We've had a few EnterpriseDB customers who have had fantastically
 painful experiences with PostgreSQL + ZFS.  Supposedly, aligning the
 ZFS block size to the PostgreSQL block size is supposed to make these
 problems go away, but in my experience it does not have that effect.
 So I think telling people who want checksums go use ZFS is a lot
 like telling them oh, I see you have a hangnail, we recommend that
 you solve that by cutting your arm off with a rusty saw.

 Wow, what platform are you using ZFS on?

 (we have a half-dozen clients on ZFS ...)

Not us, customers.  But as to platform, I have yet to run across
anyone running ZFS on anything but Solaris.  I'd be interested to hear
your experiences.   Mine rhyme with sun a play dreaming.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Joshua D. Drake


On 03/06/2013 03:06 PM, Robert Haas wrote:


On Wed, Mar 6, 2013 at 6:00 PM, Josh Berkus j...@agliodbs.com wrote:

We've had a few EnterpriseDB customers who have had fantastically
painful experiences with PostgreSQL + ZFS.  Supposedly, aligning the
ZFS block size to the PostgreSQL block size is supposed to make these
problems go away, but in my experience it does not have that effect.
So I think telling people who want checksums go use ZFS is a lot
like telling them oh, I see you have a hangnail, we recommend that
you solve that by cutting your arm off with a rusty saw.


Wow, what platform are you using ZFS on?

(we have a half-dozen clients on ZFS ...)


Not us, customers.  But as to platform, I have yet to run across
anyone running ZFS on anything but Solaris.  I'd be interested to hear
your experiences.   Mine rhyme with sun a play dreaming.


I would guess he meant on X86_64 or Sparc.

JD







--
Command Prompt, Inc. - http://www.commandprompt.com/
PostgreSQL Support, Training, Professional Services and Development
High Availability, Oracle Conversion, Postgres-XC
@cmdpromptinc - 509-416-6579


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-03-06 11:21:21 -0500, Garick Hamlin wrote:
 If picking a CRC why not a short optimal one rather than truncate CRC32C?

 CRC32C is available in hardware since SSE4.2.

I think that should be at most a fourth-order consideration, since we
are not interested solely in Intel hardware, nor do we have any portable
way of getting at such a feature even if the hardware has it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Craig Ringer
On 03/06/2013 07:34 PM, Heikki Linnakangas wrote:
 It'd be difficult to change the algorithm in a future release without
 breaking on-disk compatibility,
On-disk compatibility is broken with major releases anyway, so I don't
see this as a huge barrier.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Andres Freund
On 2013-03-07 08:37:40 +0800, Craig Ringer wrote:
 On 03/06/2013 07:34 PM, Heikki Linnakangas wrote:
  It'd be difficult to change the algorithm in a future release without
  breaking on-disk compatibility,
 On-disk compatibility is broken with major releases anyway, so I don't
 see this as a huge barrier.

Uh, pg_upgrade?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Greg Smith

On 3/6/13 1:34 PM, Robert Haas wrote:

We've had a few EnterpriseDB customers who have had fantastically
painful experiences with PostgreSQL + ZFS.  Supposedly, aligning the
ZFS block size to the PostgreSQL block size is supposed to make these
problems go away, but in my experience it does not have that effect.


There are a couple of major tuning issues you have to get right for good 
ZFS performance, like its tendency to gobble more RAM than is 
necessarily appropriate for a PostgreSQL host.  If you nail down all 
those and carefully setup everything it can work OK.  When Sun had a 
bunch of good engineers working on the problem they certainly pulled it 
off.  I managed a 3TB database on a ZFS volume for a while myself. 
Being able to make filesystem snapshots cleanly and easily was very nice.


As for the write performance implications of COW, though, at a couple of 
points I was only able to keep that system ingesting data fast enough if 
I turned fsync off :(  It's not as if even ZFS makes all the filesystem 
issues the database worries about go away either.  Take a look at 
http://www.c0t0d0s0.org/archives/6071-No,-ZFS-really-doesnt-need-a-fsck.html 
as an example.  That should leave you with a healthy concern over ZFS 
handling of power interruption and lying drives.  [NTFS and ext3] have 
the same problem, but it has different effects, that aren't as visible 
as in ZFS.  ext4 actually fixed this for most hardware though, and I 
believe ZFS still has the same uberblock concern.  ZFS reliability and 
its page checksums are good, but they're not magic for eliminating torn 
page issues.


Normally I would agree with Heikki's theory of let's wait a few years 
and see if the filesystem will take care of it idea.  But for me, the 
when do we get checksums? clock started ticking in 2006 when ZFS 
popularized its implementation, and now it's gone off and it keeps 
ringing at new places.  I would love it if FreeBSD had caught a massive 
popularity wave in the last few years, so ZFS was running in a lot more 
places.  Instead what I keep seeing is deployments Linux with filesystem 
choices skewed toward conservative.  Forget about the leading edge--I'd 
be happy if I could get one large customer to migrate off of ext3...


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Craig Ringer
On 03/07/2013 08:41 AM, Andres Freund wrote:
 On 2013-03-07 08:37:40 +0800, Craig Ringer wrote:
 On 03/06/2013 07:34 PM, Heikki Linnakangas wrote:
 It'd be difficult to change the algorithm in a future release without
 breaking on-disk compatibility,
 On-disk compatibility is broken with major releases anyway, so I don't
 see this as a huge barrier.
 Uh, pg_upgrade?
Yeah. I was thinking that pg_upgrade copes with a lot of
incompatibilities already, but this is lower-level. Darn.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Jim Nasby

On 3/4/13 7:04 PM, Daniel Farina wrote:

Corruption has easily occupied more than one person-month of time last
year for us.


Just FYI for anyone that's experienced corruption... we've looked into doing 
row-level checksums at work. The only challenge we ran into was how to check 
them when reading data back. I don't remember the details but there was an 
issue with doing this via SELECT rules. It would be possible if you were 
willing to put writable views on all your tables (which isn't actually as 
horrible as it sounds; it wouldn't be hard to write a function to automagically 
do that for you).


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Jim Nasby

On 3/6/13 1:14 PM, Josh Berkus wrote:



There may be good reasons to reject this patch.  Or there may not.
But I completely disagree with the idea that asking them to solve the
problem at the filesystem level is sensible.


Yes, can we get back to the main issues with the patch?

1) argument over whether the checksum is sufficient to detect most
errors, or if it will give users false confidence.

2) performance overhead.

Based on Smith's report, I consider (2) to be a deal-killer right now.
The level of overhead reported by him would prevent the users I work
with from ever employing checksums on production systems.


FWIW, the write workload most likely wouldn't be a problem for us. I am 
concerned about the reported 24-32% hit when reading back in from FS cache... 
that might kill this for us.

I'm working on doing a test to see how bad it actually is for us... but getting 
stuff like that done at work is like pulling teeth, so we'll see...


Specifically, the writing checksums for a read-only query is a defect I
think is prohibitively bad.  When we first talked about this feature for
9.2, we were going to exclude hint bits from checksums, in order to
avoid this issue; what happened to that?

(FWIW, I still support the idea of moving hint bits to a separate
filehandle, as we do with the FSM, but clearly that's not happening for
9.3 ...)


+1


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Greg Smith

On 3/6/13 6:34 AM, Heikki Linnakangas wrote:

Another thought is that perhaps something like CRC32C would be faster to
calculate on modern hardware, and could be safely truncated to 16-bits
using the same technique you're using to truncate the Fletcher's
Checksum. Greg's tests showed that the overhead of CRC calculation is
significant in some workloads, so it would be good to spend some time to
optimize that. It'd be difficult to change the algorithm in a future
release without breaking on-disk compatibility, so let's make sure we
pick the best one.


Simon sent over his first rev of this using a quick to compute 16 bit 
checksum as a reasonable trade-off, one that it's possible to do right 
now.  It's not optimal in a few ways, but it catches single bit errors 
that are missed right now, and Fletcher-16 computes quickly and without 
a large amount of code.  It's worth double-checking that the code is 
using the best Fletcher-16 approach available.  I've started on that, 
but I'm working on your general performance concerns first, with the 
implementation that's already there.


From what I've read so far, I think picking Fletcher-16 instead of the 
main alternative, CRC-16-IBM AKA CRC-16-ANSI, is a reasonable choice. 
There's a good table showing the main possibilities here at 
https://en.wikipedia.org/wiki/Cyclic_redundancy_check


One day I hope that in-place upgrade learns how to do page format 
upgrades, with the sort of background conversion tools and necessary 
tracking metadata we've discussed for that work.  When that day comes, I 
would expect it to be straightforward to upgrade pages from 16 bit 
Fletcher checksums to 32 bit CRC-32C ones.  Ideally we would be able to 
jump on the CRC-32C train today, but there's nowhere to put all 32 bits. 
 Using a Fletcher 16 bit checksum for 9.3 doesn't prevent the project 
from going that way later though, once page header expansion is a solved 
problem.


The problem with running CRC32C in software is that the standard fast 
approach uses a slicing technique that requires a chunk of 
pre-computed data be around, a moderately large lookup table.  I don't 
see that there's any advantage to having all that baggage around if 
you're just going to throw away half of the result anyway.  More on 
CRC32Cs in my next message.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Greg Smith

On 3/6/13 1:24 PM, Tom Lane wrote:

Andres Freund and...@2ndquadrant.com writes:

On 2013-03-06 11:21:21 -0500, Garick Hamlin wrote:

If picking a CRC why not a short optimal one rather than truncate CRC32C?



CRC32C is available in hardware since SSE4.2.


I think that should be at most a fourth-order consideration, since we
are not interested solely in Intel hardware, nor do we have any portable
way of getting at such a feature even if the hardware has it.


True, but that situation might actually improve.

The Castagnoli CRC-32C that's accelerated on the better Intel CPUs is 
also used to protect iSCSI and SCTP (a streaming protocol).  And there 
is an active project to use a CRC32C to checksum ext4 metadata blocks on 
Linux:  https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums 
https://groups.google.com/forum/?fromgroups=#!topic/linux.kernel/APKfoMzjgdY


Now, that project doesn't make the Postgres feature obsolete, because 
there's nowhere to put checksum data for every block on ext4 without 
whacking block alignment.  The filesystem can't make an extra 32 bits 
appear on every block any more than we can.  It's using a similar trick 
to the PG checksum feature, grabbing some empty space just for the 
metadata then shoving the CRC32C into there.  But the fact that this is 
going on means that there are already Linux kernel modules built with 
both software/hardware accelerated versions of the CRC32C function.  And 
the iSCSI/SCTP use cases means it's not out of the question this will 
show up in other useful forms one day.  Maybe two years from now, there 
will be a common Linux library that autoconf can find to compute the CRC 
for us--with hardware acceleration when available, in software if not.


The first of those ext4 links above even discusses the exact sort of 
issue we're facing.  The author wonders if the easiest way to proceed 
for 16 bit checksums is to compute the CRC32C, then truncate it, simply 
because CRC32C creation is so likely to get hardware help one day.  I 
think that logic doesn't really apply to the PostgreSQL case as strongly 
though, as the timetime before we can expect a hardware accelerated 
version to be available is much further off than a Linux kernel 
developer's future.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Greg Stark
On Wed, Mar 6, 2013 at 11:04 PM, Robert Haas robertmh...@gmail.com wrote:
 When we first talked about this feature for
 9.2, we were going to exclude hint bits from checksums, in order to
 avoid this issue; what happened to that?

 I don't think anyone ever thought that was a particularly practical
 design.  I certainly don't.

Really? I thought it was pretty much the consensus for a good while.

The main problem it ran into was that we kept turning up hint bits
that we didn't realize we had. Index line pointers turned out to have
hint bits, page headers have one, and so on. As long as it was just
the heap page per-tuple transaction hint bits it seemed plausible to
just skip them or move them all to a contiguous blocks. Once it
started to look like the checksumming code had to know about every
data structure on every page it seemed a bit daunting. But that wasn't
something we realized for quite a long time.



-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Greg Smith
TL;DR summary:  on a system I thought was a fair middle of the road 
server, pgbench tests are averaging about a 2% increase in WAL writes 
and a 2% slowdown when I turn on checksums.  There are a small number of 
troublesome cases where that overhead rises to closer to 20%, an upper 
limit that's shown up in a few tests aiming to stress this feature now.


On 3/4/13 10:09 PM, Jeff Davis wrote:

= Test 2 - worst-case overhead for calculating checksum while reading data =

Jeff saw an 18% slowdown, I get 24 to 32%.  This one bothers me because
the hit is going to happen during the very common situation where data
is shuffling a lot between a larger OS cache and shared_buffers taking a
relatively small fraction.


I believe that test 1 and test 2 can be improved a little, if there is a
need. Right now we copy the page and then calculate the checksum on the
copy. If we instead calculate as we're copying, I believe it will make
it significantly faster.


It's good to know there's at least some ideas for optimizing this one 
further.  I think the situation where someone has:


shared_buffers  database  total RAM

is fairly common for web applications.  For people on Amazon EC2 
instances for example, giving out the performance tuning advice of get 
a bigger instance until the database fits in RAM works amazingly well. 
 If the hotspot of that data set fits in shared_buffers, those people 
will still be in good shape even with checksums enabled.  If the hot 
working set is spread out more randomly, though, it's not impossible to 
see how they could suffer regularly from this ~20% OS cache-shared 
buffers movement penalty.


Regardless, Jeff's three cases are good synthetic exercises to see 
worst-case behavior, but they are magnifying small differences.  To see 
a more general case, I ran through a series of pgbench tests in its 
standard write mode.  In order to be useful, I ended up using a system 
with a battery-backed write cache, but with only a single drive 
attached.  I needed fsync to be fast to keep that from being the 
bottleneck.  But I wanted physical I/O to be slow.  I ran three test 
sets at various size/client loads:  one without the BBWC (which I kept 
here because it gives some useful scale to the graphs), one with the 
baseline 9.3 code, and one with checksums enabled on the cluster.  I did 
only basic postgresql.conf tuning:


 checkpoint_segments| 64
 shared_buffers | 2GB

There's two graphs comparing sets attached, you can see that the 
slowdown of checksums for this test is pretty minor.  There is a clear 
gap between the two plots, but it's not a very big one, especially if 
you note how much difference a BBWC makes.


I put the numeric results into a spreadsheet, also attached.  There's so 
much noise in pgbench results that I found it hard to get a single 
number for the difference; they bounce around about +/-5% here. 
Averaging across everything gives a solid 2% drop when checksums are on 
that looked detectable above the noise.


Things are worse on the bigger data sets.  At the highest size I tested, 
the drop was more like 7%.  The two larger size / low client count 
results I got were really bad, 25% and 16% drops.  I think this is 
closing in on the range of things:  perhaps only 2% when most of your 
data fits in shared_buffers, more like 10% if your database is bigger, 
and in the worst case 20% is possible.  I don't completely trust those 
25/16% numbers though, I'm going to revisit that configuration.


The other thing I track now in pgbench-tools is how many bytes of WAL 
are written.  Since the total needs to be measured relative to work 
accomplished, the derived number that looks useful there is average 
bytes of WAL per transaction.  On smaller database this is around 6K, 
while larger databases topped out for me at around 22K WAL 
bytes/transaction.  Remember that the pgbench transaction is several 
statements.  Updates touch different blocks in pgbench_accounts, index 
blocks, and the small tables.


The WAL increase from checksumming is a bit more consistent than the TPS 
rates.  Many cases were 3 to 5%.  There was one ugly case were it hit 
30%, and I want to dig into where that came from more.  On average, 
again it was a 2% increase over the baseline.


Cases where you spew hint bit WAL data where before none were written 
(Jeff's test #3) remain a far worst performer than any of these.  Since 
pgbench does a VACUUM before starting, none of those cases were 
encountered here though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
attachment: clients-sets.pngattachment: scaling-sets.png

Checksum-pgbench.xls
Description: MS-Excel spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Daniel Farina
On Wed, Mar 6, 2013 at 8:17 PM, Greg Smith g...@2ndquadrant.com wrote:
 TL;DR summary:  on a system I thought was a fair middle of the road server,
 pgbench tests are averaging about a 2% increase in WAL writes and a 2%
 slowdown when I turn on checksums.  There are a small number of troublesome
 cases where that overhead rises to closer to 20%, an upper limit that's
 shown up in a few tests aiming to stress this feature now.

I have only done some cursory research, but cpu-time of 20% seem to
expected for InnoDB's CRC computation[0].  Although a galling number,
this comparison with other systems may be a way to see how much of
that overhead is avoidable or just the price of entry.  It's unclear
how this 20% cpu-time compares to your above whole-system results, but
it's enough to suggest that nothing comes for (nearly) free.

[0]: http://mysqlha.blogspot.com/2009/05/innodb-checksum-performance.html

--
fdr


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Greg Smith

On 3/7/13 12:15 AM, Daniel Farina wrote:

I have only done some cursory research, but cpu-time of 20% seem to
expected for InnoDB's CRC computation[0].  Although a galling number,
this comparison with other systems may be a way to see how much of
that overhead is avoidable or just the price of entry.  It's unclear
how this 20% cpu-time compares to your above whole-system results, but
it's enough to suggest that nothing comes for (nearly) free.


That does provide a useful measuring point:  how long does the 
computation take compared to the memcpy that moves the buffer around. 
It looks like they started out with 3.2 memcpy worth of work, and with 
enough optimization ended up at 1.27 worth.


The important thing to keep in mind is that shared_buffers works pretty 
well at holding on to the most frequently accessed information.  A 
typical server I see will show pg_statio information suggesting 90%+ of 
block requests are coming from hits there, the rest misses suggesting a 
mix of OS cache and real disk reads.  Let's say 90% are hits, 5% are 
fetches at this 20% penalty, and 5% are real reads where the checksum 
time is trivial compared to physical disk I/O.  That works out to be a 
real average slowdown of 6%.  I think way more deployments are going to 
be like that case, which matches most of my pgbench runs, than the worse 
case workloads.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-05 Thread Simon Riggs
On 5 March 2013 01:04, Daniel Farina dan...@heroku.com wrote:

 Corruption has easily occupied more than one person-month of time last
 year for us.  This year to date I've burned two weeks, although
 admittedly this was probably the result of statistical clustering.
 Other colleagues of mine have probably put in a week or two in
 aggregate in this year to date.  The ability to quickly, accurately,
 and maybe at some later date proactively finding good backups to run
 WAL recovery from is one of the biggest strides we can make in the
 operation of Postgres.  The especially ugly cases are where the page
 header is not corrupt, so full page images can carry along malformed
 tuples...basically, when the corruption works its way into the WAL,
 we're in much worse shape.  Checksums would hopefully prevent this
 case, converting them into corrupt pages that will not be modified.

 It would be better yet if I could write tools to find the last-good
 version of pages, and so I think tight integration with Postgres will
 see a lot of benefits that would be quite difficult and non-portable
 when relying on file system checksumming.

 You are among the most well-positioned to make assessments of the cost
 of the feature, but I thought you might appreciate a perspective of
 the benefits, too.  I think they're large, and for me they are the
 highest pole in the tent for what makes Postgres stressful to operate
 as-is today.  It's a testament to the quality of the programming in
 Postgres that Postgres programming error is not the largest problem.

That's good perspective.

I think we all need to be clear that committing this patch also
commits the community (via the committer) to significant work and
responsibility around this, and my minimum assessment of it is 1 month
per year for a 3-5 years, much of that on the committer. In effect
this will move time and annoyance experienced by users of Postgres
back onto developers of Postgres. That is where it should be, but the
effect will be large and easily noticeable, IMHO.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-05 Thread Heikki Linnakangas

On 04.03.2013 09:11, Simon Riggs wrote:

On 3 March 2013 18:24, Greg Smithg...@2ndquadrant.com  wrote:


The 16-bit checksum feature seems functional, with two sources of overhead.
There's some CPU time burned to compute checksums when pages enter the
system.  And there's extra overhead for WAL logging hint bits.  I'll
quantify both of those better in another message.


It's crunch time. Do you and Jeff believe this patch should be
committed to Postgres core?

Are there objectors?


In addition to my hostility towards this patch in general, there are 
some specifics in the patch I'd like to raise (read out in a grumpy voice):


If you enable checksums, the free space map never gets updated in a 
standby. It will slowly drift to be completely out of sync with reality, 
which could lead to significant slowdown and bloat after failover.


Since the checksums are an all-or-nothing cluster-wide setting, the 
three extra flags in the page header, PD_CHECKSUMS1, PD_CHECKSUM2 and 
PD_HEADERCHECK, are not needed. Let's leave them out. That keeps the 
code simpler, and leaves the bits free for future use. If we want to 
enable such per-page setting in the future, we can add it later. For a 
per-relation scheme, they're not needed.



+ * The checksum algorithm is a modified Fletcher 64-bit (which is
+ * order-sensitive). The modification is because, at the end, we have two
+ * 64-bit sums, but we only have room for a 16-bit checksum. So, instead of
+ * using a modulus of 2^32 - 1, we use 2^8 - 1; making it also resemble a
+ * Fletcher 16-bit. We don't use Fletcher 16-bit directly, because processing
+ * single bytes at a time is slower.


How does the error detection rate of this compare with e.g CRC-16? Is 
there any ill effect from truncating the Fletcher sums like this?



+   /*
+* Store the sums as bytes in the checksum. We add one to shift the 
range
+* from 0..255 to 1..256, to make zero invalid for checksum bytes (which
+* seems wise).
+*/
+   p8Checksum[0] = (sum1 % 255) + 1;
+   p8Checksum[1] = (sum2 % 255) + 1;


That's a bit odd. We don't avoid zero in the WAL crc, and I don't recall 
seeing that in other checksum implementations either. 16-bits is not 
very wide for a checksum, and this eats about 1% of the space of valid 
values.


I can see that it might be a handy debugging aid to avoid 0. But there's 
probably no need to avoid 0 in both bytes, it seems enough to avoid a 
completely zero return value.


XLogCheckBuffer() and XLogCheckBufferNeedsBackup() read the page LSN 
without a lock. That's not atomic, so it could incorrectly determine 
that a page doesn't need to be backed up. We used to always hold an 
exclusive lock on the buffer when it's called, which prevents 
modifications to the LSN, but that's no longer the case.


Shouldn't SetBufferCommitInfoNeedsSave() check the BM_PERMANENT flag? I 
think it will generate WAL records for unlogged tables as it is.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-05 Thread Jeff Davis

Thank you for the review.

On Tue, 2013-03-05 at 11:35 +0200, Heikki Linnakangas wrote:
 If you enable checksums, the free space map never gets updated in a 
 standby. It will slowly drift to be completely out of sync with reality, 
 which could lead to significant slowdown and bloat after failover.

Will investigate.

 Since the checksums are an all-or-nothing cluster-wide setting, the 
 three extra flags in the page header, PD_CHECKSUMS1, PD_CHECKSUM2 and 
 PD_HEADERCHECK, are not needed. Let's leave them out. That keeps the 
 code simpler, and leaves the bits free for future use. If we want to 
 enable such per-page setting in the future, we can add it later. For a 
 per-relation scheme, they're not needed.

They don't really need to be there, I just put them there because it
seemed wise if we ever want to allow online enabling/disabling of
checksums. But I will remove them.

 How does the error detection rate of this compare with e.g CRC-16? Is 
 there any ill effect from truncating the Fletcher sums like this?

I don't recall if I published these results or not, but I loaded a
table, and used pageinspect to get the checksums of the pages. I then
did some various GROUP BY queries to see if I could find any clustering
or stepping of the checksum values, and I could not. The distribution
seemed very uniform across the 255^2 space.

I tried to think of other problems, like missing errors in the high or
low bits of a word or a page (similar to the issue with mod 256
described below), but I couldn't find any. I'm not enough of an expert
to say more than that about the error detection rate.

Fletcher is probably significantly faster than CRC-16, because I'm just
doing int32 addition in a tight loop.

Simon originally chose Fletcher, so perhaps he has more to say.

 That's a bit odd. We don't avoid zero in the WAL crc, and I don't recall 
 seeing that in other checksum implementations either. 16-bits is not 
 very wide for a checksum, and this eats about 1% of the space of valid 
 values.
 
 I can see that it might be a handy debugging aid to avoid 0. But there's 
 probably no need to avoid 0 in both bytes, it seems enough to avoid a 
 completely zero return value.

http://en.wikipedia.org/wiki/Fletcher%27s_checksum

If you look at the section on Fletcher-16, it discusses the choice of
the modulus. If we used 256, then an error anywhere except the lowest
byte of a 4-byte word read from the page would be missed.

Considering that I was using only 255 values anyway, I thought I might
as well shift the values away from zero.

We could get slightly better by using all combinations. I also
considered chopping the 64-bit ints into 16-bit chunks and XORing them
together. But when I saw the fact that we avoided zero with the other
approach, I kind of liked it, and kept it.

 XLogCheckBuffer() and XLogCheckBufferNeedsBackup() read the page LSN 
 without a lock. That's not atomic, so it could incorrectly determine 
 that a page doesn't need to be backed up. We used to always hold an 
 exclusive lock on the buffer when it's called, which prevents 
 modifications to the LSN, but that's no longer the case.

Will investigate, but it sounds like a buffer header lock will fix it.

 Shouldn't SetBufferCommitInfoNeedsSave() check the BM_PERMANENT flag? I 
 think it will generate WAL records for unlogged tables as it is.

Yes, thank you.

Also, in FlushBuffer(), this patch moves the clearing of the
BM_JUST_DIRTIED bit to before the WAL flush. That seems to expand the
window during which a change to a page will prevent it from being marked
clean. Do you see any performance problem with that?

The alternative is to take the buffer header lock twice: once to get the
LSN, then WAL flush, then another header lock to clear BM_JUST_DIRTIED.
Not sure if that's better or worse. This goes back to Simon's patch, so
he may have a comment here, as well.

I'll post a new patch with these comments addressed, probably tomorrow
so that I have some time to self-review and do some basic testing.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Heikki Linnakangas

On 04.03.2013 09:11, Simon Riggs wrote:

Are there objectors?


FWIW, I still think that checksumming belongs in the filesystem, not 
PostgreSQL. If you go ahead with this anyway, at the very least I'd like 
to see some sort of a comparison with e.g btrfs. How do performance, 
error-detection rate, and behavior on error compare? Any other metrics 
that are relevant here?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jeff Davis
On Mon, 2013-03-04 at 10:36 +0200, Heikki Linnakangas wrote:
 On 04.03.2013 09:11, Simon Riggs wrote:
  Are there objectors?
 
 FWIW, I still think that checksumming belongs in the filesystem, not 
 PostgreSQL.

Doing checksums in the filesystem has some downsides. One is that you
need to use a copy-on-write filesystem like btrfs or zfs, which (by
design) will fragment the heap on random writes. If we're going to start
pushing people toward those systems, we will probably need to spend some
effort to mitigate this problem (aside: my patch to remove
PD_ALL_VISIBLE might get some new wind behind it).

There are also other issues, like what fraction of our users can freely
move to btrfs, and when. If it doesn't happen to be already there, you
need root to get it there, which has never been a requirement before.

I don't fundamentally disagree. We probably need to perform reasonably
well on btrfs in COW mode[1] regardless, because a lot of people will be
using it a few years from now. But there are a lot of unknowns here, and
I'm concerned about tying checksums to a series of things that will be
resolved a few years from now, if ever.

[1] Interestingly, you can turn off COW mode on btrfs, but you lose
checksums if you do.

  If you go ahead with this anyway, at the very least I'd like 
 to see some sort of a comparison with e.g btrfs. How do performance, 
 error-detection rate, and behavior on error compare? Any other metrics 
 that are relevant here?

I suspect it will be hard to get an apples-to-apples comparison here
because of the heap fragmentation, which means that a sequential scan is
not so sequential. That may be acceptable for some workloads but not for
others, so it would get tricky to compare. And any performance numbers
from an experimental filesystem are somewhat suspect anyway.

Also, it's a little more challenging to test corruption on a filesystem,
because you need to find the location of the file you want to corrupt,
and corrupt it out from underneath the filesystem.

Greg may have more comments on this matter.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Greg Smith

On 3/4/13 2:11 AM, Simon Riggs wrote:

It's crunch time. Do you and Jeff believe this patch should be
committed to Postgres core?


I want to see a GUC to allow turning this off, to avoid the problem I 
saw where a non-critical header corruption problem can cause an entire 
page to be unreadable.  A variation on that capable of turning this off 
altogether, as Craig suggested, is a good idea too.


Those are both simple fixes, and I would be pleased to see this 
committed at that point.


I'll write up a long discussion of filesystem trends and why I think 
this is more relevant than ever if that's the main objection now.  There 
is no such thing as a stable release of btrfs, and no timetable for when 
there will be one.  I could do some benchmarks of that but I didn't 
think they were very relevant.  Who cares how fast something might run 
when it may not work correctly?  btrfs might as well be /dev/null to me 
right now--sure it's fast, but maybe the data won't be there at all. 
How long has it taken the Linux kernel to reach the point it handles 
write barriers and fsync correctly?  It does not give me a lot of 
confidence that now is the time they'll suddenly start executing on 
database filesystem mechanics perfectly.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jeff Davis
On Mon, 2013-02-25 at 01:30 -0500, Greg Smith wrote:
 Attached is some bit rot updates to the checksums patches.  The 
 replace-tli one still works fine.  I fixed a number of conflicts in the 
 larger patch.  The one I've attached here isn't 100% to project 
 standards--I don't have all the context diff tools setup yet for 
 example.  I expect to revise this more now that I've got the whole week 
 cleared to work on CF submissions.

Thank you for the rebase. I redid the rebase myself and came up with
essentially the same result, but there was an additional problem that
needed fixing after the materialized view patch. 

I will post a new version tonight that includes those fixes as well as
something to address these recent comments (probably just another GUC).
Further comment in another reply.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Heikki Linnakangas

On 04.03.2013 20:58, Greg Smith wrote:

There
is no such thing as a stable release of btrfs, and no timetable for when
there will be one. I could do some benchmarks of that but I didn't think
they were very relevant. Who cares how fast something might run when it
may not work correctly? btrfs might as well be /dev/null to me right
now--sure it's fast, but maybe the data won't be there at all.


This PostgreSQL patch hasn't seen any production use, either. In fact, 
I'd consider btrfs to be more mature than this patch. Unless you think 
that there will be some major changes to the worse in performance in 
btrfs, it's perfectly valid and useful to compare the two.


A comparison with ZFS would be nice too. That's mature, and has checksums.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jeff Davis
On Mon, 2013-03-04 at 11:52 +0800, Craig Ringer wrote:
 I also suspect that at least in the first release it might be desirable
 to have an option that essentially says something's gone horribly wrong
 and we no longer want to check or write checksums, we want a
 non-checksummed DB that can still read our data from before we turned
 checksumming off. Essentially, a way for someone who's trying
 checksumming in production after their staging tests worked out OK to
 abort and go back to the non-checksummed case without having to do a
 full dump and reload.

A recovery option to extract data sounds like a good idea, but I don't
want to go as far as you are suggesting here.

An option to ignore checksum failures (while still printing the
warnings) sounds like all we need here. I think Greg's point that the
page might be written out again (hiding the corruption) is a very good
one, but the same is true for zero_damaged_pages. So we can just still
allow the writes to proceed (including setting the checksum on write),
and the system should be as available as it would be without checksums.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jeff Davis
On Mon, 2013-03-04 at 22:13 +0200, Heikki Linnakangas wrote:
 On 04.03.2013 20:58, Greg Smith wrote:
  There
  is no such thing as a stable release of btrfs, and no timetable for when
  there will be one. I could do some benchmarks of that but I didn't think
  they were very relevant. Who cares how fast something might run when it
  may not work correctly? btrfs might as well be /dev/null to me right
  now--sure it's fast, but maybe the data won't be there at all.
 
 This PostgreSQL patch hasn't seen any production use, either. In fact, 
 I'd consider btrfs to be more mature than this patch. Unless you think 
 that there will be some major changes to the worse in performance in 
 btrfs, it's perfectly valid and useful to compare the two.
 
 A comparison with ZFS would be nice too. That's mature, and has checksums.

Is there any reason why we can't have both postgres and filesystem
checksums? The same user might not want both (or might, if neither are
entirely trustworthy yet), but I think it's too early to declare one as
the right solution and the other not. Even with btrfs stable, I
pointed out a number of reasons users might not want it, and reasons
that the project should not depend on it.

Numbers are always nice, but it takes a lot of effort to come up with
them. What kind of numbers are you looking for, and how *specifically*
will those numbers affect the decision?

If btrfs with checksums is 10% slower than ext4 with postgres checksums,
does that mean we should commit the postgres checksums?

On the other side of the coin, if btrfs with checksums is exactly the
same speed as ext4 with no postgres checksums (i.e. checksums are free
if we use btrfs), does that mean postgres checksums should be rejected?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Heikki Linnakangas

On 04.03.2013 18:00, Jeff Davis wrote:

On Mon, 2013-03-04 at 10:36 +0200, Heikki Linnakangas wrote:

On 04.03.2013 09:11, Simon Riggs wrote:

Are there objectors?


FWIW, I still think that checksumming belongs in the filesystem, not
PostgreSQL.


Doing checksums in the filesystem has some downsides. One is that you
need to use a copy-on-write filesystem like btrfs or zfs, which (by
design) will fragment the heap on random writes.


Yeah, fragmentation will certainly hurt some workloads. But how badly, 
and which workloads, and how does that compare with the work that 
PostgreSQL has to do to maintain the checksums? I'd like to see some 
data on those things.



There are also other issues, like what fraction of our users can freely
move to btrfs, and when. If it doesn't happen to be already there, you
need root to get it there, which has never been a requirement before.


If you're serious enough about your data that you want checksums, you 
should be able to choose your filesystem.



  If you go ahead with this anyway, at the very least I'd like
to see some sort of a comparison with e.g btrfs. How do performance,
error-detection rate, and behavior on error compare? Any other metrics
that are relevant here?


I suspect it will be hard to get an apples-to-apples comparison here
because of the heap fragmentation, which means that a sequential scan is
not so sequential. That may be acceptable for some workloads but not for
others, so it would get tricky to compare.


An apples-to-apples comparison is to run the benchmark and see what 
happens. If it gets fragmented as hell on btrfs, and performance tanks 
because of that, then that's your result. If avoiding fragmentation is 
critical to the workload, then with btrfs you'll want to run the 
defragmenter in the background to keep it in order, and factor that into 
the test case.


I realize that performance testing is laborious. But we can't skip it 
and assume that the patch performs fine, because it's hard to benchmark.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jeff Davis
On Sun, 2013-03-03 at 22:18 -0500, Greg Smith wrote:
 As for a design of a GUC that might be useful here, the option itself 
 strikes me as being like archive_mode in its general use.  There is an 
 element of parameters like wal_sync_method or enable_cassert though, 
 where the options available vary depending on how you built the cluster. 
   Maybe name it checksum_level with options like this:
 
 off:  only valid option if you didn't enable checksums with initdb
 enforcing:  full checksum behavior as written right now.
 unvalidated:  broken checksums on reads are ignored.

I think GUCs should be orthogonal to initdb settings. If nothing else,
it's extra effort to get initdb to write the right postgresql.conf.

A single new GUC that prevents checksum failures from causing an error
seems sufficient to address the concerns you, Dan, and Craig raised.

We would still calculate the checksum and print the warning; and then
pass it through the rest of the header checks. If the header checks
pass, then it proceeds. If the header checks fail, and if
zero_damaged_pages is off, then it would still generate an error (as
today).

So: ignore_checksum_failures = on|off ?

 The main tricky case I see in that is where you read in a page with a 
 busted checksum using unvalidated.  Ideally you wouldn't write such a 
 page back out again, because it's going to hide that it's corrupted in 
 some way already.  How to enforce that though?  Perhaps unvalidated 
 only be allowed in a read-only transaction?

That's a good point. But we already have zero_damaged_pages, which does
something similar. And it's supposed to be a recovery option to get the
data out rather than something to run in online mode. It will still
print the warning, so it won't completely hide the corruption.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jeff Davis
On Mon, 2013-03-04 at 13:58 -0500, Greg Smith wrote:
 On 3/4/13 2:11 AM, Simon Riggs wrote:
  It's crunch time. Do you and Jeff believe this patch should be
  committed to Postgres core?
 
 I want to see a GUC to allow turning this off, to avoid the problem I 
 saw where a non-critical header corruption problem can cause an entire 
 page to be unreadable.  A variation on that capable of turning this off 
 altogether, as Craig suggested, is a good idea too.

Based on your comments as well those of Dan and Craig, I am leaning
toward a GUC that causes a checksum failure to be ignored. It will still
emit the checksum failure warning, but proceed.

That will then fall through to the normal header checks we've always
had, and the same zero_damaged_pages option. So, to get past a really
corrupt page, you'd need to set ignore_checksum_failure and
zero_damaged_pages.

 I'll write up a long discussion of filesystem trends and why I think 
 this is more relevant than ever if that's the main objection now.  There 
 is no such thing as a stable release of btrfs, and no timetable for when 
 there will be one.  I could do some benchmarks of that but I didn't 
 think they were very relevant.  Who cares how fast something might run 
 when it may not work correctly?  btrfs might as well be /dev/null to me 
 right now--sure it's fast, but maybe the data won't be there at all. 
 How long has it taken the Linux kernel to reach the point it handles 
 write barriers and fsync correctly?  It does not give me a lot of 
 confidence that now is the time they'll suddenly start executing on 
 database filesystem mechanics perfectly.

I have a similar viewpoint here. It will take significant effort to come
up with anything, and I'm not sure how meaningful the numbers would be.
Even if btrfs is great, this feature is not mutually exclusive with
btrfs:
  * users might not have easy access to run the filesystem
  * they might not trust it
  * they might get poor performance numbers
  * postgres checksums might provide a good test of btrfs checksums, and
vice-versa, until both are stable

Additionally, I don't like the idea of depending so heavily on what
linux is doing. If there are performance problems that affect postgres,
will they fix them? Will they introduce new ones? Are there a zillion
tuneable options that a new user has to get right in order to run
postgres efficiently, and will poor settings mean a bunch of postgres
is slow blog posts?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jim Nasby

On 3/4/13 10:00 AM, Jeff Davis wrote:

On Mon, 2013-03-04 at 10:36 +0200, Heikki Linnakangas wrote:

On 04.03.2013 09:11, Simon Riggs wrote:

 Are there objectors?


FWIW, I still think that checksumming belongs in the filesystem, not
PostgreSQL.

Doing checksums in the filesystem has some downsides.


Additionally, no filesystem I'm aware of checksums the data in the filesystem 
cache. A PG checksum would.

I'll also mention that this debate has been had in the past. The time to object 
to the concept of a checksuming feature was a long time ago, before a ton of 
development effort went into this... :(


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jim Nasby

On 3/4/13 2:48 PM, Jeff Davis wrote:

On Mon, 2013-03-04 at 13:58 -0500, Greg Smith wrote:

On 3/4/13 2:11 AM, Simon Riggs wrote:

 It's crunch time. Do you and Jeff believe this patch should be
 committed to Postgres core?


I want to see a GUC to allow turning this off, to avoid the problem I
saw where a non-critical header corruption problem can cause an entire
page to be unreadable.  A variation on that capable of turning this off
altogether, as Craig suggested, is a good idea too.

Based on your comments as well those of Dan and Craig, I am leaning
toward a GUC that causes a checksum failure to be ignored. It will still
emit the checksum failure warning, but proceed.


I suggest we paint that GUC along the lines of checksum_failure_log_level, 
defaulting to ERROR. That way if someone wanted completely bury the elogs to like DEBUG 
they could.

My $2.98 (inflation adjusted).


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jeff Davis
On Mon, 2013-03-04 at 22:27 +0200, Heikki Linnakangas wrote:
 Yeah, fragmentation will certainly hurt some workloads. But how badly, 
 and which workloads, and how does that compare with the work that 
 PostgreSQL has to do to maintain the checksums? I'd like to see some 
 data on those things.

I think we all would. Btrfs will be a major filesystem in a few years,
and we should be ready to support it.

Unfortunately, it's easier said than done. What you're talking about
seems like a significant benchmark report that encompasses a lot of
workloads. And there's a concern that a lot of it will be invalidated if
they are still improving the performance of btrfs.

 If you're serious enough about your data that you want checksums, you 
 should be able to choose your filesystem.

I simply disagree. I am targeting my feature at casual users. They may
not have a lot of data or a dedicated DBA, but the data they do have
might be very important transactional data.

And right now, if they take a backup of their data, it will contain all
of the corruption from the original. And since corruption is silent
today, then they would probably think the backup is fine, and may delete
the previous good backups.

 An apples-to-apples comparison is to run the benchmark and see what 
 happens. If it gets fragmented as hell on btrfs, and performance tanks 
 because of that, then that's your result. If avoiding fragmentation is 
 critical to the workload, then with btrfs you'll want to run the 
 defragmenter in the background to keep it in order, and factor that into 
 the test case.

Again, easier said than done. To get real fragmentation problems, the
data set needs to be huge, and we need to reach a steady state of this
background defrag process, and a million other things.

 I realize that performance testing is laborious. But we can't skip it 
 and assume that the patch performs fine, because it's hard to benchmark.

You aren't asking me to benchmark the patch in question. You are asking
me to benchmark a filesystem that very few people actually run postgres
on in production. I don't think that's a reasonable requirement.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Heikki Linnakangas

On 04.03.2013 22:51, Jim Nasby wrote:

The time to
object to the concept of a checksuming feature was a long time ago,
before a ton of development effort went into this... :(


I did. Development went ahead anyway.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Heikki Linnakangas

On 04.03.2013 22:40, Jeff Davis wrote:

Is there any reason why we can't have both postgres and filesystem
checksums?


Of course not. But if we can get away without checksums in Postgres, 
that's better, because then we don't need to maintain that feature in 
Postgres. If the patch gets committed, it's not mission accomplished. 
There will be discussion and need for further development on things like 
what to do if you get a checksum failure, patches to extend the 
checksums to cover things like the clog and other non-data files and so 
forth. And it's an extra complication that will need to be taken into 
account when developing other new features; in particular, hint bit 
updates need to write a WAL record. Even if you have all the current 
hint bits covered, it's an extra hurdle for future patches that might 
want to have hint bits in e.g new index access methods.



The same user might not want both (or might, if neither are
entirely trustworthy yet), but I think it's too early to declare one as
the right solution and the other not. Even with btrfs stable, I
pointed out a number of reasons users might not want it, and reasons
that the project should not depend on it.


The PostgreSQL project would not be depending on it, any more than the 
project depends on filesystem snapshots for backup purposes, or the OS 
memory manager for caching.



Numbers are always nice, but it takes a lot of effort to come up with
them. What kind of numbers are you looking for, and how *specifically*
will those numbers affect the decision?


Benchmark of vanilla PostgreSQL, PostgreSQL + this patch, and PostgreSQL 
running on btrfs or ZFS with data checksums enabled. DBT-2 might be a 
good candidate, as it's I/O heavy. That would be a good general test; in 
addition it would be good to see a benchmark of the worst case scenario 
for the fragmentation you're expecting to see on btrfs, as well as a 
worst case scenario for the extra WAL traffic with the patch.



If btrfs with checksums is 10% slower than ext4 with postgres checksums,
does that mean we should commit the postgres checksums?


In my opinion, a 10% gain would not be worth it, and we should not 
commit in that case.



On the other side of the coin, if btrfs with checksums is exactly the
same speed as ext4 with no postgres checksums (i.e. checksums are free
if we use btrfs), does that mean postgres checksums should be rejected?


Yes, I think so. I'm sure at least some others will disagree; Greg 
already made it quite clear that he doesn't care how the performance of 
this compares with btrfs.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread k...@rice.edu
On Mon, Mar 04, 2013 at 01:00:09PM -0800, Jeff Davis wrote:
 On Mon, 2013-03-04 at 22:27 +0200, Heikki Linnakangas wrote:
  If you're serious enough about your data that you want checksums, you 
  should be able to choose your filesystem.
 
 I simply disagree. I am targeting my feature at casual users. They may
 not have a lot of data or a dedicated DBA, but the data they do have
 might be very important transactional data.
 
 And right now, if they take a backup of their data, it will contain all
 of the corruption from the original. And since corruption is silent
 today, then they would probably think the backup is fine, and may delete
 the previous good backups.
 
+1

There is no reasonable availability of checksum capable filesystems across
PostgreSQL's supported OSes. It really needs to be available in core.

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Heikki Linnakangas

On 04.03.2013 23:00, Jeff Davis wrote:

On Mon, 2013-03-04 at 22:27 +0200, Heikki Linnakangas wrote:

Yeah, fragmentation will certainly hurt some workloads. But how badly,
and which workloads, and how does that compare with the work that
PostgreSQL has to do to maintain the checksums? I'd like to see some
data on those things.


I think we all would. Btrfs will be a major filesystem in a few years,
and we should be ready to support it.


Perhaps we should just wait a few years? If we suspect that this becomes 
obsolete in a few years, it's probably better to just wait, than add a 
feature we'll have to keep maintaining. Assuming it gets committed 
today, it's going to take a year or two for 9.3 to get released and all 
the bugs ironed out, anyway.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jim Nasby

On 3/4/13 3:00 PM, Heikki Linnakangas wrote:

On 04.03.2013 22:51, Jim Nasby wrote:

The time to
object to the concept of a checksuming feature was a long time ago,
before a ton of development effort went into this... :(


I did. Development went ahead anyway.


Right, because the community felt that this was valuable enough to do 
regardless of things like FS checksumming. But now you're bringing the issue up 
yet again, this time after a large amount of time has been invested.

I know that you're doing what you feel is best for the project, but in this 
case the community didn't agree with your view. Raising the same objection at 
this point is not productive at this point.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Heikki Linnakangas

On 04.03.2013 22:51, Jim Nasby wrote:

Additionally, no filesystem I'm aware of checksums the data in the
filesystem cache. A PG checksum would.


The patch says:


+ * IMPORTANT NOTE -
+ * The checksum is not valid at all times on a data page. We set it before we
+ * flush page/buffer, and implicitly invalidate the checksum when we modify the
+ * page. A heavily accessed buffer might then spend most of its life with an
+ * invalid page checksum, so testing random pages in the buffer pool will tell
+ * you nothing. The reason for this is that the checksum detects otherwise
+ * silent errors caused by the filesystems on which we rely. We do not protect
+ * buffers against uncorrectable memory errors, since these have a very low
+ * measured incidence according to research on large server farms,
+ * http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf, discussed 
2010/12/22.


It's still true that it does in fact cover pages in the filesystem 
cache, but apparently that's not important.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Stephen Frost
* Heikki Linnakangas (hlinnakan...@vmware.com) wrote:
 Perhaps we should just wait a few years? If we suspect that this
 becomes obsolete in a few years, it's probably better to just wait,
 than add a feature we'll have to keep maintaining. Assuming it gets
 committed today, it's going to take a year or two for 9.3 to get
 released and all the bugs ironed out, anyway.

For my 2c, I don't see it being obsolete in a few years, even if every
existing FS out there gets checksumming (which won't happen, imv).
It's quite clear that there is still ongoing development in the
filesystem space and any new software will have its own set of bugs.
Having a layer of protection built-in to PG wil undoubtably be a good
thing and will be used by our users.

Thanks,

Stephen


signature.asc
Description: Digital signature


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Craig Ringer
On 03/05/2013 04:48 AM, Jeff Davis wrote:
 We would still calculate the checksum and print the warning; and then
 pass it through the rest of the header checks. If the header checks
 pass, then it proceeds. If the header checks fail, and if
 zero_damaged_pages is off, then it would still generate an error (as
 today).

 So: ignore_checksum_failures = on|off ?
That seems reasonable to me. It would be important to document clearly
in postgresql.conf and on the docs for the option that enabling this
option can launder data corruption, so that blocks that we suspected
were damaged are marked clean on rewrite. So long as that's clearly
documented I'm personally quite comfortable with your suggestion, since
my focus is just making sure I can get a DB back to a fully operational
state as quickly as possible when that's necessary.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Greg Smith

On 3/4/13 3:13 PM, Heikki Linnakangas wrote:

This PostgreSQL patch hasn't seen any production use, either. In fact,
I'd consider btrfs to be more mature than this patch. Unless you think
that there will be some major changes to the worse in performance in
btrfs, it's perfectly valid and useful to compare the two.


I think my last message came out with a bit more hostile attitude about 
this than I intended it to; sorry about that.  My problem with this idea 
comes from looking at the history of how Linux has failed to work 
properly before.  The best example I can point at is the one I 
documented at 
http://www.postgresql.org/message-id/4b512d0d.4030...@2ndquadrant.com 
along with this handy pgbench chart: 
http://www.phoronix.com/scan.php?page=articleitem=ubuntu_lucid_alpha2num=3


TPS on pgbench dropped from 1102 to about 110 after a kernel bug fix. 
It was 10X as fast in some kernel versions because fsync wasn't working 
properly.  Kernel filesystem issues have regularly resulted in data not 
being written to disk when it should have been, inflating the results 
accordingly.  Fake writes due to lying drives, write barriers that 
only actually work on server-class hardware, write barriers that don't 
work on md volumes, and then this one; it's a recurring pattern.  It's 
not the fault of the kernel developers, it's a hard problem and drive 
manufacturers aren't making it easy for them.


My concern, then, is that if the comparison target is btrfs performance, 
how do we know it's working reliably?  The track record says that bugs 
in this area usually inflate results, compared with a correct 
implementation.  You are certainly right that this checksum code is less 
mature than btrfs; it's just over a year old after all.  I feel quite 
good that it's not benchmarking faster than it really is, especially 
when I can directly measure how the write volume is increasing in the 
worst result.


I can't say that btrfs is slower or faster than it will eventually be 
due to bugs; I can't tell you the right way to tune btrfs for 
PostgreSQL; and I haven't even had anyone asking the question yet. 
Right now, the main thing I know about testing performance on Linux 
kernels new enough to support btrfs is that they're just generally slow 
running PostgreSQL.  See the multiple confirmed regression issues at 
http://www.postgresql.org/message-id/60b572d9298d944580f7d51195dd30804357fa4...@vmbx125.ihostexchange.net 
for example.  That new kernel mess needs to get sorted out too one day. 
 Why does database performance suck on kernel 3.2?  I don't know yet, 
but it doesn't help me get excited about assuming btrfs results will be 
useful.


ZFS was supposed to save everyone from worrying about corruption issues. 
 That didn't work out, I think due to the commercial agenda behind its 
development.  Now we have btrfs coming in some number of years, a 
project still tied more than I would like to Oracle.  I'm not too 
optimistic about that one either.  It doesn't help that now the original 
project lead, Chris Mason, has left there and is working at 
FusionIO--and that company's filesystem plans don't include 
checksumming, either.  (See 
http://www.fusionio.com/blog/under-the-hood-of-the-iomemory-sdk/ for a 
quick intro to what they're doing right now, which includes bypassing 
the Linux filesystem layer with their own flash optimized but POSIX 
compliant directFS)


There is an optimistic future path I can envision where btrfs matures 
quickly and in a way that performs well for PostgreSQL.  Maybe we'll end 
up there, and if that happens everyone can look back and say this was a 
stupid idea.  But there are a lot of other outcomes I see as possible 
here, and in all the rest of them having some checksumming capabilities 
available is a win.


One of the areas PostgreSQL has a solid reputation on is being trusted 
to run as reliably as possible.  All of the deployment trends I'm seeing 
have people moving toward less reliable hardware.  VMs, cloud systems, 
regular drives instead of hardware RAID, etc.  A lot of people badly 
want to leave behind the era of the giant database server, and have a 
lot of replicas running on smaller/cheaper systems instead.  There's a 
useful advocacy win for the project if lower grade hardware can be used 
to hit a target reliability level, with software picking up some of the 
error detection job instead.  Yes, it costs something in terms of future 
maintenance on the codebase, as new features almost invariably do.  If I 
didn't see being able to make noise about the improved reliability of 
PostgreSQL as valuable enough to consider it anyway, I wouldn't even be 
working on this thing.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jim Nasby

On 3/4/13 5:20 PM, Craig Ringer wrote:

On 03/05/2013 04:48 AM, Jeff Davis wrote:

We would still calculate the checksum and print the warning; and then
pass it through the rest of the header checks. If the header checks
pass, then it proceeds. If the header checks fail, and if
zero_damaged_pages is off, then it would still generate an error (as
today).

So: ignore_checksum_failures = on|off ?

That seems reasonable to me. It would be important to document clearly
in postgresql.conf and on the docs for the option that enabling this
option can launder data corruption, so that blocks that we suspected
were damaged are marked clean on rewrite. So long as that's clearly
documented I'm personally quite comfortable with your suggestion, since
my focus is just making sure I can get a DB back to a fully operational
state as quickly as possible when that's necessary.


I replied to this somewhere else in the thread when I over-looked Jeff's 
original post, so sorry for the noise... :(

Would it be better to do checksum_logging_level = valid elog levels ? That 
way someone could set the notification to anything from DEBUG up to PANIC. ISTM the 
default should be ERROR.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Craig Ringer
On 03/05/2013 08:15 AM, Jim Nasby wrote:

 Would it be better to do checksum_logging_level = valid elog levels
 ? That way someone could set the notification to anything from DEBUG
 up to PANIC. ISTM the default should be ERROR. 
That seems nice at first brush, but I don't think it holds up.

All our other log_level parameters control only output. If I saw that
parameter, I would think aah, this is how we control the detail and
verbosity of messages regarding checksum checking and maintenance. I
would be totally astonished if I changed it and it actually affected the
system's data integrity checking and enforcement processes. Logging
control GUCs control what we show to what clients/log files, not what
log statements get executed; they're a filter and don't control the
behaviour of the emitting log point.

Control over whether checksum failures are an error or merely warned
about is reasonable, but I strongly disagree with the idea of making
this seem like it's just a logging parameter.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jim Nasby

On 3/4/13 6:22 PM, Craig Ringer wrote:

On 03/05/2013 08:15 AM, Jim Nasby wrote:


Would it be better to do checksum_logging_level = valid elog levels
? That way someone could set the notification to anything from DEBUG
up to PANIC. ISTM the default should be ERROR.

That seems nice at first brush, but I don't think it holds up.

All our other log_level parameters control only output. If I saw that
parameter, I would think aah, this is how we control the detail and
verbosity of messages regarding checksum checking and maintenance. I
would be totally astonished if I changed it and it actually affected the
system's data integrity checking and enforcement processes. Logging
control GUCs control what we show to what clients/log files, not what
log statements get executed; they're a filter and don't control the
behaviour of the emitting log point.

Control over whether checksum failures are an error or merely warned
about is reasonable, but I strongly disagree with the idea of making
this seem like it's just a logging parameter.


Good point. I thought we actually had precedent for controlling the level that 
something gets logged at, but now that you mention it I guess we don't. And 
this could sure as hell cause confusion.

So yeah, your original idea sounds best.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Josh Berkus
Heikki,

 Perhaps we should just wait a few years? If we suspect that this becomes
 obsolete in a few years, it's probably better to just wait, than add a
 feature we'll have to keep maintaining. Assuming it gets committed
 today, it's going to take a year or two for 9.3 to get released and all
 the bugs ironed out, anyway.

You are far more optimistic about FS development than I am:

* Windows and OSX are unlikely to ever have usable FS checksums
* BTRFS may be years away from being production-quality for DB server,
and (given the current dev priorities) may *never* be suitable for DB
servers.
* For various reasons, many users may stay with other filesystems, even
on Linux.
* All filesystems have bugs, and the FS may be itself causing the
corruption.
* FS checksums may not catch underlying driver bugs (i.e. better to have
two checks than one if you KNOW something is wrong)

We have people who could use PostgreSQL-level checksums *now* because
they are having data corruption issues *now* and need a tool to help
determine what layer the corruption is occurring at.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Daniel Farina
On Mon, Mar 4, 2013 at 1:22 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 04.03.2013 23:00, Jeff Davis wrote:

 On Mon, 2013-03-04 at 22:27 +0200, Heikki Linnakangas wrote:

 Yeah, fragmentation will certainly hurt some workloads. But how badly,
 and which workloads, and how does that compare with the work that
 PostgreSQL has to do to maintain the checksums? I'd like to see some
 data on those things.


 I think we all would. Btrfs will be a major filesystem in a few years,
 and we should be ready to support it.


 Perhaps we should just wait a few years? If we suspect that this becomes
 obsolete in a few years, it's probably better to just wait, than add a
 feature we'll have to keep maintaining. Assuming it gets committed today,
 it's going to take a year or two for 9.3 to get released and all the bugs
 ironed out, anyway.

Putting aside the not-so-rosy predictions seen elsewhere in this
thread about the availability of a high performance, reliable
checksumming file system available on common platforms, I'd like to
express what benefit this feature will have to me:

Corruption has easily occupied more than one person-month of time last
year for us.  This year to date I've burned two weeks, although
admittedly this was probably the result of statistical clustering.
Other colleagues of mine have probably put in a week or two in
aggregate in this year to date.  The ability to quickly, accurately,
and maybe at some later date proactively finding good backups to run
WAL recovery from is one of the biggest strides we can make in the
operation of Postgres.  The especially ugly cases are where the page
header is not corrupt, so full page images can carry along malformed
tuples...basically, when the corruption works its way into the WAL,
we're in much worse shape.  Checksums would hopefully prevent this
case, converting them into corrupt pages that will not be modified.

It would be better yet if I could write tools to find the last-good
version of pages, and so I think tight integration with Postgres will
see a lot of benefits that would be quite difficult and non-portable
when relying on file system checksumming.

You are among the most well-positioned to make assessments of the cost
of the feature, but I thought you might appreciate a perspective of
the benefits, too.  I think they're large, and for me they are the
highest pole in the tent for what makes Postgres stressful to operate
as-is today.  It's a testament to the quality of the programming in
Postgres that Postgres programming error is not the largest problem.

For sense of reference, I think the next largest operational problem
is the disruption caused by logical backups, e.g. pg_dump, and in
particular its long running transactions and sessions.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jeff Davis
On Mon, 2013-03-04 at 14:57 -0600, Jim Nasby wrote:
 I suggest we paint that GUC along the lines of
 checksum_failure_log_level, defaulting to ERROR. That way if someone
 wanted completely bury the elogs to like DEBUG they could.

The reason I didn't want to do that is because it's essentially a
recovery feature. A boolean seems more appropriate than a slider.

That's a good point about burying the messages with DEBUG, but I think
it might be slightly over-engineering it. I am willing to change it if
others want it, though.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jeff Davis
On Mon, 2013-03-04 at 23:22 +0200, Heikki Linnakangas wrote:
 On 04.03.2013 23:00, Jeff Davis wrote:
  On Mon, 2013-03-04 at 22:27 +0200, Heikki Linnakangas wrote:
  Yeah, fragmentation will certainly hurt some workloads. But how badly,
  and which workloads, and how does that compare with the work that
  PostgreSQL has to do to maintain the checksums? I'd like to see some
  data on those things.
 
  I think we all would. Btrfs will be a major filesystem in a few years,
  and we should be ready to support it.
 
 Perhaps we should just wait a few years? If we suspect that this becomes 
 obsolete in a few years

I do not expect it to be obsolete, even if btrfs is stable and fast
today.

Consider this hypothetical scenario: what if btrfs performs acceptably
well today, but they tune it away from our needs later and it tanks
performance? Then, when we complain, the btrfs people say for DB
workloads, you should turn off COW, or use ext4 or XFS. And then we say
but we want checksums. And then they tell us that real databases do
their own checksums.

Then what?

I don't think that scenario is very outlandish. Postgres is essentially
a COW system (for tuples), and stacking COW on top of COW does not seem
like a good idea (neither for filesystems nor actual cows). So it may be
within reason for the filesystem folks to say we're doing the wrong
thing, and then checksums are our problem again. Additionally, I don't
have a lot of faith that linux will address all of our btrfs complaints
(even legitimate ones) in a reasonable amount of time, if ever.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jeff Davis
On Mon, 2013-03-04 at 23:11 +0200, Heikki Linnakangas wrote:
 Of course not. But if we can get away without checksums in Postgres, 
 that's better, because then we don't need to maintain that feature in 
 Postgres. If the patch gets committed, it's not mission accomplished. 
 There will be discussion and need for further development on things like 
 what to do if you get a checksum failure, patches to extend the 
 checksums to cover things like the clog and other non-data files and so 
 forth. And it's an extra complication that will need to be taken into 
 account when developing other new features; in particular, hint bit 
 updates need to write a WAL record. Even if you have all the current 
 hint bits covered, it's an extra hurdle for future patches that might 
 want to have hint bits in e.g new index access methods.

The example you chose of adding a hint bit is a little overstated -- as
far as I can tell, setting a hint bit follows pretty much the same
pattern as before, except that I renamed the function to
MarkBufferDirtyHint().

But I agree in general. If complexity can be removed or avoided, that is
a very good thing. But right now, we have no answer to a real problem
that other databases do have an answer for. To me, the benefit is worth
the cost.

We aren't going down an irreversible path by adding checksums. If every
platform has a good checksumming filesystem and there is no demand for
the postgres code any more, we can deprecate it and remove it. But at
least users would have something between now and then.

 The PostgreSQL project would not be depending on it, any more than the 
 project depends on filesystem snapshots for backup purposes, or the OS 
 memory manager for caching.

I don't understand your analogies at all. We have WAL-protected base
backups so that users can get a consistent snapshot without filesystem
snapshots. To follow the analogy, we want postgres checksums so that the
user can be protected without filesystem checksums.

I would agree with you if we could point users somewhere and actually
recommend something and say what you're doing now is wrong, do X
instead (though if there is only one such X, we are dependent on it).
But even if we fast forward to three years from now: if someone shows up
saying that XFS gives him the best performance, but wants checksums,
will we really be able to say you are wrong to be using XFS; use
Btrfs?

One of the things I like about postgres is that we don't push a lot of
hard trade-offs on users. Several people (including you) put in effort
recently to support unlogged gist indexes. Are there some huge number of
users there that can't live without unlogged gist indexes? Probably not.
But that is one less thing that potential users have to trade away, and
one less thing to be confused or frustrated about.

I want to get to the point where checksums are the default, and only
advanced users would disable them. If that point comes in the form of
checksumming filesystems that are fast enough and enabled by default on
most of the platforms we support, that's fine with me. But I'm not very
sure that it will happen that way ever, and certainly not soon.

  If btrfs with checksums is 10% slower than ext4 with postgres checksums,
  does that mean we should commit the postgres checksums?
 
 In my opinion, a 10% gain would not be worth it, and we should not 
 commit in that case.
 
  On the other side of the coin, if btrfs with checksums is exactly the
  same speed as ext4 with no postgres checksums (i.e. checksums are free
  if we use btrfs), does that mean postgres checksums should be rejected?
 
 Yes, I think so. I'm sure at least some others will disagree; Greg 
 already made it quite clear that he doesn't care how the performance of 
 this compares with btrfs.

If all paths lead to rejection, what are these tests supposed to
accomplish, exactly?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Jeff Davis
On Sun, 2013-03-03 at 18:05 -0500, Greg Smith wrote:
 = Test 1 - find worst-case overhead for the checksum calculation on write =
 
 This can hit 25% of runtime when you isolate it out.  I'm not sure if 
 how I'm running this multiple times makes sense yet.  This one is so 
 much slower on my Mac that I can't barely see a change at all.
 
 = Test 2 - worst-case overhead for calculating checksum while reading data =
 
 Jeff saw an 18% slowdown, I get 24 to 32%.  This one bothers me because 
 the hit is going to happen during the very common situation where data 
 is shuffling a lot between a larger OS cache and shared_buffers taking a 
 relatively small fraction.  If that issue were cracked, such that 
 shared_buffers could be 50% of RAM, I think the typical real-world 
 impact of this would be easier to take.

I believe that test 1 and test 2 can be improved a little, if there is a
need. Right now we copy the page and then calculate the checksum on the
copy. If we instead calculate as we're copying, I believe it will make
it significantly faster.

I decided against doing that, because it decreased the readability, and
we can always do that later as an optimization. That should mitigate the
case you have in mind, which is a very legitimate concern. I'll wait for
someone to ask for it, though.

 = Test 3 - worst-case WAL overhead =
 
 This is the really nasty one.  The 10,000,000 rows touched by the SELECT 
 statement here create no WAL in a non-checksum environment.  When 
 checksums are on, 368,513,656 bytes of WAL are written, so about 37 
 bytes per row.

Yeah, nothing we can do about this.

 Right now the whole hint bit mechanism and its overhead are treated as 
 an internal detail that isn't in the regular documentation.  I think 
 committing this sort of checksum patch will require exposing some of the 
 implementation to the user in the documentation, so people can 
 understand what the trouble cases are--either in advance or when trying 
 to puzzle out why they're hitting one of them.

Any particular sections that you think would be good to update?

Thank you for the test results.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-03 Thread Craig Ringer
On 03/02/2013 12:48 AM, Daniel Farina wrote:
 On Sun, Feb 24, 2013 at 10:30 PM, Greg Smith g...@2ndquadrant.com wrote:
 Attached is some bit rot updates to the checksums patches.  The replace-tli
 one still works fine
 I rather badly want this feature, and if the open issues with the
 patch has hit zero, I'm thinking about applying it, shipping it, and
 turning it on.  Given that the heap format has not changed, the main
 affordence I may check for is if I can work in backwards compatibility
 (while not maintaining the checksums, of course) in case of an
 emergency.

Did you get a chance to see whether you can run it in
checksum-validation-and-update-off backward compatible mode? This seems
like an important thing to have working (and tested for) in case of
bugs, performance issues or other unforseen circumstances.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-03 Thread Greg Smith
And here's an updated version of the checksum corruption testing wrapper 
script already.  This includes an additional safety check that you've 
set PGDATA to a location that can be erased.  Presumably no one else 
would like to accidentally do this:


rm -rf /*

Like I just did.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


check-check.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-03 Thread Greg Smith

On 12/19/12 6:30 PM, Jeff Davis wrote:

I ran a few tests.
Test 1 - find worst-case overhead for the checksum calculation on write:
Test 2 - worst-case overhead for calculating checksum while reading data
Test 3 - worst-case WAL overhead


What I've done is wrap all of these tests into a shell script that runs 
them 3 times each, with and without checksums.  That includes steps like 
the spots where Jeff found a sync helped improve repeatability.  I ran 
these manually before and didn't notice enough of a difference to argue 
with any of his results at the time.  Having them packaged up usefully 
means I can try some additional platforms too, and other testers should 
be easily able to take a crack at it.


On the last one, in addition to runtime I directly measure how many 
bytes of WAL are written.  It's 0 in the usual case, where the hint bit 
changes triggered by the first SELECT * FROM foo don't generate any WAL.


Detailed results with both my and Jeff's numbers are in the attached 
spreadsheet.  I did my tests on a Mac writing to SSD, to try and get 
some variety in the test platforms.  The main difference there is that 
Test 1 is much slower on my system, enough so that the slowdown isn't as 
pronounced.


Remember, these are a set of tests designed to magnify the worst case 
here.  I don't feel any of these results make the feature uncommittable. 
 The numbers I'm getting are not significantly different from the ones 
Jeff posted back in December, and those were acceptable to some of the 
early adopter candidates I've been surveying informally.  These numbers 
are amplifying overhead without doing much in the way of real disk I/O, 
which can easily be a lot more expensive than any of this.  I do think 
there needs to be a bit more documentation of the potential downsides to 
checksumming written though, since they are pretty hefty in some situations.


I'm going to get some pgbench results next, to try and put this into a 
more realistic context too.  The numbers for this round break down like 
this:


= Test 1 - find worst-case overhead for the checksum calculation on write =

This can hit 25% of runtime when you isolate it out.  I'm not sure if 
how I'm running this multiple times makes sense yet.  This one is so 
much slower on my Mac that I can't barely see a change at all.


= Test 2 - worst-case overhead for calculating checksum while reading data =

Jeff saw an 18% slowdown, I get 24 to 32%.  This one bothers me because 
the hit is going to happen during the very common situation where data 
is shuffling a lot between a larger OS cache and shared_buffers taking a 
relatively small fraction.  If that issue were cracked, such that 
shared_buffers could be 50% of RAM, I think the typical real-world 
impact of this would be easier to take.


= Test 3 - worst-case WAL overhead =

This is the really nasty one.  The 10,000,000 rows touched by the SELECT 
statement here create no WAL in a non-checksum environment.  When 
checksums are on, 368,513,656 bytes of WAL are written, so about 37 
bytes per row.


Jeff saw this increase runtime by 135%, going from 1000ms to 2350ms.  My 
multiple runs are jumping around in a way I also don't trust fully yet. 
 But the first and best of the ones I'm seeing goes from 1660ms to 
4013ms, which is a 140% increase.  The others are even worse.  I suspect 
I'm filling a cache that isn't cleared before the second and third run 
are over.  I'll know for sure when I switch back to Linux.


The really nasty case I can see making people really cranky is where 
someone has fsync on, a slowly rotating drive, and then discovers this 
slowing read statements.  There's already a decent share of why is it 
writing when I do 'SELECT *'? complaints around the block I/O, which is 
fully asynchronous in a lot of cases.


Right now the whole hint bit mechanism and its overhead are treated as 
an internal detail that isn't in the regular documentation.  I think 
committing this sort of checksum patch will require exposing some of the 
implementation to the user in the documentation, so people can 
understand what the trouble cases are--either in advance or when trying 
to puzzle out why they're hitting one of them.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


ChecksumPerfResults.xls
Description: MS-Excel spreadsheet
# Configuration for testing checksum performance.
#
# Primary goals:
# -Reasonable buffer size and checkpoint parameters
# -Prevent interference from the bgwriter or autovacuum.
# -Turn off fsync so that it's measuring the calculation
#  overhead, not the effort of actually writing to disk.
#
autovacuum = off
fsync = off
bgwriter_lru_maxpages = 0
shared_buffers = 1024MB
checkpoint_segments = 64



perf-checksum.sh
Description: Bourne shell script
--
-- Convert hex value to a decimal one.  It's possible to do this using
-- undocumented features of the bit type, such as:
--

Re: [HACKERS] Enabling Checksums

2013-03-03 Thread Greg Smith

On 3/3/13 9:22 AM, Craig Ringer wrote:

Did you get a chance to see whether you can run it in
checksum-validation-and-update-off backward compatible mode? This seems
like an important thing to have working (and tested for) in case of
bugs, performance issues or other unforseen circumstances.


There isn't any way to do this in the current code.  The big 
simplification Jeff introduced here, to narrow complexity toward a 
commit candidate, was to make checksumming a cluster-level decision. 
You get it for everything or not at all.


The problem I posted about earlier today, where a header checksum error 
can block access to the entire relation, could be resolved with some 
sort of ignore read checksums GUC.  But that's impractical right now 
for the write side of things.  There have been a long list of metadata 
proposals to handle situations where part of a cluster is checksummed, 
but not all of it.  Once that sort of feature is implemented, it becomes 
a lot easier to talk about selectively disabling writes.


As for a design of a GUC that might be useful here, the option itself 
strikes me as being like archive_mode in its general use.  There is an 
element of parameters like wal_sync_method or enable_cassert though, 
where the options available vary depending on how you built the cluster. 
 Maybe name it checksum_level with options like this:


off:  only valid option if you didn't enable checksums with initdb
enforcing:  full checksum behavior as written right now.
unvalidated:  broken checksums on reads are ignored.

The main tricky case I see in that is where you read in a page with a 
busted checksum using unvalidated.  Ideally you wouldn't write such a 
page back out again, because it's going to hide that it's corrupted in 
some way already.  How to enforce that though?  Perhaps unvalidated 
only be allowed in a read-only transaction?


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-03 Thread Craig Ringer
On 03/04/2013 11:18 AM, Greg Smith wrote:
 On 3/3/13 9:22 AM, Craig Ringer wrote:
 Did you get a chance to see whether you can run it in
 checksum-validation-and-update-off backward compatible mode? This seems
 like an important thing to have working (and tested for) in case of
 bugs, performance issues or other unforseen circumstances.

 There isn't any way to do this in the current code.  The big
 simplification Jeff introduced here, to narrow complexity toward a
 commit candidate, was to make checksumming a cluster-level decision.
 You get it for everything or not at all.

 The problem I posted about earlier today, where a header checksum
 error can block access to the entire relation, could be resolved with
 some sort of ignore read checksums GUC.  But that's impractical
 right now for the write side of things.  There have been a long list
 of metadata proposals to handle situations where part of a cluster is
 checksummed, but not all of it.  Once that sort of feature is
 implemented, it becomes a lot easier to talk about selectively
 disabling writes.

 As for a design of a GUC that might be useful here, the option itself
 strikes me as being like archive_mode in its general use.  There is an
 element of parameters like wal_sync_method or enable_cassert though,
 where the options available vary depending on how you built the
 cluster.  Maybe name it checksum_level with options like this:

 off:  only valid option if you didn't enable checksums with initdb
 enforcing:  full checksum behavior as written right now.
 unvalidated:  broken checksums on reads are ignored.

 The main tricky case I see in that is where you read in a page with a
 busted checksum using unvalidated.  Ideally you wouldn't write such
 a page back out again, because it's going to hide that it's corrupted
 in some way already.  How to enforce that though?  Perhaps
 unvalidated only be allowed in a read-only transaction?
That sounds like a really good step for disaster recovery, yes.

I also suspect that at least in the first release it might be desirable
to have an option that essentially says something's gone horribly wrong
and we no longer want to check or write checksums, we want a
non-checksummed DB that can still read our data from before we turned
checksumming off. Essentially, a way for someone who's trying
checksumming in production after their staging tests worked out OK to
abort and go back to the non-checksummed case without having to do a
full dump and reload.

Given that, I suspect we need a 4th state, like disabled or
unvalidating_writable where we ignore checksums completely and
maintain the checksum-enabled layout but just write padding to the
checksum fields and don't bother to check them on reading.

My key concern boils down to being able to get someone up and running
quickly and with minimal disruption if something we didn't think of goes
wrong. Oh, you have to dump and reload your 1TB database before you can
start writing to it again isn't going to cut it.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-03 Thread Greg Smith

On 3/3/13 10:52 PM, Craig Ringer wrote:

I also suspect that at least in the first release it might be desirable
to have an option that essentially says something's gone horribly wrong
and we no longer want to check or write checksums, we want a
non-checksummed DB that can still read our data from before we turned
checksumming off.


I see that as being something that involves disabling the cluster-wide 
flag that turns checksumming on, the one that is reported by 
pg_controldata.  I think it would have to be a one-way, system down kind 
of change, which I think is fair given the ugly (but feasible) situation 
you're describing.  It would need to be something stronger than a GUC. 
Once you start writing out pages without checksums, you're back into the 
fuzzy state where some pages have them, others don't, and there's no 
good way to deal with that yet.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-03 Thread Craig Ringer
On 03/04/2013 12:19 PM, Greg Smith wrote:
 On 3/3/13 10:52 PM, Craig Ringer wrote:
 I also suspect that at least in the first release it might be desirable
 to have an option that essentially says something's gone horribly wrong
 and we no longer want to check or write checksums, we want a
 non-checksummed DB that can still read our data from before we turned
 checksumming off.

 I see that as being something that involves disabling the cluster-wide
 flag that turns checksumming on, the one that is reported by
 pg_controldata.  I think it would have to be a one-way, system down
 kind of change, which I think is fair given the ugly (but feasible)
 situation you're describing.  It would need to be something stronger
 than a GUC. Once you start writing out pages without checksums, you're
 back into the fuzzy state where some pages have them, others don't,
 and there's no good way to deal with that yet.

Agreed, I was envisioning a one-way process where re-enabling checksums
would involve be a re-initdb and reload. A DB restart seems perfectly
reasonable, it's just a full dump and reload before they can get running
again that I feel must be avoided.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-03 Thread Simon Riggs
On 3 March 2013 18:24, Greg Smith g...@2ndquadrant.com wrote:

 The 16-bit checksum feature seems functional, with two sources of overhead.
 There's some CPU time burned to compute checksums when pages enter the
 system.  And there's extra overhead for WAL logging hint bits.  I'll
 quantify both of those better in another message.

It's crunch time. Do you and Jeff believe this patch should be
committed to Postgres core?

Are there objectors?

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-03-01 Thread Daniel Farina
On Sun, Feb 24, 2013 at 10:30 PM, Greg Smith g...@2ndquadrant.com wrote:
 Attached is some bit rot updates to the checksums patches.  The replace-tli
 one still works fine

I rather badly want this feature, and if the open issues with the
patch has hit zero, I'm thinking about applying it, shipping it, and
turning it on.  Given that the heap format has not changed, the main
affordence I may check for is if I can work in backwards compatibility
(while not maintaining the checksums, of course) in case of an
emergency.

-- 
fdr


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-01-28 Thread Robert Haas
On Sun, Jan 27, 2013 at 5:28 PM, Jeff Davis pg...@j-davis.com wrote:
 There's a maximum of one FPI per page per cycle, and we need the FPI for
 any modified page in this design regardless.

 So, deferring the XLOG_HINT WAL record doesn't change the total number
 of FPIs emitted. The only savings would be on the trivial XLOG_HINT wal
 record itself, because we might notice that it's not necessary in the
 case where some other WAL action happened to the page.

OK, I see.  So the case where this really hurts is where a page is
updated for hint bits only and then not touched again for the
remainder of the checkpoint cycle.

  At first glance, it seems sound as long as the WAL FPI makes it to disk
  before the data. But to meet that requirement, it seems like we'd need
  to write an FPI and then immediately flush WAL before cleaning a page,
  and that doesn't seem like a win. Do you (or Simon) see an opportunity
  here that I'm missing?

 I am not sure that isn't a win.  After all, we can need to flush WAL
 before flushing a buffer anyway, so this is just adding another case -

 Right, but if we get the WAL record in earlier, there is a greater
 chance that it goes out with some unrelated WAL flush, and we don't need
 to flush the WAL to clean the buffer at all. Separating WAL insertions
 from WAL flushes seems like a fairly important goal, so I'm a little
 skeptical of a proposal to narrow that gap so drastically.

 It's hard to analyze without a specific proposal on the table. But if
 cleaning pages requires a WAL record followed immediately by a flush, it
 seems like that would increase the number of actual WAL flushes we need
 to do by a lot.

Yeah, maybe.  I think Simon had a good argument for not pursuing this
route, anyway.

 and the payoff is that the initial access to a page, setting hint
 bits, is quickly followed by a write operation, we avoid the need for
 any extra WAL to cover the hint bit change.  I bet that's common,
 because if updating you'll usually need to look at the tuples on the
 page and decide whether they are visible to your scan before, say,
 updating one of them

 That's a good point, I'm just not sure how avoid that problem without a
 lot of complexity or a big cost. It seems like we want to defer the
 XLOG_HINT WAL record for a short time; but not wait so long that we need
 to clean the buffer or miss a chance to piggyback on another WAL flush.

  By the way, the approach I took was to add the heap buffer to the WAL
  chain of the XLOG_HEAP2_VISIBLE wal record when doing log_heap_visible.
  It seemed simpler to understand than trying to add a bunch of options to
  MarkBufferDirty.

 Unless I am mistaken, that's going to heavy penalize the case where
 the user vacuums an insert-only table.  It will emit much more WAL
 than currently.

 Yes, that's true, but I think that's pretty fundamental to this
 checksums design (and of course it only applies if checksums are
 enabled). We need to make sure an FPI is written and the LSN bumped
 before we write a page.

 That's why I was pushing a little on various proposals to either remove
 or mitigate the impact of hint bits (load path, remove PD_ALL_VISIBLE,
 cut down on the less-important hint bits, etc.). Maybe those aren't
 viable, but that's why I spent time on them.

 There are some other options, but I cringe a little bit thinking about
 them. One is to simply exclude the PD_ALL_VISIBLE bit from the checksum
 calculation, so that a torn page doesn't cause a problem (though
 obviously that one bit would be vulnerable to corruption). Another is to
 use a double-write buffer, but that didn't seem to go very far. Or, we
 could abandon the whole thing and tell people to use ZFS/btrfs/NAS/SAN.

I am inclined to think that we shouldn't do any of this stuff for now.
 I think it's OK if the first version of checksums is
not-that-flexible and/or not-that-performant.  We can optimize for
those things later.  Trying to monkey with this at the same time we're
trying to get checksums in risks introducing new diverting focus from
getting checksums done at all, and risks also introducing new data
corruption bugs.  We have a reputation of long standing for getting it
right first and then getting it to perform well later, so it shouldn't
a be a total shock if we take that approach here, too.  I see no
reason to think that the performance problems must be solved up front
or not at all.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-01-27 Thread Simon Riggs
On 25 January 2013 20:29, Robert Haas robertmh...@gmail.com wrote:

 The checksums patch also introduces another behavior into
 SetBufferCommitInfoNeedsSave, which is to write an XLOG_HINT WAL record
 if checksums are enabled (to avoid torn page hazards). That's only
 necessary for changes where the caller does not write WAL itself and
 doesn't bump the LSN of the data page. (There's a reason the caller
 can't easily write the XLOG_HINT WAL itself.) So, we could introduce
 another flag needsWAL that would control whether we write the
 XLOG_HINT WAL or not (only applies with checksums on, of course).

 I thought Simon had the idea, at some stage, of writing a WAL record
 to cover hint-bit changes only at the time we *write* the buffer and
 only if no FPI had already been emitted that checkpoint cycle.  I'm
 not sure whether that approach was sound, but if so it seems more
 efficient than this approach.

The requirement is that we ensure that a FPI is written to WAL before
any changes to the block are made.

The patch does that by inserting an XLOG_HINT_WAL record when we set a
hint. The insert is a no-op if we've already written the FPI in this
checkpoint cycle and we don't even reach there except when dirtying a
clean data block.

If we attempted to defer the FPI last thing before write, we'd need to
cope with the case that writes at checkpoint occur after the logical
start of the checkpoint, and also with the overhead of additional
writes at checkpoint time.

I don't see any advantage in deferring the FPI, but I do see
disadvantage in complicating this.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-01-27 Thread Robert Haas
On Sun, Jan 27, 2013 at 3:50 AM, Simon Riggs si...@2ndquadrant.com wrote:
 If we attempted to defer the FPI last thing before write, we'd need to
 cope with the case that writes at checkpoint occur after the logical
 start of the checkpoint, and also with the overhead of additional
 writes at checkpoint time.

Oh, good point.  That's surely a good reason not to do it that way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-01-27 Thread Jeff Davis
On Sat, 2013-01-26 at 23:23 -0500, Robert Haas wrote:
  If we were to try to defer writing the WAL until the page was being
  written, the most it would possibly save is the small XLOG_HINT WAL
  record; it would not save any FPIs.
 
 How is the XLOG_HINT_WAL record kept small and why does it not itself
 require an FPI?

There's a maximum of one FPI per page per cycle, and we need the FPI for
any modified page in this design regardless.

So, deferring the XLOG_HINT WAL record doesn't change the total number
of FPIs emitted. The only savings would be on the trivial XLOG_HINT wal
record itself, because we might notice that it's not necessary in the
case where some other WAL action happened to the page.

  At first glance, it seems sound as long as the WAL FPI makes it to disk
  before the data. But to meet that requirement, it seems like we'd need
  to write an FPI and then immediately flush WAL before cleaning a page,
  and that doesn't seem like a win. Do you (or Simon) see an opportunity
  here that I'm missing?
 
 I am not sure that isn't a win.  After all, we can need to flush WAL
 before flushing a buffer anyway, so this is just adding another case -

Right, but if we get the WAL record in earlier, there is a greater
chance that it goes out with some unrelated WAL flush, and we don't need
to flush the WAL to clean the buffer at all. Separating WAL insertions
from WAL flushes seems like a fairly important goal, so I'm a little
skeptical of a proposal to narrow that gap so drastically.

It's hard to analyze without a specific proposal on the table. But if
cleaning pages requires a WAL record followed immediately by a flush, it
seems like that would increase the number of actual WAL flushes we need
to do by a lot.

 and the payoff is that the initial access to a page, setting hint
 bits, is quickly followed by a write operation, we avoid the need for
 any extra WAL to cover the hint bit change.  I bet that's common,
 because if updating you'll usually need to look at the tuples on the
 page and decide whether they are visible to your scan before, say,
 updating one of them

That's a good point, I'm just not sure how avoid that problem without a
lot of complexity or a big cost. It seems like we want to defer the
XLOG_HINT WAL record for a short time; but not wait so long that we need
to clean the buffer or miss a chance to piggyback on another WAL flush.

  By the way, the approach I took was to add the heap buffer to the WAL
  chain of the XLOG_HEAP2_VISIBLE wal record when doing log_heap_visible.
  It seemed simpler to understand than trying to add a bunch of options to
  MarkBufferDirty.
 
 Unless I am mistaken, that's going to heavy penalize the case where
 the user vacuums an insert-only table.  It will emit much more WAL
 than currently.

Yes, that's true, but I think that's pretty fundamental to this
checksums design (and of course it only applies if checksums are
enabled). We need to make sure an FPI is written and the LSN bumped
before we write a page.

That's why I was pushing a little on various proposals to either remove
or mitigate the impact of hint bits (load path, remove PD_ALL_VISIBLE,
cut down on the less-important hint bits, etc.). Maybe those aren't
viable, but that's why I spent time on them.

There are some other options, but I cringe a little bit thinking about
them. One is to simply exclude the PD_ALL_VISIBLE bit from the checksum
calculation, so that a torn page doesn't cause a problem (though
obviously that one bit would be vulnerable to corruption). Another is to
use a double-write buffer, but that didn't seem to go very far. Or, we
could abandon the whole thing and tell people to use ZFS/btrfs/NAS/SAN.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-01-26 Thread Robert Haas
On Fri, Jan 25, 2013 at 9:35 PM, Jeff Davis pg...@j-davis.com wrote:
 On Fri, 2013-01-25 at 15:29 -0500, Robert Haas wrote:
 I thought Simon had the idea, at some stage, of writing a WAL record
 to cover hint-bit changes only at the time we *write* the buffer and
 only if no FPI had already been emitted that checkpoint cycle.  I'm
 not sure whether that approach was sound, but if so it seems more
 efficient than this approach.

 My patch is based on his original idea; although I've made quite a lot
 of changes, I believe that I have stuck to his same basic design w.r.t.
 WAL.

 This patch does not cause a new FPI to be emitted if one has already
 been emitted this cycle. It also does not emit a WAL record at all if an
 FPI has already been emitted.

 If we were to try to defer writing the WAL until the page was being
 written, the most it would possibly save is the small XLOG_HINT WAL
 record; it would not save any FPIs.

How is the XLOG_HINT_WAL record kept small and why does it not itself
require an FPI?

 At first glance, it seems sound as long as the WAL FPI makes it to disk
 before the data. But to meet that requirement, it seems like we'd need
 to write an FPI and then immediately flush WAL before cleaning a page,
 and that doesn't seem like a win. Do you (or Simon) see an opportunity
 here that I'm missing?

I am not sure that isn't a win.  After all, we can need to flush WAL
before flushing a buffer anyway, so this is just adding another case -
and the payoff is that the initial access to a page, setting hint
bits, is quickly followed by a write operation, we avoid the need for
any extra WAL to cover the hint bit change.  I bet that's common,
because if updating you'll usually need to look at the tuples on the
page and decide whether they are visible to your scan before, say,
updating one of them

 By the way, the approach I took was to add the heap buffer to the WAL
 chain of the XLOG_HEAP2_VISIBLE wal record when doing log_heap_visible.
 It seemed simpler to understand than trying to add a bunch of options to
 MarkBufferDirty.

Unless I am mistaken, that's going to heavy penalize the case where
the user vacuums an insert-only table.  It will emit much more WAL
than currently.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-01-25 Thread Robert Haas
On Thu, Jan 10, 2013 at 1:06 AM, Jeff Davis pg...@j-davis.com wrote:
 On Tue, 2012-12-04 at 01:03 -0800, Jeff Davis wrote:
 For now, I rebased the patches against master, and did some very minor
 cleanup.

 I think there is a problem here when setting PD_ALL_VISIBLE. I thought I
 had analyzed that before, but upon review, it doesn't look right.
 Setting PD_ALL_VISIBLE needs to be associated with a WAL action somehow,
 and a bumping of the LSN, otherwise there is a torn page hazard.

 The solution doesn't seem particularly difficult, but there are a few
 strange aspects and I'm not sure exactly which path I should take.

 First of all, the relationship between MarkBufferDirty and
 SetBufferCommitInfoNeedsSave is a little confusing. The comment over
 MarkBufferDirty is confusing because it says that the caller must have
 an exclusive lock, or else bad data could be written. But that doesn't
 have to do with marking the buffer dirty, that has to do with the data
 page change you make while you are marking it dirty -- if it's a single
 bit change, then there is no risk that I can see.

 In the current code, the only real code difference between the two is
 that SetBufferCommitInfoNeedsSave might fail to mark the buffer dirty if
 there is a race. So, in the current code, we could actually combine the
 two by passing a force flag (if true, behaves like MarkBufferDirty, if
 false, behaves like SetBufferCommitInfoNeedsSave).

 The checksums patch also introduces another behavior into
 SetBufferCommitInfoNeedsSave, which is to write an XLOG_HINT WAL record
 if checksums are enabled (to avoid torn page hazards). That's only
 necessary for changes where the caller does not write WAL itself and
 doesn't bump the LSN of the data page. (There's a reason the caller
 can't easily write the XLOG_HINT WAL itself.) So, we could introduce
 another flag needsWAL that would control whether we write the
 XLOG_HINT WAL or not (only applies with checksums on, of course).

I thought Simon had the idea, at some stage, of writing a WAL record
to cover hint-bit changes only at the time we *write* the buffer and
only if no FPI had already been emitted that checkpoint cycle.  I'm
not sure whether that approach was sound, but if so it seems more
efficient than this approach.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-01-25 Thread Jeff Davis
On Fri, 2013-01-25 at 15:29 -0500, Robert Haas wrote:
 I thought Simon had the idea, at some stage, of writing a WAL record
 to cover hint-bit changes only at the time we *write* the buffer and
 only if no FPI had already been emitted that checkpoint cycle.  I'm
 not sure whether that approach was sound, but if so it seems more
 efficient than this approach.

My patch is based on his original idea; although I've made quite a lot
of changes, I believe that I have stuck to his same basic design w.r.t.
WAL.

This patch does not cause a new FPI to be emitted if one has already
been emitted this cycle. It also does not emit a WAL record at all if an
FPI has already been emitted.

If we were to try to defer writing the WAL until the page was being
written, the most it would possibly save is the small XLOG_HINT WAL
record; it would not save any FPIs.

At first glance, it seems sound as long as the WAL FPI makes it to disk
before the data. But to meet that requirement, it seems like we'd need
to write an FPI and then immediately flush WAL before cleaning a page,
and that doesn't seem like a win. Do you (or Simon) see an opportunity
here that I'm missing?

By the way, the approach I took was to add the heap buffer to the WAL
chain of the XLOG_HEAP2_VISIBLE wal record when doing log_heap_visible.
It seemed simpler to understand than trying to add a bunch of options to
MarkBufferDirty.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-01-24 Thread Jeff Davis
On Wed, 2013-01-16 at 17:38 -0800, Jeff Davis wrote:
 New version of checksums patch.

And another new version of both patches.

Changes:
 * Rebased.
 * Rename SetBufferCommitInfoNeedsSave to MarkBufferDirtyHint. Now that
it's being used more places, it makes sense to give it a more generic
name.
 * My colleague, Yingjie He, noticed that the FSM doesn't write any WAL,
and therefore we must protect those operations against torn pages. That
seems simple enough: just use MarkBufferDirtyHint (formerly
SetBufferCommitInfoNeedsSave) instead of MarkBufferDirty. The FSM
changes are not critical, so the fact that we may lose the dirty bit is
OK.

Regards,
Jeff Davis


replace-tli-with-checksums-20130124.patch.gz
Description: GNU Zip compressed data


checksums-20130124.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-01-16 Thread Jeff Davis
New version of checksums patch.

Changes:
  * rebased
  * removed two duplicate lines; apparently the result of a bad merge
  * Added heap page to WAL chain when logging an XLOG_HEAP2_VISIBLE to
avoid torn page issues updating PD_ALL_VISIBLE. This is the most
significant change.
  * minor comment cleanup

No open issues that I'm aware of with the patch itself.

Greg appears to have made some progress on the automated corruption
tester.

Note to reviewers: I also have a patch out to remove PD_ALL_VISIBLE
entirely. The code around PD_ALL_VISIBLE is quite tricky (with or
without this patch), so if the PD_ALL_VISIBLE patch is committed first
then it will make reviewing this patch easier. Regardless, the second
patch to be committed will need to be rebased on top of the first.

Regards,
Jeff Davis


replace-tli-with-checksums-20130116.patch.gz
Description: GNU Zip compressed data


checksums-20130116.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-01-16 Thread Jeff Davis
On Tue, 2013-01-15 at 19:36 -0500, Greg Smith wrote:
 First rev of a simple corruption program is attached, in very C-ish 
 Python.

Great. Did you verify that my patch works as you expect at least in the
simple case?

  The parameters I settled on are to accept a relation name, byte 
 offset, byte value, and what sort of operation to do:  overwrite, AND, 
 OR, XOR.  I like XOR here because you can fix it just by running the 
 program again.

Oh, good idea.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-01-15 Thread Greg Smith
First rev of a simple corruption program is attached, in very C-ish 
Python.  The parameters I settled on are to accept a relation name, byte 
offset, byte value, and what sort of operation to do:  overwrite, AND, 
OR, XOR.  I like XOR here because you can fix it just by running the 
program again.  Rewriting this in C would not be terribly difficult, and 
most of the time spent on this version was figuring out what to do.


This follows Jeff's idea that the most subtle corruption is the hardest 
to spot, so testing should aim at the smallest unit of change.  If you 
can spot a one bit error in an unused byte of a page, presumably that 
will catch large errors like a byte swap.  I find some grim amusement 
that the checksum performance testing I've been trying to do got stuck 
behind a problem with a tiny, hard to detect single bit of corruption.


Here's pgbench_accounts being corrupted, the next to last byte on this line:

$ pgbench -i -s 1
$ ./pg_corrupt pgbench_accounts show
Reading byte 0 within file /usr/local/var/postgres/base/16384/25242
Current byte= 0 / $00
$ hexdump /usr/local/var/postgres/base/16384/25242 | head
000 00 00 00 00 00 00 00 00 00 00 04 00 0c 01 80 01
...
$ ./pg_corrupt pgbench_accounts 14 1
/usr/local/var/postgres base/16384/25242 8192 13434880 1640
Reading byte 14 within file /usr/local/var/postgres/base/16384/25242
Current byte= 128 / $80
Modified byte= 129 / $81
File modified successfully
$ hexdump /usr/local/var/postgres/base/16384/25242 | head
000 00 00 00 00 00 00 00 00 00 00 04 00 0c 01 81 01

That doesn't impact selecting all of the rows:

$ psql -c select count(*) from pgbench_accounts
 count

 10

And pg_dump works fine against the table too.  Tweaking this byte looks 
like a reasonable first test case for seeing if checksums can catch an 
error that query execution doesn't.


Next I'm going to test the functional part of the latest checksum patch; 
duplicate Jeff's targeted performance tests; and then run some of my 
own.  I wanted to get this little tool circulating now that it's useful 
first.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
#!/usr/bin/env python
#
# pg_corrupt
#
# Read in a byte of a PostgreSQL relation (a table or index) and allow writing
# it back with an altered value.
#
# Greg Smith g...@2ndquadrant.com
# Copyright (c) 2013, Heroku, Inc.
# Released under The PostgreSQL Licence
#

from os.path import join
from subprocess import Popen,PIPE
import sys
import psycopg2

class Operations:
SHOW=0
WRITE=1
AND=2
OR=3
XOR=4
text=['show','write','and','or','xor']

def controldata_blocks_per_segment(pgdata):
blocks_per_seg = 131072
try:
# TODO This doesn't work when called in an Emacs compile shell
out, err = Popen(pg_controldata %s % pgdata, stdout=PIPE, 
shell=True).communicate()
control_data=out.splitlines()
for c in control_data:
if c.startswith(Blocks per segment of large relation:):
blocks_per_seg=int(c.split(:)[1])
except:
print Cannot determine blocks per segment, using default 
of,blocks_per_seg
return blocks_per_seg

def get_table_info(conn,relation):
cur = conn.cursor()
q=SELECT \
   current_setting('data_directory') AS data_directory, \
   pg_relation_filepath(oid), \
   current_setting('block_size') AS block_size, \
   pg_relation_size(oid), \
   relpages \
   FROM pg_class \
   WHERE relname='%s' % relation
cur.execute(q)
if cur.rowcount != 1:
print Error: did not return 1 row from pg_class lookup of %s % 
relation
return None
table_info={}

for i in cur:
table_info['relation']=relation
table_info['pgdata'] = i[0]
table_info['filepath'] = i[1]
table_info['block_size'] = int(i[2])
table_info['relation_size'] = i[3]
table_info['relpages'] = i[4]
table_info['base_file_name']=join(i[0],i[1])
table_info['blocks_per_seg'] = 
controldata_blocks_per_segment(table_info['pgdata'])
table_info['bytes_per_seg'] = table_info['block_size'] * 
table_info['blocks_per_seg']

cur.close()
return table_info

def operate(table_info,byte_offset,operation,value):
if byte_offset  table_info['relation_size']:
print Error:  trying to change byte %s but relation %s is only %s 
bytes % \
(byte_offset, table_info['relation'],table_info['relation_size'])
return

if byte_offset  0:
print Error:  cannot use negative byte offsets
return

file_name=table_info['base_file_name']
file_seq=int(byte_offset / table_info['bytes_per_seg'])
if file_seq  0:
file_name=file_name + .%s % file_seq
file_offset = byte_offset - file_seq * table_info['bytes_per_seg']
else:
file_offset = byte_offset

print Reading byte,file_offset,within 

Re: [HACKERS] Enabling Checksums

2013-01-12 Thread Greg Smith

On 12/19/12 6:30 PM, Jeff Davis wrote:

The idea is to prevent interference from the bgwriter or autovacuum.
Also, I turn of fsync so that it's measuring the calculation overhead,
not the effort of actually writing to disk.


With my test server issues sorted, what I did was setup a single 7200RPM 
drive with a battery-backed write cache card.  That way fsync doesn't 
bottleneck things.  And I to realized that limit had to be cracked 
before anything use useful could be done.  Having the BBWC card is a bit 
better than fsync=off, because we'll get something more like the 
production workload out of it. I/O will be realistic, but limited to 
only one one drive can pull off.



Without checksums, it takes about 1000ms. With checksums, about 2350ms.
I also tested with checksums but without the CHECKPOINT commands above,
and it was also 1000ms.


I think we need to use lower checkpoint_segments to try and trigger more 
checkpoints.  My 10 minute pgbench-tool runs will normally have at most 
3 checkpoints.  I would think something like 10 would be more useful, to 
make sure we're spending enough time seeing extra WAL writes;



This test is more plausible than the other two, so it's more likely to
be a real problem. So, the biggest cost of checksums is, by far, the
extra full-page images in WAL, which matches our expectations.


What I've done with pgbench-tools is actually measure the amount of WAL 
from the start to the end of the test run.  To analyze it you need to 
scale it a bit; computing wal bytes / commit seems to work.


pgbench-tools also launches vmstat and isstat in a way that it's 
possible to graph the values later.  The interesting results I'm seeing 
are when the disk is about 80% busy and when it's 100% busy.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-01-10 Thread Simon Riggs
On 10 January 2013 06:06, Jeff Davis pg...@j-davis.com wrote:

 The checksums patch also introduces another behavior into
 SetBufferCommitInfoNeedsSave, which is to write an XLOG_HINT WAL record
 if checksums are enabled (to avoid torn page hazards). That's only
 necessary for changes where the caller does not write WAL itself and
 doesn't bump the LSN of the data page. (There's a reason the caller
 can't easily write the XLOG_HINT WAL itself.) So, we could introduce
 another flag needsWAL that would control whether we write the
 XLOG_HINT WAL or not (only applies with checksums on, of course).

That wouldn't work because it can't know the exact answer to that, but
the way the patch does this is already correct.

XLOG_HINT_WAL doesn't always write a WAL record, it only does it when
necessary. See XLogInsert()

Didn't fully understand other comments. Do we we need an answer now?
My head is somewhere else.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-01-10 Thread Jeff Davis
  The checksums patch also introduces another behavior into
  SetBufferCommitInfoNeedsSave, which is to write an XLOG_HINT WAL record
  if checksums are enabled (to avoid torn page hazards). That's only
  necessary for changes where the caller does not write WAL itself and
  doesn't bump the LSN of the data page. (There's a reason the caller
  can't easily write the XLOG_HINT WAL itself.) So, we could introduce
  another flag needsWAL that would control whether we write the
  XLOG_HINT WAL or not (only applies with checksums on, of course).
 
 That wouldn't work because it can't know the exact answer to that, but
 the way the patch does this is already correct.

The name I chose was poor, but the flag should mean the caller does not
write WAL associated with this action. If that flag is true, and if
checksums are enabled, then it would do an XLogInsert, which may write
WAL (depending on the LSN check).

That part of the patch is correct currently, but the problem is with
updates to PD_ALL_VISIBLE. Let me try to explain again:

Calls to PageSetAllVisible are not directly associated with a WAL
action, but they are associated with a call to MarkBufferDirty and do
have an exclusive content lock on the buffer. There's a torn page hazard
there for checksums, because without any WAL action associated with the
data page, there is no backup page.

One idea might be to use SetBufferCommitInfoNeedsSave (which will write
WAL if necessary) instead of MarkBufferDirty. But that is unsafe,
because it might not actually mark the buffer dirty due to a race
(documented in SetBufferCommitInfoNeedsSave). So that's why I wanted to
refactor MarkBufferDirty/SetBufferCommitInfoNeedsSave, to separate the
concept that it may need a WAL record from the concept that actually
dirtying the page is optional.

Another idea is to make the WAL action for visibilitymap_set have
another item in the chain pointing to the heap buffer, and bump the heap
LSN.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2013-01-09 Thread Jeff Davis
On Tue, 2012-12-04 at 01:03 -0800, Jeff Davis wrote:
 For now, I rebased the patches against master, and did some very minor
 cleanup.

I think there is a problem here when setting PD_ALL_VISIBLE. I thought I
had analyzed that before, but upon review, it doesn't look right.
Setting PD_ALL_VISIBLE needs to be associated with a WAL action somehow,
and a bumping of the LSN, otherwise there is a torn page hazard.

The solution doesn't seem particularly difficult, but there are a few
strange aspects and I'm not sure exactly which path I should take.

First of all, the relationship between MarkBufferDirty and
SetBufferCommitInfoNeedsSave is a little confusing. The comment over
MarkBufferDirty is confusing because it says that the caller must have
an exclusive lock, or else bad data could be written. But that doesn't
have to do with marking the buffer dirty, that has to do with the data
page change you make while you are marking it dirty -- if it's a single
bit change, then there is no risk that I can see.

In the current code, the only real code difference between the two is
that SetBufferCommitInfoNeedsSave might fail to mark the buffer dirty if
there is a race. So, in the current code, we could actually combine the
two by passing a force flag (if true, behaves like MarkBufferDirty, if
false, behaves like SetBufferCommitInfoNeedsSave).

The checksums patch also introduces another behavior into
SetBufferCommitInfoNeedsSave, which is to write an XLOG_HINT WAL record
if checksums are enabled (to avoid torn page hazards). That's only
necessary for changes where the caller does not write WAL itself and
doesn't bump the LSN of the data page. (There's a reason the caller
can't easily write the XLOG_HINT WAL itself.) So, we could introduce
another flag needsWAL that would control whether we write the
XLOG_HINT WAL or not (only applies with checksums on, of course).

The reason for all of this is because the setting of PD_ALL_VISIBLE does
not fit MarkBufferDirty, because MarkBufferDirty does not write the
XLOG_HINT WAL and neither does the caller. But it also doesn't fit
SetBufferCommitInfoNeedsSave, because that is subject to a race. If
MarkBufferDirty had the signature:

  MarkBufferDirty(Buffer buffer, bool force, bool needsWAL)

then normal page changes would look like:
  MarkBufferDirty(buffer, true, false)
setting PD_ALL_VISIBLE would look like:
  MarkBufferDirty(buffer, true, true)
and setting a hint would look like:
  MarkBufferDirty(buffer, false, true)


Another approach would be for the caller who sets PD_ALL_VISIBLE to
write WAL. But that requires inventing a new WAL record or chaining the
heap block onto the wal entry when doing visibilitymap_set (only
necessary when checksums are on). That seems somewhat of a hack, but
perhaps it's not too bad.

Also, I have another patch posted that is removing PD_ALL_VISIBLE
entirely, which is dampening my enthusiasm to do too much work that
might be thrown away. So right now, I'm leaning toward just adding the
heap buffer to the WAL chain during visibilitymap_set.

Thoughts?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-20 Thread Martijn van Oosterhout
On Tue, Dec 18, 2012 at 04:06:02AM -0500, Greg Smith wrote:
 On 12/18/12 3:17 AM, Simon Riggs wrote:
 Clearly part of the response could involve pg_dump on the damaged
 structure, at some point.
 
 This is the main thing I wanted to try out more, once I have a
 decent corruption generation tool.  If you've corrupted a single
 record but can still pg_dump the remainder, that seems the best we
 can do to help people recover from that.  Providing some
 documentation on how to figure out what rows are in that block,
 presumably by using the contrib inspection tools, would be helpful
 too.

FWIW, Postgres is pretty resiliant against corruption. I've maintained
a postgres db on a server with bad memory (don't ask) and since most
scrambling was in text strings you just got funny output sometimes. The
most common failure was a memory allocation failure as postgres tried
to copy a datum whose length field was correupted.

If things went really wonky you could identify the bad tuples by hand
and then delete them by ctid. Regular reindexing helped too.

All I'm saying is that a mode where you log a warning but proceed
anyway is useful.  It won't pin down the exact error, but it will tell
you where to look and help find the non-obvious corruption (so you can
possibly fix it by hand).

Have a nice day,
-- 
Martijn van Oosterhout   klep...@svana.org   http://svana.org/kleptog/
 He who writes carelessly confesses thereby at the very outset that he does
 not attach much importance to his own thoughts.
   -- Arthur Schopenhauer


signature.asc
Description: Digital signature


Re: [HACKERS] Enabling Checksums

2012-12-19 Thread Jeff Davis
On Tue, 2012-12-04 at 01:03 -0800, Jeff Davis wrote:
  4. We need some general performance testing to show whether this is
  insane or not.

I ran a few tests.

Test 1 - find worst-case overhead for the checksum calculation on write:

   fsync = off
   bgwriter_lru_maxpages = 0
   shared_buffers = 1024MB
   checkpoint_segments = 64
   autovacuum = off

The idea is to prevent interference from the bgwriter or autovacuum.
Also, I turn of fsync so that it's measuring the calculation overhead,
not the effort of actually writing to disk.

drop table foo;
create table foo(i int, j int) with (fillfactor=50);
create index foo_idx on foo(i);
insert into foo select g%25, -1 from generate_series(1,1000) g;
checkpoint;
-- during the following sleep, issue an OS sync
-- to make test results more consistent
select pg_sleep(30);
\timing on
update foo set j=-1 where i = 0;
select pg_sleep(2);
checkpoint;
update foo set j=-1 where i = 0;
select pg_sleep(2);
checkpoint;
update foo set j=-1 where i = 0;
select pg_sleep(2);
checkpoint;
\timing off

I am measuring the time of the CHECKPOINT command, not the update. The
update is just to dirty all of the pages (they should all be HOT
updates). Without checksums, it takes about 400ms. With checksums, it
takes about 500ms. That overhead is quite low, considering that the
bottleneck is almost always somewhere else (like actually writing to
disk).

Test 2 - worst-case overhead for calculating checksum while reading data

Same configuration as above. This time, just load a big table:

drop table foo;
create table foo(i int, j int) with (fillfactor=50);
insert into foo select g%25, -1 from generate_series(1,1000) g;
-- make sure hint bits and PD_ALL_VISIBLE are set everywhere
select count(*) from foo;
vacuum;
vacuum;
vacuum;
select relfilenode from pg_class where relname='foo';

Then shut down the server and restart it. Then do a cat
data/base/12055/*  /dev/null to get the table loaded into the OS
buffer cache. Then do:

\timing on
SELECT COUNT(*) FROM foo;

So, shared buffers are cold, but OS cache is warm. This should test the
overhead of going from the OS to shared buffers, which requires the
checksum calculation. Without checksums is around 820ms; with checksums
around 970ms. Again, this is quite reasonable, because I would expect
the bottleneck to be reading from the disk rather than the calculation
itself.

Test 3 - worst-case WAL overhead

For this test, I also left fsync off, because I didn't want to test the
effort to flush WAL (which shouldn't really be required for this test,
anyway). This was simpler:

  drop table foo;
  create table foo(i int, j int) with (fillfactor=50);
  insert into foo select g%25, -1 from generate_series(1,1000) g;
  checkpoint;
  select pg_sleep(1);
  checkpoint;
  select pg_sleep(30); -- do an OS sync while this is running
  \timing on
  SELECT COUNT(*) FROM foo;

Without checksums, it takes about 1000ms. With checksums, about 2350ms.
I also tested with checksums but without the CHECKPOINT commands above,
and it was also 1000ms.

This test is more plausible than the other two, so it's more likely to
be a real problem. So, the biggest cost of checksums is, by far, the
extra full-page images in WAL, which matches our expectations.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-18 Thread Simon Riggs
On 18 December 2012 02:21, Jeff Davis pg...@j-davis.com wrote:
 On Mon, 2012-12-17 at 19:14 +, Simon Riggs wrote:
 We'll need a way of expressing some form of corruption tolerance.
 zero_damaged_pages is just insane,

 The main problem I see with zero_damaged_pages is that it could
 potentially write out the zero page, thereby really losing your data if
 it wasn't already lost. (Of course, we document that you should have a
 backup first, but it's still dangerous). I assume that this is the same
 problem you are talking about.

I think we should discuss whether we accept my premise? Checksums will
actually detect more errors than we see now, and people will want to
do something about that. Returning to backup is one way of handling
it, but on a busy production system with pressure on, there is
incentive to implement a workaround, not a fix. It's not an easy call
to say we've got 3 corrupt blocks, so I'm going to take the whole
system offline while I restore from backup.

If you do restore from backup, and the backup also contains the 3
corrupt blocks, what then?

Clearly part of the response could involve pg_dump on the damaged
structure, at some point.

 I suppose we could have a new ReadBufferMaybe function that would only
 be used by a sequential scan; and then just skip over the page if it's
 corrupt, depending on a GUC. That would at least allow sequential scans
 to (partially) work, which might be good enough for some data recovery
 situations. If a catalog index is corrupted, that could just be rebuilt.
 Haven't thought about the details, though.

Not sure if you're being facetious here or not. Mild reworking of the
logic for heap page access could cope with a NULL buffer response and
subsequent looping, which would allow us to run pg_dump against a
damaged table to allow data to be saved, keeping file intact for
further analysis.

I'm suggesting we work a little harder than your block is corrupt
and give some thought to what the user will do next. Indexes are a
good case, because we can/should report the block error, mark the
index as invalid and then hint that it should be rebuilt.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-18 Thread Greg Smith

On 12/18/12 3:17 AM, Simon Riggs wrote:

Clearly part of the response could involve pg_dump on the damaged
structure, at some point.


This is the main thing I wanted to try out more, once I have a decent 
corruption generation tool.  If you've corrupted a single record but can 
still pg_dump the remainder, that seems the best we can do to help 
people recover from that.  Providing some documentation on how to figure 
out what rows are in that block, presumably by using the contrib 
inspection tools, would be helpful too.



Indexes are a good case, because we can/should report the block error, mark the
index as invalid and then hint that it should be rebuilt.


Marking a whole index invalid because there's one bad entry has enough 
downsides that I'm not sure how much we'd want to automate that.  Not 
having that index available could easily result in an effectively down 
system due to low performance.  The choices are uglier if it's backing a 
unique constraint.


In general, what I hope people will be able to do is switch over to 
their standby server, and then investigate further.  I think it's 
unlikely that people willing to pay for block checksums will only have 
one server.  Having some way to nail down if the same block is bad on a 
given standby seems like a useful interface we should offer, and it 
shouldn't take too much work.  Ideally you won't find the same 
corruption there.  I'd like a way to check the entirety of a standby for 
checksum issues, ideally run right after it becomes current.  It seems 
the most likely way to see corruption on one of those is to replicate a 
corrupt block.


There is no good way to make the poor soul who has no standby server 
happy here.  You're just choosing between bad alternatives.  The first 
block error is often just that--the first one, to be joined by others 
soon afterward.  My experience at how drives fail says the second error 
is a lot more likely after you've seen one.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-18 Thread Kevin Grittner
Greg Smith wrote:

 In general, what I hope people will be able to do is switch over to 
 their standby server, and then investigate further. I think it's 
 unlikely that people willing to pay for block checksums will only have 
 one server. Having some way to nail down if the same block is bad on a 
 given standby seems like a useful interface we should offer, and it 
 shouldn't take too much work. Ideally you won't find the same 
 corruption there. I'd like a way to check the entirety of a standby for 
 checksum issues, ideally run right after it becomes current. It seems 
 the most likely way to see corruption on one of those is to replicate a 
 corrupt block.
 
 There is no good way to make the poor soul who has no standby server 
 happy here. You're just choosing between bad alternatives. The first 
 block error is often just that--the first one, to be joined by others 
 soon afterward. My experience at how drives fail says the second error 
 is a lot more likely after you've seen one.

+1 on all of that.

-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-18 Thread Greg Stark
 There is no good way to make the poor soul who has no standby server
 happy here. You're just choosing between bad alternatives. The first
 block error is often just that--the first one, to be joined by others
 soon afterward. My experience at how drives fail says the second error
 is a lot more likely after you've seen one.

For what it's worth Oracle allows you to recover a specific block from
backups including replaying the archive logs for that one block.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-18 Thread Jeff Davis
On Tue, 2012-12-18 at 08:17 +, Simon Riggs wrote:
 I think we should discuss whether we accept my premise? Checksums will
 actually detect more errors than we see now, and people will want to
 do something about that. Returning to backup is one way of handling
 it, but on a busy production system with pressure on, there is
 incentive to implement a workaround, not a fix. It's not an easy call
 to say we've got 3 corrupt blocks, so I'm going to take the whole
 system offline while I restore from backup.

Up until now, my assumption has generally been that, upon finding the
corruption, the primary course of action is taking that server down
(hopefully you have a good replica), and do some kind of restore or sync
a new replica.

It sounds like you are exploring other possibilities.

  I suppose we could have a new ReadBufferMaybe function that would only
  be used by a sequential scan; and then just skip over the page if it's
  corrupt, depending on a GUC. That would at least allow sequential scans
  to (partially) work, which might be good enough for some data recovery
  situations. If a catalog index is corrupted, that could just be rebuilt.
  Haven't thought about the details, though.
 
 Not sure if you're being facetious here or not.

No. It was an incomplete thought (as I said), but sincere.

 Mild reworking of the
 logic for heap page access could cope with a NULL buffer response and
 subsequent looping, which would allow us to run pg_dump against a
 damaged table to allow data to be saved, keeping file intact for
 further analysis.

Right.

 I'm suggesting we work a little harder than your block is corrupt
 and give some thought to what the user will do next. Indexes are a
 good case, because we can/should report the block error, mark the
 index as invalid and then hint that it should be rebuilt.

Agreed; this applies to any derived data.

I don't think it will be very practical to keep a server running in this
state forever, but it might give enough time to reach a suitable
maintenance window.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-18 Thread Jeff Davis
On Tue, 2012-12-18 at 04:06 -0500, Greg Smith wrote:
 Having some way to nail down if the same block is bad on a 
 given standby seems like a useful interface we should offer, and it 
 shouldn't take too much work.  Ideally you won't find the same 
 corruption there.  I'd like a way to check the entirety of a standby for 
 checksum issues, ideally run right after it becomes current.  It seems 
 the most likely way to see corruption on one of those is to replicate a 
 corrupt block.

Part of the design is that pg_basebackup would verify checksums during
replication, so we should not replicate corrupt blocks (of course,
that's not implemented yet, so it's still a concern for now).

And we can also have ways to do background/offline checksum verification
with a separate utility.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-17 Thread Dimitri Fontaine
Jeff Davis pg...@j-davis.com writes:
 -A relation name
 -Corruption type (an entry from this list)
 -How many blocks to touch
 
 I'll just loop based on the count, randomly selecting a block each time 
 and messing with it in that way.

For the messing with it part, did you consider zzuf?

  http://caca.zoy.org/wiki/zzuf

 Does it make sense to have a separate executable (pg_corrupt) just for
 corrupting the data as a test? Or should it be part of a
 corruption-testing harness (pg_corruptiontester?), that introduces the
 corruption and then verifies that it's properly detected?

Maybe we need our own zzuf implementation, though.

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-17 Thread Simon Riggs
On 14 December 2012 20:15, Greg Smith g...@2ndquadrant.com wrote:
 On 12/14/12 3:00 PM, Jeff Davis wrote:

 After some thought, I don't see much value in introducing multiple
 instances of corruption at a time. I would think that the smallest unit
 of corruption would be the hardest to detect, so by introducing many of
 them in one pass makes it easier to detect.


 That seems reasonable.  It would eliminate a lot of issues with reproducing
 a fault too.  I can just print the impacted block number presuming it will
 show up in a log, and make it possible to override picking one at random
 with a command line input.


Discussing this makes me realise that we need a more useful response
than just your data is corrupt, so user can respond yes, I know,
I'm trying to save whats left.

We'll need a way of expressing some form of corruption tolerance.
zero_damaged_pages is just insane, much better if we set
corruption_tolerance = N to allow us to skip N corrupt pages before
failing, with -1 meaning keep skipping for ever. Settable by superuser
only.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-17 Thread Tom Lane
Simon Riggs si...@2ndquadrant.com writes:
 Discussing this makes me realise that we need a more useful response
 than just your data is corrupt, so user can respond yes, I know,
 I'm trying to save whats left.

 We'll need a way of expressing some form of corruption tolerance.
 zero_damaged_pages is just insane, much better if we set
 corruption_tolerance = N to allow us to skip N corrupt pages before
 failing, with -1 meaning keep skipping for ever. Settable by superuser
 only.

Define skip.  Extra points if it makes sense for an index.  And what
about things like pg_clog pages?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-17 Thread Simon Riggs
On 17 December 2012 19:29, Tom Lane t...@sss.pgh.pa.us wrote:
 Simon Riggs si...@2ndquadrant.com writes:
 Discussing this makes me realise that we need a more useful response
 than just your data is corrupt, so user can respond yes, I know,
 I'm trying to save whats left.

 We'll need a way of expressing some form of corruption tolerance.
 zero_damaged_pages is just insane, much better if we set
 corruption_tolerance = N to allow us to skip N corrupt pages before
 failing, with -1 meaning keep skipping for ever. Settable by superuser
 only.

 Define skip.

Allow data access, but accept that the answer is silently incomplete.
Not really much difference from zero_damaged_pages which just removes
the error by removing any chance of repair or recovery, and then
silently gives the wrong answer.

 Extra points if it makes sense for an index.

I guess not, but that's no barrier to it working on heap pages only,
in my suggested use case.

 And what about things like pg_clog pages?

SLRUs aren't checksummed because of their lack of header space.
Perhaps that is a major point against the patch.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-17 Thread Jeff Davis
On Mon, 2012-12-17 at 19:14 +, Simon Riggs wrote:
 We'll need a way of expressing some form of corruption tolerance.
 zero_damaged_pages is just insane,

The main problem I see with zero_damaged_pages is that it could
potentially write out the zero page, thereby really losing your data if
it wasn't already lost. (Of course, we document that you should have a
backup first, but it's still dangerous). I assume that this is the same
problem you are talking about.

I suppose we could have a new ReadBufferMaybe function that would only
be used by a sequential scan; and then just skip over the page if it's
corrupt, depending on a GUC. That would at least allow sequential scans
to (partially) work, which might be good enough for some data recovery
situations. If a catalog index is corrupted, that could just be rebuilt.
Haven't thought about the details, though.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-14 Thread Jeff Davis
On Wed, 2012-12-12 at 17:52 -0500, Greg Smith wrote:
 I can take this on, as part of the QA around checksums working as 
 expected.  The result would be a Python program; I don't have quite 
 enough time to write this in C or re-learn Perl to do it right now.  But 
 this won't be a lot of code.  If it's tossed one day as simply a 
 prototype for something more permanent, I think it's still worth doing now.
 
 The UI I'm thinking of for what I'm going to call pg_corrupt is a CLI 
 that asks for:
 
 -A relation name
 -Corruption type (an entry from this list)
 -How many blocks to touch
 
 I'll just loop based on the count, randomly selecting a block each time 
 and messing with it in that way.
 
 The randomness seed should be printed as part of the output, so that 
 it's possible re-create the damage exactly later.  If the server doesn't 
 handle it correctly, we'll want to be able to replicate the condition it 
 choked on exactly later, just based on the tool's log output.
 
 Any other requests?

After some thought, I don't see much value in introducing multiple
instances of corruption at a time. I would think that the smallest unit
of corruption would be the hardest to detect, so by introducing many of
them in one pass makes it easier to detect.

For example, if we introduce an all-ones page, and also transpose two
pages, the all-ones error might be detected even if the transpose error
is not being detected properly. And we'd not know that the transpose
error was not being detected, because the error appears as soon as it
sees the all-ones page.

Does it make sense to have a separate executable (pg_corrupt) just for
corrupting the data as a test? Or should it be part of a
corruption-testing harness (pg_corruptiontester?), that introduces the
corruption and then verifies that it's properly detected?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-14 Thread Greg Smith

On 12/14/12 3:00 PM, Jeff Davis wrote:

After some thought, I don't see much value in introducing multiple
instances of corruption at a time. I would think that the smallest unit
of corruption would be the hardest to detect, so by introducing many of
them in one pass makes it easier to detect.


That seems reasonable.  It would eliminate a lot of issues with 
reproducing a fault too.  I can just print the impacted block number 
presuming it will show up in a log, and make it possible to override 
picking one at random with a command line input.



Does it make sense to have a separate executable (pg_corrupt) just for
corrupting the data as a test? Or should it be part of a
corruption-testing harness (pg_corruptiontester?), that introduces the
corruption and then verifies that it's properly detected?


Let me see what falls out of the coding, I don't think this part needs 
to get nailed down yet.  Building a corruption testing harness is going 
to involve a lot of creating new clusters and test data to torture. 
It's a different style of problem than injecting faults in the first place.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-12 Thread Greg Smith

On 12/5/12 6:49 PM, Simon Riggs wrote:

* Zeroing pages, making pages all 1s
* Transposing pages
* Moving chunks of data sideways in a block
* Flipping bits randomly
* Flipping data endianness
* Destroying particular catalog tables or structures


I can take this on, as part of the QA around checksums working as 
expected.  The result would be a Python program; I don't have quite 
enough time to write this in C or re-learn Perl to do it right now.  But 
this won't be a lot of code.  If it's tossed one day as simply a 
prototype for something more permanent, I think it's still worth doing now.


The UI I'm thinking of for what I'm going to call pg_corrupt is a CLI 
that asks for:


-A relation name
-Corruption type (an entry from this list)
-How many blocks to touch

I'll just loop based on the count, randomly selecting a block each time 
and messing with it in that way.


The randomness seed should be printed as part of the output, so that 
it's possible re-create the damage exactly later.  If the server doesn't 
handle it correctly, we'll want to be able to replicate the condition it 
choked on exactly later, just based on the tool's log output.


Any other requests?

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-06 Thread Kevin Grittner
Robert Haas wrote:
 Jeff Davis pg...@j-davis.com wrote:
 Or, I could write up a test framework in ruby or python, using
 the appropriate pg driver, and some not-so-portable shell
 commands to start and stop the server. Then, I can publish that
 on this list, and that would at least make it easier to test
 semi-manually and give greater confidence in pre-commit
 revisions.
 
 That latter approach is similar to what happened with SSI's
 isolation tester. It started out in Python, and then Heikki
 rewrote it in C.
 If Python/Ruby code is massively simpler to write than the C
 code, that might be a good way to start out. It'll be an aid to
 reviewers even if neither it nor any descendent gets committed.
 
 Frankly, I think some automated testing harness (written in C or
 Perl) that could do fault-injection tests as part of the
 buildfarm would be amazingly awesome. I'm drooling just thinking
 about it. But I guess that's getting ahead of myself.

There may be room for both.

My experience was that the dtester tool from Markus made it
relatively easy for me to hack up new tests which gave detailed
information about which permutations were behaving as desired,
which were known not to be covered, and which had regressions.
That speed of adding new tests and detail about improvements or
regressions allowed faster development than would have been
possible with the isolation tester that Heikki wrote in C.

On the other hand, dtester requires python (in fact, I think it
requries python version 2.x were x is 5 or greater), a requirement
which I don't think we want to add for builds. It wasn't very
compatible with the normal make check environment, either in how it
was run or in its output. And it was much slower than the isolation
test framework -- like by about an order of magnitude.

So for a completed product on which you want to test for
regressions, the isolation tester is much better. For a development
effort on the scale of SSI, I would want to have dtester or
something very like it available.

Neither one quite handles tests for all the types of concurrency
conditions that one might want. I had some idea how to add some
additonal useful cases to dtester, and it didn't look outrageously
hard. I haven't really looked at how to do that in the insolation
tester, so I don't know how hard it would be there.

-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-05 Thread Robert Haas
On Tue, Dec 4, 2012 at 6:17 PM, Jeff Davis pg...@j-davis.com wrote:
 Or, I could write up a test framework in ruby or python, using the
 appropriate pg driver, and some not-so-portable shell commands to start
 and stop the server. Then, I can publish that on this list, and that
 would at least make it easier to test semi-manually and give greater
 confidence in pre-commit revisions.

That latter approach is similar to what happened with SSI's isolation
tester.  It started out in Python, and then Heikki rewrote it in C.
If Python/Ruby code is massively simpler to write than the C code,
that might be a good way to start out.  It'll be an aid to reviewers
even if neither it nor any descendent gets committed.

Frankly, I think some automated testing harness (written in C or Perl)
that could do fault-injection tests as part of the buildfarm would be
amazingly awesome.  I'm drooling just thinking about it.  But I guess
that's getting ahead of myself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-05 Thread Simon Riggs
On 5 December 2012 23:40, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Dec 4, 2012 at 6:17 PM, Jeff Davis pg...@j-davis.com wrote:
 Or, I could write up a test framework in ruby or python, using the
 appropriate pg driver, and some not-so-portable shell commands to start
 and stop the server. Then, I can publish that on this list, and that
 would at least make it easier to test semi-manually and give greater
 confidence in pre-commit revisions.

 That latter approach is similar to what happened with SSI's isolation
 tester.  It started out in Python, and then Heikki rewrote it in C.
 If Python/Ruby code is massively simpler to write than the C code,
 that might be a good way to start out.  It'll be an aid to reviewers
 even if neither it nor any descendent gets committed.

 Frankly, I think some automated testing harness (written in C or Perl)
 that could do fault-injection tests as part of the buildfarm would be
 amazingly awesome.  I'm drooling just thinking about it.  But I guess
 that's getting ahead of myself.

Agreed, though we can restrict that to a few things at first.

* Zeroing pages, making pages all 1s
* Transposing pages
* Moving chunks of data sideways in a block
* Flipping bits randomly
* Flipping data endianness
* Destroying particular catalog tables or structures

etc

As a contrib module, so we can be sure to never install it. ;-)

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Enabling Checksums

2012-12-04 Thread Jeff Davis
On Mon, 2012-12-03 at 13:16 +, Simon Riggs wrote:
 On 3 December 2012 09:56, Simon Riggs si...@2ndquadrant.com wrote:
 
  I think the way forwards for this is...
 
  1. Break out the changes around inCommit flag, since that is just
  uncontroversial refactoring. I can do that. That reduces the noise
  level in the patch and makes it easier to understand the meaningful
  changes.
 
 Done.

Thank you.

One minor thing I noticed: it looks like nwaits is a useless variable.
Your original checksums patch used it to generate a warning, but now
that is gone. It's not throwing a compiler warning for some reason.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


<    1   2   3   4   >