subject:"\[HACKERS\] Enabling Checksums"


On 3/6/13 1:34 PM, Robert Haas wrote:

We've had a few EnterpriseDB customers who have had fantastically
painful experiences with PostgreSQL + ZFS.  Supposedly, aligning the
ZFS block size to the PostgreSQL block size is supposed to make these
problems go away, but in my experience it does not have that effect.


There are a couple of major tuning issues you have to get right for good 
ZFS performance, like its tendency to gobble more RAM than is 
necessarily appropriate for a PostgreSQL host.  If you nail down all 
those and carefully setup everything it can work OK.  When Sun had a 
bunch of good engineers working on the problem they certainly pulled it 
off.  I managed a 3TB database on a ZFS volume for a while myself. 
Being able to make filesystem snapshots cleanly and easily was very nice.


As for the write performance implications of COW, though, at a couple of 
points I was only able to keep that system ingesting data fast enough if 
I turned fsync off :(  It's not as if even ZFS makes all the filesystem 
issues the database worries about go away either.  Take a look at 
http://www.c0t0d0s0.org/archives/6071-No,-ZFS-really-doesnt-need-a-fsck.html 
as an example.  That should leave you with a healthy concern over ZFS 
handling of power interruption and lying drives.  [NTFS and ext3] have 
the same problem, but it has different effects, that aren't as visible 
as in ZFS.  ext4 actually fixed this for most hardware though, and I 
believe ZFS still has the same uberblock concern.  ZFS reliability and 
its page checksums are good, but they're not magic for eliminating torn 
page issues.


Normally I would agree with Heikki's theory of let's wait a few years 
and see if the filesystem will take care of it idea.  But for me, the 
when do we get checksums? clock started ticking in 2006 when ZFS 
popularized its implementation, and now it's gone off and it keeps 
ringing at new places.  I would love it if FreeBSD had caught a massive 
popularity wave in the last few years, so ZFS was running in a lot more 
places.  Instead what I keep seeing is deployments Linux with filesystem 
choices skewed toward conservative.  Forget about the leading edge--I'd 
be happy if I could get one large customer to migrate off of ext3...


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Craig Ringer

On 03/07/2013 08:41 AM, Andres Freund wrote:
 On 2013-03-07 08:37:40 +0800, Craig Ringer wrote:
 On 03/06/2013 07:34 PM, Heikki Linnakangas wrote:
 It'd be difficult to change the algorithm in a future release without
 breaking on-disk compatibility,
 On-disk compatibility is broken with major releases anyway, so I don't
 see this as a huge barrier.
 Uh, pg_upgrade?
Yeah. I was thinking that pg_upgrade copes with a lot of
incompatibilities already, but this is lower-level. Darn.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Jim Nasby


On 3/4/13 7:04 PM, Daniel Farina wrote:

Corruption has easily occupied more than one person-month of time last
year for us.


Just FYI for anyone that's experienced corruption... we've looked into doing 
row-level checksums at work. The only challenge we ran into was how to check 
them when reading data back. I don't remember the details but there was an 
issue with doing this via SELECT rules. It would be possible if you were 
willing to put writable views on all your tables (which isn't actually as 
horrible as it sounds; it wouldn't be hard to write a function to automagically 
do that for you).


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Jim Nasby


On 3/6/13 1:14 PM, Josh Berkus wrote:



There may be good reasons to reject this patch.  Or there may not.
But I completely disagree with the idea that asking them to solve the
problem at the filesystem level is sensible.


Yes, can we get back to the main issues with the patch?

1) argument over whether the checksum is sufficient to detect most
errors, or if it will give users false confidence.

2) performance overhead.

Based on Smith's report, I consider (2) to be a deal-killer right now.
The level of overhead reported by him would prevent the users I work
with from ever employing checksums on production systems.


FWIW, the write workload most likely wouldn't be a problem for us. I am 
concerned about the reported 24-32% hit when reading back in from FS cache... 
that might kill this for us.

I'm working on doing a test to see how bad it actually is for us... but getting 
stuff like that done at work is like pulling teeth, so we'll see...


Specifically, the writing checksums for a read-only query is a defect I
think is prohibitively bad.  When we first talked about this feature for
9.2, we were going to exclude hint bits from checksums, in order to
avoid this issue; what happened to that?

(FWIW, I still support the idea of moving hint bits to a separate
filehandle, as we do with the FSM, but clearly that's not happening for
9.3 ...)


+1


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 3/6/13 6:34 AM, Heikki Linnakangas wrote:

Another thought is that perhaps something like CRC32C would be faster to
calculate on modern hardware, and could be safely truncated to 16-bits
using the same technique you're using to truncate the Fletcher's
Checksum. Greg's tests showed that the overhead of CRC calculation is
significant in some workloads, so it would be good to spend some time to
optimize that. It'd be difficult to change the algorithm in a future
release without breaking on-disk compatibility, so let's make sure we
pick the best one.


Simon sent over his first rev of this using a quick to compute 16 bit 
checksum as a reasonable trade-off, one that it's possible to do right 
now.  It's not optimal in a few ways, but it catches single bit errors 
that are missed right now, and Fletcher-16 computes quickly and without 
a large amount of code.  It's worth double-checking that the code is 
using the best Fletcher-16 approach available.  I've started on that, 
but I'm working on your general performance concerns first, with the 
implementation that's already there.


From what I've read so far, I think picking Fletcher-16 instead of the 
main alternative, CRC-16-IBM AKA CRC-16-ANSI, is a reasonable choice. 
There's a good table showing the main possibilities here at 
https://en.wikipedia.org/wiki/Cyclic_redundancy_check


One day I hope that in-place upgrade learns how to do page format 
upgrades, with the sort of background conversion tools and necessary 
tracking metadata we've discussed for that work.  When that day comes, I 
would expect it to be straightforward to upgrade pages from 16 bit 
Fletcher checksums to 32 bit CRC-32C ones.  Ideally we would be able to 
jump on the CRC-32C train today, but there's nowhere to put all 32 bits. 
 Using a Fletcher 16 bit checksum for 9.3 doesn't prevent the project 
from going that way later though, once page header expansion is a solved 
problem.


The problem with running CRC32C in software is that the standard fast 
approach uses a slicing technique that requires a chunk of 
pre-computed data be around, a moderately large lookup table.  I don't 
see that there's any advantage to having all that baggage around if 
you're just going to throw away half of the result anyway.  More on 
CRC32Cs in my next message.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 3/6/13 1:24 PM, Tom Lane wrote:

Andres Freund and...@2ndquadrant.com writes:

On 2013-03-06 11:21:21 -0500, Garick Hamlin wrote:

If picking a CRC why not a short optimal one rather than truncate CRC32C?



CRC32C is available in hardware since SSE4.2.


I think that should be at most a fourth-order consideration, since we
are not interested solely in Intel hardware, nor do we have any portable
way of getting at such a feature even if the hardware has it.


True, but that situation might actually improve.

The Castagnoli CRC-32C that's accelerated on the better Intel CPUs is 
also used to protect iSCSI and SCTP (a streaming protocol).  And there 
is an active project to use a CRC32C to checksum ext4 metadata blocks on 
Linux:  https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums 
https://groups.google.com/forum/?fromgroups=#!topic/linux.kernel/APKfoMzjgdY


Now, that project doesn't make the Postgres feature obsolete, because 
there's nowhere to put checksum data for every block on ext4 without 
whacking block alignment.  The filesystem can't make an extra 32 bits 
appear on every block any more than we can.  It's using a similar trick 
to the PG checksum feature, grabbing some empty space just for the 
metadata then shoving the CRC32C into there.  But the fact that this is 
going on means that there are already Linux kernel modules built with 
both software/hardware accelerated versions of the CRC32C function.  And 
the iSCSI/SCTP use cases means it's not out of the question this will 
show up in other useful forms one day.  Maybe two years from now, there 
will be a common Linux library that autoconf can find to compute the CRC 
for us--with hardware acceleration when available, in software if not.


The first of those ext4 links above even discusses the exact sort of 
issue we're facing.  The author wonders if the easiest way to proceed 
for 16 bit checksums is to compute the CRC32C, then truncate it, simply 
because CRC32C creation is so likely to get hardware help one day.  I 
think that logic doesn't really apply to the PostgreSQL case as strongly 
though, as the timetime before we can expect a hardware accelerated 
version to be available is much further off than a Linux kernel 
developer's future.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Greg Stark

On Wed, Mar 6, 2013 at 11:04 PM, Robert Haas robertmh...@gmail.com wrote:
 When we first talked about this feature for
 9.2, we were going to exclude hint bits from checksums, in order to
 avoid this issue; what happened to that?

 I don't think anyone ever thought that was a particularly practical
 design.  I certainly don't.

Really? I thought it was pretty much the consensus for a good while.

The main problem it ran into was that we kept turning up hint bits
that we didn't realize we had. Index line pointers turned out to have
hint bits, page headers have one, and so on. As long as it was just
the heap page per-tuple transaction hint bits it seemed plausible to
just skip them or move them all to a contiguous blocks. Once it
started to look like the checksumming code had to know about every
data structure on every page it seemed a bit daunting. But that wasn't
something we realized for quite a long time.



-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

TL;DR summary:  on a system I thought was a fair middle of the road 
server, pgbench tests are averaging about a 2% increase in WAL writes 
and a 2% slowdown when I turn on checksums.  There are a small number of 
troublesome cases where that overhead rises to closer to 20%, an upper 
limit that's shown up in a few tests aiming to stress this feature now.


On 3/4/13 10:09 PM, Jeff Davis wrote:

= Test 2 - worst-case overhead for calculating checksum while reading data =

Jeff saw an 18% slowdown, I get 24 to 32%.  This one bothers me because
the hit is going to happen during the very common situation where data
is shuffling a lot between a larger OS cache and shared_buffers taking a
relatively small fraction.


I believe that test 1 and test 2 can be improved a little, if there is a
need. Right now we copy the page and then calculate the checksum on the
copy. If we instead calculate as we're copying, I believe it will make
it significantly faster.


It's good to know there's at least some ideas for optimizing this one 
further.  I think the situation where someone has:


shared_buffers  database  total RAM

is fairly common for web applications.  For people on Amazon EC2 
instances for example, giving out the performance tuning advice of get 
a bigger instance until the database fits in RAM works amazingly well. 
 If the hotspot of that data set fits in shared_buffers, those people 
will still be in good shape even with checksums enabled.  If the hot 
working set is spread out more randomly, though, it's not impossible to 
see how they could suffer regularly from this ~20% OS cache-shared 
buffers movement penalty.


Regardless, Jeff's three cases are good synthetic exercises to see 
worst-case behavior, but they are magnifying small differences.  To see 
a more general case, I ran through a series of pgbench tests in its 
standard write mode.  In order to be useful, I ended up using a system 
with a battery-backed write cache, but with only a single drive 
attached.  I needed fsync to be fast to keep that from being the 
bottleneck.  But I wanted physical I/O to be slow.  I ran three test 
sets at various size/client loads:  one without the BBWC (which I kept 
here because it gives some useful scale to the graphs), one with the 
baseline 9.3 code, and one with checksums enabled on the cluster.  I did 
only basic postgresql.conf tuning:


 checkpoint_segments| 64
 shared_buffers | 2GB

There's two graphs comparing sets attached, you can see that the 
slowdown of checksums for this test is pretty minor.  There is a clear 
gap between the two plots, but it's not a very big one, especially if 
you note how much difference a BBWC makes.


I put the numeric results into a spreadsheet, also attached.  There's so 
much noise in pgbench results that I found it hard to get a single 
number for the difference; they bounce around about +/-5% here. 
Averaging across everything gives a solid 2% drop when checksums are on 
that looked detectable above the noise.


Things are worse on the bigger data sets.  At the highest size I tested, 
the drop was more like 7%.  The two larger size / low client count 
results I got were really bad, 25% and 16% drops.  I think this is 
closing in on the range of things:  perhaps only 2% when most of your 
data fits in shared_buffers, more like 10% if your database is bigger, 
and in the worst case 20% is possible.  I don't completely trust those 
25/16% numbers though, I'm going to revisit that configuration.


The other thing I track now in pgbench-tools is how many bytes of WAL 
are written.  Since the total needs to be measured relative to work 
accomplished, the derived number that looks useful there is average 
bytes of WAL per transaction.  On smaller database this is around 6K, 
while larger databases topped out for me at around 22K WAL 
bytes/transaction.  Remember that the pgbench transaction is several 
statements.  Updates touch different blocks in pgbench_accounts, index 
blocks, and the small tables.


The WAL increase from checksumming is a bit more consistent than the TPS 
rates.  Many cases were 3 to 5%.  There was one ugly case were it hit 
30%, and I want to dig into where that came from more.  On average, 
again it was a 2% increase over the baseline.


Cases where you spew hint bit WAL data where before none were written 
(Jeff's test #3) remain a far worst performer than any of these.  Since 
pgbench does a VACUUM before starting, none of those cases were 
encountered here though.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
attachment: clients-sets.pngattachment: scaling-sets.png

Checksum-pgbench.xls
Description: MS-Excel spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-06 Thread Daniel Farina

On Wed, Mar 6, 2013 at 8:17 PM, Greg Smith g...@2ndquadrant.com wrote:
 TL;DR summary:  on a system I thought was a fair middle of the road server,
 pgbench tests are averaging about a 2% increase in WAL writes and a 2%
 slowdown when I turn on checksums.  There are a small number of troublesome
 cases where that overhead rises to closer to 20%, an upper limit that's
 shown up in a few tests aiming to stress this feature now.

I have only done some cursory research, but cpu-time of 20% seem to
expected for InnoDB's CRC computation[0].  Although a galling number,
this comparison with other systems may be a way to see how much of
that overhead is avoidable or just the price of entry.  It's unclear
how this 20% cpu-time compares to your above whole-system results, but
it's enough to suggest that nothing comes for (nearly) free.

[0]: http://mysqlha.blogspot.com/2009/05/innodb-checksum-performance.html

--
fdr


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 3/7/13 12:15 AM, Daniel Farina wrote:

I have only done some cursory research, but cpu-time of 20% seem to
expected for InnoDB's CRC computation[0].  Although a galling number,
this comparison with other systems may be a way to see how much of
that overhead is avoidable or just the price of entry.  It's unclear
how this 20% cpu-time compares to your above whole-system results, but
it's enough to suggest that nothing comes for (nearly) free.


That does provide a useful measuring point:  how long does the 
computation take compared to the memcpy that moves the buffer around. 
It looks like they started out with 3.2 memcpy worth of work, and with 
enough optimization ended up at 1.27 worth.


The important thing to keep in mind is that shared_buffers works pretty 
well at holding on to the most frequently accessed information.  A 
typical server I see will show pg_statio information suggesting 90%+ of 
block requests are coming from hits there, the rest misses suggesting a 
mix of OS cache and real disk reads.  Let's say 90% are hits, 5% are 
fetches at this 20% penalty, and 5% are real reads where the checksum 
time is trivial compared to physical disk I/O.  That works out to be a 
real average slowdown of 6%.  I think way more deployments are going to 
be like that case, which matches most of my pgbench runs, than the worse 
case workloads.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-05 Thread Simon Riggs

On 5 March 2013 01:04, Daniel Farina dan...@heroku.com wrote:

 Corruption has easily occupied more than one person-month of time last
 year for us.  This year to date I've burned two weeks, although
 admittedly this was probably the result of statistical clustering.
 Other colleagues of mine have probably put in a week or two in
 aggregate in this year to date.  The ability to quickly, accurately,
 and maybe at some later date proactively finding good backups to run
 WAL recovery from is one of the biggest strides we can make in the
 operation of Postgres.  The especially ugly cases are where the page
 header is not corrupt, so full page images can carry along malformed
 tuples...basically, when the corruption works its way into the WAL,
 we're in much worse shape.  Checksums would hopefully prevent this
 case, converting them into corrupt pages that will not be modified.

 It would be better yet if I could write tools to find the last-good
 version of pages, and so I think tight integration with Postgres will
 see a lot of benefits that would be quite difficult and non-portable
 when relying on file system checksumming.

 You are among the most well-positioned to make assessments of the cost
 of the feature, but I thought you might appreciate a perspective of
 the benefits, too.  I think they're large, and for me they are the
 highest pole in the tent for what makes Postgres stressful to operate
 as-is today.  It's a testament to the quality of the programming in
 Postgres that Postgres programming error is not the largest problem.

That's good perspective.

I think we all need to be clear that committing this patch also
commits the community (via the committer) to significant work and
responsibility around this, and my minimum assessment of it is 1 month
per year for a 3-5 years, much of that on the committer. In effect
this will move time and annoyance experienced by users of Postgres
back onto developers of Postgres. That is where it should be, but the
effect will be large and easily noticeable, IMHO.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-05 Thread Heikki Linnakangas


On 04.03.2013 09:11, Simon Riggs wrote:

On 3 March 2013 18:24, Greg Smithg...@2ndquadrant.com  wrote:


The 16-bit checksum feature seems functional, with two sources of overhead.
There's some CPU time burned to compute checksums when pages enter the
system.  And there's extra overhead for WAL logging hint bits.  I'll
quantify both of those better in another message.


It's crunch time. Do you and Jeff believe this patch should be
committed to Postgres core?

Are there objectors?


In addition to my hostility towards this patch in general, there are 
some specifics in the patch I'd like to raise (read out in a grumpy voice):


If you enable checksums, the free space map never gets updated in a 
standby. It will slowly drift to be completely out of sync with reality, 
which could lead to significant slowdown and bloat after failover.


Since the checksums are an all-or-nothing cluster-wide setting, the 
three extra flags in the page header, PD_CHECKSUMS1, PD_CHECKSUM2 and 
PD_HEADERCHECK, are not needed. Let's leave them out. That keeps the 
code simpler, and leaves the bits free for future use. If we want to 
enable such per-page setting in the future, we can add it later. For a 
per-relation scheme, they're not needed.



+ * The checksum algorithm is a modified Fletcher 64-bit (which is
+ * order-sensitive). The modification is because, at the end, we have two
+ * 64-bit sums, but we only have room for a 16-bit checksum. So, instead of
+ * using a modulus of 2^32 - 1, we use 2^8 - 1; making it also resemble a
+ * Fletcher 16-bit. We don't use Fletcher 16-bit directly, because processing
+ * single bytes at a time is slower.


How does the error detection rate of this compare with e.g CRC-16? Is 
there any ill effect from truncating the Fletcher sums like this?



+   /*
+* Store the sums as bytes in the checksum. We add one to shift the 
range
+* from 0..255 to 1..256, to make zero invalid for checksum bytes (which
+* seems wise).
+*/
+   p8Checksum[0] = (sum1 % 255) + 1;
+   p8Checksum[1] = (sum2 % 255) + 1;


That's a bit odd. We don't avoid zero in the WAL crc, and I don't recall 
seeing that in other checksum implementations either. 16-bits is not 
very wide for a checksum, and this eats about 1% of the space of valid 
values.


I can see that it might be a handy debugging aid to avoid 0. But there's 
probably no need to avoid 0 in both bytes, it seems enough to avoid a 
completely zero return value.


XLogCheckBuffer() and XLogCheckBufferNeedsBackup() read the page LSN 
without a lock. That's not atomic, so it could incorrectly determine 
that a page doesn't need to be backed up. We used to always hold an 
exclusive lock on the buffer when it's called, which prevents 
modifications to the LSN, but that's no longer the case.


Shouldn't SetBufferCommitInfoNeedsSave() check the BM_PERMANENT flag? I 
think it will generate WAL records for unlogged tables as it is.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-05 Thread Jeff Davis


Thank you for the review.

On Tue, 2013-03-05 at 11:35 +0200, Heikki Linnakangas wrote:
 If you enable checksums, the free space map never gets updated in a 
 standby. It will slowly drift to be completely out of sync with reality, 
 which could lead to significant slowdown and bloat after failover.

Will investigate.

 Since the checksums are an all-or-nothing cluster-wide setting, the 
 three extra flags in the page header, PD_CHECKSUMS1, PD_CHECKSUM2 and 
 PD_HEADERCHECK, are not needed. Let's leave them out. That keeps the 
 code simpler, and leaves the bits free for future use. If we want to 
 enable such per-page setting in the future, we can add it later. For a 
 per-relation scheme, they're not needed.

They don't really need to be there, I just put them there because it
seemed wise if we ever want to allow online enabling/disabling of
checksums. But I will remove them.

 How does the error detection rate of this compare with e.g CRC-16? Is 
 there any ill effect from truncating the Fletcher sums like this?

I don't recall if I published these results or not, but I loaded a
table, and used pageinspect to get the checksums of the pages. I then
did some various GROUP BY queries to see if I could find any clustering
or stepping of the checksum values, and I could not. The distribution
seemed very uniform across the 255^2 space.

I tried to think of other problems, like missing errors in the high or
low bits of a word or a page (similar to the issue with mod 256
described below), but I couldn't find any. I'm not enough of an expert
to say more than that about the error detection rate.

Fletcher is probably significantly faster than CRC-16, because I'm just
doing int32 addition in a tight loop.

Simon originally chose Fletcher, so perhaps he has more to say.

 That's a bit odd. We don't avoid zero in the WAL crc, and I don't recall 
 seeing that in other checksum implementations either. 16-bits is not 
 very wide for a checksum, and this eats about 1% of the space of valid 
 values.
 
 I can see that it might be a handy debugging aid to avoid 0. But there's 
 probably no need to avoid 0 in both bytes, it seems enough to avoid a 
 completely zero return value.

http://en.wikipedia.org/wiki/Fletcher%27s_checksum

If you look at the section on Fletcher-16, it discusses the choice of
the modulus. If we used 256, then an error anywhere except the lowest
byte of a 4-byte word read from the page would be missed.

Considering that I was using only 255 values anyway, I thought I might
as well shift the values away from zero.

We could get slightly better by using all combinations. I also
considered chopping the 64-bit ints into 16-bit chunks and XORing them
together. But when I saw the fact that we avoided zero with the other
approach, I kind of liked it, and kept it.

 XLogCheckBuffer() and XLogCheckBufferNeedsBackup() read the page LSN 
 without a lock. That's not atomic, so it could incorrectly determine 
 that a page doesn't need to be backed up. We used to always hold an 
 exclusive lock on the buffer when it's called, which prevents 
 modifications to the LSN, but that's no longer the case.

Will investigate, but it sounds like a buffer header lock will fix it.

 Shouldn't SetBufferCommitInfoNeedsSave() check the BM_PERMANENT flag? I 
 think it will generate WAL records for unlogged tables as it is.

Yes, thank you.

Also, in FlushBuffer(), this patch moves the clearing of the
BM_JUST_DIRTIED bit to before the WAL flush. That seems to expand the
window during which a change to a page will prevent it from being marked
clean. Do you see any performance problem with that?

The alternative is to take the buffer header lock twice: once to get the
LSN, then WAL flush, then another header lock to clear BM_JUST_DIRTIED.
Not sure if that's better or worse. This goes back to Simon's patch, so
he may have a comment here, as well.

I'll post a new patch with these comments addressed, probably tomorrow
so that I have some time to self-review and do some basic testing.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 04.03.2013 09:11, Simon Riggs wrote:

Are there objectors?


FWIW, I still think that checksumming belongs in the filesystem, not 
PostgreSQL. If you go ahead with this anyway, at the very least I'd like 
to see some sort of a comparison with e.g btrfs. How do performance, 
error-detection rate, and behavior on error compare? Any other metrics 
that are relevant here?


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

On Mon, 2013-03-04 at 10:36 +0200, Heikki Linnakangas wrote:
 On 04.03.2013 09:11, Simon Riggs wrote:
  Are there objectors?
 
 FWIW, I still think that checksumming belongs in the filesystem, not 
 PostgreSQL.

Doing checksums in the filesystem has some downsides. One is that you
need to use a copy-on-write filesystem like btrfs or zfs, which (by
design) will fragment the heap on random writes. If we're going to start
pushing people toward those systems, we will probably need to spend some
effort to mitigate this problem (aside: my patch to remove
PD_ALL_VISIBLE might get some new wind behind it).

There are also other issues, like what fraction of our users can freely
move to btrfs, and when. If it doesn't happen to be already there, you
need root to get it there, which has never been a requirement before.

I don't fundamentally disagree. We probably need to perform reasonably
well on btrfs in COW mode[1] regardless, because a lot of people will be
using it a few years from now. But there are a lot of unknowns here, and
I'm concerned about tying checksums to a series of things that will be
resolved a few years from now, if ever.

[1] Interestingly, you can turn off COW mode on btrfs, but you lose
checksums if you do.

  If you go ahead with this anyway, at the very least I'd like 
 to see some sort of a comparison with e.g btrfs. How do performance, 
 error-detection rate, and behavior on error compare? Any other metrics 
 that are relevant here?

I suspect it will be hard to get an apples-to-apples comparison here
because of the heap fragmentation, which means that a sequential scan is
not so sequential. That may be acceptable for some workloads but not for
others, so it would get tricky to compare. And any performance numbers
from an experimental filesystem are somewhat suspect anyway.

Also, it's a little more challenging to test corruption on a filesystem,
because you need to find the location of the file you want to corrupt,
and corrupt it out from underneath the filesystem.

Greg may have more comments on this matter.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Greg Smith


On 3/4/13 2:11 AM, Simon Riggs wrote:

It's crunch time. Do you and Jeff believe this patch should be
committed to Postgres core?


I want to see a GUC to allow turning this off, to avoid the problem I 
saw where a non-critical header corruption problem can cause an entire 
page to be unreadable.  A variation on that capable of turning this off 
altogether, as Craig suggested, is a good idea too.


Those are both simple fixes, and I would be pleased to see this 
committed at that point.


I'll write up a long discussion of filesystem trends and why I think 
this is more relevant than ever if that's the main objection now.  There 
is no such thing as a stable release of btrfs, and no timetable for when 
there will be one.  I could do some benchmarks of that but I didn't 
think they were very relevant.  Who cares how fast something might run 
when it may not work correctly?  btrfs might as well be /dev/null to me 
right now--sure it's fast, but maybe the data won't be there at all. 
How long has it taken the Linux kernel to reach the point it handles 
write barriers and fsync correctly?  It does not give me a lot of 
confidence that now is the time they'll suddenly start executing on 
database filesystem mechanics perfectly.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

On Mon, 2013-02-25 at 01:30 -0500, Greg Smith wrote:
 Attached is some bit rot updates to the checksums patches.  The 
 replace-tli one still works fine.  I fixed a number of conflicts in the 
 larger patch.  The one I've attached here isn't 100% to project 
 standards--I don't have all the context diff tools setup yet for 
 example.  I expect to revise this more now that I've got the whole week 
 cleared to work on CF submissions.

Thank you for the rebase. I redid the rebase myself and came up with
essentially the same result, but there was an additional problem that
needed fixing after the materialized view patch. 

I will post a new version tonight that includes those fixes as well as
something to address these recent comments (probably just another GUC).
Further comment in another reply.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 04.03.2013 20:58, Greg Smith wrote:

There
is no such thing as a stable release of btrfs, and no timetable for when
there will be one. I could do some benchmarks of that but I didn't think
they were very relevant. Who cares how fast something might run when it
may not work correctly? btrfs might as well be /dev/null to me right
now--sure it's fast, but maybe the data won't be there at all.


This PostgreSQL patch hasn't seen any production use, either. In fact, 
I'd consider btrfs to be more mature than this patch. Unless you think 
that there will be some major changes to the worse in performance in 
btrfs, it's perfectly valid and useful to compare the two.


A comparison with ZFS would be nice too. That's mature, and has checksums.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

On Mon, 2013-03-04 at 11:52 +0800, Craig Ringer wrote:
 I also suspect that at least in the first release it might be desirable
 to have an option that essentially says something's gone horribly wrong
 and we no longer want to check or write checksums, we want a
 non-checksummed DB that can still read our data from before we turned
 checksumming off. Essentially, a way for someone who's trying
 checksumming in production after their staging tests worked out OK to
 abort and go back to the non-checksummed case without having to do a
 full dump and reload.

A recovery option to extract data sounds like a good idea, but I don't
want to go as far as you are suggesting here.

An option to ignore checksum failures (while still printing the
warnings) sounds like all we need here. I think Greg's point that the
page might be written out again (hiding the corruption) is a very good
one, but the same is true for zero_damaged_pages. So we can just still
allow the writes to proceed (including setting the checksum on write),
and the system should be as available as it would be without checksums.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

On Mon, 2013-03-04 at 22:13 +0200, Heikki Linnakangas wrote:
 On 04.03.2013 20:58, Greg Smith wrote:
  There
  is no such thing as a stable release of btrfs, and no timetable for when
  there will be one. I could do some benchmarks of that but I didn't think
  they were very relevant. Who cares how fast something might run when it
  may not work correctly? btrfs might as well be /dev/null to me right
  now--sure it's fast, but maybe the data won't be there at all.
 
 This PostgreSQL patch hasn't seen any production use, either. In fact, 
 I'd consider btrfs to be more mature than this patch. Unless you think 
 that there will be some major changes to the worse in performance in 
 btrfs, it's perfectly valid and useful to compare the two.
 
 A comparison with ZFS would be nice too. That's mature, and has checksums.

Is there any reason why we can't have both postgres and filesystem
checksums? The same user might not want both (or might, if neither are
entirely trustworthy yet), but I think it's too early to declare one as
the right solution and the other not. Even with btrfs stable, I
pointed out a number of reasons users might not want it, and reasons
that the project should not depend on it.

Numbers are always nice, but it takes a lot of effort to come up with
them. What kind of numbers are you looking for, and how *specifically*
will those numbers affect the decision?

If btrfs with checksums is 10% slower than ext4 with postgres checksums,
does that mean we should commit the postgres checksums?

On the other side of the coin, if btrfs with checksums is exactly the
same speed as ext4 with no postgres checksums (i.e. checksums are free
if we use btrfs), does that mean postgres checksums should be rejected?

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 04.03.2013 18:00, Jeff Davis wrote:

On Mon, 2013-03-04 at 10:36 +0200, Heikki Linnakangas wrote:

On 04.03.2013 09:11, Simon Riggs wrote:

Are there objectors?


FWIW, I still think that checksumming belongs in the filesystem, not
PostgreSQL.


Doing checksums in the filesystem has some downsides. One is that you
need to use a copy-on-write filesystem like btrfs or zfs, which (by
design) will fragment the heap on random writes.


Yeah, fragmentation will certainly hurt some workloads. But how badly, 
and which workloads, and how does that compare with the work that 
PostgreSQL has to do to maintain the checksums? I'd like to see some 
data on those things.



There are also other issues, like what fraction of our users can freely
move to btrfs, and when. If it doesn't happen to be already there, you
need root to get it there, which has never been a requirement before.


If you're serious enough about your data that you want checksums, you 
should be able to choose your filesystem.



  If you go ahead with this anyway, at the very least I'd like
to see some sort of a comparison with e.g btrfs. How do performance,
error-detection rate, and behavior on error compare? Any other metrics
that are relevant here?


I suspect it will be hard to get an apples-to-apples comparison here
because of the heap fragmentation, which means that a sequential scan is
not so sequential. That may be acceptable for some workloads but not for
others, so it would get tricky to compare.


An apples-to-apples comparison is to run the benchmark and see what 
happens. If it gets fragmented as hell on btrfs, and performance tanks 
because of that, then that's your result. If avoiding fragmentation is 
critical to the workload, then with btrfs you'll want to run the 
defragmenter in the background to keep it in order, and factor that into 
the test case.


I realize that performance testing is laborious. But we can't skip it 
and assume that the patch performs fine, because it's hard to benchmark.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

On Sun, 2013-03-03 at 22:18 -0500, Greg Smith wrote:
 As for a design of a GUC that might be useful here, the option itself 
 strikes me as being like archive_mode in its general use.  There is an 
 element of parameters like wal_sync_method or enable_cassert though, 
 where the options available vary depending on how you built the cluster. 
   Maybe name it checksum_level with options like this:
 
 off:  only valid option if you didn't enable checksums with initdb
 enforcing:  full checksum behavior as written right now.
 unvalidated:  broken checksums on reads are ignored.

I think GUCs should be orthogonal to initdb settings. If nothing else,
it's extra effort to get initdb to write the right postgresql.conf.

A single new GUC that prevents checksum failures from causing an error
seems sufficient to address the concerns you, Dan, and Craig raised.

We would still calculate the checksum and print the warning; and then
pass it through the rest of the header checks. If the header checks
pass, then it proceeds. If the header checks fail, and if
zero_damaged_pages is off, then it would still generate an error (as
today).

So: ignore_checksum_failures = on|off ?

 The main tricky case I see in that is where you read in a page with a 
 busted checksum using unvalidated.  Ideally you wouldn't write such a 
 page back out again, because it's going to hide that it's corrupted in 
 some way already.  How to enforce that though?  Perhaps unvalidated 
 only be allowed in a read-only transaction?

That's a good point. But we already have zero_damaged_pages, which does
something similar. And it's supposed to be a recovery option to get the
data out rather than something to run in online mode. It will still
print the warning, so it won't completely hide the corruption.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

On Mon, 2013-03-04 at 13:58 -0500, Greg Smith wrote:
 On 3/4/13 2:11 AM, Simon Riggs wrote:
  It's crunch time. Do you and Jeff believe this patch should be
  committed to Postgres core?
 
 I want to see a GUC to allow turning this off, to avoid the problem I 
 saw where a non-critical header corruption problem can cause an entire 
 page to be unreadable.  A variation on that capable of turning this off 
 altogether, as Craig suggested, is a good idea too.

Based on your comments as well those of Dan and Craig, I am leaning
toward a GUC that causes a checksum failure to be ignored. It will still
emit the checksum failure warning, but proceed.

That will then fall through to the normal header checks we've always
had, and the same zero_damaged_pages option. So, to get past a really
corrupt page, you'd need to set ignore_checksum_failure and
zero_damaged_pages.

 I'll write up a long discussion of filesystem trends and why I think 
 this is more relevant than ever if that's the main objection now.  There 
 is no such thing as a stable release of btrfs, and no timetable for when 
 there will be one.  I could do some benchmarks of that but I didn't 
 think they were very relevant.  Who cares how fast something might run 
 when it may not work correctly?  btrfs might as well be /dev/null to me 
 right now--sure it's fast, but maybe the data won't be there at all. 
 How long has it taken the Linux kernel to reach the point it handles 
 write barriers and fsync correctly?  It does not give me a lot of 
 confidence that now is the time they'll suddenly start executing on 
 database filesystem mechanics perfectly.

I have a similar viewpoint here. It will take significant effort to come
up with anything, and I'm not sure how meaningful the numbers would be.
Even if btrfs is great, this feature is not mutually exclusive with
btrfs:
  * users might not have easy access to run the filesystem
  * they might not trust it
  * they might get poor performance numbers
  * postgres checksums might provide a good test of btrfs checksums, and
vice-versa, until both are stable

Additionally, I don't like the idea of depending so heavily on what
linux is doing. If there are performance problems that affect postgres,
will they fix them? Will they introduce new ones? Are there a zillion
tuneable options that a new user has to get right in order to run
postgres efficiently, and will poor settings mean a bunch of postgres
is slow blog posts?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 3/4/13 10:00 AM, Jeff Davis wrote:

On Mon, 2013-03-04 at 10:36 +0200, Heikki Linnakangas wrote:

On 04.03.2013 09:11, Simon Riggs wrote:

 Are there objectors?


FWIW, I still think that checksumming belongs in the filesystem, not
PostgreSQL.

Doing checksums in the filesystem has some downsides.


Additionally, no filesystem I'm aware of checksums the data in the filesystem 
cache. A PG checksum would.

I'll also mention that this debate has been had in the past. The time to object 
to the concept of a checksuming feature was a long time ago, before a ton of 
development effort went into this... :(


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 3/4/13 2:48 PM, Jeff Davis wrote:

On Mon, 2013-03-04 at 13:58 -0500, Greg Smith wrote:

On 3/4/13 2:11 AM, Simon Riggs wrote:

 It's crunch time. Do you and Jeff believe this patch should be
 committed to Postgres core?


I want to see a GUC to allow turning this off, to avoid the problem I
saw where a non-critical header corruption problem can cause an entire
page to be unreadable.  A variation on that capable of turning this off
altogether, as Craig suggested, is a good idea too.

Based on your comments as well those of Dan and Craig, I am leaning
toward a GUC that causes a checksum failure to be ignored. It will still
emit the checksum failure warning, but proceed.


I suggest we paint that GUC along the lines of checksum_failure_log_level, 
defaulting to ERROR. That way if someone wanted completely bury the elogs to like DEBUG 
they could.

My $2.98 (inflation adjusted).


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

On Mon, 2013-03-04 at 22:27 +0200, Heikki Linnakangas wrote:
 Yeah, fragmentation will certainly hurt some workloads. But how badly, 
 and which workloads, and how does that compare with the work that 
 PostgreSQL has to do to maintain the checksums? I'd like to see some 
 data on those things.

I think we all would. Btrfs will be a major filesystem in a few years,
and we should be ready to support it.

Unfortunately, it's easier said than done. What you're talking about
seems like a significant benchmark report that encompasses a lot of
workloads. And there's a concern that a lot of it will be invalidated if
they are still improving the performance of btrfs.

 If you're serious enough about your data that you want checksums, you 
 should be able to choose your filesystem.

I simply disagree. I am targeting my feature at casual users. They may
not have a lot of data or a dedicated DBA, but the data they do have
might be very important transactional data.

And right now, if they take a backup of their data, it will contain all
of the corruption from the original. And since corruption is silent
today, then they would probably think the backup is fine, and may delete
the previous good backups.

 An apples-to-apples comparison is to run the benchmark and see what 
 happens. If it gets fragmented as hell on btrfs, and performance tanks 
 because of that, then that's your result. If avoiding fragmentation is 
 critical to the workload, then with btrfs you'll want to run the 
 defragmenter in the background to keep it in order, and factor that into 
 the test case.

Again, easier said than done. To get real fragmentation problems, the
data set needs to be huge, and we need to reach a steady state of this
background defrag process, and a million other things.

 I realize that performance testing is laborious. But we can't skip it 
 and assume that the patch performs fine, because it's hard to benchmark.

You aren't asking me to benchmark the patch in question. You are asking
me to benchmark a filesystem that very few people actually run postgres
on in production. I don't think that's a reasonable requirement.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 04.03.2013 22:51, Jim Nasby wrote:

The time to
object to the concept of a checksuming feature was a long time ago,
before a ton of development effort went into this... :(


I did. Development went ahead anyway.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 04.03.2013 22:40, Jeff Davis wrote:

Is there any reason why we can't have both postgres and filesystem
checksums?


Of course not. But if we can get away without checksums in Postgres, 
that's better, because then we don't need to maintain that feature in 
Postgres. If the patch gets committed, it's not mission accomplished. 
There will be discussion and need for further development on things like 
what to do if you get a checksum failure, patches to extend the 
checksums to cover things like the clog and other non-data files and so 
forth. And it's an extra complication that will need to be taken into 
account when developing other new features; in particular, hint bit 
updates need to write a WAL record. Even if you have all the current 
hint bits covered, it's an extra hurdle for future patches that might 
want to have hint bits in e.g new index access methods.



The same user might not want both (or might, if neither are
entirely trustworthy yet), but I think it's too early to declare one as
the right solution and the other not. Even with btrfs stable, I
pointed out a number of reasons users might not want it, and reasons
that the project should not depend on it.


The PostgreSQL project would not be depending on it, any more than the 
project depends on filesystem snapshots for backup purposes, or the OS 
memory manager for caching.



Numbers are always nice, but it takes a lot of effort to come up with
them. What kind of numbers are you looking for, and how *specifically*
will those numbers affect the decision?


Benchmark of vanilla PostgreSQL, PostgreSQL + this patch, and PostgreSQL 
running on btrfs or ZFS with data checksums enabled. DBT-2 might be a 
good candidate, as it's I/O heavy. That would be a good general test; in 
addition it would be good to see a benchmark of the worst case scenario 
for the fragmentation you're expecting to see on btrfs, as well as a 
worst case scenario for the extra WAL traffic with the patch.



If btrfs with checksums is 10% slower than ext4 with postgres checksums,
does that mean we should commit the postgres checksums?


In my opinion, a 10% gain would not be worth it, and we should not 
commit in that case.



On the other side of the coin, if btrfs with checksums is exactly the
same speed as ext4 with no postgres checksums (i.e. checksums are free
if we use btrfs), does that mean postgres checksums should be rejected?


Yes, I think so. I'm sure at least some others will disagree; Greg 
already made it quite clear that he doesn't care how the performance of 
this compares with btrfs.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-04 Thread k...@rice.edu

On Mon, Mar 04, 2013 at 01:00:09PM -0800, Jeff Davis wrote:
 On Mon, 2013-03-04 at 22:27 +0200, Heikki Linnakangas wrote:
  If you're serious enough about your data that you want checksums, you 
  should be able to choose your filesystem.
 
 I simply disagree. I am targeting my feature at casual users. They may
 not have a lot of data or a dedicated DBA, but the data they do have
 might be very important transactional data.
 
 And right now, if they take a backup of their data, it will contain all
 of the corruption from the original. And since corruption is silent
 today, then they would probably think the backup is fine, and may delete
 the previous good backups.
 
+1

There is no reasonable availability of checksum capable filesystems across
PostgreSQL's supported OSes. It really needs to be available in core.

Regards,
Ken


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 04.03.2013 23:00, Jeff Davis wrote:

On Mon, 2013-03-04 at 22:27 +0200, Heikki Linnakangas wrote:

Yeah, fragmentation will certainly hurt some workloads. But how badly,
and which workloads, and how does that compare with the work that
PostgreSQL has to do to maintain the checksums? I'd like to see some
data on those things.


I think we all would. Btrfs will be a major filesystem in a few years,
and we should be ready to support it.


Perhaps we should just wait a few years? If we suspect that this becomes 
obsolete in a few years, it's probably better to just wait, than add a 
feature we'll have to keep maintaining. Assuming it gets committed 
today, it's going to take a year or two for 9.3 to get released and all 
the bugs ironed out, anyway.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 3/4/13 3:00 PM, Heikki Linnakangas wrote:

On 04.03.2013 22:51, Jim Nasby wrote:

The time to
object to the concept of a checksuming feature was a long time ago,
before a ton of development effort went into this... :(


I did. Development went ahead anyway.


Right, because the community felt that this was valuable enough to do 
regardless of things like FS checksumming. But now you're bringing the issue up 
yet again, this time after a large amount of time has been invested.

I know that you're doing what you feel is best for the project, but in this 
case the community didn't agree with your view. Raising the same objection at 
this point is not productive at this point.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 04.03.2013 22:51, Jim Nasby wrote:

Additionally, no filesystem I'm aware of checksums the data in the
filesystem cache. A PG checksum would.


The patch says:


+ * IMPORTANT NOTE -
+ * The checksum is not valid at all times on a data page. We set it before we
+ * flush page/buffer, and implicitly invalidate the checksum when we modify the
+ * page. A heavily accessed buffer might then spend most of its life with an
+ * invalid page checksum, so testing random pages in the buffer pool will tell
+ * you nothing. The reason for this is that the checksum detects otherwise
+ * silent errors caused by the filesystems on which we rely. We do not protect
+ * buffers against uncorrectable memory errors, since these have a very low
+ * measured incidence according to research on large server farms,
+ * http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf, discussed 
2010/12/22.


It's still true that it does in fact cover pages in the filesystem 
cache, but apparently that's not important.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Stephen Frost

* Heikki Linnakangas (hlinnakan...@vmware.com) wrote:
 Perhaps we should just wait a few years? If we suspect that this
 becomes obsolete in a few years, it's probably better to just wait,
 than add a feature we'll have to keep maintaining. Assuming it gets
 committed today, it's going to take a year or two for 9.3 to get
 released and all the bugs ironed out, anyway.

For my 2c, I don't see it being obsolete in a few years, even if every
existing FS out there gets checksumming (which won't happen, imv).
It's quite clear that there is still ongoing development in the
filesystem space and any new software will have its own set of bugs.
Having a layer of protection built-in to PG wil undoubtably be a good
thing and will be used by our users.

Thanks,

Stephen


signature.asc
Description: Digital signature

Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Craig Ringer

On 03/05/2013 04:48 AM, Jeff Davis wrote:
 We would still calculate the checksum and print the warning; and then
 pass it through the rest of the header checks. If the header checks
 pass, then it proceeds. If the header checks fail, and if
 zero_damaged_pages is off, then it would still generate an error (as
 today).

 So: ignore_checksum_failures = on|off ?
That seems reasonable to me. It would be important to document clearly
in postgresql.conf and on the docs for the option that enabling this
option can launder data corruption, so that blocks that we suspected
were damaged are marked clean on rewrite. So long as that's clearly
documented I'm personally quite comfortable with your suggestion, since
my focus is just making sure I can get a DB back to a fully operational
state as quickly as possible when that's necessary.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Greg Smith

On 3/4/13 3:13 PM, Heikki Linnakangas wrote:

This PostgreSQL patch hasn't seen any production use, either. In fact,
I'd consider btrfs to be more mature than this patch. Unless you think
that there will be some major changes to the worse in performance in
btrfs, it's perfectly valid and useful to compare the two.

I think my last message came out with a bit more hostile attitude about
this than I intended it to; sorry about that. My problem with this idea
comes from looking at the history of how Linux has failed to work
properly before. The best example I can point at is the one I
documented at
http://www.postgresql.org/message-id/4b512d0d.4030...@2ndquadrant.com
along with this handy pgbench chart:
http://www.phoronix.com/scan.php?page=articleitem=ubuntu_lucid_alpha2num=3

TPS on pgbench dropped from 1102 to about 110 after a kernel bug fix.
It was 10X as fast in some kernel versions because fsync wasn't working
properly. Kernel filesystem issues have regularly resulted in data not
being written to disk when it should have been, inflating the results
accordingly. Fake writes due to lying drives, write barriers that
only actually work on server-class hardware, write barriers that don't
work on md volumes, and then this one; it's a recurring pattern. It's
not the fault of the kernel developers, it's a hard problem and drive
manufacturers aren't making it easy for them.

My concern, then, is that if the comparison target is btrfs performance,
how do we know it's working reliably? The track record says that bugs
in this area usually inflate results, compared with a correct
implementation. You are certainly right that this checksum code is less
mature than btrfs; it's just over a year old after all. I feel quite
good that it's not benchmarking faster than it really is, especially
when I can directly measure how the write volume is increasing in the
worst result.

I can't say that btrfs is slower or faster than it will eventually be
due to bugs; I can't tell you the right way to tune btrfs for
PostgreSQL; and I haven't even had anyone asking the question yet.
Right now, the main thing I know about testing performance on Linux
kernels new enough to support btrfs is that they're just generally slow
running PostgreSQL. See the multiple confirmed regression issues at
http://www.postgresql.org/message-id/60b572d9298d944580f7d51195dd30804357fa4...@vmbx125.ihostexchange.net
for example. That new kernel mess needs to get sorted out too one day.
Why does database performance suck on kernel 3.2? I don't know yet,
but it doesn't help me get excited about assuming btrfs results will be
useful.

ZFS was supposed to save everyone from worrying about corruption issues.
That didn't work out, I think due to the commercial agenda behind its
development. Now we have btrfs coming in some number of years, a
project still tied more than I would like to Oracle. I'm not too
optimistic about that one either. It doesn't help that now the original
project lead, Chris Mason, has left there and is working at
FusionIO--and that company's filesystem plans don't include
checksumming, either. (See
http://www.fusionio.com/blog/under-the-hood-of-the-iomemory-sdk/ for a
quick intro to what they're doing right now, which includes bypassing
the Linux filesystem layer with their own flash optimized but POSIX
compliant directFS)

There is an optimistic future path I can envision where btrfs matures
quickly and in a way that performs well for PostgreSQL. Maybe we'll end
up there, and if that happens everyone can look back and say this was a
stupid idea. But there are a lot of other outcomes I see as possible
here, and in all the rest of them having some checksumming capabilities
available is a win.

One of the areas PostgreSQL has a solid reputation on is being trusted
to run as reliably as possible. All of the deployment trends I'm seeing
have people moving toward less reliable hardware. VMs, cloud systems,
regular drives instead of hardware RAID, etc. A lot of people badly
want to leave behind the era of the giant database server, and have a
lot of replicas running on smaller/cheaper systems instead. There's a
useful advocacy win for the project if lower grade hardware can be used
to hit a target reliability level, with software picking up some of the
error detection job instead. Yes, it costs something in terms of future
maintenance on the codebase, as new features almost invariably do. If I
didn't see being able to make noise about the improved reliability of
PostgreSQL as valuable enough to consider it anyway, I wouldn't even be
working on this thing.

--
Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 3/4/13 5:20 PM, Craig Ringer wrote:

On 03/05/2013 04:48 AM, Jeff Davis wrote:

We would still calculate the checksum and print the warning; and then
pass it through the rest of the header checks. If the header checks
pass, then it proceeds. If the header checks fail, and if
zero_damaged_pages is off, then it would still generate an error (as
today).

So: ignore_checksum_failures = on|off ?

That seems reasonable to me. It would be important to document clearly
in postgresql.conf and on the docs for the option that enabling this
option can launder data corruption, so that blocks that we suspected
were damaged are marked clean on rewrite. So long as that's clearly
documented I'm personally quite comfortable with your suggestion, since
my focus is just making sure I can get a DB back to a fully operational
state as quickly as possible when that's necessary.


I replied to this somewhere else in the thread when I over-looked Jeff's 
original post, so sorry for the noise... :(

Would it be better to do checksum_logging_level = valid elog levels ? That 
way someone could set the notification to anything from DEBUG up to PANIC. ISTM the 
default should be ERROR.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Craig Ringer

On 03/05/2013 08:15 AM, Jim Nasby wrote:

 Would it be better to do checksum_logging_level = valid elog levels
 ? That way someone could set the notification to anything from DEBUG
 up to PANIC. ISTM the default should be ERROR. 
That seems nice at first brush, but I don't think it holds up.

All our other log_level parameters control only output. If I saw that
parameter, I would think aah, this is how we control the detail and
verbosity of messages regarding checksum checking and maintenance. I
would be totally astonished if I changed it and it actually affected the
system's data integrity checking and enforcement processes. Logging
control GUCs control what we show to what clients/log files, not what
log statements get executed; they're a filter and don't control the
behaviour of the emitting log point.

Control over whether checksum failures are an error or merely warned
about is reasonable, but I strongly disagree with the idea of making
this seem like it's just a logging parameter.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 3/4/13 6:22 PM, Craig Ringer wrote:

On 03/05/2013 08:15 AM, Jim Nasby wrote:


Would it be better to do checksum_logging_level = valid elog levels
? That way someone could set the notification to anything from DEBUG
up to PANIC. ISTM the default should be ERROR.

That seems nice at first brush, but I don't think it holds up.

All our other log_level parameters control only output. If I saw that
parameter, I would think aah, this is how we control the detail and
verbosity of messages regarding checksum checking and maintenance. I
would be totally astonished if I changed it and it actually affected the
system's data integrity checking and enforcement processes. Logging
control GUCs control what we show to what clients/log files, not what
log statements get executed; they're a filter and don't control the
behaviour of the emitting log point.

Control over whether checksum failures are an error or merely warned
about is reasonable, but I strongly disagree with the idea of making
this seem like it's just a logging parameter.


Good point. I thought we actually had precedent for controlling the level that 
something gets logged at, but now that you mention it I guess we don't. And 
this could sure as hell cause confusion.

So yeah, your original idea sounds best.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Josh Berkus

Heikki,

 Perhaps we should just wait a few years? If we suspect that this becomes
 obsolete in a few years, it's probably better to just wait, than add a
 feature we'll have to keep maintaining. Assuming it gets committed
 today, it's going to take a year or two for 9.3 to get released and all
 the bugs ironed out, anyway.

You are far more optimistic about FS development than I am:

* Windows and OSX are unlikely to ever have usable FS checksums
* BTRFS may be years away from being production-quality for DB server,
and (given the current dev priorities) may *never* be suitable for DB
servers.
* For various reasons, many users may stay with other filesystems, even
on Linux.
* All filesystems have bugs, and the FS may be itself causing the
corruption.
* FS checksums may not catch underlying driver bugs (i.e. better to have
two checks than one if you KNOW something is wrong)

We have people who could use PostgreSQL-level checksums *now* because
they are having data corruption issues *now* and need a tool to help
determine what layer the corruption is occurring at.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-04 Thread Daniel Farina

On Mon, Mar 4, 2013 at 1:22 PM, Heikki Linnakangas
hlinnakan...@vmware.com wrote:
 On 04.03.2013 23:00, Jeff Davis wrote:

 On Mon, 2013-03-04 at 22:27 +0200, Heikki Linnakangas wrote:

 Yeah, fragmentation will certainly hurt some workloads. But how badly,
 and which workloads, and how does that compare with the work that
 PostgreSQL has to do to maintain the checksums? I'd like to see some
 data on those things.


 I think we all would. Btrfs will be a major filesystem in a few years,
 and we should be ready to support it.


 Perhaps we should just wait a few years? If we suspect that this becomes
 obsolete in a few years, it's probably better to just wait, than add a
 feature we'll have to keep maintaining. Assuming it gets committed today,
 it's going to take a year or two for 9.3 to get released and all the bugs
 ironed out, anyway.

Putting aside the not-so-rosy predictions seen elsewhere in this
thread about the availability of a high performance, reliable
checksumming file system available on common platforms, I'd like to
express what benefit this feature will have to me:

Corruption has easily occupied more than one person-month of time last
year for us.  This year to date I've burned two weeks, although
admittedly this was probably the result of statistical clustering.
Other colleagues of mine have probably put in a week or two in
aggregate in this year to date.  The ability to quickly, accurately,
and maybe at some later date proactively finding good backups to run
WAL recovery from is one of the biggest strides we can make in the
operation of Postgres.  The especially ugly cases are where the page
header is not corrupt, so full page images can carry along malformed
tuples...basically, when the corruption works its way into the WAL,
we're in much worse shape.  Checksums would hopefully prevent this
case, converting them into corrupt pages that will not be modified.

It would be better yet if I could write tools to find the last-good
version of pages, and so I think tight integration with Postgres will
see a lot of benefits that would be quite difficult and non-portable
when relying on file system checksumming.

You are among the most well-positioned to make assessments of the cost
of the feature, but I thought you might appreciate a perspective of
the benefits, too.  I think they're large, and for me they are the
highest pole in the tent for what makes Postgres stressful to operate
as-is today.  It's a testament to the quality of the programming in
Postgres that Postgres programming error is not the largest problem.

For sense of reference, I think the next largest operational problem
is the disruption caused by logical backups, e.g. pg_dump, and in
particular its long running transactions and sessions.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

On Mon, 2013-03-04 at 14:57 -0600, Jim Nasby wrote:
 I suggest we paint that GUC along the lines of
 checksum_failure_log_level, defaulting to ERROR. That way if someone
 wanted completely bury the elogs to like DEBUG they could.

The reason I didn't want to do that is because it's essentially a
recovery feature. A boolean seems more appropriate than a slider.

That's a good point about burying the messages with DEBUG, but I think
it might be slightly over-engineering it. I am willing to change it if
others want it, though.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

On Mon, 2013-03-04 at 23:22 +0200, Heikki Linnakangas wrote:
 On 04.03.2013 23:00, Jeff Davis wrote:
  On Mon, 2013-03-04 at 22:27 +0200, Heikki Linnakangas wrote:
  Yeah, fragmentation will certainly hurt some workloads. But how badly,
  and which workloads, and how does that compare with the work that
  PostgreSQL has to do to maintain the checksums? I'd like to see some
  data on those things.
 
  I think we all would. Btrfs will be a major filesystem in a few years,
  and we should be ready to support it.
 
 Perhaps we should just wait a few years? If we suspect that this becomes 
 obsolete in a few years

I do not expect it to be obsolete, even if btrfs is stable and fast
today.

Consider this hypothetical scenario: what if btrfs performs acceptably
well today, but they tune it away from our needs later and it tanks
performance? Then, when we complain, the btrfs people say for DB
workloads, you should turn off COW, or use ext4 or XFS. And then we say
but we want checksums. And then they tell us that real databases do
their own checksums.

Then what?

I don't think that scenario is very outlandish. Postgres is essentially
a COW system (for tuples), and stacking COW on top of COW does not seem
like a good idea (neither for filesystems nor actual cows). So it may be
within reason for the filesystem folks to say we're doing the wrong
thing, and then checksums are our problem again. Additionally, I don't
have a lot of faith that linux will address all of our btrfs complaints
(even legitimate ones) in a reasonable amount of time, if ever.

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

On Mon, 2013-03-04 at 23:11 +0200, Heikki Linnakangas wrote:
 Of course not. But if we can get away without checksums in Postgres, 
 that's better, because then we don't need to maintain that feature in 
 Postgres. If the patch gets committed, it's not mission accomplished. 
 There will be discussion and need for further development on things like 
 what to do if you get a checksum failure, patches to extend the 
 checksums to cover things like the clog and other non-data files and so 
 forth. And it's an extra complication that will need to be taken into 
 account when developing other new features; in particular, hint bit 
 updates need to write a WAL record. Even if you have all the current 
 hint bits covered, it's an extra hurdle for future patches that might 
 want to have hint bits in e.g new index access methods.

The example you chose of adding a hint bit is a little overstated -- as
far as I can tell, setting a hint bit follows pretty much the same
pattern as before, except that I renamed the function to
MarkBufferDirtyHint().

But I agree in general. If complexity can be removed or avoided, that is
a very good thing. But right now, we have no answer to a real problem
that other databases do have an answer for. To me, the benefit is worth
the cost.

We aren't going down an irreversible path by adding checksums. If every
platform has a good checksumming filesystem and there is no demand for
the postgres code any more, we can deprecate it and remove it. But at
least users would have something between now and then.

 The PostgreSQL project would not be depending on it, any more than the 
 project depends on filesystem snapshots for backup purposes, or the OS 
 memory manager for caching.

I don't understand your analogies at all. We have WAL-protected base
backups so that users can get a consistent snapshot without filesystem
snapshots. To follow the analogy, we want postgres checksums so that the
user can be protected without filesystem checksums.

I would agree with you if we could point users somewhere and actually
recommend something and say what you're doing now is wrong, do X
instead (though if there is only one such X, we are dependent on it).
But even if we fast forward to three years from now: if someone shows up
saying that XFS gives him the best performance, but wants checksums,
will we really be able to say you are wrong to be using XFS; use
Btrfs?

One of the things I like about postgres is that we don't push a lot of
hard trade-offs on users. Several people (including you) put in effort
recently to support unlogged gist indexes. Are there some huge number of
users there that can't live without unlogged gist indexes? Probably not.
But that is one less thing that potential users have to trade away, and
one less thing to be confused or frustrated about.

I want to get to the point where checksums are the default, and only
advanced users would disable them. If that point comes in the form of
checksumming filesystems that are fast enough and enabled by default on
most of the platforms we support, that's fine with me. But I'm not very
sure that it will happen that way ever, and certainly not soon.

  If btrfs with checksums is 10% slower than ext4 with postgres checksums,
  does that mean we should commit the postgres checksums?
 
 In my opinion, a 10% gain would not be worth it, and we should not 
 commit in that case.
 
  On the other side of the coin, if btrfs with checksums is exactly the
  same speed as ext4 with no postgres checksums (i.e. checksums are free
  if we use btrfs), does that mean postgres checksums should be rejected?
 
 Yes, I think so. I'm sure at least some others will disagree; Greg 
 already made it quite clear that he doesn't care how the performance of 
 this compares with btrfs.

If all paths lead to rejection, what are these tests supposed to
accomplish, exactly?

Regards,
Jeff Davis



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

On Sun, 2013-03-03 at 18:05 -0500, Greg Smith wrote:
 = Test 1 - find worst-case overhead for the checksum calculation on write =
 
 This can hit 25% of runtime when you isolate it out.  I'm not sure if 
 how I'm running this multiple times makes sense yet.  This one is so 
 much slower on my Mac that I can't barely see a change at all.
 
 = Test 2 - worst-case overhead for calculating checksum while reading data =
 
 Jeff saw an 18% slowdown, I get 24 to 32%.  This one bothers me because 
 the hit is going to happen during the very common situation where data 
 is shuffling a lot between a larger OS cache and shared_buffers taking a 
 relatively small fraction.  If that issue were cracked, such that 
 shared_buffers could be 50% of RAM, I think the typical real-world 
 impact of this would be easier to take.

I believe that test 1 and test 2 can be improved a little, if there is a
need. Right now we copy the page and then calculate the checksum on the
copy. If we instead calculate as we're copying, I believe it will make
it significantly faster.

I decided against doing that, because it decreased the readability, and
we can always do that later as an optimization. That should mitigate the
case you have in mind, which is a very legitimate concern. I'll wait for
someone to ask for it, though.

 = Test 3 - worst-case WAL overhead =
 
 This is the really nasty one.  The 10,000,000 rows touched by the SELECT 
 statement here create no WAL in a non-checksum environment.  When 
 checksums are on, 368,513,656 bytes of WAL are written, so about 37 
 bytes per row.

Yeah, nothing we can do about this.

 Right now the whole hint bit mechanism and its overhead are treated as 
 an internal detail that isn't in the regular documentation.  I think 
 committing this sort of checksum patch will require exposing some of the 
 implementation to the user in the documentation, so people can 
 understand what the trouble cases are--either in advance or when trying 
 to puzzle out why they're hitting one of them.

Any particular sections that you think would be good to update?

Thank you for the test results.

Regards,
Jeff Davis




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-03 Thread Craig Ringer

On 03/02/2013 12:48 AM, Daniel Farina wrote:
 On Sun, Feb 24, 2013 at 10:30 PM, Greg Smith g...@2ndquadrant.com wrote:
 Attached is some bit rot updates to the checksums patches.  The replace-tli
 one still works fine
 I rather badly want this feature, and if the open issues with the
 patch has hit zero, I'm thinking about applying it, shipping it, and
 turning it on.  Given that the heap format has not changed, the main
 affordence I may check for is if I can work in backwards compatibility
 (while not maintaining the checksums, of course) in case of an
 emergency.

Did you get a chance to see whether you can run it in
checksum-validation-and-update-off backward compatible mode? This seems
like an important thing to have working (and tested for) in case of
bugs, performance issues or other unforseen circumstances.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

And here's an updated version of the checksum corruption testing wrapper 
script already.  This includes an additional safety check that you've 
set PGDATA to a location that can be erased.  Presumably no one else 
would like to accidentally do this:


rm -rf /*

Like I just did.

--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


check-check.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums


On 12/19/12 6:30 PM, Jeff Davis wrote:

I ran a few tests.
Test 1 - find worst-case overhead for the checksum calculation on write:
Test 2 - worst-case overhead for calculating checksum while reading data
Test 3 - worst-case WAL overhead


What I've done is wrap all of these tests into a shell script that runs 
them 3 times each, with and without checksums.  That includes steps like 
the spots where Jeff found a sync helped improve repeatability.  I ran 
these manually before and didn't notice enough of a difference to argue 
with any of his results at the time.  Having them packaged up usefully 
means I can try some additional platforms too, and other testers should 
be easily able to take a crack at it.


On the last one, in addition to runtime I directly measure how many 
bytes of WAL are written.  It's 0 in the usual case, where the hint bit 
changes triggered by the first SELECT * FROM foo don't generate any WAL.


Detailed results with both my and Jeff's numbers are in the attached 
spreadsheet.  I did my tests on a Mac writing to SSD, to try and get 
some variety in the test platforms.  The main difference there is that 
Test 1 is much slower on my system, enough so that the slowdown isn't as 
pronounced.


Remember, these are a set of tests designed to magnify the worst case 
here.  I don't feel any of these results make the feature uncommittable. 
 The numbers I'm getting are not significantly different from the ones 
Jeff posted back in December, and those were acceptable to some of the 
early adopter candidates I've been surveying informally.  These numbers 
are amplifying overhead without doing much in the way of real disk I/O, 
which can easily be a lot more expensive than any of this.  I do think 
there needs to be a bit more documentation of the potential downsides to 
checksumming written though, since they are pretty hefty in some situations.


I'm going to get some pgbench results next, to try and put this into a 
more realistic context too.  The numbers for this round break down like 
this:


= Test 1 - find worst-case overhead for the checksum calculation on write =

This can hit 25% of runtime when you isolate it out.  I'm not sure if 
how I'm running this multiple times makes sense yet.  This one is so 
much slower on my Mac that I can't barely see a change at all.


= Test 2 - worst-case overhead for calculating checksum while reading data =

Jeff saw an 18% slowdown, I get 24 to 32%.  This one bothers me because 
the hit is going to happen during the very common situation where data 
is shuffling a lot between a larger OS cache and shared_buffers taking a 
relatively small fraction.  If that issue were cracked, such that 
shared_buffers could be 50% of RAM, I think the typical real-world 
impact of this would be easier to take.


= Test 3 - worst-case WAL overhead =

This is the really nasty one.  The 10,000,000 rows touched by the SELECT 
statement here create no WAL in a non-checksum environment.  When 
checksums are on, 368,513,656 bytes of WAL are written, so about 37 
bytes per row.


Jeff saw this increase runtime by 135%, going from 1000ms to 2350ms.  My 
multiple runs are jumping around in a way I also don't trust fully yet. 
 But the first and best of the ones I'm seeing goes from 1660ms to 
4013ms, which is a 140% increase.  The others are even worse.  I suspect 
I'm filling a cache that isn't cleared before the second and third run 
are over.  I'll know for sure when I switch back to Linux.


The really nasty case I can see making people really cranky is where 
someone has fsync on, a slowly rotating drive, and then discovers this 
slowing read statements.  There's already a decent share of why is it 
writing when I do 'SELECT *'? complaints around the block I/O, which is 
fully asynchronous in a lot of cases.


Right now the whole hint bit mechanism and its overhead are treated as 
an internal detail that isn't in the regular documentation.  I think 
committing this sort of checksum patch will require exposing some of the 
implementation to the user in the documentation, so people can 
understand what the trouble cases are--either in advance or when trying 
to puzzle out why they're hitting one of them.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


ChecksumPerfResults.xls
Description: MS-Excel spreadsheet
# Configuration for testing checksum performance.
#
# Primary goals:
# -Reasonable buffer size and checkpoint parameters
# -Prevent interference from the bgwriter or autovacuum.
# -Turn off fsync so that it's measuring the calculation
#  overhead, not the effort of actually writing to disk.
#
autovacuum = off
fsync = off
bgwriter_lru_maxpages = 0
shared_buffers = 1024MB
checkpoint_segments = 64



perf-checksum.sh
Description: Bourne shell script
--
-- Convert hex value to a decimal one.  It's possible to do this using
-- undocumented features of the bit type, such as:
--

Re: [HACKERS] Enabling Checksums


On 3/3/13 9:22 AM, Craig Ringer wrote:

Did you get a chance to see whether you can run it in
checksum-validation-and-update-off backward compatible mode? This seems
like an important thing to have working (and tested for) in case of
bugs, performance issues or other unforseen circumstances.


There isn't any way to do this in the current code.  The big 
simplification Jeff introduced here, to narrow complexity toward a 
commit candidate, was to make checksumming a cluster-level decision. 
You get it for everything or not at all.


The problem I posted about earlier today, where a header checksum error 
can block access to the entire relation, could be resolved with some 
sort of ignore read checksums GUC.  But that's impractical right now 
for the write side of things.  There have been a long list of metadata 
proposals to handle situations where part of a cluster is checksummed, 
but not all of it.  Once that sort of feature is implemented, it becomes 
a lot easier to talk about selectively disabling writes.


As for a design of a GUC that might be useful here, the option itself 
strikes me as being like archive_mode in its general use.  There is an 
element of parameters like wal_sync_method or enable_cassert though, 
where the options available vary depending on how you built the cluster. 
 Maybe name it checksum_level with options like this:


off:  only valid option if you didn't enable checksums with initdb
enforcing:  full checksum behavior as written right now.
unvalidated:  broken checksums on reads are ignored.

The main tricky case I see in that is where you read in a page with a 
busted checksum using unvalidated.  Ideally you wouldn't write such a 
page back out again, because it's going to hide that it's corrupted in 
some way already.  How to enforce that though?  Perhaps unvalidated 
only be allowed in a read-only transaction?


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

2013-03-03 Thread Craig Ringer

On 03/04/2013 11:18 AM, Greg Smith wrote:
 On 3/3/13 9:22 AM, Craig Ringer wrote:
 Did you get a chance to see whether you can run it in
 checksum-validation-and-update-off backward compatible mode? This seems
 like an important thing to have working (and tested for) in case of
 bugs, performance issues or other unforseen circumstances.

 There isn't any way to do this in the current code.  The big
 simplification Jeff introduced here, to narrow complexity toward a
 commit candidate, was to make checksumming a cluster-level decision.
 You get it for everything or not at all.

 The problem I posted about earlier today, where a header checksum
 error can block access to the entire relation, could be resolved with
 some sort of ignore read checksums GUC.  But that's impractical
 right now for the write side of things.  There have been a long list
 of metadata proposals to handle situations where part of a cluster is
 checksummed, but not all of it.  Once that sort of feature is
 implemented, it becomes a lot easier to talk about selectively
 disabling writes.

 As for a design of a GUC that might be useful here, the option itself
 strikes me as being like archive_mode in its general use.  There is an
 element of parameters like wal_sync_method or enable_cassert though,
 where the options available vary depending on how you built the
 cluster.  Maybe name it checksum_level with options like this:

 off:  only valid option if you didn't enable checksums with initdb
 enforcing:  full checksum behavior as written right now.
 unvalidated:  broken checksums on reads are ignored.

 The main tricky case I see in that is where you read in a page with a
 busted checksum using unvalidated.  Ideally you wouldn't write such
 a page back out again, because it's going to hide that it's corrupted
 in some way already.  How to enforce that though?  Perhaps
 unvalidated only be allowed in a read-only transaction?
That sounds like a really good step for disaster recovery, yes.

I also suspect that at least in the first release it might be desirable
to have an option that essentially says something's gone horribly wrong
and we no longer want to check or write checksums, we want a
non-checksummed DB that can still read our data from before we turned
checksumming off. Essentially, a way for someone who's trying
checksumming in production after their staging tests worked out OK to
abort and go back to the non-checksummed case without having to do a
full dump and reload.

Given that, I suspect we need a 4th state, like disabled or
unvalidating_writable where we ignore checksums completely and
maintain the checksum-enabled layout but just write padding to the
checksum fields and don't bother to check them on reading.

My key concern boils down to being able to get someone up and running
quickly and with minimal disruption if something we didn't think of goes
wrong. Oh, you have to dump and reload your 1TB database before you can
start writing to it again isn't going to cut it.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums