Re: [HACKERS] Enabling Checksums
On 2013-03-06 13:34:21 +0200, Heikki Linnakangas wrote: On 06.03.2013 10:41, Simon Riggs wrote: On 5 March 2013 18:02, Jeff Davispg...@j-davis.com wrote: Fletcher is probably significantly faster than CRC-16, because I'm just doing int32 addition in a tight loop. Simon originally chose Fletcher, so perhaps he has more to say. IIRC the research showed Fletcher was significantly faster for only a small loss in error detection rate. It was sufficient to make our error detection 1 million times better, possibly more. That seems sufficient to enable early detection of problems, since if we missed the first error, a second is very likely to be caught (etc). So I am assuming that we're trying to catch a pattern of errors early, rather than guarantee we can catch the very first error. Fletcher's checksum is good in general, I was mainly worried about truncating the Fletcher-64 into two 8-bit values. I can't spot any obvious weakness in it, but if it's indeed faster and as good as a straightforward Fletcher-16, I wonder why that method is not more widely used. I personally am not that convinced that fletcher is a such good choice given that it afaik doesn't distinguish between all-zero and all-one runs that are long enough. Another thought is that perhaps something like CRC32C would be faster to calculate on modern hardware, and could be safely truncated to 16-bits using the same technique you're using to truncate the Fletcher's Checksum. Greg's tests showed that the overhead of CRC calculation is significant in some workloads, so it would be good to spend some time to optimize that. It'd be difficult to change the algorithm in a future release without breaking on-disk compatibility, so let's make sure we pick the best one. I had implemented a noticeably faster CRC32 implementation somewhere around 201005202227.49990.and...@anarazel.de . I have since repeatedly seen pg's CRC32 implementation being a major limitation, so I think brushing up that patch would be a good idea. We might think about switching the polynom for WAL at the same time, given, as you say, CRC32c is available in hardware. The bigger problem is probably stuff like the control file et al. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Mar 06, 2013 at 01:34:21PM +0200, Heikki Linnakangas wrote: On 06.03.2013 10:41, Simon Riggs wrote: On 5 March 2013 18:02, Jeff Davispg...@j-davis.com wrote: Fletcher is probably significantly faster than CRC-16, because I'm just doing int32 addition in a tight loop. Simon originally chose Fletcher, so perhaps he has more to say. IIRC the research showed Fletcher was significantly faster for only a small loss in error detection rate. It was sufficient to make our error detection 1 million times better, possibly more. That seems sufficient to enable early detection of problems, since if we missed the first error, a second is very likely to be caught (etc). So I am assuming that we're trying to catch a pattern of errors early, rather than guarantee we can catch the very first error. Fletcher's checksum is good in general, I was mainly worried about truncating the Fletcher-64 into two 8-bit values. I can't spot any obvious weakness in it, but if it's indeed faster and as good as a straightforward Fletcher-16, I wonder why that method is not more widely used. I was wondering about the effectiveness of this resulting truncated hash function as well. Another thought is that perhaps something like CRC32C would be faster to calculate on modern hardware, and could be safely truncated to 16-bits using the same technique you're using to truncate the Fletcher's Checksum. Greg's tests showed that the overhead of CRC calculation is significant in some workloads, so it would be good to spend some time to optimize that. It'd be difficult to change the algorithm in a future release without breaking on-disk compatibility, so let's make sure we pick the best one. If picking a CRC why not a short optimal one rather than truncate CRC32C? I've been reading about optimal checksum for small messages for other reasons and found this paper quite good. http://www.ece.cmu.edu/~koopman/roses/dsn04/koopman04_crc_poly_embedded.pdf I was interested in small messages and small checksums so this paper may not be as much help here. Other than CRCs and fletcher sums, Pearson hashing with a 16-bit block might be worth considering. Either a pearson hash or a 16-CRC is small enough to implement with a lookup table rather than a formula. I've been wondering what kind of errors we expect? Single bit flips? Large swaths of bytes corrupted? Are we more worried about collisions (the odds total garbage has the same checksum) or the odds we detect a flip of n-bits. I would think since the message is large and a write to the wrong location seems about as likely as a bit flip a pearson hash be good. Any choice seems like it would be a nice improvement of noticing a storage stack problem. The difference would be subtle. Can I estimate the odds of undetected corruption that occurred since the condition was first detected accurately or does the checksum/hash perform poorly? Garick -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 2013-03-06 11:21:21 -0500, Garick Hamlin wrote: If picking a CRC why not a short optimal one rather than truncate CRC32C? CRC32C is available in hardware since SSE4.2. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, Mar 4, 2013 at 3:13 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 04.03.2013 20:58, Greg Smith wrote: There is no such thing as a stable release of btrfs, and no timetable for when there will be one. I could do some benchmarks of that but I didn't think they were very relevant. Who cares how fast something might run when it may not work correctly? btrfs might as well be /dev/null to me right now--sure it's fast, but maybe the data won't be there at all. This PostgreSQL patch hasn't seen any production use, either. In fact, I'd consider btrfs to be more mature than this patch. Unless you think that there will be some major changes to the worse in performance in btrfs, it's perfectly valid and useful to compare the two. A comparison with ZFS would be nice too. That's mature, and has checksums. We've had a few EnterpriseDB customers who have had fantastically painful experiences with PostgreSQL + ZFS. Supposedly, aligning the ZFS block size to the PostgreSQL block size is supposed to make these problems go away, but in my experience it does not have that effect. So I think telling people who want checksums go use ZFS is a lot like telling them oh, I see you have a hangnail, we recommend that you solve that by cutting your arm off with a rusty saw. There may be good reasons to reject this patch. Or there may not. But I completely disagree with the idea that asking them to solve the problem at the filesystem level is sensible. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
There may be good reasons to reject this patch. Or there may not. But I completely disagree with the idea that asking them to solve the problem at the filesystem level is sensible. Yes, can we get back to the main issues with the patch? 1) argument over whether the checksum is sufficient to detect most errors, or if it will give users false confidence. 2) performance overhead. Based on Smith's report, I consider (2) to be a deal-killer right now. The level of overhead reported by him would prevent the users I work with from ever employing checksums on production systems. Specifically, the writing checksums for a read-only query is a defect I think is prohibitively bad. When we first talked about this feature for 9.2, we were going to exclude hint bits from checksums, in order to avoid this issue; what happened to that? (FWIW, I still support the idea of moving hint bits to a separate filehandle, as we do with the FSM, but clearly that's not happening for 9.3 ...) -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Robert, We've had a few EnterpriseDB customers who have had fantastically painful experiences with PostgreSQL + ZFS. Supposedly, aligning the ZFS block size to the PostgreSQL block size is supposed to make these problems go away, but in my experience it does not have that effect. So I think telling people who want checksums go use ZFS is a lot like telling them oh, I see you have a hangnail, we recommend that you solve that by cutting your arm off with a rusty saw. Wow, what platform are you using ZFS on? (we have a half-dozen clients on ZFS ...) -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Mar 6, 2013 at 2:14 PM, Josh Berkus j...@agliodbs.com wrote: Based on Smith's report, I consider (2) to be a deal-killer right now. I was pretty depressed by those numbers, too. The level of overhead reported by him would prevent the users I work with from ever employing checksums on production systems. Agreed. Specifically, the writing checksums for a read-only query is a defect I think is prohibitively bad. That particular part doesn't bother me so much as some of the others - but let's step back and look at the larger issue. I suspect we can all agree that the performance of this feature is terrible. The questions I think we should be asking are: 1. Are the problems fundamental, or things where we can reasonable foresee future improvement? The latter situation wouldn't bother me very much even if the current situation is pretty bad, but if there's no real hope of improvement, that's more of a problem. 2. Are the performance results sufficiently bad that we think this would be more of a liability than an asset? When we first talked about this feature for 9.2, we were going to exclude hint bits from checksums, in order to avoid this issue; what happened to that? I don't think anyone ever thought that was a particularly practical design. I certainly don't. (FWIW, I still support the idea of moving hint bits to a separate filehandle, as we do with the FSM, but clearly that's not happening for 9.3 ...) Or, most likely, ever. The whole benefit of hint bits is that the information you need is available in the same bytes you have to read anyway. Moving the information to another fork (not filehandle) would probably give up most of the benefit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Mar 6, 2013 at 6:00 PM, Josh Berkus j...@agliodbs.com wrote: We've had a few EnterpriseDB customers who have had fantastically painful experiences with PostgreSQL + ZFS. Supposedly, aligning the ZFS block size to the PostgreSQL block size is supposed to make these problems go away, but in my experience it does not have that effect. So I think telling people who want checksums go use ZFS is a lot like telling them oh, I see you have a hangnail, we recommend that you solve that by cutting your arm off with a rusty saw. Wow, what platform are you using ZFS on? (we have a half-dozen clients on ZFS ...) Not us, customers. But as to platform, I have yet to run across anyone running ZFS on anything but Solaris. I'd be interested to hear your experiences. Mine rhyme with sun a play dreaming. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 03/06/2013 03:06 PM, Robert Haas wrote: On Wed, Mar 6, 2013 at 6:00 PM, Josh Berkus j...@agliodbs.com wrote: We've had a few EnterpriseDB customers who have had fantastically painful experiences with PostgreSQL + ZFS. Supposedly, aligning the ZFS block size to the PostgreSQL block size is supposed to make these problems go away, but in my experience it does not have that effect. So I think telling people who want checksums go use ZFS is a lot like telling them oh, I see you have a hangnail, we recommend that you solve that by cutting your arm off with a rusty saw. Wow, what platform are you using ZFS on? (we have a half-dozen clients on ZFS ...) Not us, customers. But as to platform, I have yet to run across anyone running ZFS on anything but Solaris. I'd be interested to hear your experiences. Mine rhyme with sun a play dreaming. I would guess he meant on X86_64 or Sparc. JD -- Command Prompt, Inc. - http://www.commandprompt.com/ PostgreSQL Support, Training, Professional Services and Development High Availability, Oracle Conversion, Postgres-XC @cmdpromptinc - 509-416-6579 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Andres Freund and...@2ndquadrant.com writes: On 2013-03-06 11:21:21 -0500, Garick Hamlin wrote: If picking a CRC why not a short optimal one rather than truncate CRC32C? CRC32C is available in hardware since SSE4.2. I think that should be at most a fourth-order consideration, since we are not interested solely in Intel hardware, nor do we have any portable way of getting at such a feature even if the hardware has it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 03/06/2013 07:34 PM, Heikki Linnakangas wrote: It'd be difficult to change the algorithm in a future release without breaking on-disk compatibility, On-disk compatibility is broken with major releases anyway, so I don't see this as a huge barrier. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 2013-03-07 08:37:40 +0800, Craig Ringer wrote: On 03/06/2013 07:34 PM, Heikki Linnakangas wrote: It'd be difficult to change the algorithm in a future release without breaking on-disk compatibility, On-disk compatibility is broken with major releases anyway, so I don't see this as a huge barrier. Uh, pg_upgrade? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/6/13 1:34 PM, Robert Haas wrote: We've had a few EnterpriseDB customers who have had fantastically painful experiences with PostgreSQL + ZFS. Supposedly, aligning the ZFS block size to the PostgreSQL block size is supposed to make these problems go away, but in my experience it does not have that effect. There are a couple of major tuning issues you have to get right for good ZFS performance, like its tendency to gobble more RAM than is necessarily appropriate for a PostgreSQL host. If you nail down all those and carefully setup everything it can work OK. When Sun had a bunch of good engineers working on the problem they certainly pulled it off. I managed a 3TB database on a ZFS volume for a while myself. Being able to make filesystem snapshots cleanly and easily was very nice. As for the write performance implications of COW, though, at a couple of points I was only able to keep that system ingesting data fast enough if I turned fsync off :( It's not as if even ZFS makes all the filesystem issues the database worries about go away either. Take a look at http://www.c0t0d0s0.org/archives/6071-No,-ZFS-really-doesnt-need-a-fsck.html as an example. That should leave you with a healthy concern over ZFS handling of power interruption and lying drives. [NTFS and ext3] have the same problem, but it has different effects, that aren't as visible as in ZFS. ext4 actually fixed this for most hardware though, and I believe ZFS still has the same uberblock concern. ZFS reliability and its page checksums are good, but they're not magic for eliminating torn page issues. Normally I would agree with Heikki's theory of let's wait a few years and see if the filesystem will take care of it idea. But for me, the when do we get checksums? clock started ticking in 2006 when ZFS popularized its implementation, and now it's gone off and it keeps ringing at new places. I would love it if FreeBSD had caught a massive popularity wave in the last few years, so ZFS was running in a lot more places. Instead what I keep seeing is deployments Linux with filesystem choices skewed toward conservative. Forget about the leading edge--I'd be happy if I could get one large customer to migrate off of ext3... -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 03/07/2013 08:41 AM, Andres Freund wrote: On 2013-03-07 08:37:40 +0800, Craig Ringer wrote: On 03/06/2013 07:34 PM, Heikki Linnakangas wrote: It'd be difficult to change the algorithm in a future release without breaking on-disk compatibility, On-disk compatibility is broken with major releases anyway, so I don't see this as a huge barrier. Uh, pg_upgrade? Yeah. I was thinking that pg_upgrade copes with a lot of incompatibilities already, but this is lower-level. Darn. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/4/13 7:04 PM, Daniel Farina wrote: Corruption has easily occupied more than one person-month of time last year for us. Just FYI for anyone that's experienced corruption... we've looked into doing row-level checksums at work. The only challenge we ran into was how to check them when reading data back. I don't remember the details but there was an issue with doing this via SELECT rules. It would be possible if you were willing to put writable views on all your tables (which isn't actually as horrible as it sounds; it wouldn't be hard to write a function to automagically do that for you). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/6/13 1:14 PM, Josh Berkus wrote: There may be good reasons to reject this patch. Or there may not. But I completely disagree with the idea that asking them to solve the problem at the filesystem level is sensible. Yes, can we get back to the main issues with the patch? 1) argument over whether the checksum is sufficient to detect most errors, or if it will give users false confidence. 2) performance overhead. Based on Smith's report, I consider (2) to be a deal-killer right now. The level of overhead reported by him would prevent the users I work with from ever employing checksums on production systems. FWIW, the write workload most likely wouldn't be a problem for us. I am concerned about the reported 24-32% hit when reading back in from FS cache... that might kill this for us. I'm working on doing a test to see how bad it actually is for us... but getting stuff like that done at work is like pulling teeth, so we'll see... Specifically, the writing checksums for a read-only query is a defect I think is prohibitively bad. When we first talked about this feature for 9.2, we were going to exclude hint bits from checksums, in order to avoid this issue; what happened to that? (FWIW, I still support the idea of moving hint bits to a separate filehandle, as we do with the FSM, but clearly that's not happening for 9.3 ...) +1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/6/13 6:34 AM, Heikki Linnakangas wrote: Another thought is that perhaps something like CRC32C would be faster to calculate on modern hardware, and could be safely truncated to 16-bits using the same technique you're using to truncate the Fletcher's Checksum. Greg's tests showed that the overhead of CRC calculation is significant in some workloads, so it would be good to spend some time to optimize that. It'd be difficult to change the algorithm in a future release without breaking on-disk compatibility, so let's make sure we pick the best one. Simon sent over his first rev of this using a quick to compute 16 bit checksum as a reasonable trade-off, one that it's possible to do right now. It's not optimal in a few ways, but it catches single bit errors that are missed right now, and Fletcher-16 computes quickly and without a large amount of code. It's worth double-checking that the code is using the best Fletcher-16 approach available. I've started on that, but I'm working on your general performance concerns first, with the implementation that's already there. From what I've read so far, I think picking Fletcher-16 instead of the main alternative, CRC-16-IBM AKA CRC-16-ANSI, is a reasonable choice. There's a good table showing the main possibilities here at https://en.wikipedia.org/wiki/Cyclic_redundancy_check One day I hope that in-place upgrade learns how to do page format upgrades, with the sort of background conversion tools and necessary tracking metadata we've discussed for that work. When that day comes, I would expect it to be straightforward to upgrade pages from 16 bit Fletcher checksums to 32 bit CRC-32C ones. Ideally we would be able to jump on the CRC-32C train today, but there's nowhere to put all 32 bits. Using a Fletcher 16 bit checksum for 9.3 doesn't prevent the project from going that way later though, once page header expansion is a solved problem. The problem with running CRC32C in software is that the standard fast approach uses a slicing technique that requires a chunk of pre-computed data be around, a moderately large lookup table. I don't see that there's any advantage to having all that baggage around if you're just going to throw away half of the result anyway. More on CRC32Cs in my next message. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/6/13 1:24 PM, Tom Lane wrote: Andres Freund and...@2ndquadrant.com writes: On 2013-03-06 11:21:21 -0500, Garick Hamlin wrote: If picking a CRC why not a short optimal one rather than truncate CRC32C? CRC32C is available in hardware since SSE4.2. I think that should be at most a fourth-order consideration, since we are not interested solely in Intel hardware, nor do we have any portable way of getting at such a feature even if the hardware has it. True, but that situation might actually improve. The Castagnoli CRC-32C that's accelerated on the better Intel CPUs is also used to protect iSCSI and SCTP (a streaming protocol). And there is an active project to use a CRC32C to checksum ext4 metadata blocks on Linux: https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums https://groups.google.com/forum/?fromgroups=#!topic/linux.kernel/APKfoMzjgdY Now, that project doesn't make the Postgres feature obsolete, because there's nowhere to put checksum data for every block on ext4 without whacking block alignment. The filesystem can't make an extra 32 bits appear on every block any more than we can. It's using a similar trick to the PG checksum feature, grabbing some empty space just for the metadata then shoving the CRC32C into there. But the fact that this is going on means that there are already Linux kernel modules built with both software/hardware accelerated versions of the CRC32C function. And the iSCSI/SCTP use cases means it's not out of the question this will show up in other useful forms one day. Maybe two years from now, there will be a common Linux library that autoconf can find to compute the CRC for us--with hardware acceleration when available, in software if not. The first of those ext4 links above even discusses the exact sort of issue we're facing. The author wonders if the easiest way to proceed for 16 bit checksums is to compute the CRC32C, then truncate it, simply because CRC32C creation is so likely to get hardware help one day. I think that logic doesn't really apply to the PostgreSQL case as strongly though, as the timetime before we can expect a hardware accelerated version to be available is much further off than a Linux kernel developer's future. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Mar 6, 2013 at 11:04 PM, Robert Haas robertmh...@gmail.com wrote: When we first talked about this feature for 9.2, we were going to exclude hint bits from checksums, in order to avoid this issue; what happened to that? I don't think anyone ever thought that was a particularly practical design. I certainly don't. Really? I thought it was pretty much the consensus for a good while. The main problem it ran into was that we kept turning up hint bits that we didn't realize we had. Index line pointers turned out to have hint bits, page headers have one, and so on. As long as it was just the heap page per-tuple transaction hint bits it seemed plausible to just skip them or move them all to a contiguous blocks. Once it started to look like the checksumming code had to know about every data structure on every page it seemed a bit daunting. But that wasn't something we realized for quite a long time. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
TL;DR summary: on a system I thought was a fair middle of the road server, pgbench tests are averaging about a 2% increase in WAL writes and a 2% slowdown when I turn on checksums. There are a small number of troublesome cases where that overhead rises to closer to 20%, an upper limit that's shown up in a few tests aiming to stress this feature now. On 3/4/13 10:09 PM, Jeff Davis wrote: = Test 2 - worst-case overhead for calculating checksum while reading data = Jeff saw an 18% slowdown, I get 24 to 32%. This one bothers me because the hit is going to happen during the very common situation where data is shuffling a lot between a larger OS cache and shared_buffers taking a relatively small fraction. I believe that test 1 and test 2 can be improved a little, if there is a need. Right now we copy the page and then calculate the checksum on the copy. If we instead calculate as we're copying, I believe it will make it significantly faster. It's good to know there's at least some ideas for optimizing this one further. I think the situation where someone has: shared_buffers database total RAM is fairly common for web applications. For people on Amazon EC2 instances for example, giving out the performance tuning advice of get a bigger instance until the database fits in RAM works amazingly well. If the hotspot of that data set fits in shared_buffers, those people will still be in good shape even with checksums enabled. If the hot working set is spread out more randomly, though, it's not impossible to see how they could suffer regularly from this ~20% OS cache-shared buffers movement penalty. Regardless, Jeff's three cases are good synthetic exercises to see worst-case behavior, but they are magnifying small differences. To see a more general case, I ran through a series of pgbench tests in its standard write mode. In order to be useful, I ended up using a system with a battery-backed write cache, but with only a single drive attached. I needed fsync to be fast to keep that from being the bottleneck. But I wanted physical I/O to be slow. I ran three test sets at various size/client loads: one without the BBWC (which I kept here because it gives some useful scale to the graphs), one with the baseline 9.3 code, and one with checksums enabled on the cluster. I did only basic postgresql.conf tuning: checkpoint_segments| 64 shared_buffers | 2GB There's two graphs comparing sets attached, you can see that the slowdown of checksums for this test is pretty minor. There is a clear gap between the two plots, but it's not a very big one, especially if you note how much difference a BBWC makes. I put the numeric results into a spreadsheet, also attached. There's so much noise in pgbench results that I found it hard to get a single number for the difference; they bounce around about +/-5% here. Averaging across everything gives a solid 2% drop when checksums are on that looked detectable above the noise. Things are worse on the bigger data sets. At the highest size I tested, the drop was more like 7%. The two larger size / low client count results I got were really bad, 25% and 16% drops. I think this is closing in on the range of things: perhaps only 2% when most of your data fits in shared_buffers, more like 10% if your database is bigger, and in the worst case 20% is possible. I don't completely trust those 25/16% numbers though, I'm going to revisit that configuration. The other thing I track now in pgbench-tools is how many bytes of WAL are written. Since the total needs to be measured relative to work accomplished, the derived number that looks useful there is average bytes of WAL per transaction. On smaller database this is around 6K, while larger databases topped out for me at around 22K WAL bytes/transaction. Remember that the pgbench transaction is several statements. Updates touch different blocks in pgbench_accounts, index blocks, and the small tables. The WAL increase from checksumming is a bit more consistent than the TPS rates. Many cases were 3 to 5%. There was one ugly case were it hit 30%, and I want to dig into where that came from more. On average, again it was a 2% increase over the baseline. Cases where you spew hint bit WAL data where before none were written (Jeff's test #3) remain a far worst performer than any of these. Since pgbench does a VACUUM before starting, none of those cases were encountered here though. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com attachment: clients-sets.pngattachment: scaling-sets.png Checksum-pgbench.xls Description: MS-Excel spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, Mar 6, 2013 at 8:17 PM, Greg Smith g...@2ndquadrant.com wrote: TL;DR summary: on a system I thought was a fair middle of the road server, pgbench tests are averaging about a 2% increase in WAL writes and a 2% slowdown when I turn on checksums. There are a small number of troublesome cases where that overhead rises to closer to 20%, an upper limit that's shown up in a few tests aiming to stress this feature now. I have only done some cursory research, but cpu-time of 20% seem to expected for InnoDB's CRC computation[0]. Although a galling number, this comparison with other systems may be a way to see how much of that overhead is avoidable or just the price of entry. It's unclear how this 20% cpu-time compares to your above whole-system results, but it's enough to suggest that nothing comes for (nearly) free. [0]: http://mysqlha.blogspot.com/2009/05/innodb-checksum-performance.html -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/7/13 12:15 AM, Daniel Farina wrote: I have only done some cursory research, but cpu-time of 20% seem to expected for InnoDB's CRC computation[0]. Although a galling number, this comparison with other systems may be a way to see how much of that overhead is avoidable or just the price of entry. It's unclear how this 20% cpu-time compares to your above whole-system results, but it's enough to suggest that nothing comes for (nearly) free. That does provide a useful measuring point: how long does the computation take compared to the memcpy that moves the buffer around. It looks like they started out with 3.2 memcpy worth of work, and with enough optimization ended up at 1.27 worth. The important thing to keep in mind is that shared_buffers works pretty well at holding on to the most frequently accessed information. A typical server I see will show pg_statio information suggesting 90%+ of block requests are coming from hits there, the rest misses suggesting a mix of OS cache and real disk reads. Let's say 90% are hits, 5% are fetches at this 20% penalty, and 5% are real reads where the checksum time is trivial compared to physical disk I/O. That works out to be a real average slowdown of 6%. I think way more deployments are going to be like that case, which matches most of my pgbench runs, than the worse case workloads. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 5 March 2013 01:04, Daniel Farina dan...@heroku.com wrote: Corruption has easily occupied more than one person-month of time last year for us. This year to date I've burned two weeks, although admittedly this was probably the result of statistical clustering. Other colleagues of mine have probably put in a week or two in aggregate in this year to date. The ability to quickly, accurately, and maybe at some later date proactively finding good backups to run WAL recovery from is one of the biggest strides we can make in the operation of Postgres. The especially ugly cases are where the page header is not corrupt, so full page images can carry along malformed tuples...basically, when the corruption works its way into the WAL, we're in much worse shape. Checksums would hopefully prevent this case, converting them into corrupt pages that will not be modified. It would be better yet if I could write tools to find the last-good version of pages, and so I think tight integration with Postgres will see a lot of benefits that would be quite difficult and non-portable when relying on file system checksumming. You are among the most well-positioned to make assessments of the cost of the feature, but I thought you might appreciate a perspective of the benefits, too. I think they're large, and for me they are the highest pole in the tent for what makes Postgres stressful to operate as-is today. It's a testament to the quality of the programming in Postgres that Postgres programming error is not the largest problem. That's good perspective. I think we all need to be clear that committing this patch also commits the community (via the committer) to significant work and responsibility around this, and my minimum assessment of it is 1 month per year for a 3-5 years, much of that on the committer. In effect this will move time and annoyance experienced by users of Postgres back onto developers of Postgres. That is where it should be, but the effect will be large and easily noticeable, IMHO. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 04.03.2013 09:11, Simon Riggs wrote: On 3 March 2013 18:24, Greg Smithg...@2ndquadrant.com wrote: The 16-bit checksum feature seems functional, with two sources of overhead. There's some CPU time burned to compute checksums when pages enter the system. And there's extra overhead for WAL logging hint bits. I'll quantify both of those better in another message. It's crunch time. Do you and Jeff believe this patch should be committed to Postgres core? Are there objectors? In addition to my hostility towards this patch in general, there are some specifics in the patch I'd like to raise (read out in a grumpy voice): If you enable checksums, the free space map never gets updated in a standby. It will slowly drift to be completely out of sync with reality, which could lead to significant slowdown and bloat after failover. Since the checksums are an all-or-nothing cluster-wide setting, the three extra flags in the page header, PD_CHECKSUMS1, PD_CHECKSUM2 and PD_HEADERCHECK, are not needed. Let's leave them out. That keeps the code simpler, and leaves the bits free for future use. If we want to enable such per-page setting in the future, we can add it later. For a per-relation scheme, they're not needed. + * The checksum algorithm is a modified Fletcher 64-bit (which is + * order-sensitive). The modification is because, at the end, we have two + * 64-bit sums, but we only have room for a 16-bit checksum. So, instead of + * using a modulus of 2^32 - 1, we use 2^8 - 1; making it also resemble a + * Fletcher 16-bit. We don't use Fletcher 16-bit directly, because processing + * single bytes at a time is slower. How does the error detection rate of this compare with e.g CRC-16? Is there any ill effect from truncating the Fletcher sums like this? + /* +* Store the sums as bytes in the checksum. We add one to shift the range +* from 0..255 to 1..256, to make zero invalid for checksum bytes (which +* seems wise). +*/ + p8Checksum[0] = (sum1 % 255) + 1; + p8Checksum[1] = (sum2 % 255) + 1; That's a bit odd. We don't avoid zero in the WAL crc, and I don't recall seeing that in other checksum implementations either. 16-bits is not very wide for a checksum, and this eats about 1% of the space of valid values. I can see that it might be a handy debugging aid to avoid 0. But there's probably no need to avoid 0 in both bytes, it seems enough to avoid a completely zero return value. XLogCheckBuffer() and XLogCheckBufferNeedsBackup() read the page LSN without a lock. That's not atomic, so it could incorrectly determine that a page doesn't need to be backed up. We used to always hold an exclusive lock on the buffer when it's called, which prevents modifications to the LSN, but that's no longer the case. Shouldn't SetBufferCommitInfoNeedsSave() check the BM_PERMANENT flag? I think it will generate WAL records for unlogged tables as it is. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Thank you for the review. On Tue, 2013-03-05 at 11:35 +0200, Heikki Linnakangas wrote: If you enable checksums, the free space map never gets updated in a standby. It will slowly drift to be completely out of sync with reality, which could lead to significant slowdown and bloat after failover. Will investigate. Since the checksums are an all-or-nothing cluster-wide setting, the three extra flags in the page header, PD_CHECKSUMS1, PD_CHECKSUM2 and PD_HEADERCHECK, are not needed. Let's leave them out. That keeps the code simpler, and leaves the bits free for future use. If we want to enable such per-page setting in the future, we can add it later. For a per-relation scheme, they're not needed. They don't really need to be there, I just put them there because it seemed wise if we ever want to allow online enabling/disabling of checksums. But I will remove them. How does the error detection rate of this compare with e.g CRC-16? Is there any ill effect from truncating the Fletcher sums like this? I don't recall if I published these results or not, but I loaded a table, and used pageinspect to get the checksums of the pages. I then did some various GROUP BY queries to see if I could find any clustering or stepping of the checksum values, and I could not. The distribution seemed very uniform across the 255^2 space. I tried to think of other problems, like missing errors in the high or low bits of a word or a page (similar to the issue with mod 256 described below), but I couldn't find any. I'm not enough of an expert to say more than that about the error detection rate. Fletcher is probably significantly faster than CRC-16, because I'm just doing int32 addition in a tight loop. Simon originally chose Fletcher, so perhaps he has more to say. That's a bit odd. We don't avoid zero in the WAL crc, and I don't recall seeing that in other checksum implementations either. 16-bits is not very wide for a checksum, and this eats about 1% of the space of valid values. I can see that it might be a handy debugging aid to avoid 0. But there's probably no need to avoid 0 in both bytes, it seems enough to avoid a completely zero return value. http://en.wikipedia.org/wiki/Fletcher%27s_checksum If you look at the section on Fletcher-16, it discusses the choice of the modulus. If we used 256, then an error anywhere except the lowest byte of a 4-byte word read from the page would be missed. Considering that I was using only 255 values anyway, I thought I might as well shift the values away from zero. We could get slightly better by using all combinations. I also considered chopping the 64-bit ints into 16-bit chunks and XORing them together. But when I saw the fact that we avoided zero with the other approach, I kind of liked it, and kept it. XLogCheckBuffer() and XLogCheckBufferNeedsBackup() read the page LSN without a lock. That's not atomic, so it could incorrectly determine that a page doesn't need to be backed up. We used to always hold an exclusive lock on the buffer when it's called, which prevents modifications to the LSN, but that's no longer the case. Will investigate, but it sounds like a buffer header lock will fix it. Shouldn't SetBufferCommitInfoNeedsSave() check the BM_PERMANENT flag? I think it will generate WAL records for unlogged tables as it is. Yes, thank you. Also, in FlushBuffer(), this patch moves the clearing of the BM_JUST_DIRTIED bit to before the WAL flush. That seems to expand the window during which a change to a page will prevent it from being marked clean. Do you see any performance problem with that? The alternative is to take the buffer header lock twice: once to get the LSN, then WAL flush, then another header lock to clear BM_JUST_DIRTIED. Not sure if that's better or worse. This goes back to Simon's patch, so he may have a comment here, as well. I'll post a new patch with these comments addressed, probably tomorrow so that I have some time to self-review and do some basic testing. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 04.03.2013 09:11, Simon Riggs wrote: Are there objectors? FWIW, I still think that checksumming belongs in the filesystem, not PostgreSQL. If you go ahead with this anyway, at the very least I'd like to see some sort of a comparison with e.g btrfs. How do performance, error-detection rate, and behavior on error compare? Any other metrics that are relevant here? - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, 2013-03-04 at 10:36 +0200, Heikki Linnakangas wrote: On 04.03.2013 09:11, Simon Riggs wrote: Are there objectors? FWIW, I still think that checksumming belongs in the filesystem, not PostgreSQL. Doing checksums in the filesystem has some downsides. One is that you need to use a copy-on-write filesystem like btrfs or zfs, which (by design) will fragment the heap on random writes. If we're going to start pushing people toward those systems, we will probably need to spend some effort to mitigate this problem (aside: my patch to remove PD_ALL_VISIBLE might get some new wind behind it). There are also other issues, like what fraction of our users can freely move to btrfs, and when. If it doesn't happen to be already there, you need root to get it there, which has never been a requirement before. I don't fundamentally disagree. We probably need to perform reasonably well on btrfs in COW mode[1] regardless, because a lot of people will be using it a few years from now. But there are a lot of unknowns here, and I'm concerned about tying checksums to a series of things that will be resolved a few years from now, if ever. [1] Interestingly, you can turn off COW mode on btrfs, but you lose checksums if you do. If you go ahead with this anyway, at the very least I'd like to see some sort of a comparison with e.g btrfs. How do performance, error-detection rate, and behavior on error compare? Any other metrics that are relevant here? I suspect it will be hard to get an apples-to-apples comparison here because of the heap fragmentation, which means that a sequential scan is not so sequential. That may be acceptable for some workloads but not for others, so it would get tricky to compare. And any performance numbers from an experimental filesystem are somewhat suspect anyway. Also, it's a little more challenging to test corruption on a filesystem, because you need to find the location of the file you want to corrupt, and corrupt it out from underneath the filesystem. Greg may have more comments on this matter. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/4/13 2:11 AM, Simon Riggs wrote: It's crunch time. Do you and Jeff believe this patch should be committed to Postgres core? I want to see a GUC to allow turning this off, to avoid the problem I saw where a non-critical header corruption problem can cause an entire page to be unreadable. A variation on that capable of turning this off altogether, as Craig suggested, is a good idea too. Those are both simple fixes, and I would be pleased to see this committed at that point. I'll write up a long discussion of filesystem trends and why I think this is more relevant than ever if that's the main objection now. There is no such thing as a stable release of btrfs, and no timetable for when there will be one. I could do some benchmarks of that but I didn't think they were very relevant. Who cares how fast something might run when it may not work correctly? btrfs might as well be /dev/null to me right now--sure it's fast, but maybe the data won't be there at all. How long has it taken the Linux kernel to reach the point it handles write barriers and fsync correctly? It does not give me a lot of confidence that now is the time they'll suddenly start executing on database filesystem mechanics perfectly. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, 2013-02-25 at 01:30 -0500, Greg Smith wrote: Attached is some bit rot updates to the checksums patches. The replace-tli one still works fine. I fixed a number of conflicts in the larger patch. The one I've attached here isn't 100% to project standards--I don't have all the context diff tools setup yet for example. I expect to revise this more now that I've got the whole week cleared to work on CF submissions. Thank you for the rebase. I redid the rebase myself and came up with essentially the same result, but there was an additional problem that needed fixing after the materialized view patch. I will post a new version tonight that includes those fixes as well as something to address these recent comments (probably just another GUC). Further comment in another reply. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 04.03.2013 20:58, Greg Smith wrote: There is no such thing as a stable release of btrfs, and no timetable for when there will be one. I could do some benchmarks of that but I didn't think they were very relevant. Who cares how fast something might run when it may not work correctly? btrfs might as well be /dev/null to me right now--sure it's fast, but maybe the data won't be there at all. This PostgreSQL patch hasn't seen any production use, either. In fact, I'd consider btrfs to be more mature than this patch. Unless you think that there will be some major changes to the worse in performance in btrfs, it's perfectly valid and useful to compare the two. A comparison with ZFS would be nice too. That's mature, and has checksums. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, 2013-03-04 at 11:52 +0800, Craig Ringer wrote: I also suspect that at least in the first release it might be desirable to have an option that essentially says something's gone horribly wrong and we no longer want to check or write checksums, we want a non-checksummed DB that can still read our data from before we turned checksumming off. Essentially, a way for someone who's trying checksumming in production after their staging tests worked out OK to abort and go back to the non-checksummed case without having to do a full dump and reload. A recovery option to extract data sounds like a good idea, but I don't want to go as far as you are suggesting here. An option to ignore checksum failures (while still printing the warnings) sounds like all we need here. I think Greg's point that the page might be written out again (hiding the corruption) is a very good one, but the same is true for zero_damaged_pages. So we can just still allow the writes to proceed (including setting the checksum on write), and the system should be as available as it would be without checksums. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, 2013-03-04 at 22:13 +0200, Heikki Linnakangas wrote: On 04.03.2013 20:58, Greg Smith wrote: There is no such thing as a stable release of btrfs, and no timetable for when there will be one. I could do some benchmarks of that but I didn't think they were very relevant. Who cares how fast something might run when it may not work correctly? btrfs might as well be /dev/null to me right now--sure it's fast, but maybe the data won't be there at all. This PostgreSQL patch hasn't seen any production use, either. In fact, I'd consider btrfs to be more mature than this patch. Unless you think that there will be some major changes to the worse in performance in btrfs, it's perfectly valid and useful to compare the two. A comparison with ZFS would be nice too. That's mature, and has checksums. Is there any reason why we can't have both postgres and filesystem checksums? The same user might not want both (or might, if neither are entirely trustworthy yet), but I think it's too early to declare one as the right solution and the other not. Even with btrfs stable, I pointed out a number of reasons users might not want it, and reasons that the project should not depend on it. Numbers are always nice, but it takes a lot of effort to come up with them. What kind of numbers are you looking for, and how *specifically* will those numbers affect the decision? If btrfs with checksums is 10% slower than ext4 with postgres checksums, does that mean we should commit the postgres checksums? On the other side of the coin, if btrfs with checksums is exactly the same speed as ext4 with no postgres checksums (i.e. checksums are free if we use btrfs), does that mean postgres checksums should be rejected? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 04.03.2013 18:00, Jeff Davis wrote: On Mon, 2013-03-04 at 10:36 +0200, Heikki Linnakangas wrote: On 04.03.2013 09:11, Simon Riggs wrote: Are there objectors? FWIW, I still think that checksumming belongs in the filesystem, not PostgreSQL. Doing checksums in the filesystem has some downsides. One is that you need to use a copy-on-write filesystem like btrfs or zfs, which (by design) will fragment the heap on random writes. Yeah, fragmentation will certainly hurt some workloads. But how badly, and which workloads, and how does that compare with the work that PostgreSQL has to do to maintain the checksums? I'd like to see some data on those things. There are also other issues, like what fraction of our users can freely move to btrfs, and when. If it doesn't happen to be already there, you need root to get it there, which has never been a requirement before. If you're serious enough about your data that you want checksums, you should be able to choose your filesystem. If you go ahead with this anyway, at the very least I'd like to see some sort of a comparison with e.g btrfs. How do performance, error-detection rate, and behavior on error compare? Any other metrics that are relevant here? I suspect it will be hard to get an apples-to-apples comparison here because of the heap fragmentation, which means that a sequential scan is not so sequential. That may be acceptable for some workloads but not for others, so it would get tricky to compare. An apples-to-apples comparison is to run the benchmark and see what happens. If it gets fragmented as hell on btrfs, and performance tanks because of that, then that's your result. If avoiding fragmentation is critical to the workload, then with btrfs you'll want to run the defragmenter in the background to keep it in order, and factor that into the test case. I realize that performance testing is laborious. But we can't skip it and assume that the patch performs fine, because it's hard to benchmark. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Sun, 2013-03-03 at 22:18 -0500, Greg Smith wrote: As for a design of a GUC that might be useful here, the option itself strikes me as being like archive_mode in its general use. There is an element of parameters like wal_sync_method or enable_cassert though, where the options available vary depending on how you built the cluster. Maybe name it checksum_level with options like this: off: only valid option if you didn't enable checksums with initdb enforcing: full checksum behavior as written right now. unvalidated: broken checksums on reads are ignored. I think GUCs should be orthogonal to initdb settings. If nothing else, it's extra effort to get initdb to write the right postgresql.conf. A single new GUC that prevents checksum failures from causing an error seems sufficient to address the concerns you, Dan, and Craig raised. We would still calculate the checksum and print the warning; and then pass it through the rest of the header checks. If the header checks pass, then it proceeds. If the header checks fail, and if zero_damaged_pages is off, then it would still generate an error (as today). So: ignore_checksum_failures = on|off ? The main tricky case I see in that is where you read in a page with a busted checksum using unvalidated. Ideally you wouldn't write such a page back out again, because it's going to hide that it's corrupted in some way already. How to enforce that though? Perhaps unvalidated only be allowed in a read-only transaction? That's a good point. But we already have zero_damaged_pages, which does something similar. And it's supposed to be a recovery option to get the data out rather than something to run in online mode. It will still print the warning, so it won't completely hide the corruption. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, 2013-03-04 at 13:58 -0500, Greg Smith wrote: On 3/4/13 2:11 AM, Simon Riggs wrote: It's crunch time. Do you and Jeff believe this patch should be committed to Postgres core? I want to see a GUC to allow turning this off, to avoid the problem I saw where a non-critical header corruption problem can cause an entire page to be unreadable. A variation on that capable of turning this off altogether, as Craig suggested, is a good idea too. Based on your comments as well those of Dan and Craig, I am leaning toward a GUC that causes a checksum failure to be ignored. It will still emit the checksum failure warning, but proceed. That will then fall through to the normal header checks we've always had, and the same zero_damaged_pages option. So, to get past a really corrupt page, you'd need to set ignore_checksum_failure and zero_damaged_pages. I'll write up a long discussion of filesystem trends and why I think this is more relevant than ever if that's the main objection now. There is no such thing as a stable release of btrfs, and no timetable for when there will be one. I could do some benchmarks of that but I didn't think they were very relevant. Who cares how fast something might run when it may not work correctly? btrfs might as well be /dev/null to me right now--sure it's fast, but maybe the data won't be there at all. How long has it taken the Linux kernel to reach the point it handles write barriers and fsync correctly? It does not give me a lot of confidence that now is the time they'll suddenly start executing on database filesystem mechanics perfectly. I have a similar viewpoint here. It will take significant effort to come up with anything, and I'm not sure how meaningful the numbers would be. Even if btrfs is great, this feature is not mutually exclusive with btrfs: * users might not have easy access to run the filesystem * they might not trust it * they might get poor performance numbers * postgres checksums might provide a good test of btrfs checksums, and vice-versa, until both are stable Additionally, I don't like the idea of depending so heavily on what linux is doing. If there are performance problems that affect postgres, will they fix them? Will they introduce new ones? Are there a zillion tuneable options that a new user has to get right in order to run postgres efficiently, and will poor settings mean a bunch of postgres is slow blog posts? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/4/13 10:00 AM, Jeff Davis wrote: On Mon, 2013-03-04 at 10:36 +0200, Heikki Linnakangas wrote: On 04.03.2013 09:11, Simon Riggs wrote: Are there objectors? FWIW, I still think that checksumming belongs in the filesystem, not PostgreSQL. Doing checksums in the filesystem has some downsides. Additionally, no filesystem I'm aware of checksums the data in the filesystem cache. A PG checksum would. I'll also mention that this debate has been had in the past. The time to object to the concept of a checksuming feature was a long time ago, before a ton of development effort went into this... :( -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/4/13 2:48 PM, Jeff Davis wrote: On Mon, 2013-03-04 at 13:58 -0500, Greg Smith wrote: On 3/4/13 2:11 AM, Simon Riggs wrote: It's crunch time. Do you and Jeff believe this patch should be committed to Postgres core? I want to see a GUC to allow turning this off, to avoid the problem I saw where a non-critical header corruption problem can cause an entire page to be unreadable. A variation on that capable of turning this off altogether, as Craig suggested, is a good idea too. Based on your comments as well those of Dan and Craig, I am leaning toward a GUC that causes a checksum failure to be ignored. It will still emit the checksum failure warning, but proceed. I suggest we paint that GUC along the lines of checksum_failure_log_level, defaulting to ERROR. That way if someone wanted completely bury the elogs to like DEBUG they could. My $2.98 (inflation adjusted). -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, 2013-03-04 at 22:27 +0200, Heikki Linnakangas wrote: Yeah, fragmentation will certainly hurt some workloads. But how badly, and which workloads, and how does that compare with the work that PostgreSQL has to do to maintain the checksums? I'd like to see some data on those things. I think we all would. Btrfs will be a major filesystem in a few years, and we should be ready to support it. Unfortunately, it's easier said than done. What you're talking about seems like a significant benchmark report that encompasses a lot of workloads. And there's a concern that a lot of it will be invalidated if they are still improving the performance of btrfs. If you're serious enough about your data that you want checksums, you should be able to choose your filesystem. I simply disagree. I am targeting my feature at casual users. They may not have a lot of data or a dedicated DBA, but the data they do have might be very important transactional data. And right now, if they take a backup of their data, it will contain all of the corruption from the original. And since corruption is silent today, then they would probably think the backup is fine, and may delete the previous good backups. An apples-to-apples comparison is to run the benchmark and see what happens. If it gets fragmented as hell on btrfs, and performance tanks because of that, then that's your result. If avoiding fragmentation is critical to the workload, then with btrfs you'll want to run the defragmenter in the background to keep it in order, and factor that into the test case. Again, easier said than done. To get real fragmentation problems, the data set needs to be huge, and we need to reach a steady state of this background defrag process, and a million other things. I realize that performance testing is laborious. But we can't skip it and assume that the patch performs fine, because it's hard to benchmark. You aren't asking me to benchmark the patch in question. You are asking me to benchmark a filesystem that very few people actually run postgres on in production. I don't think that's a reasonable requirement. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 04.03.2013 22:51, Jim Nasby wrote: The time to object to the concept of a checksuming feature was a long time ago, before a ton of development effort went into this... :( I did. Development went ahead anyway. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 04.03.2013 22:40, Jeff Davis wrote: Is there any reason why we can't have both postgres and filesystem checksums? Of course not. But if we can get away without checksums in Postgres, that's better, because then we don't need to maintain that feature in Postgres. If the patch gets committed, it's not mission accomplished. There will be discussion and need for further development on things like what to do if you get a checksum failure, patches to extend the checksums to cover things like the clog and other non-data files and so forth. And it's an extra complication that will need to be taken into account when developing other new features; in particular, hint bit updates need to write a WAL record. Even if you have all the current hint bits covered, it's an extra hurdle for future patches that might want to have hint bits in e.g new index access methods. The same user might not want both (or might, if neither are entirely trustworthy yet), but I think it's too early to declare one as the right solution and the other not. Even with btrfs stable, I pointed out a number of reasons users might not want it, and reasons that the project should not depend on it. The PostgreSQL project would not be depending on it, any more than the project depends on filesystem snapshots for backup purposes, or the OS memory manager for caching. Numbers are always nice, but it takes a lot of effort to come up with them. What kind of numbers are you looking for, and how *specifically* will those numbers affect the decision? Benchmark of vanilla PostgreSQL, PostgreSQL + this patch, and PostgreSQL running on btrfs or ZFS with data checksums enabled. DBT-2 might be a good candidate, as it's I/O heavy. That would be a good general test; in addition it would be good to see a benchmark of the worst case scenario for the fragmentation you're expecting to see on btrfs, as well as a worst case scenario for the extra WAL traffic with the patch. If btrfs with checksums is 10% slower than ext4 with postgres checksums, does that mean we should commit the postgres checksums? In my opinion, a 10% gain would not be worth it, and we should not commit in that case. On the other side of the coin, if btrfs with checksums is exactly the same speed as ext4 with no postgres checksums (i.e. checksums are free if we use btrfs), does that mean postgres checksums should be rejected? Yes, I think so. I'm sure at least some others will disagree; Greg already made it quite clear that he doesn't care how the performance of this compares with btrfs. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, Mar 04, 2013 at 01:00:09PM -0800, Jeff Davis wrote: On Mon, 2013-03-04 at 22:27 +0200, Heikki Linnakangas wrote: If you're serious enough about your data that you want checksums, you should be able to choose your filesystem. I simply disagree. I am targeting my feature at casual users. They may not have a lot of data or a dedicated DBA, but the data they do have might be very important transactional data. And right now, if they take a backup of their data, it will contain all of the corruption from the original. And since corruption is silent today, then they would probably think the backup is fine, and may delete the previous good backups. +1 There is no reasonable availability of checksum capable filesystems across PostgreSQL's supported OSes. It really needs to be available in core. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 04.03.2013 23:00, Jeff Davis wrote: On Mon, 2013-03-04 at 22:27 +0200, Heikki Linnakangas wrote: Yeah, fragmentation will certainly hurt some workloads. But how badly, and which workloads, and how does that compare with the work that PostgreSQL has to do to maintain the checksums? I'd like to see some data on those things. I think we all would. Btrfs will be a major filesystem in a few years, and we should be ready to support it. Perhaps we should just wait a few years? If we suspect that this becomes obsolete in a few years, it's probably better to just wait, than add a feature we'll have to keep maintaining. Assuming it gets committed today, it's going to take a year or two for 9.3 to get released and all the bugs ironed out, anyway. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/4/13 3:00 PM, Heikki Linnakangas wrote: On 04.03.2013 22:51, Jim Nasby wrote: The time to object to the concept of a checksuming feature was a long time ago, before a ton of development effort went into this... :( I did. Development went ahead anyway. Right, because the community felt that this was valuable enough to do regardless of things like FS checksumming. But now you're bringing the issue up yet again, this time after a large amount of time has been invested. I know that you're doing what you feel is best for the project, but in this case the community didn't agree with your view. Raising the same objection at this point is not productive at this point. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 04.03.2013 22:51, Jim Nasby wrote: Additionally, no filesystem I'm aware of checksums the data in the filesystem cache. A PG checksum would. The patch says: + * IMPORTANT NOTE - + * The checksum is not valid at all times on a data page. We set it before we + * flush page/buffer, and implicitly invalidate the checksum when we modify the + * page. A heavily accessed buffer might then spend most of its life with an + * invalid page checksum, so testing random pages in the buffer pool will tell + * you nothing. The reason for this is that the checksum detects otherwise + * silent errors caused by the filesystems on which we rely. We do not protect + * buffers against uncorrectable memory errors, since these have a very low + * measured incidence according to research on large server farms, + * http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf, discussed 2010/12/22. It's still true that it does in fact cover pages in the filesystem cache, but apparently that's not important. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
* Heikki Linnakangas (hlinnakan...@vmware.com) wrote: Perhaps we should just wait a few years? If we suspect that this becomes obsolete in a few years, it's probably better to just wait, than add a feature we'll have to keep maintaining. Assuming it gets committed today, it's going to take a year or two for 9.3 to get released and all the bugs ironed out, anyway. For my 2c, I don't see it being obsolete in a few years, even if every existing FS out there gets checksumming (which won't happen, imv). It's quite clear that there is still ongoing development in the filesystem space and any new software will have its own set of bugs. Having a layer of protection built-in to PG wil undoubtably be a good thing and will be used by our users. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Enabling Checksums
On 03/05/2013 04:48 AM, Jeff Davis wrote: We would still calculate the checksum and print the warning; and then pass it through the rest of the header checks. If the header checks pass, then it proceeds. If the header checks fail, and if zero_damaged_pages is off, then it would still generate an error (as today). So: ignore_checksum_failures = on|off ? That seems reasonable to me. It would be important to document clearly in postgresql.conf and on the docs for the option that enabling this option can launder data corruption, so that blocks that we suspected were damaged are marked clean on rewrite. So long as that's clearly documented I'm personally quite comfortable with your suggestion, since my focus is just making sure I can get a DB back to a fully operational state as quickly as possible when that's necessary. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/4/13 3:13 PM, Heikki Linnakangas wrote: This PostgreSQL patch hasn't seen any production use, either. In fact, I'd consider btrfs to be more mature than this patch. Unless you think that there will be some major changes to the worse in performance in btrfs, it's perfectly valid and useful to compare the two. I think my last message came out with a bit more hostile attitude about this than I intended it to; sorry about that. My problem with this idea comes from looking at the history of how Linux has failed to work properly before. The best example I can point at is the one I documented at http://www.postgresql.org/message-id/4b512d0d.4030...@2ndquadrant.com along with this handy pgbench chart: http://www.phoronix.com/scan.php?page=articleitem=ubuntu_lucid_alpha2num=3 TPS on pgbench dropped from 1102 to about 110 after a kernel bug fix. It was 10X as fast in some kernel versions because fsync wasn't working properly. Kernel filesystem issues have regularly resulted in data not being written to disk when it should have been, inflating the results accordingly. Fake writes due to lying drives, write barriers that only actually work on server-class hardware, write barriers that don't work on md volumes, and then this one; it's a recurring pattern. It's not the fault of the kernel developers, it's a hard problem and drive manufacturers aren't making it easy for them. My concern, then, is that if the comparison target is btrfs performance, how do we know it's working reliably? The track record says that bugs in this area usually inflate results, compared with a correct implementation. You are certainly right that this checksum code is less mature than btrfs; it's just over a year old after all. I feel quite good that it's not benchmarking faster than it really is, especially when I can directly measure how the write volume is increasing in the worst result. I can't say that btrfs is slower or faster than it will eventually be due to bugs; I can't tell you the right way to tune btrfs for PostgreSQL; and I haven't even had anyone asking the question yet. Right now, the main thing I know about testing performance on Linux kernels new enough to support btrfs is that they're just generally slow running PostgreSQL. See the multiple confirmed regression issues at http://www.postgresql.org/message-id/60b572d9298d944580f7d51195dd30804357fa4...@vmbx125.ihostexchange.net for example. That new kernel mess needs to get sorted out too one day. Why does database performance suck on kernel 3.2? I don't know yet, but it doesn't help me get excited about assuming btrfs results will be useful. ZFS was supposed to save everyone from worrying about corruption issues. That didn't work out, I think due to the commercial agenda behind its development. Now we have btrfs coming in some number of years, a project still tied more than I would like to Oracle. I'm not too optimistic about that one either. It doesn't help that now the original project lead, Chris Mason, has left there and is working at FusionIO--and that company's filesystem plans don't include checksumming, either. (See http://www.fusionio.com/blog/under-the-hood-of-the-iomemory-sdk/ for a quick intro to what they're doing right now, which includes bypassing the Linux filesystem layer with their own flash optimized but POSIX compliant directFS) There is an optimistic future path I can envision where btrfs matures quickly and in a way that performs well for PostgreSQL. Maybe we'll end up there, and if that happens everyone can look back and say this was a stupid idea. But there are a lot of other outcomes I see as possible here, and in all the rest of them having some checksumming capabilities available is a win. One of the areas PostgreSQL has a solid reputation on is being trusted to run as reliably as possible. All of the deployment trends I'm seeing have people moving toward less reliable hardware. VMs, cloud systems, regular drives instead of hardware RAID, etc. A lot of people badly want to leave behind the era of the giant database server, and have a lot of replicas running on smaller/cheaper systems instead. There's a useful advocacy win for the project if lower grade hardware can be used to hit a target reliability level, with software picking up some of the error detection job instead. Yes, it costs something in terms of future maintenance on the codebase, as new features almost invariably do. If I didn't see being able to make noise about the improved reliability of PostgreSQL as valuable enough to consider it anyway, I wouldn't even be working on this thing. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/4/13 5:20 PM, Craig Ringer wrote: On 03/05/2013 04:48 AM, Jeff Davis wrote: We would still calculate the checksum and print the warning; and then pass it through the rest of the header checks. If the header checks pass, then it proceeds. If the header checks fail, and if zero_damaged_pages is off, then it would still generate an error (as today). So: ignore_checksum_failures = on|off ? That seems reasonable to me. It would be important to document clearly in postgresql.conf and on the docs for the option that enabling this option can launder data corruption, so that blocks that we suspected were damaged are marked clean on rewrite. So long as that's clearly documented I'm personally quite comfortable with your suggestion, since my focus is just making sure I can get a DB back to a fully operational state as quickly as possible when that's necessary. I replied to this somewhere else in the thread when I over-looked Jeff's original post, so sorry for the noise... :( Would it be better to do checksum_logging_level = valid elog levels ? That way someone could set the notification to anything from DEBUG up to PANIC. ISTM the default should be ERROR. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 03/05/2013 08:15 AM, Jim Nasby wrote: Would it be better to do checksum_logging_level = valid elog levels ? That way someone could set the notification to anything from DEBUG up to PANIC. ISTM the default should be ERROR. That seems nice at first brush, but I don't think it holds up. All our other log_level parameters control only output. If I saw that parameter, I would think aah, this is how we control the detail and verbosity of messages regarding checksum checking and maintenance. I would be totally astonished if I changed it and it actually affected the system's data integrity checking and enforcement processes. Logging control GUCs control what we show to what clients/log files, not what log statements get executed; they're a filter and don't control the behaviour of the emitting log point. Control over whether checksum failures are an error or merely warned about is reasonable, but I strongly disagree with the idea of making this seem like it's just a logging parameter. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/4/13 6:22 PM, Craig Ringer wrote: On 03/05/2013 08:15 AM, Jim Nasby wrote: Would it be better to do checksum_logging_level = valid elog levels ? That way someone could set the notification to anything from DEBUG up to PANIC. ISTM the default should be ERROR. That seems nice at first brush, but I don't think it holds up. All our other log_level parameters control only output. If I saw that parameter, I would think aah, this is how we control the detail and verbosity of messages regarding checksum checking and maintenance. I would be totally astonished if I changed it and it actually affected the system's data integrity checking and enforcement processes. Logging control GUCs control what we show to what clients/log files, not what log statements get executed; they're a filter and don't control the behaviour of the emitting log point. Control over whether checksum failures are an error or merely warned about is reasonable, but I strongly disagree with the idea of making this seem like it's just a logging parameter. Good point. I thought we actually had precedent for controlling the level that something gets logged at, but now that you mention it I guess we don't. And this could sure as hell cause confusion. So yeah, your original idea sounds best. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Heikki, Perhaps we should just wait a few years? If we suspect that this becomes obsolete in a few years, it's probably better to just wait, than add a feature we'll have to keep maintaining. Assuming it gets committed today, it's going to take a year or two for 9.3 to get released and all the bugs ironed out, anyway. You are far more optimistic about FS development than I am: * Windows and OSX are unlikely to ever have usable FS checksums * BTRFS may be years away from being production-quality for DB server, and (given the current dev priorities) may *never* be suitable for DB servers. * For various reasons, many users may stay with other filesystems, even on Linux. * All filesystems have bugs, and the FS may be itself causing the corruption. * FS checksums may not catch underlying driver bugs (i.e. better to have two checks than one if you KNOW something is wrong) We have people who could use PostgreSQL-level checksums *now* because they are having data corruption issues *now* and need a tool to help determine what layer the corruption is occurring at. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, Mar 4, 2013 at 1:22 PM, Heikki Linnakangas hlinnakan...@vmware.com wrote: On 04.03.2013 23:00, Jeff Davis wrote: On Mon, 2013-03-04 at 22:27 +0200, Heikki Linnakangas wrote: Yeah, fragmentation will certainly hurt some workloads. But how badly, and which workloads, and how does that compare with the work that PostgreSQL has to do to maintain the checksums? I'd like to see some data on those things. I think we all would. Btrfs will be a major filesystem in a few years, and we should be ready to support it. Perhaps we should just wait a few years? If we suspect that this becomes obsolete in a few years, it's probably better to just wait, than add a feature we'll have to keep maintaining. Assuming it gets committed today, it's going to take a year or two for 9.3 to get released and all the bugs ironed out, anyway. Putting aside the not-so-rosy predictions seen elsewhere in this thread about the availability of a high performance, reliable checksumming file system available on common platforms, I'd like to express what benefit this feature will have to me: Corruption has easily occupied more than one person-month of time last year for us. This year to date I've burned two weeks, although admittedly this was probably the result of statistical clustering. Other colleagues of mine have probably put in a week or two in aggregate in this year to date. The ability to quickly, accurately, and maybe at some later date proactively finding good backups to run WAL recovery from is one of the biggest strides we can make in the operation of Postgres. The especially ugly cases are where the page header is not corrupt, so full page images can carry along malformed tuples...basically, when the corruption works its way into the WAL, we're in much worse shape. Checksums would hopefully prevent this case, converting them into corrupt pages that will not be modified. It would be better yet if I could write tools to find the last-good version of pages, and so I think tight integration with Postgres will see a lot of benefits that would be quite difficult and non-portable when relying on file system checksumming. You are among the most well-positioned to make assessments of the cost of the feature, but I thought you might appreciate a perspective of the benefits, too. I think they're large, and for me they are the highest pole in the tent for what makes Postgres stressful to operate as-is today. It's a testament to the quality of the programming in Postgres that Postgres programming error is not the largest problem. For sense of reference, I think the next largest operational problem is the disruption caused by logical backups, e.g. pg_dump, and in particular its long running transactions and sessions. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, 2013-03-04 at 14:57 -0600, Jim Nasby wrote: I suggest we paint that GUC along the lines of checksum_failure_log_level, defaulting to ERROR. That way if someone wanted completely bury the elogs to like DEBUG they could. The reason I didn't want to do that is because it's essentially a recovery feature. A boolean seems more appropriate than a slider. That's a good point about burying the messages with DEBUG, but I think it might be slightly over-engineering it. I am willing to change it if others want it, though. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, 2013-03-04 at 23:22 +0200, Heikki Linnakangas wrote: On 04.03.2013 23:00, Jeff Davis wrote: On Mon, 2013-03-04 at 22:27 +0200, Heikki Linnakangas wrote: Yeah, fragmentation will certainly hurt some workloads. But how badly, and which workloads, and how does that compare with the work that PostgreSQL has to do to maintain the checksums? I'd like to see some data on those things. I think we all would. Btrfs will be a major filesystem in a few years, and we should be ready to support it. Perhaps we should just wait a few years? If we suspect that this becomes obsolete in a few years I do not expect it to be obsolete, even if btrfs is stable and fast today. Consider this hypothetical scenario: what if btrfs performs acceptably well today, but they tune it away from our needs later and it tanks performance? Then, when we complain, the btrfs people say for DB workloads, you should turn off COW, or use ext4 or XFS. And then we say but we want checksums. And then they tell us that real databases do their own checksums. Then what? I don't think that scenario is very outlandish. Postgres is essentially a COW system (for tuples), and stacking COW on top of COW does not seem like a good idea (neither for filesystems nor actual cows). So it may be within reason for the filesystem folks to say we're doing the wrong thing, and then checksums are our problem again. Additionally, I don't have a lot of faith that linux will address all of our btrfs complaints (even legitimate ones) in a reasonable amount of time, if ever. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, 2013-03-04 at 23:11 +0200, Heikki Linnakangas wrote: Of course not. But if we can get away without checksums in Postgres, that's better, because then we don't need to maintain that feature in Postgres. If the patch gets committed, it's not mission accomplished. There will be discussion and need for further development on things like what to do if you get a checksum failure, patches to extend the checksums to cover things like the clog and other non-data files and so forth. And it's an extra complication that will need to be taken into account when developing other new features; in particular, hint bit updates need to write a WAL record. Even if you have all the current hint bits covered, it's an extra hurdle for future patches that might want to have hint bits in e.g new index access methods. The example you chose of adding a hint bit is a little overstated -- as far as I can tell, setting a hint bit follows pretty much the same pattern as before, except that I renamed the function to MarkBufferDirtyHint(). But I agree in general. If complexity can be removed or avoided, that is a very good thing. But right now, we have no answer to a real problem that other databases do have an answer for. To me, the benefit is worth the cost. We aren't going down an irreversible path by adding checksums. If every platform has a good checksumming filesystem and there is no demand for the postgres code any more, we can deprecate it and remove it. But at least users would have something between now and then. The PostgreSQL project would not be depending on it, any more than the project depends on filesystem snapshots for backup purposes, or the OS memory manager for caching. I don't understand your analogies at all. We have WAL-protected base backups so that users can get a consistent snapshot without filesystem snapshots. To follow the analogy, we want postgres checksums so that the user can be protected without filesystem checksums. I would agree with you if we could point users somewhere and actually recommend something and say what you're doing now is wrong, do X instead (though if there is only one such X, we are dependent on it). But even if we fast forward to three years from now: if someone shows up saying that XFS gives him the best performance, but wants checksums, will we really be able to say you are wrong to be using XFS; use Btrfs? One of the things I like about postgres is that we don't push a lot of hard trade-offs on users. Several people (including you) put in effort recently to support unlogged gist indexes. Are there some huge number of users there that can't live without unlogged gist indexes? Probably not. But that is one less thing that potential users have to trade away, and one less thing to be confused or frustrated about. I want to get to the point where checksums are the default, and only advanced users would disable them. If that point comes in the form of checksumming filesystems that are fast enough and enabled by default on most of the platforms we support, that's fine with me. But I'm not very sure that it will happen that way ever, and certainly not soon. If btrfs with checksums is 10% slower than ext4 with postgres checksums, does that mean we should commit the postgres checksums? In my opinion, a 10% gain would not be worth it, and we should not commit in that case. On the other side of the coin, if btrfs with checksums is exactly the same speed as ext4 with no postgres checksums (i.e. checksums are free if we use btrfs), does that mean postgres checksums should be rejected? Yes, I think so. I'm sure at least some others will disagree; Greg already made it quite clear that he doesn't care how the performance of this compares with btrfs. If all paths lead to rejection, what are these tests supposed to accomplish, exactly? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Sun, 2013-03-03 at 18:05 -0500, Greg Smith wrote: = Test 1 - find worst-case overhead for the checksum calculation on write = This can hit 25% of runtime when you isolate it out. I'm not sure if how I'm running this multiple times makes sense yet. This one is so much slower on my Mac that I can't barely see a change at all. = Test 2 - worst-case overhead for calculating checksum while reading data = Jeff saw an 18% slowdown, I get 24 to 32%. This one bothers me because the hit is going to happen during the very common situation where data is shuffling a lot between a larger OS cache and shared_buffers taking a relatively small fraction. If that issue were cracked, such that shared_buffers could be 50% of RAM, I think the typical real-world impact of this would be easier to take. I believe that test 1 and test 2 can be improved a little, if there is a need. Right now we copy the page and then calculate the checksum on the copy. If we instead calculate as we're copying, I believe it will make it significantly faster. I decided against doing that, because it decreased the readability, and we can always do that later as an optimization. That should mitigate the case you have in mind, which is a very legitimate concern. I'll wait for someone to ask for it, though. = Test 3 - worst-case WAL overhead = This is the really nasty one. The 10,000,000 rows touched by the SELECT statement here create no WAL in a non-checksum environment. When checksums are on, 368,513,656 bytes of WAL are written, so about 37 bytes per row. Yeah, nothing we can do about this. Right now the whole hint bit mechanism and its overhead are treated as an internal detail that isn't in the regular documentation. I think committing this sort of checksum patch will require exposing some of the implementation to the user in the documentation, so people can understand what the trouble cases are--either in advance or when trying to puzzle out why they're hitting one of them. Any particular sections that you think would be good to update? Thank you for the test results. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 03/02/2013 12:48 AM, Daniel Farina wrote: On Sun, Feb 24, 2013 at 10:30 PM, Greg Smith g...@2ndquadrant.com wrote: Attached is some bit rot updates to the checksums patches. The replace-tli one still works fine I rather badly want this feature, and if the open issues with the patch has hit zero, I'm thinking about applying it, shipping it, and turning it on. Given that the heap format has not changed, the main affordence I may check for is if I can work in backwards compatibility (while not maintaining the checksums, of course) in case of an emergency. Did you get a chance to see whether you can run it in checksum-validation-and-update-off backward compatible mode? This seems like an important thing to have working (and tested for) in case of bugs, performance issues or other unforseen circumstances. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
And here's an updated version of the checksum corruption testing wrapper script already. This includes an additional safety check that you've set PGDATA to a location that can be erased. Presumably no one else would like to accidentally do this: rm -rf /* Like I just did. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com check-check.sh Description: Bourne shell script -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 12/19/12 6:30 PM, Jeff Davis wrote: I ran a few tests. Test 1 - find worst-case overhead for the checksum calculation on write: Test 2 - worst-case overhead for calculating checksum while reading data Test 3 - worst-case WAL overhead What I've done is wrap all of these tests into a shell script that runs them 3 times each, with and without checksums. That includes steps like the spots where Jeff found a sync helped improve repeatability. I ran these manually before and didn't notice enough of a difference to argue with any of his results at the time. Having them packaged up usefully means I can try some additional platforms too, and other testers should be easily able to take a crack at it. On the last one, in addition to runtime I directly measure how many bytes of WAL are written. It's 0 in the usual case, where the hint bit changes triggered by the first SELECT * FROM foo don't generate any WAL. Detailed results with both my and Jeff's numbers are in the attached spreadsheet. I did my tests on a Mac writing to SSD, to try and get some variety in the test platforms. The main difference there is that Test 1 is much slower on my system, enough so that the slowdown isn't as pronounced. Remember, these are a set of tests designed to magnify the worst case here. I don't feel any of these results make the feature uncommittable. The numbers I'm getting are not significantly different from the ones Jeff posted back in December, and those were acceptable to some of the early adopter candidates I've been surveying informally. These numbers are amplifying overhead without doing much in the way of real disk I/O, which can easily be a lot more expensive than any of this. I do think there needs to be a bit more documentation of the potential downsides to checksumming written though, since they are pretty hefty in some situations. I'm going to get some pgbench results next, to try and put this into a more realistic context too. The numbers for this round break down like this: = Test 1 - find worst-case overhead for the checksum calculation on write = This can hit 25% of runtime when you isolate it out. I'm not sure if how I'm running this multiple times makes sense yet. This one is so much slower on my Mac that I can't barely see a change at all. = Test 2 - worst-case overhead for calculating checksum while reading data = Jeff saw an 18% slowdown, I get 24 to 32%. This one bothers me because the hit is going to happen during the very common situation where data is shuffling a lot between a larger OS cache and shared_buffers taking a relatively small fraction. If that issue were cracked, such that shared_buffers could be 50% of RAM, I think the typical real-world impact of this would be easier to take. = Test 3 - worst-case WAL overhead = This is the really nasty one. The 10,000,000 rows touched by the SELECT statement here create no WAL in a non-checksum environment. When checksums are on, 368,513,656 bytes of WAL are written, so about 37 bytes per row. Jeff saw this increase runtime by 135%, going from 1000ms to 2350ms. My multiple runs are jumping around in a way I also don't trust fully yet. But the first and best of the ones I'm seeing goes from 1660ms to 4013ms, which is a 140% increase. The others are even worse. I suspect I'm filling a cache that isn't cleared before the second and third run are over. I'll know for sure when I switch back to Linux. The really nasty case I can see making people really cranky is where someone has fsync on, a slowly rotating drive, and then discovers this slowing read statements. There's already a decent share of why is it writing when I do 'SELECT *'? complaints around the block I/O, which is fully asynchronous in a lot of cases. Right now the whole hint bit mechanism and its overhead are treated as an internal detail that isn't in the regular documentation. I think committing this sort of checksum patch will require exposing some of the implementation to the user in the documentation, so people can understand what the trouble cases are--either in advance or when trying to puzzle out why they're hitting one of them. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com ChecksumPerfResults.xls Description: MS-Excel spreadsheet # Configuration for testing checksum performance. # # Primary goals: # -Reasonable buffer size and checkpoint parameters # -Prevent interference from the bgwriter or autovacuum. # -Turn off fsync so that it's measuring the calculation # overhead, not the effort of actually writing to disk. # autovacuum = off fsync = off bgwriter_lru_maxpages = 0 shared_buffers = 1024MB checkpoint_segments = 64 perf-checksum.sh Description: Bourne shell script -- -- Convert hex value to a decimal one. It's possible to do this using -- undocumented features of the bit type, such as: --
Re: [HACKERS] Enabling Checksums
On 3/3/13 9:22 AM, Craig Ringer wrote: Did you get a chance to see whether you can run it in checksum-validation-and-update-off backward compatible mode? This seems like an important thing to have working (and tested for) in case of bugs, performance issues or other unforseen circumstances. There isn't any way to do this in the current code. The big simplification Jeff introduced here, to narrow complexity toward a commit candidate, was to make checksumming a cluster-level decision. You get it for everything or not at all. The problem I posted about earlier today, where a header checksum error can block access to the entire relation, could be resolved with some sort of ignore read checksums GUC. But that's impractical right now for the write side of things. There have been a long list of metadata proposals to handle situations where part of a cluster is checksummed, but not all of it. Once that sort of feature is implemented, it becomes a lot easier to talk about selectively disabling writes. As for a design of a GUC that might be useful here, the option itself strikes me as being like archive_mode in its general use. There is an element of parameters like wal_sync_method or enable_cassert though, where the options available vary depending on how you built the cluster. Maybe name it checksum_level with options like this: off: only valid option if you didn't enable checksums with initdb enforcing: full checksum behavior as written right now. unvalidated: broken checksums on reads are ignored. The main tricky case I see in that is where you read in a page with a busted checksum using unvalidated. Ideally you wouldn't write such a page back out again, because it's going to hide that it's corrupted in some way already. How to enforce that though? Perhaps unvalidated only be allowed in a read-only transaction? -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 03/04/2013 11:18 AM, Greg Smith wrote: On 3/3/13 9:22 AM, Craig Ringer wrote: Did you get a chance to see whether you can run it in checksum-validation-and-update-off backward compatible mode? This seems like an important thing to have working (and tested for) in case of bugs, performance issues or other unforseen circumstances. There isn't any way to do this in the current code. The big simplification Jeff introduced here, to narrow complexity toward a commit candidate, was to make checksumming a cluster-level decision. You get it for everything or not at all. The problem I posted about earlier today, where a header checksum error can block access to the entire relation, could be resolved with some sort of ignore read checksums GUC. But that's impractical right now for the write side of things. There have been a long list of metadata proposals to handle situations where part of a cluster is checksummed, but not all of it. Once that sort of feature is implemented, it becomes a lot easier to talk about selectively disabling writes. As for a design of a GUC that might be useful here, the option itself strikes me as being like archive_mode in its general use. There is an element of parameters like wal_sync_method or enable_cassert though, where the options available vary depending on how you built the cluster. Maybe name it checksum_level with options like this: off: only valid option if you didn't enable checksums with initdb enforcing: full checksum behavior as written right now. unvalidated: broken checksums on reads are ignored. The main tricky case I see in that is where you read in a page with a busted checksum using unvalidated. Ideally you wouldn't write such a page back out again, because it's going to hide that it's corrupted in some way already. How to enforce that though? Perhaps unvalidated only be allowed in a read-only transaction? That sounds like a really good step for disaster recovery, yes. I also suspect that at least in the first release it might be desirable to have an option that essentially says something's gone horribly wrong and we no longer want to check or write checksums, we want a non-checksummed DB that can still read our data from before we turned checksumming off. Essentially, a way for someone who's trying checksumming in production after their staging tests worked out OK to abort and go back to the non-checksummed case without having to do a full dump and reload. Given that, I suspect we need a 4th state, like disabled or unvalidating_writable where we ignore checksums completely and maintain the checksum-enabled layout but just write padding to the checksum fields and don't bother to check them on reading. My key concern boils down to being able to get someone up and running quickly and with minimal disruption if something we didn't think of goes wrong. Oh, you have to dump and reload your 1TB database before you can start writing to it again isn't going to cut it. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3/3/13 10:52 PM, Craig Ringer wrote: I also suspect that at least in the first release it might be desirable to have an option that essentially says something's gone horribly wrong and we no longer want to check or write checksums, we want a non-checksummed DB that can still read our data from before we turned checksumming off. I see that as being something that involves disabling the cluster-wide flag that turns checksumming on, the one that is reported by pg_controldata. I think it would have to be a one-way, system down kind of change, which I think is fair given the ugly (but feasible) situation you're describing. It would need to be something stronger than a GUC. Once you start writing out pages without checksums, you're back into the fuzzy state where some pages have them, others don't, and there's no good way to deal with that yet. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 03/04/2013 12:19 PM, Greg Smith wrote: On 3/3/13 10:52 PM, Craig Ringer wrote: I also suspect that at least in the first release it might be desirable to have an option that essentially says something's gone horribly wrong and we no longer want to check or write checksums, we want a non-checksummed DB that can still read our data from before we turned checksumming off. I see that as being something that involves disabling the cluster-wide flag that turns checksumming on, the one that is reported by pg_controldata. I think it would have to be a one-way, system down kind of change, which I think is fair given the ugly (but feasible) situation you're describing. It would need to be something stronger than a GUC. Once you start writing out pages without checksums, you're back into the fuzzy state where some pages have them, others don't, and there's no good way to deal with that yet. Agreed, I was envisioning a one-way process where re-enabling checksums would involve be a re-initdb and reload. A DB restart seems perfectly reasonable, it's just a full dump and reload before they can get running again that I feel must be avoided. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 3 March 2013 18:24, Greg Smith g...@2ndquadrant.com wrote: The 16-bit checksum feature seems functional, with two sources of overhead. There's some CPU time burned to compute checksums when pages enter the system. And there's extra overhead for WAL logging hint bits. I'll quantify both of those better in another message. It's crunch time. Do you and Jeff believe this patch should be committed to Postgres core? Are there objectors? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Sun, Feb 24, 2013 at 10:30 PM, Greg Smith g...@2ndquadrant.com wrote: Attached is some bit rot updates to the checksums patches. The replace-tli one still works fine I rather badly want this feature, and if the open issues with the patch has hit zero, I'm thinking about applying it, shipping it, and turning it on. Given that the heap format has not changed, the main affordence I may check for is if I can work in backwards compatibility (while not maintaining the checksums, of course) in case of an emergency. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Sun, Jan 27, 2013 at 5:28 PM, Jeff Davis pg...@j-davis.com wrote: There's a maximum of one FPI per page per cycle, and we need the FPI for any modified page in this design regardless. So, deferring the XLOG_HINT WAL record doesn't change the total number of FPIs emitted. The only savings would be on the trivial XLOG_HINT wal record itself, because we might notice that it's not necessary in the case where some other WAL action happened to the page. OK, I see. So the case where this really hurts is where a page is updated for hint bits only and then not touched again for the remainder of the checkpoint cycle. At first glance, it seems sound as long as the WAL FPI makes it to disk before the data. But to meet that requirement, it seems like we'd need to write an FPI and then immediately flush WAL before cleaning a page, and that doesn't seem like a win. Do you (or Simon) see an opportunity here that I'm missing? I am not sure that isn't a win. After all, we can need to flush WAL before flushing a buffer anyway, so this is just adding another case - Right, but if we get the WAL record in earlier, there is a greater chance that it goes out with some unrelated WAL flush, and we don't need to flush the WAL to clean the buffer at all. Separating WAL insertions from WAL flushes seems like a fairly important goal, so I'm a little skeptical of a proposal to narrow that gap so drastically. It's hard to analyze without a specific proposal on the table. But if cleaning pages requires a WAL record followed immediately by a flush, it seems like that would increase the number of actual WAL flushes we need to do by a lot. Yeah, maybe. I think Simon had a good argument for not pursuing this route, anyway. and the payoff is that the initial access to a page, setting hint bits, is quickly followed by a write operation, we avoid the need for any extra WAL to cover the hint bit change. I bet that's common, because if updating you'll usually need to look at the tuples on the page and decide whether they are visible to your scan before, say, updating one of them That's a good point, I'm just not sure how avoid that problem without a lot of complexity or a big cost. It seems like we want to defer the XLOG_HINT WAL record for a short time; but not wait so long that we need to clean the buffer or miss a chance to piggyback on another WAL flush. By the way, the approach I took was to add the heap buffer to the WAL chain of the XLOG_HEAP2_VISIBLE wal record when doing log_heap_visible. It seemed simpler to understand than trying to add a bunch of options to MarkBufferDirty. Unless I am mistaken, that's going to heavy penalize the case where the user vacuums an insert-only table. It will emit much more WAL than currently. Yes, that's true, but I think that's pretty fundamental to this checksums design (and of course it only applies if checksums are enabled). We need to make sure an FPI is written and the LSN bumped before we write a page. That's why I was pushing a little on various proposals to either remove or mitigate the impact of hint bits (load path, remove PD_ALL_VISIBLE, cut down on the less-important hint bits, etc.). Maybe those aren't viable, but that's why I spent time on them. There are some other options, but I cringe a little bit thinking about them. One is to simply exclude the PD_ALL_VISIBLE bit from the checksum calculation, so that a torn page doesn't cause a problem (though obviously that one bit would be vulnerable to corruption). Another is to use a double-write buffer, but that didn't seem to go very far. Or, we could abandon the whole thing and tell people to use ZFS/btrfs/NAS/SAN. I am inclined to think that we shouldn't do any of this stuff for now. I think it's OK if the first version of checksums is not-that-flexible and/or not-that-performant. We can optimize for those things later. Trying to monkey with this at the same time we're trying to get checksums in risks introducing new diverting focus from getting checksums done at all, and risks also introducing new data corruption bugs. We have a reputation of long standing for getting it right first and then getting it to perform well later, so it shouldn't a be a total shock if we take that approach here, too. I see no reason to think that the performance problems must be solved up front or not at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 25 January 2013 20:29, Robert Haas robertmh...@gmail.com wrote: The checksums patch also introduces another behavior into SetBufferCommitInfoNeedsSave, which is to write an XLOG_HINT WAL record if checksums are enabled (to avoid torn page hazards). That's only necessary for changes where the caller does not write WAL itself and doesn't bump the LSN of the data page. (There's a reason the caller can't easily write the XLOG_HINT WAL itself.) So, we could introduce another flag needsWAL that would control whether we write the XLOG_HINT WAL or not (only applies with checksums on, of course). I thought Simon had the idea, at some stage, of writing a WAL record to cover hint-bit changes only at the time we *write* the buffer and only if no FPI had already been emitted that checkpoint cycle. I'm not sure whether that approach was sound, but if so it seems more efficient than this approach. The requirement is that we ensure that a FPI is written to WAL before any changes to the block are made. The patch does that by inserting an XLOG_HINT_WAL record when we set a hint. The insert is a no-op if we've already written the FPI in this checkpoint cycle and we don't even reach there except when dirtying a clean data block. If we attempted to defer the FPI last thing before write, we'd need to cope with the case that writes at checkpoint occur after the logical start of the checkpoint, and also with the overhead of additional writes at checkpoint time. I don't see any advantage in deferring the FPI, but I do see disadvantage in complicating this. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Sun, Jan 27, 2013 at 3:50 AM, Simon Riggs si...@2ndquadrant.com wrote: If we attempted to defer the FPI last thing before write, we'd need to cope with the case that writes at checkpoint occur after the logical start of the checkpoint, and also with the overhead of additional writes at checkpoint time. Oh, good point. That's surely a good reason not to do it that way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Sat, 2013-01-26 at 23:23 -0500, Robert Haas wrote: If we were to try to defer writing the WAL until the page was being written, the most it would possibly save is the small XLOG_HINT WAL record; it would not save any FPIs. How is the XLOG_HINT_WAL record kept small and why does it not itself require an FPI? There's a maximum of one FPI per page per cycle, and we need the FPI for any modified page in this design regardless. So, deferring the XLOG_HINT WAL record doesn't change the total number of FPIs emitted. The only savings would be on the trivial XLOG_HINT wal record itself, because we might notice that it's not necessary in the case where some other WAL action happened to the page. At first glance, it seems sound as long as the WAL FPI makes it to disk before the data. But to meet that requirement, it seems like we'd need to write an FPI and then immediately flush WAL before cleaning a page, and that doesn't seem like a win. Do you (or Simon) see an opportunity here that I'm missing? I am not sure that isn't a win. After all, we can need to flush WAL before flushing a buffer anyway, so this is just adding another case - Right, but if we get the WAL record in earlier, there is a greater chance that it goes out with some unrelated WAL flush, and we don't need to flush the WAL to clean the buffer at all. Separating WAL insertions from WAL flushes seems like a fairly important goal, so I'm a little skeptical of a proposal to narrow that gap so drastically. It's hard to analyze without a specific proposal on the table. But if cleaning pages requires a WAL record followed immediately by a flush, it seems like that would increase the number of actual WAL flushes we need to do by a lot. and the payoff is that the initial access to a page, setting hint bits, is quickly followed by a write operation, we avoid the need for any extra WAL to cover the hint bit change. I bet that's common, because if updating you'll usually need to look at the tuples on the page and decide whether they are visible to your scan before, say, updating one of them That's a good point, I'm just not sure how avoid that problem without a lot of complexity or a big cost. It seems like we want to defer the XLOG_HINT WAL record for a short time; but not wait so long that we need to clean the buffer or miss a chance to piggyback on another WAL flush. By the way, the approach I took was to add the heap buffer to the WAL chain of the XLOG_HEAP2_VISIBLE wal record when doing log_heap_visible. It seemed simpler to understand than trying to add a bunch of options to MarkBufferDirty. Unless I am mistaken, that's going to heavy penalize the case where the user vacuums an insert-only table. It will emit much more WAL than currently. Yes, that's true, but I think that's pretty fundamental to this checksums design (and of course it only applies if checksums are enabled). We need to make sure an FPI is written and the LSN bumped before we write a page. That's why I was pushing a little on various proposals to either remove or mitigate the impact of hint bits (load path, remove PD_ALL_VISIBLE, cut down on the less-important hint bits, etc.). Maybe those aren't viable, but that's why I spent time on them. There are some other options, but I cringe a little bit thinking about them. One is to simply exclude the PD_ALL_VISIBLE bit from the checksum calculation, so that a torn page doesn't cause a problem (though obviously that one bit would be vulnerable to corruption). Another is to use a double-write buffer, but that didn't seem to go very far. Or, we could abandon the whole thing and tell people to use ZFS/btrfs/NAS/SAN. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Fri, Jan 25, 2013 at 9:35 PM, Jeff Davis pg...@j-davis.com wrote: On Fri, 2013-01-25 at 15:29 -0500, Robert Haas wrote: I thought Simon had the idea, at some stage, of writing a WAL record to cover hint-bit changes only at the time we *write* the buffer and only if no FPI had already been emitted that checkpoint cycle. I'm not sure whether that approach was sound, but if so it seems more efficient than this approach. My patch is based on his original idea; although I've made quite a lot of changes, I believe that I have stuck to his same basic design w.r.t. WAL. This patch does not cause a new FPI to be emitted if one has already been emitted this cycle. It also does not emit a WAL record at all if an FPI has already been emitted. If we were to try to defer writing the WAL until the page was being written, the most it would possibly save is the small XLOG_HINT WAL record; it would not save any FPIs. How is the XLOG_HINT_WAL record kept small and why does it not itself require an FPI? At first glance, it seems sound as long as the WAL FPI makes it to disk before the data. But to meet that requirement, it seems like we'd need to write an FPI and then immediately flush WAL before cleaning a page, and that doesn't seem like a win. Do you (or Simon) see an opportunity here that I'm missing? I am not sure that isn't a win. After all, we can need to flush WAL before flushing a buffer anyway, so this is just adding another case - and the payoff is that the initial access to a page, setting hint bits, is quickly followed by a write operation, we avoid the need for any extra WAL to cover the hint bit change. I bet that's common, because if updating you'll usually need to look at the tuples on the page and decide whether they are visible to your scan before, say, updating one of them By the way, the approach I took was to add the heap buffer to the WAL chain of the XLOG_HEAP2_VISIBLE wal record when doing log_heap_visible. It seemed simpler to understand than trying to add a bunch of options to MarkBufferDirty. Unless I am mistaken, that's going to heavy penalize the case where the user vacuums an insert-only table. It will emit much more WAL than currently. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Thu, Jan 10, 2013 at 1:06 AM, Jeff Davis pg...@j-davis.com wrote: On Tue, 2012-12-04 at 01:03 -0800, Jeff Davis wrote: For now, I rebased the patches against master, and did some very minor cleanup. I think there is a problem here when setting PD_ALL_VISIBLE. I thought I had analyzed that before, but upon review, it doesn't look right. Setting PD_ALL_VISIBLE needs to be associated with a WAL action somehow, and a bumping of the LSN, otherwise there is a torn page hazard. The solution doesn't seem particularly difficult, but there are a few strange aspects and I'm not sure exactly which path I should take. First of all, the relationship between MarkBufferDirty and SetBufferCommitInfoNeedsSave is a little confusing. The comment over MarkBufferDirty is confusing because it says that the caller must have an exclusive lock, or else bad data could be written. But that doesn't have to do with marking the buffer dirty, that has to do with the data page change you make while you are marking it dirty -- if it's a single bit change, then there is no risk that I can see. In the current code, the only real code difference between the two is that SetBufferCommitInfoNeedsSave might fail to mark the buffer dirty if there is a race. So, in the current code, we could actually combine the two by passing a force flag (if true, behaves like MarkBufferDirty, if false, behaves like SetBufferCommitInfoNeedsSave). The checksums patch also introduces another behavior into SetBufferCommitInfoNeedsSave, which is to write an XLOG_HINT WAL record if checksums are enabled (to avoid torn page hazards). That's only necessary for changes where the caller does not write WAL itself and doesn't bump the LSN of the data page. (There's a reason the caller can't easily write the XLOG_HINT WAL itself.) So, we could introduce another flag needsWAL that would control whether we write the XLOG_HINT WAL or not (only applies with checksums on, of course). I thought Simon had the idea, at some stage, of writing a WAL record to cover hint-bit changes only at the time we *write* the buffer and only if no FPI had already been emitted that checkpoint cycle. I'm not sure whether that approach was sound, but if so it seems more efficient than this approach. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Fri, 2013-01-25 at 15:29 -0500, Robert Haas wrote: I thought Simon had the idea, at some stage, of writing a WAL record to cover hint-bit changes only at the time we *write* the buffer and only if no FPI had already been emitted that checkpoint cycle. I'm not sure whether that approach was sound, but if so it seems more efficient than this approach. My patch is based on his original idea; although I've made quite a lot of changes, I believe that I have stuck to his same basic design w.r.t. WAL. This patch does not cause a new FPI to be emitted if one has already been emitted this cycle. It also does not emit a WAL record at all if an FPI has already been emitted. If we were to try to defer writing the WAL until the page was being written, the most it would possibly save is the small XLOG_HINT WAL record; it would not save any FPIs. At first glance, it seems sound as long as the WAL FPI makes it to disk before the data. But to meet that requirement, it seems like we'd need to write an FPI and then immediately flush WAL before cleaning a page, and that doesn't seem like a win. Do you (or Simon) see an opportunity here that I'm missing? By the way, the approach I took was to add the heap buffer to the WAL chain of the XLOG_HEAP2_VISIBLE wal record when doing log_heap_visible. It seemed simpler to understand than trying to add a bunch of options to MarkBufferDirty. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, 2013-01-16 at 17:38 -0800, Jeff Davis wrote: New version of checksums patch. And another new version of both patches. Changes: * Rebased. * Rename SetBufferCommitInfoNeedsSave to MarkBufferDirtyHint. Now that it's being used more places, it makes sense to give it a more generic name. * My colleague, Yingjie He, noticed that the FSM doesn't write any WAL, and therefore we must protect those operations against torn pages. That seems simple enough: just use MarkBufferDirtyHint (formerly SetBufferCommitInfoNeedsSave) instead of MarkBufferDirty. The FSM changes are not critical, so the fact that we may lose the dirty bit is OK. Regards, Jeff Davis replace-tli-with-checksums-20130124.patch.gz Description: GNU Zip compressed data checksums-20130124.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
New version of checksums patch. Changes: * rebased * removed two duplicate lines; apparently the result of a bad merge * Added heap page to WAL chain when logging an XLOG_HEAP2_VISIBLE to avoid torn page issues updating PD_ALL_VISIBLE. This is the most significant change. * minor comment cleanup No open issues that I'm aware of with the patch itself. Greg appears to have made some progress on the automated corruption tester. Note to reviewers: I also have a patch out to remove PD_ALL_VISIBLE entirely. The code around PD_ALL_VISIBLE is quite tricky (with or without this patch), so if the PD_ALL_VISIBLE patch is committed first then it will make reviewing this patch easier. Regardless, the second patch to be committed will need to be rebased on top of the first. Regards, Jeff Davis replace-tli-with-checksums-20130116.patch.gz Description: GNU Zip compressed data checksums-20130116.patch.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Tue, 2013-01-15 at 19:36 -0500, Greg Smith wrote: First rev of a simple corruption program is attached, in very C-ish Python. Great. Did you verify that my patch works as you expect at least in the simple case? The parameters I settled on are to accept a relation name, byte offset, byte value, and what sort of operation to do: overwrite, AND, OR, XOR. I like XOR here because you can fix it just by running the program again. Oh, good idea. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
First rev of a simple corruption program is attached, in very C-ish Python. The parameters I settled on are to accept a relation name, byte offset, byte value, and what sort of operation to do: overwrite, AND, OR, XOR. I like XOR here because you can fix it just by running the program again. Rewriting this in C would not be terribly difficult, and most of the time spent on this version was figuring out what to do. This follows Jeff's idea that the most subtle corruption is the hardest to spot, so testing should aim at the smallest unit of change. If you can spot a one bit error in an unused byte of a page, presumably that will catch large errors like a byte swap. I find some grim amusement that the checksum performance testing I've been trying to do got stuck behind a problem with a tiny, hard to detect single bit of corruption. Here's pgbench_accounts being corrupted, the next to last byte on this line: $ pgbench -i -s 1 $ ./pg_corrupt pgbench_accounts show Reading byte 0 within file /usr/local/var/postgres/base/16384/25242 Current byte= 0 / $00 $ hexdump /usr/local/var/postgres/base/16384/25242 | head 000 00 00 00 00 00 00 00 00 00 00 04 00 0c 01 80 01 ... $ ./pg_corrupt pgbench_accounts 14 1 /usr/local/var/postgres base/16384/25242 8192 13434880 1640 Reading byte 14 within file /usr/local/var/postgres/base/16384/25242 Current byte= 128 / $80 Modified byte= 129 / $81 File modified successfully $ hexdump /usr/local/var/postgres/base/16384/25242 | head 000 00 00 00 00 00 00 00 00 00 00 04 00 0c 01 81 01 That doesn't impact selecting all of the rows: $ psql -c select count(*) from pgbench_accounts count 10 And pg_dump works fine against the table too. Tweaking this byte looks like a reasonable first test case for seeing if checksums can catch an error that query execution doesn't. Next I'm going to test the functional part of the latest checksum patch; duplicate Jeff's targeted performance tests; and then run some of my own. I wanted to get this little tool circulating now that it's useful first. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com #!/usr/bin/env python # # pg_corrupt # # Read in a byte of a PostgreSQL relation (a table or index) and allow writing # it back with an altered value. # # Greg Smith g...@2ndquadrant.com # Copyright (c) 2013, Heroku, Inc. # Released under The PostgreSQL Licence # from os.path import join from subprocess import Popen,PIPE import sys import psycopg2 class Operations: SHOW=0 WRITE=1 AND=2 OR=3 XOR=4 text=['show','write','and','or','xor'] def controldata_blocks_per_segment(pgdata): blocks_per_seg = 131072 try: # TODO This doesn't work when called in an Emacs compile shell out, err = Popen(pg_controldata %s % pgdata, stdout=PIPE, shell=True).communicate() control_data=out.splitlines() for c in control_data: if c.startswith(Blocks per segment of large relation:): blocks_per_seg=int(c.split(:)[1]) except: print Cannot determine blocks per segment, using default of,blocks_per_seg return blocks_per_seg def get_table_info(conn,relation): cur = conn.cursor() q=SELECT \ current_setting('data_directory') AS data_directory, \ pg_relation_filepath(oid), \ current_setting('block_size') AS block_size, \ pg_relation_size(oid), \ relpages \ FROM pg_class \ WHERE relname='%s' % relation cur.execute(q) if cur.rowcount != 1: print Error: did not return 1 row from pg_class lookup of %s % relation return None table_info={} for i in cur: table_info['relation']=relation table_info['pgdata'] = i[0] table_info['filepath'] = i[1] table_info['block_size'] = int(i[2]) table_info['relation_size'] = i[3] table_info['relpages'] = i[4] table_info['base_file_name']=join(i[0],i[1]) table_info['blocks_per_seg'] = controldata_blocks_per_segment(table_info['pgdata']) table_info['bytes_per_seg'] = table_info['block_size'] * table_info['blocks_per_seg'] cur.close() return table_info def operate(table_info,byte_offset,operation,value): if byte_offset table_info['relation_size']: print Error: trying to change byte %s but relation %s is only %s bytes % \ (byte_offset, table_info['relation'],table_info['relation_size']) return if byte_offset 0: print Error: cannot use negative byte offsets return file_name=table_info['base_file_name'] file_seq=int(byte_offset / table_info['bytes_per_seg']) if file_seq 0: file_name=file_name + .%s % file_seq file_offset = byte_offset - file_seq * table_info['bytes_per_seg'] else: file_offset = byte_offset print Reading byte,file_offset,within
Re: [HACKERS] Enabling Checksums
On 12/19/12 6:30 PM, Jeff Davis wrote: The idea is to prevent interference from the bgwriter or autovacuum. Also, I turn of fsync so that it's measuring the calculation overhead, not the effort of actually writing to disk. With my test server issues sorted, what I did was setup a single 7200RPM drive with a battery-backed write cache card. That way fsync doesn't bottleneck things. And I to realized that limit had to be cracked before anything use useful could be done. Having the BBWC card is a bit better than fsync=off, because we'll get something more like the production workload out of it. I/O will be realistic, but limited to only one one drive can pull off. Without checksums, it takes about 1000ms. With checksums, about 2350ms. I also tested with checksums but without the CHECKPOINT commands above, and it was also 1000ms. I think we need to use lower checkpoint_segments to try and trigger more checkpoints. My 10 minute pgbench-tool runs will normally have at most 3 checkpoints. I would think something like 10 would be more useful, to make sure we're spending enough time seeing extra WAL writes; This test is more plausible than the other two, so it's more likely to be a real problem. So, the biggest cost of checksums is, by far, the extra full-page images in WAL, which matches our expectations. What I've done with pgbench-tools is actually measure the amount of WAL from the start to the end of the test run. To analyze it you need to scale it a bit; computing wal bytes / commit seems to work. pgbench-tools also launches vmstat and isstat in a way that it's possible to graph the values later. The interesting results I'm seeing are when the disk is about 80% busy and when it's 100% busy. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 10 January 2013 06:06, Jeff Davis pg...@j-davis.com wrote: The checksums patch also introduces another behavior into SetBufferCommitInfoNeedsSave, which is to write an XLOG_HINT WAL record if checksums are enabled (to avoid torn page hazards). That's only necessary for changes where the caller does not write WAL itself and doesn't bump the LSN of the data page. (There's a reason the caller can't easily write the XLOG_HINT WAL itself.) So, we could introduce another flag needsWAL that would control whether we write the XLOG_HINT WAL or not (only applies with checksums on, of course). That wouldn't work because it can't know the exact answer to that, but the way the patch does this is already correct. XLOG_HINT_WAL doesn't always write a WAL record, it only does it when necessary. See XLogInsert() Didn't fully understand other comments. Do we we need an answer now? My head is somewhere else. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
The checksums patch also introduces another behavior into SetBufferCommitInfoNeedsSave, which is to write an XLOG_HINT WAL record if checksums are enabled (to avoid torn page hazards). That's only necessary for changes where the caller does not write WAL itself and doesn't bump the LSN of the data page. (There's a reason the caller can't easily write the XLOG_HINT WAL itself.) So, we could introduce another flag needsWAL that would control whether we write the XLOG_HINT WAL or not (only applies with checksums on, of course). That wouldn't work because it can't know the exact answer to that, but the way the patch does this is already correct. The name I chose was poor, but the flag should mean the caller does not write WAL associated with this action. If that flag is true, and if checksums are enabled, then it would do an XLogInsert, which may write WAL (depending on the LSN check). That part of the patch is correct currently, but the problem is with updates to PD_ALL_VISIBLE. Let me try to explain again: Calls to PageSetAllVisible are not directly associated with a WAL action, but they are associated with a call to MarkBufferDirty and do have an exclusive content lock on the buffer. There's a torn page hazard there for checksums, because without any WAL action associated with the data page, there is no backup page. One idea might be to use SetBufferCommitInfoNeedsSave (which will write WAL if necessary) instead of MarkBufferDirty. But that is unsafe, because it might not actually mark the buffer dirty due to a race (documented in SetBufferCommitInfoNeedsSave). So that's why I wanted to refactor MarkBufferDirty/SetBufferCommitInfoNeedsSave, to separate the concept that it may need a WAL record from the concept that actually dirtying the page is optional. Another idea is to make the WAL action for visibilitymap_set have another item in the chain pointing to the heap buffer, and bump the heap LSN. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Tue, 2012-12-04 at 01:03 -0800, Jeff Davis wrote: For now, I rebased the patches against master, and did some very minor cleanup. I think there is a problem here when setting PD_ALL_VISIBLE. I thought I had analyzed that before, but upon review, it doesn't look right. Setting PD_ALL_VISIBLE needs to be associated with a WAL action somehow, and a bumping of the LSN, otherwise there is a torn page hazard. The solution doesn't seem particularly difficult, but there are a few strange aspects and I'm not sure exactly which path I should take. First of all, the relationship between MarkBufferDirty and SetBufferCommitInfoNeedsSave is a little confusing. The comment over MarkBufferDirty is confusing because it says that the caller must have an exclusive lock, or else bad data could be written. But that doesn't have to do with marking the buffer dirty, that has to do with the data page change you make while you are marking it dirty -- if it's a single bit change, then there is no risk that I can see. In the current code, the only real code difference between the two is that SetBufferCommitInfoNeedsSave might fail to mark the buffer dirty if there is a race. So, in the current code, we could actually combine the two by passing a force flag (if true, behaves like MarkBufferDirty, if false, behaves like SetBufferCommitInfoNeedsSave). The checksums patch also introduces another behavior into SetBufferCommitInfoNeedsSave, which is to write an XLOG_HINT WAL record if checksums are enabled (to avoid torn page hazards). That's only necessary for changes where the caller does not write WAL itself and doesn't bump the LSN of the data page. (There's a reason the caller can't easily write the XLOG_HINT WAL itself.) So, we could introduce another flag needsWAL that would control whether we write the XLOG_HINT WAL or not (only applies with checksums on, of course). The reason for all of this is because the setting of PD_ALL_VISIBLE does not fit MarkBufferDirty, because MarkBufferDirty does not write the XLOG_HINT WAL and neither does the caller. But it also doesn't fit SetBufferCommitInfoNeedsSave, because that is subject to a race. If MarkBufferDirty had the signature: MarkBufferDirty(Buffer buffer, bool force, bool needsWAL) then normal page changes would look like: MarkBufferDirty(buffer, true, false) setting PD_ALL_VISIBLE would look like: MarkBufferDirty(buffer, true, true) and setting a hint would look like: MarkBufferDirty(buffer, false, true) Another approach would be for the caller who sets PD_ALL_VISIBLE to write WAL. But that requires inventing a new WAL record or chaining the heap block onto the wal entry when doing visibilitymap_set (only necessary when checksums are on). That seems somewhat of a hack, but perhaps it's not too bad. Also, I have another patch posted that is removing PD_ALL_VISIBLE entirely, which is dampening my enthusiasm to do too much work that might be thrown away. So right now, I'm leaning toward just adding the heap buffer to the WAL chain during visibilitymap_set. Thoughts? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Tue, Dec 18, 2012 at 04:06:02AM -0500, Greg Smith wrote: On 12/18/12 3:17 AM, Simon Riggs wrote: Clearly part of the response could involve pg_dump on the damaged structure, at some point. This is the main thing I wanted to try out more, once I have a decent corruption generation tool. If you've corrupted a single record but can still pg_dump the remainder, that seems the best we can do to help people recover from that. Providing some documentation on how to figure out what rows are in that block, presumably by using the contrib inspection tools, would be helpful too. FWIW, Postgres is pretty resiliant against corruption. I've maintained a postgres db on a server with bad memory (don't ask) and since most scrambling was in text strings you just got funny output sometimes. The most common failure was a memory allocation failure as postgres tried to copy a datum whose length field was correupted. If things went really wonky you could identify the bad tuples by hand and then delete them by ctid. Regular reindexing helped too. All I'm saying is that a mode where you log a warning but proceed anyway is useful. It won't pin down the exact error, but it will tell you where to look and help find the non-obvious corruption (so you can possibly fix it by hand). Have a nice day, -- Martijn van Oosterhout klep...@svana.org http://svana.org/kleptog/ He who writes carelessly confesses thereby at the very outset that he does not attach much importance to his own thoughts. -- Arthur Schopenhauer signature.asc Description: Digital signature
Re: [HACKERS] Enabling Checksums
On Tue, 2012-12-04 at 01:03 -0800, Jeff Davis wrote: 4. We need some general performance testing to show whether this is insane or not. I ran a few tests. Test 1 - find worst-case overhead for the checksum calculation on write: fsync = off bgwriter_lru_maxpages = 0 shared_buffers = 1024MB checkpoint_segments = 64 autovacuum = off The idea is to prevent interference from the bgwriter or autovacuum. Also, I turn of fsync so that it's measuring the calculation overhead, not the effort of actually writing to disk. drop table foo; create table foo(i int, j int) with (fillfactor=50); create index foo_idx on foo(i); insert into foo select g%25, -1 from generate_series(1,1000) g; checkpoint; -- during the following sleep, issue an OS sync -- to make test results more consistent select pg_sleep(30); \timing on update foo set j=-1 where i = 0; select pg_sleep(2); checkpoint; update foo set j=-1 where i = 0; select pg_sleep(2); checkpoint; update foo set j=-1 where i = 0; select pg_sleep(2); checkpoint; \timing off I am measuring the time of the CHECKPOINT command, not the update. The update is just to dirty all of the pages (they should all be HOT updates). Without checksums, it takes about 400ms. With checksums, it takes about 500ms. That overhead is quite low, considering that the bottleneck is almost always somewhere else (like actually writing to disk). Test 2 - worst-case overhead for calculating checksum while reading data Same configuration as above. This time, just load a big table: drop table foo; create table foo(i int, j int) with (fillfactor=50); insert into foo select g%25, -1 from generate_series(1,1000) g; -- make sure hint bits and PD_ALL_VISIBLE are set everywhere select count(*) from foo; vacuum; vacuum; vacuum; select relfilenode from pg_class where relname='foo'; Then shut down the server and restart it. Then do a cat data/base/12055/* /dev/null to get the table loaded into the OS buffer cache. Then do: \timing on SELECT COUNT(*) FROM foo; So, shared buffers are cold, but OS cache is warm. This should test the overhead of going from the OS to shared buffers, which requires the checksum calculation. Without checksums is around 820ms; with checksums around 970ms. Again, this is quite reasonable, because I would expect the bottleneck to be reading from the disk rather than the calculation itself. Test 3 - worst-case WAL overhead For this test, I also left fsync off, because I didn't want to test the effort to flush WAL (which shouldn't really be required for this test, anyway). This was simpler: drop table foo; create table foo(i int, j int) with (fillfactor=50); insert into foo select g%25, -1 from generate_series(1,1000) g; checkpoint; select pg_sleep(1); checkpoint; select pg_sleep(30); -- do an OS sync while this is running \timing on SELECT COUNT(*) FROM foo; Without checksums, it takes about 1000ms. With checksums, about 2350ms. I also tested with checksums but without the CHECKPOINT commands above, and it was also 1000ms. This test is more plausible than the other two, so it's more likely to be a real problem. So, the biggest cost of checksums is, by far, the extra full-page images in WAL, which matches our expectations. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 18 December 2012 02:21, Jeff Davis pg...@j-davis.com wrote: On Mon, 2012-12-17 at 19:14 +, Simon Riggs wrote: We'll need a way of expressing some form of corruption tolerance. zero_damaged_pages is just insane, The main problem I see with zero_damaged_pages is that it could potentially write out the zero page, thereby really losing your data if it wasn't already lost. (Of course, we document that you should have a backup first, but it's still dangerous). I assume that this is the same problem you are talking about. I think we should discuss whether we accept my premise? Checksums will actually detect more errors than we see now, and people will want to do something about that. Returning to backup is one way of handling it, but on a busy production system with pressure on, there is incentive to implement a workaround, not a fix. It's not an easy call to say we've got 3 corrupt blocks, so I'm going to take the whole system offline while I restore from backup. If you do restore from backup, and the backup also contains the 3 corrupt blocks, what then? Clearly part of the response could involve pg_dump on the damaged structure, at some point. I suppose we could have a new ReadBufferMaybe function that would only be used by a sequential scan; and then just skip over the page if it's corrupt, depending on a GUC. That would at least allow sequential scans to (partially) work, which might be good enough for some data recovery situations. If a catalog index is corrupted, that could just be rebuilt. Haven't thought about the details, though. Not sure if you're being facetious here or not. Mild reworking of the logic for heap page access could cope with a NULL buffer response and subsequent looping, which would allow us to run pg_dump against a damaged table to allow data to be saved, keeping file intact for further analysis. I'm suggesting we work a little harder than your block is corrupt and give some thought to what the user will do next. Indexes are a good case, because we can/should report the block error, mark the index as invalid and then hint that it should be rebuilt. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 12/18/12 3:17 AM, Simon Riggs wrote: Clearly part of the response could involve pg_dump on the damaged structure, at some point. This is the main thing I wanted to try out more, once I have a decent corruption generation tool. If you've corrupted a single record but can still pg_dump the remainder, that seems the best we can do to help people recover from that. Providing some documentation on how to figure out what rows are in that block, presumably by using the contrib inspection tools, would be helpful too. Indexes are a good case, because we can/should report the block error, mark the index as invalid and then hint that it should be rebuilt. Marking a whole index invalid because there's one bad entry has enough downsides that I'm not sure how much we'd want to automate that. Not having that index available could easily result in an effectively down system due to low performance. The choices are uglier if it's backing a unique constraint. In general, what I hope people will be able to do is switch over to their standby server, and then investigate further. I think it's unlikely that people willing to pay for block checksums will only have one server. Having some way to nail down if the same block is bad on a given standby seems like a useful interface we should offer, and it shouldn't take too much work. Ideally you won't find the same corruption there. I'd like a way to check the entirety of a standby for checksum issues, ideally run right after it becomes current. It seems the most likely way to see corruption on one of those is to replicate a corrupt block. There is no good way to make the poor soul who has no standby server happy here. You're just choosing between bad alternatives. The first block error is often just that--the first one, to be joined by others soon afterward. My experience at how drives fail says the second error is a lot more likely after you've seen one. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Greg Smith wrote: In general, what I hope people will be able to do is switch over to their standby server, and then investigate further. I think it's unlikely that people willing to pay for block checksums will only have one server. Having some way to nail down if the same block is bad on a given standby seems like a useful interface we should offer, and it shouldn't take too much work. Ideally you won't find the same corruption there. I'd like a way to check the entirety of a standby for checksum issues, ideally run right after it becomes current. It seems the most likely way to see corruption on one of those is to replicate a corrupt block. There is no good way to make the poor soul who has no standby server happy here. You're just choosing between bad alternatives. The first block error is often just that--the first one, to be joined by others soon afterward. My experience at how drives fail says the second error is a lot more likely after you've seen one. +1 on all of that. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
There is no good way to make the poor soul who has no standby server happy here. You're just choosing between bad alternatives. The first block error is often just that--the first one, to be joined by others soon afterward. My experience at how drives fail says the second error is a lot more likely after you've seen one. For what it's worth Oracle allows you to recover a specific block from backups including replaying the archive logs for that one block. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Tue, 2012-12-18 at 08:17 +, Simon Riggs wrote: I think we should discuss whether we accept my premise? Checksums will actually detect more errors than we see now, and people will want to do something about that. Returning to backup is one way of handling it, but on a busy production system with pressure on, there is incentive to implement a workaround, not a fix. It's not an easy call to say we've got 3 corrupt blocks, so I'm going to take the whole system offline while I restore from backup. Up until now, my assumption has generally been that, upon finding the corruption, the primary course of action is taking that server down (hopefully you have a good replica), and do some kind of restore or sync a new replica. It sounds like you are exploring other possibilities. I suppose we could have a new ReadBufferMaybe function that would only be used by a sequential scan; and then just skip over the page if it's corrupt, depending on a GUC. That would at least allow sequential scans to (partially) work, which might be good enough for some data recovery situations. If a catalog index is corrupted, that could just be rebuilt. Haven't thought about the details, though. Not sure if you're being facetious here or not. No. It was an incomplete thought (as I said), but sincere. Mild reworking of the logic for heap page access could cope with a NULL buffer response and subsequent looping, which would allow us to run pg_dump against a damaged table to allow data to be saved, keeping file intact for further analysis. Right. I'm suggesting we work a little harder than your block is corrupt and give some thought to what the user will do next. Indexes are a good case, because we can/should report the block error, mark the index as invalid and then hint that it should be rebuilt. Agreed; this applies to any derived data. I don't think it will be very practical to keep a server running in this state forever, but it might give enough time to reach a suitable maintenance window. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Tue, 2012-12-18 at 04:06 -0500, Greg Smith wrote: Having some way to nail down if the same block is bad on a given standby seems like a useful interface we should offer, and it shouldn't take too much work. Ideally you won't find the same corruption there. I'd like a way to check the entirety of a standby for checksum issues, ideally run right after it becomes current. It seems the most likely way to see corruption on one of those is to replicate a corrupt block. Part of the design is that pg_basebackup would verify checksums during replication, so we should not replicate corrupt blocks (of course, that's not implemented yet, so it's still a concern for now). And we can also have ways to do background/offline checksum verification with a separate utility. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Jeff Davis pg...@j-davis.com writes: -A relation name -Corruption type (an entry from this list) -How many blocks to touch I'll just loop based on the count, randomly selecting a block each time and messing with it in that way. For the messing with it part, did you consider zzuf? http://caca.zoy.org/wiki/zzuf Does it make sense to have a separate executable (pg_corrupt) just for corrupting the data as a test? Or should it be part of a corruption-testing harness (pg_corruptiontester?), that introduces the corruption and then verifies that it's properly detected? Maybe we need our own zzuf implementation, though. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 14 December 2012 20:15, Greg Smith g...@2ndquadrant.com wrote: On 12/14/12 3:00 PM, Jeff Davis wrote: After some thought, I don't see much value in introducing multiple instances of corruption at a time. I would think that the smallest unit of corruption would be the hardest to detect, so by introducing many of them in one pass makes it easier to detect. That seems reasonable. It would eliminate a lot of issues with reproducing a fault too. I can just print the impacted block number presuming it will show up in a log, and make it possible to override picking one at random with a command line input. Discussing this makes me realise that we need a more useful response than just your data is corrupt, so user can respond yes, I know, I'm trying to save whats left. We'll need a way of expressing some form of corruption tolerance. zero_damaged_pages is just insane, much better if we set corruption_tolerance = N to allow us to skip N corrupt pages before failing, with -1 meaning keep skipping for ever. Settable by superuser only. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Simon Riggs si...@2ndquadrant.com writes: Discussing this makes me realise that we need a more useful response than just your data is corrupt, so user can respond yes, I know, I'm trying to save whats left. We'll need a way of expressing some form of corruption tolerance. zero_damaged_pages is just insane, much better if we set corruption_tolerance = N to allow us to skip N corrupt pages before failing, with -1 meaning keep skipping for ever. Settable by superuser only. Define skip. Extra points if it makes sense for an index. And what about things like pg_clog pages? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 17 December 2012 19:29, Tom Lane t...@sss.pgh.pa.us wrote: Simon Riggs si...@2ndquadrant.com writes: Discussing this makes me realise that we need a more useful response than just your data is corrupt, so user can respond yes, I know, I'm trying to save whats left. We'll need a way of expressing some form of corruption tolerance. zero_damaged_pages is just insane, much better if we set corruption_tolerance = N to allow us to skip N corrupt pages before failing, with -1 meaning keep skipping for ever. Settable by superuser only. Define skip. Allow data access, but accept that the answer is silently incomplete. Not really much difference from zero_damaged_pages which just removes the error by removing any chance of repair or recovery, and then silently gives the wrong answer. Extra points if it makes sense for an index. I guess not, but that's no barrier to it working on heap pages only, in my suggested use case. And what about things like pg_clog pages? SLRUs aren't checksummed because of their lack of header space. Perhaps that is a major point against the patch. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, 2012-12-17 at 19:14 +, Simon Riggs wrote: We'll need a way of expressing some form of corruption tolerance. zero_damaged_pages is just insane, The main problem I see with zero_damaged_pages is that it could potentially write out the zero page, thereby really losing your data if it wasn't already lost. (Of course, we document that you should have a backup first, but it's still dangerous). I assume that this is the same problem you are talking about. I suppose we could have a new ReadBufferMaybe function that would only be used by a sequential scan; and then just skip over the page if it's corrupt, depending on a GUC. That would at least allow sequential scans to (partially) work, which might be good enough for some data recovery situations. If a catalog index is corrupted, that could just be rebuilt. Haven't thought about the details, though. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Wed, 2012-12-12 at 17:52 -0500, Greg Smith wrote: I can take this on, as part of the QA around checksums working as expected. The result would be a Python program; I don't have quite enough time to write this in C or re-learn Perl to do it right now. But this won't be a lot of code. If it's tossed one day as simply a prototype for something more permanent, I think it's still worth doing now. The UI I'm thinking of for what I'm going to call pg_corrupt is a CLI that asks for: -A relation name -Corruption type (an entry from this list) -How many blocks to touch I'll just loop based on the count, randomly selecting a block each time and messing with it in that way. The randomness seed should be printed as part of the output, so that it's possible re-create the damage exactly later. If the server doesn't handle it correctly, we'll want to be able to replicate the condition it choked on exactly later, just based on the tool's log output. Any other requests? After some thought, I don't see much value in introducing multiple instances of corruption at a time. I would think that the smallest unit of corruption would be the hardest to detect, so by introducing many of them in one pass makes it easier to detect. For example, if we introduce an all-ones page, and also transpose two pages, the all-ones error might be detected even if the transpose error is not being detected properly. And we'd not know that the transpose error was not being detected, because the error appears as soon as it sees the all-ones page. Does it make sense to have a separate executable (pg_corrupt) just for corrupting the data as a test? Or should it be part of a corruption-testing harness (pg_corruptiontester?), that introduces the corruption and then verifies that it's properly detected? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 12/14/12 3:00 PM, Jeff Davis wrote: After some thought, I don't see much value in introducing multiple instances of corruption at a time. I would think that the smallest unit of corruption would be the hardest to detect, so by introducing many of them in one pass makes it easier to detect. That seems reasonable. It would eliminate a lot of issues with reproducing a fault too. I can just print the impacted block number presuming it will show up in a log, and make it possible to override picking one at random with a command line input. Does it make sense to have a separate executable (pg_corrupt) just for corrupting the data as a test? Or should it be part of a corruption-testing harness (pg_corruptiontester?), that introduces the corruption and then verifies that it's properly detected? Let me see what falls out of the coding, I don't think this part needs to get nailed down yet. Building a corruption testing harness is going to involve a lot of creating new clusters and test data to torture. It's a different style of problem than injecting faults in the first place. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 12/5/12 6:49 PM, Simon Riggs wrote: * Zeroing pages, making pages all 1s * Transposing pages * Moving chunks of data sideways in a block * Flipping bits randomly * Flipping data endianness * Destroying particular catalog tables or structures I can take this on, as part of the QA around checksums working as expected. The result would be a Python program; I don't have quite enough time to write this in C or re-learn Perl to do it right now. But this won't be a lot of code. If it's tossed one day as simply a prototype for something more permanent, I think it's still worth doing now. The UI I'm thinking of for what I'm going to call pg_corrupt is a CLI that asks for: -A relation name -Corruption type (an entry from this list) -How many blocks to touch I'll just loop based on the count, randomly selecting a block each time and messing with it in that way. The randomness seed should be printed as part of the output, so that it's possible re-create the damage exactly later. If the server doesn't handle it correctly, we'll want to be able to replicate the condition it choked on exactly later, just based on the tool's log output. Any other requests? -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
Robert Haas wrote: Jeff Davis pg...@j-davis.com wrote: Or, I could write up a test framework in ruby or python, using the appropriate pg driver, and some not-so-portable shell commands to start and stop the server. Then, I can publish that on this list, and that would at least make it easier to test semi-manually and give greater confidence in pre-commit revisions. That latter approach is similar to what happened with SSI's isolation tester. It started out in Python, and then Heikki rewrote it in C. If Python/Ruby code is massively simpler to write than the C code, that might be a good way to start out. It'll be an aid to reviewers even if neither it nor any descendent gets committed. Frankly, I think some automated testing harness (written in C or Perl) that could do fault-injection tests as part of the buildfarm would be amazingly awesome. I'm drooling just thinking about it. But I guess that's getting ahead of myself. There may be room for both. My experience was that the dtester tool from Markus made it relatively easy for me to hack up new tests which gave detailed information about which permutations were behaving as desired, which were known not to be covered, and which had regressions. That speed of adding new tests and detail about improvements or regressions allowed faster development than would have been possible with the isolation tester that Heikki wrote in C. On the other hand, dtester requires python (in fact, I think it requries python version 2.x were x is 5 or greater), a requirement which I don't think we want to add for builds. It wasn't very compatible with the normal make check environment, either in how it was run or in its output. And it was much slower than the isolation test framework -- like by about an order of magnitude. So for a completed product on which you want to test for regressions, the isolation tester is much better. For a development effort on the scale of SSI, I would want to have dtester or something very like it available. Neither one quite handles tests for all the types of concurrency conditions that one might want. I had some idea how to add some additonal useful cases to dtester, and it didn't look outrageously hard. I haven't really looked at how to do that in the insolation tester, so I don't know how hard it would be there. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Tue, Dec 4, 2012 at 6:17 PM, Jeff Davis pg...@j-davis.com wrote: Or, I could write up a test framework in ruby or python, using the appropriate pg driver, and some not-so-portable shell commands to start and stop the server. Then, I can publish that on this list, and that would at least make it easier to test semi-manually and give greater confidence in pre-commit revisions. That latter approach is similar to what happened with SSI's isolation tester. It started out in Python, and then Heikki rewrote it in C. If Python/Ruby code is massively simpler to write than the C code, that might be a good way to start out. It'll be an aid to reviewers even if neither it nor any descendent gets committed. Frankly, I think some automated testing harness (written in C or Perl) that could do fault-injection tests as part of the buildfarm would be amazingly awesome. I'm drooling just thinking about it. But I guess that's getting ahead of myself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On 5 December 2012 23:40, Robert Haas robertmh...@gmail.com wrote: On Tue, Dec 4, 2012 at 6:17 PM, Jeff Davis pg...@j-davis.com wrote: Or, I could write up a test framework in ruby or python, using the appropriate pg driver, and some not-so-portable shell commands to start and stop the server. Then, I can publish that on this list, and that would at least make it easier to test semi-manually and give greater confidence in pre-commit revisions. That latter approach is similar to what happened with SSI's isolation tester. It started out in Python, and then Heikki rewrote it in C. If Python/Ruby code is massively simpler to write than the C code, that might be a good way to start out. It'll be an aid to reviewers even if neither it nor any descendent gets committed. Frankly, I think some automated testing harness (written in C or Perl) that could do fault-injection tests as part of the buildfarm would be amazingly awesome. I'm drooling just thinking about it. But I guess that's getting ahead of myself. Agreed, though we can restrict that to a few things at first. * Zeroing pages, making pages all 1s * Transposing pages * Moving chunks of data sideways in a block * Flipping bits randomly * Flipping data endianness * Destroying particular catalog tables or structures etc As a contrib module, so we can be sure to never install it. ;-) -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Enabling Checksums
On Mon, 2012-12-03 at 13:16 +, Simon Riggs wrote: On 3 December 2012 09:56, Simon Riggs si...@2ndquadrant.com wrote: I think the way forwards for this is... 1. Break out the changes around inCommit flag, since that is just uncontroversial refactoring. I can do that. That reduces the noise level in the patch and makes it easier to understand the meaningful changes. Done. Thank you. One minor thing I noticed: it looks like nwaits is a useless variable. Your original checksums patch used it to generate a warning, but now that is gone. It's not throwing a compiler warning for some reason. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers