Re: [HACKERS] Enabling Checksums

Greg Smith Mon, 04 Mar 2013 15:34:44 -0800

On 3/4/13 3:13 PM, Heikki Linnakangas wrote:

This PostgreSQL patch hasn't seen any production use, either. In fact,
I'd consider btrfs to be more mature than this patch. Unless you think
that there will be some major changes to the worse in performance in
btrfs, it's perfectly valid and useful to compare the two.

I think my last message came out with a bit more hostile attitude aboutthis than I intended it to; sorry about that. My problem with this ideacomes from looking at the history of how Linux has failed to workproperly before. The best example I can point at is the one Idocumented athttp://www.postgresql.org/message-id/4b512d0d.4030...@2ndquadrant.comalong with this handy pgbench chart:http://www.phoronix.com/scan.php?page=article&item=ubuntu_lucid_alpha2&num=3

TPS on pgbench dropped from 1102 to about 110 after a kernel bug fix.It was 10X as fast in some kernel versions because fsync wasn't workingproperly. Kernel filesystem issues have regularly resulted in data notbeing written to disk when it should have been, inflating the resultsaccordingly. Fake writes due to "lying drives", write barriers thatonly actually work on server-class hardware, write barriers that don'twork on md volumes, and then this one; it's a recurring pattern. It'snot the fault of the kernel developers, it's a hard problem and drivemanufacturers aren't making it easy for them.

My concern, then, is that if the comparison target is btrfs performance,how do we know it's working reliably? The track record says that bugsin this area usually inflate results, compared with a correctimplementation. You are certainly right that this checksum code is lessmature than btrfs; it's just over a year old after all. I feel quitegood that it's not benchmarking faster than it really is, especiallywhen I can directly measure how the write volume is increasing in theworst result.

I can't say that btrfs is slower or faster than it will eventually bedue to bugs; I can't tell you the right way to tune btrfs forPostgreSQL; and I haven't even had anyone asking the question yet.Right now, the main thing I know about testing performance on Linuxkernels new enough to support btrfs is that they're just generally slowrunning PostgreSQL. See the multiple confirmed regression issues athttp://www.postgresql.org/message-id/60b572d9298d944580f7d51195dd30804357fa4...@vmbx125.ihostexchange.netfor example. That new kernel mess needs to get sorted out too one day.Why does database performance suck on kernel 3.2? I don't know yet,but it doesn't help me get excited about assuming btrfs results will beuseful.

ZFS was supposed to save everyone from worrying about corruption issues.That didn't work out, I think due to the commercial agenda behind itsdevelopment. Now we have btrfs coming in some number of years, aproject still tied more than I would like to Oracle. I'm not toooptimistic about that one either. It doesn't help that now the originalproject lead, Chris Mason, has left there and is working atFusionIO--and that company's filesystem plans don't includechecksumming, either. (Seehttp://www.fusionio.com/blog/under-the-hood-of-the-iomemory-sdk/ for aquick intro to what they're doing right now, which includes bypassingthe Linux filesystem layer with their own flash optimized but POSIXcompliant directFS)

There is an optimistic future path I can envision where btrfs maturesquickly and in a way that performs well for PostgreSQL. Maybe we'll endup there, and if that happens everyone can look back and say this was astupid idea. But there are a lot of other outcomes I see as possiblehere, and in all the rest of them having some checksumming capabilitiesavailable is a win.

One of the areas PostgreSQL has a solid reputation on is being trustedto run as reliably as possible. All of the deployment trends I'm seeinghave people moving toward less reliable hardware. VMs, cloud systems,regular drives instead of hardware RAID, etc. A lot of people badlywant to leave behind the era of the giant database server, and have alot of replicas running on smaller/cheaper systems instead. There's auseful advocacy win for the project if lower grade hardware can be usedto hit a target reliability level, with software picking up some of theerror detection job instead. Yes, it costs something in terms of futuremaintenance on the codebase, as new features almost invariably do. If Ididn't see being able to make noise about the improved reliability ofPostgreSQL as valuable enough to consider it anyway, I wouldn't even beworking on this thing.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Enabling Checksums

Reply via email to