On 3/4/13 3:13 PM, Heikki Linnakangas wrote:
This PostgreSQL patch hasn't seen any production use, either. In fact,
I'd consider btrfs to be more mature than this patch. Unless you think
that there will be some major changes to the worse in performance in
btrfs, it's perfectly valid and useful to compare the two.

I think my last message came out with a bit more hostile attitude about this than I intended it to; sorry about that. My problem with this idea comes from looking at the history of how Linux has failed to work properly before. The best example I can point at is the one I documented at http://www.postgresql.org/message-id/4b512d0d.4030...@2ndquadrant.com along with this handy pgbench chart: http://www.phoronix.com/scan.php?page=article&item=ubuntu_lucid_alpha2&num=3

TPS on pgbench dropped from 1102 to about 110 after a kernel bug fix. It was 10X as fast in some kernel versions because fsync wasn't working properly. Kernel filesystem issues have regularly resulted in data not being written to disk when it should have been, inflating the results accordingly. Fake writes due to "lying drives", write barriers that only actually work on server-class hardware, write barriers that don't work on md volumes, and then this one; it's a recurring pattern. It's not the fault of the kernel developers, it's a hard problem and drive manufacturers aren't making it easy for them.

My concern, then, is that if the comparison target is btrfs performance, how do we know it's working reliably? The track record says that bugs in this area usually inflate results, compared with a correct implementation. You are certainly right that this checksum code is less mature than btrfs; it's just over a year old after all. I feel quite good that it's not benchmarking faster than it really is, especially when I can directly measure how the write volume is increasing in the worst result.

I can't say that btrfs is slower or faster than it will eventually be due to bugs; I can't tell you the right way to tune btrfs for PostgreSQL; and I haven't even had anyone asking the question yet. Right now, the main thing I know about testing performance on Linux kernels new enough to support btrfs is that they're just generally slow running PostgreSQL. See the multiple confirmed regression issues at http://www.postgresql.org/message-id/60b572d9298d944580f7d51195dd30804357fa4...@vmbx125.ihostexchange.net for example. That new kernel mess needs to get sorted out too one day. Why does database performance suck on kernel 3.2? I don't know yet, but it doesn't help me get excited about assuming btrfs results will be useful.

ZFS was supposed to save everyone from worrying about corruption issues. That didn't work out, I think due to the commercial agenda behind its development. Now we have btrfs coming in some number of years, a project still tied more than I would like to Oracle. I'm not too optimistic about that one either. It doesn't help that now the original project lead, Chris Mason, has left there and is working at FusionIO--and that company's filesystem plans don't include checksumming, either. (See http://www.fusionio.com/blog/under-the-hood-of-the-iomemory-sdk/ for a quick intro to what they're doing right now, which includes bypassing the Linux filesystem layer with their own flash optimized but POSIX compliant directFS)

There is an optimistic future path I can envision where btrfs matures quickly and in a way that performs well for PostgreSQL. Maybe we'll end up there, and if that happens everyone can look back and say this was a stupid idea. But there are a lot of other outcomes I see as possible here, and in all the rest of them having some checksumming capabilities available is a win.

One of the areas PostgreSQL has a solid reputation on is being trusted to run as reliably as possible. All of the deployment trends I'm seeing have people moving toward less reliable hardware. VMs, cloud systems, regular drives instead of hardware RAID, etc. A lot of people badly want to leave behind the era of the giant database server, and have a lot of replicas running on smaller/cheaper systems instead. There's a useful advocacy win for the project if lower grade hardware can be used to hit a target reliability level, with software picking up some of the error detection job instead. Yes, it costs something in terms of future maintenance on the codebase, as new features almost invariably do. If I didn't see being able to make noise about the improved reliability of PostgreSQL as valuable enough to consider it anyway, I wouldn't even be working on this thing.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to