On Wed, Jan 25, 2017 at 2:23 PM, Stephen Frost <sfr...@snowman.net> wrote: >> Sure. If the database runs fast enough with checksums enabled, >> there's basically no reason to have them turned off. The issue is >> when it doesn't. > > I don't believe we're talking about forcing every user to have checksums > enabled. We are discussing the default.
I never said otherwise. > Would you say that most user's databases run fast enough with checksums > enabled? Or more than most, maybe 70%? 80%? In today's environment, > I'd probably say that it's more like 90+%. I don't have statistics on that, but I'd certainly agree that it's over 90%. However, I estimate that the number of percentage of people who wouldn't be helped by checksums is also over 90%. I don't think it's easy to say whether there are more people who would benefit from checksums than would be hurt by the performance penalty or visca versa. My own feeling is the second, but I understand that yours is the first. > Yet, our default is to have them disabled and *really* hard to enable. First of all, that could be fixed by further development. Second, really hard to enable is a relative term. I accept that enabling checksums is not a pleasant process. Right now, you'd have to do a dump/restore, or use logical replication to replicate the data to a new cluster and then switch over. On the other hand, if checksums are really a critical feature, how are people getting to the point where they've got a mission-critical production system and only then discovering that they want to enable checksums? If you tell somebody "we have an optional feature called checksums and you should really use it" and they respond "well, I'd like to, but I already put my system into critical production use and it's not worth it to me to take downtime to get them enabled", that sounds to me like the feature is nice-to-have, not absolutely essential. When something is essential, you find a way to get it done, whether it's painful or not, because that's what essential means. And if checksums are not essential, then they shouldn't be enabled by default unless they're very cheap -- and I think we already know that's not true in all workloads. > I agree that it's unfortunate that we haven't put more effort into > fixing that- I'm all for it, but it's disappointing to see that people > are not in favor of changing the default as I believe it would both help > our users and encourage more development of the feature. I think it would help some users and hurt others. I do agree that it would encourage more development of the feature -- almost of necessity. In particular, I bet it would spur development of an efficient way of turning checksums off -- but I'd rather see us approach it from the other direction: let's develop an efficient way of turning the feature on and off FIRST. Deciding that the feature has to be on for everyone because turning it on later is too hard for the people who later decide they want it is letting the tail wag the dog. Also, I think that one of the big problems with the way checksums work is that you don't find problems with your archived data until it's too late. Suppose that in February bits get flipped in a block. You don't access the data until July[1]. Well, it's nice to have the system tell you that the data is corrupted, but what are you going to do about it? By that point, all of your backups are probably corrupted. So it's basically: ERROR: you're screwed It's nice to know that (maybe?) but without a recovery strategy a whole lot of people who get that message are going to immediately start asking "How do I ignore the fact that I'm screwed and try to read the data anyway?". And then you wonder what the point of having the feature turned on is, especially if it's costly. It's almost an attractive nuisance at that point - nobody wants to be the user that turns off checksums because they sound good on paper, but when you actually have a problem an awful lot of people are NOT going to want to try to restore from backup and maybe lose recent transactions. They're going to want to ignore the checksum failures. That's kind of awful. Peter's comments upthread get at this: "We need to invest in corruption detection/verification tools that are run on an as-needed basis." Exactly. If we could verify that our data is good before throwing away our old backups, that'd be good. If we could verify that our indexes were structurally sane, that would be superior to anything checksums can ever give us because it catches not only storage failures but also software failures within PostgreSQL itself and user malfeasance above the PostgreSQL layer (e.g. redefining the supposedly-immutable function to give different answers) and damage inflicted inadvertently by environmental changes (e.g. upgrading glibc and having strcoll() change its mind). If we could verify that every XID and MXID in the heap points to a clog or multixact record that still exists, that'd catch more than just bit flips. I'm not trying to downplay the usefulness of checksums *in a certain context*. It's a good feature, and I'm glad we have it. But I think you're somewhat inflating the utility of it while discounting the very real costs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] of the following year, maybe. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers