On Wed, Jan 25, 2017 at 2:23 PM, Stephen Frost <sfr...@snowman.net> wrote:
>> Sure.  If the database runs fast enough with checksums enabled,
>> there's basically no reason to have them turned off.  The issue is
>> when it doesn't.
>
> I don't believe we're talking about forcing every user to have checksums
> enabled.  We are discussing the default.

I never said otherwise.

> Would you say that most user's databases run fast enough with checksums
> enabled?  Or more than most, maybe 70%?  80%?  In today's environment,
> I'd probably say that it's more like 90+%.

I don't have statistics on that, but I'd certainly agree that it's
over 90%.  However, I estimate that the number of percentage of people
who wouldn't be helped by checksums is also over 90%.  I don't think
it's easy to say whether there are more people who would benefit from
checksums than would be hurt by the performance penalty or visca
versa.  My own feeling is the second, but I understand that yours is
the first.

> Yet, our default is to have them disabled and *really* hard to enable.

First of all, that could be fixed by further development.

Second, really hard to enable is a relative term.  I accept that
enabling checksums is not a pleasant process.  Right now, you'd have
to do a dump/restore, or use logical replication to replicate the data
to a new cluster and then switch over.  On the other hand, if
checksums are really a critical feature, how are people getting to the
point where they've got a mission-critical production system and only
then discovering that they want to enable checksums?  If you tell
somebody "we have an optional feature called checksums and you should
really use it" and they respond "well, I'd like to, but I already put
my system into critical production use and it's not worth it to me to
take downtime to get them enabled", that sounds to me like the feature
is nice-to-have, not absolutely essential.  When something is
essential, you find a way to get it done, whether it's painful or not,
because that's what essential means.  And if checksums are not
essential, then they shouldn't be enabled by default unless they're
very cheap -- and I think we already know that's not true in all
workloads.

> I agree that it's unfortunate that we haven't put more effort into
> fixing that- I'm all for it, but it's disappointing to see that people
> are not in favor of changing the default as I believe it would both help
> our users and encourage more development of the feature.

I think it would help some users and hurt others.  I do agree that it
would encourage more development of the feature -- almost of
necessity.  In particular, I bet it would spur development of an
efficient way of turning checksums off -- but I'd rather see us
approach it from the other direction: let's develop an efficient way
of turning the feature on and off FIRST.  Deciding that the feature
has to be on for everyone because turning it on later is too hard for
the people who later decide they want it is letting the tail wag the
dog.

Also, I think that one of the big problems with the way checksums work
is that you don't find problems with your archived data until it's too
late.  Suppose that in February bits get flipped in a block.  You
don't access the data until July[1].  Well, it's nice to have the
system tell you that the data is corrupted, but what are you going to
do about it?  By that point, all of your backups are probably
corrupted.  So it's basically:

ERROR: you're screwed

It's nice to know that (maybe?) but without a recovery strategy a
whole lot of people who get that message are going to immediately
start asking "How do I ignore the fact that I'm screwed and try to
read the data anyway?".  And then you wonder what the point of having
the feature turned on is, especially if it's costly.  It's almost an
attractive nuisance at that point - nobody wants to be the user that
turns off checksums because they sound good on paper, but when you
actually have a problem an awful lot of people are NOT going to want
to try to restore from backup and maybe lose recent transactions.
They're going to want to ignore the checksum failures.  That's kind of
awful.

Peter's comments upthread get at this: "We need to invest in
corruption detection/verification tools that are run on an as-needed
basis."  Exactly.  If we could verify that our data is good before
throwing away our old backups, that'd be good.  If we could verify
that our indexes were structurally sane, that would be superior to
anything checksums can ever give us because it catches not only
storage failures but also software failures within PostgreSQL itself
and user malfeasance above the PostgreSQL layer (e.g. redefining the
supposedly-immutable function to give different answers) and damage
inflicted inadvertently by environmental changes (e.g. upgrading glibc
and having strcoll() change its mind).  If we could verify that every
XID and MXID in the heap points to a clog or multixact record that
still exists, that'd catch more than just bit flips.

I'm not trying to downplay the usefulness of checksums *in a certain
context*.  It's a good feature, and I'm glad we have it.  But I think
you're somewhat inflating the utility of it while discounting the very
real costs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] of the following year, maybe.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to