On Mon, Sep 14, 2020 at 3:00 PM Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > FWIW I agree with Andres' stance on this. The current system is *very* > complicated and bugs are obscure already. If we hide them, what we'll > be getting is a system where data can become corrupted for no apparent > reason.
I think I might have to give up on this proposal given the level of opposition to it, but the nature of the opposition doesn't make any sense to me on a technical level. Suppose a tuple with tid A has been updated, producing a new version at tid B. The argument that is now being offered is that if A has been found to be corrupt then we'd better stop vacuuming the table altogether lest we reach B and vacuum it too, further corrupting the table and destroying forensic evidence. But even ignoring the fact that many users want to get the database running again more than they want to do forensics, it's entirely possible that B < A, in which case the damage has already been done. Therefore, I can't see any argument that this patch creates any scenario that can't happen already. It seems entirely reasonable to me to say, as a review comment, hey, you haven't sufficiently considered this particular scenario, that still needs work. But the argument here is much more about whether this is a reasonable thing to do in general and under any circumstances, and it feels to me like you guys are saying "no" without offering any really convincing evidence that there are unfixable problems here. IOW, I agree that having a GUC corrupt_my_tables_more=true is not a reasonable thing, but I disagree that the proposal on the table is tantamount to that. The big picture here is that people have terabyte-scale tables, 1 or 2 tuples get corrupted, and right now the only real fix is to dump and restore the whole table, which leads to prolonged downtime. The pg_surgery stuff should help with that, and the work to make VACUUM report the exact TID will also help, and if we can get the heapcheck stuff Mark Dilger is working on committed, that will provide an alternative and probably better way of finding this kind of corruption, which is all to the good. However, I disagree with the idea that a typical user who has a 2TB with one corrupted tuple on page 0 probably wants VACUUM to fail over and over again, letting the table bloat like crazy, instead of bleating loudly but still vacuuming the other 0.999999% of the table. I mean, somebody probably wants that, and that's fine. But I have a hard time imagining it as a typical view. Am I just lacking in imagination? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company