On Wed, Feb 25, 2015 at 3:26 AM, Andres Freund <and...@2ndquadrant.com> wrote: > I'm pretty sure this will entirely fail if you have a transaction that's > large enough to spill to disk. Calling ReorderBufferIterTXNNext() will > reuse the memory for the in memory changes. > > It's also borked because it skips a large number of checks for the next > change. You're entirely disregarding what the routine does otherwise - > it could be a toast record, it could be a change to a system table, it > could be a new snapshot... > > Also, this seems awfully fragile in th presence of toast rows and such? > Can we be entirely sure that the next WAL record logged by that session > would be the internal super deletion?
toast_save_datum() is called with a heap_insert() call before heap insertion for the tuple proper. We're relying on the assumption that if there is no immediate super deletion record, things are fine. We cannot speculatively insert into catalogs, BTW. Why should we see a REORDER_BUFFER_CHANGE_INTERNAL_COMMAND_ID case next, in the case that we need to care about (when the tuple was super deleted immediately afterwards)? Now, I fully admit that as written, the quick sketch patch is unacceptably fragile (I think it might accidentally work despite this, leaving aside concerns about resource lifespans). The "peek ahead" needs to be more robust, for sure, most obviously by not failing when we spill to disk, but also by being paranoid about super deletion records. Basically, by "more paranoid", I mean the next thing immediately following a speculatively inserted tuple might not be a super deletion record, but we still look until we conclude that there won't be one (based on something else - some "terminating condition" - happening, including the xact committing). Other examples of terminating conditions would be new, unrelated REORDER_BUFFER_CHANGE_INSERT/ REORDER_BUFFER_CHANGE_UPDATE/ REORDER_BUFFER_CHANGE_DELETE changes. We can make some assumptions about when it's safe to conclude that there was no super deletion that are not fragile. As for the idea of writing WAL separately, I don't think it's worth it. We'd need to go exclusive lock the buffer again, which seems like a non-starter. While what I have here *is* very ugly, I see no better alternative. By doing what you suggest, we'd be finding a special case way of safely violating the general "WAL log-before-data" rule. Performance aside, that seems like a worse, less isolated kludge...no? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers