On Mon, Jul 29, 2019 at 9:35 AM Robert Haas <robertmh...@gmail.com> wrote: > I mean, essentially any well-designed framework intended for any sort > of task whatsoever is going to have a design center where one can > foresee that it will work well, and then as a result of working well > for the thing for which it was designed, it will also work well for > other things that are sufficiently similar. So, I think you're > correct, but I also don't think that's really saying very much.
I agree that it's quite unclear how important this is. I don't necessarily think it matters if zheap doesn't do that well with GIN indexes. I think it's probably going to be useful to imagine how GIN indexing might work for zheap because it clarifies the strengths and weaknesses of your design. It's perfectly fine for there to be weaknesses, provided that they are well understood. > The > trick is to figure out whether and how the ideas you have could be > generalized with reasonable effort to handle other cases, and that's > easier with some projects than others. I think when it comes to UNDO, > it's actually really hard. I agree. > Unfortunately, we haven't had many takers so far -- thanks for chiming > in. I don't have the ability to express my general concerns here in a very crisp way. This is complicated stuff. Thanks for tolerating the hand-wavy nature of my feedback about this. > I don't really understand your comments about GIN. My simplistic > understanding of GIN is that it's not very different from btree in > this regard. GIN is quite similar to btree from a Postgres point of view -- GIN is simply a btree that is good at storing duplicates (and has higher level infrastructure to make things like FTS work). So I'd say that your understanding is fairly complete, at least as far as traditional Postgres goes. But if we imagine a system in which we have to roll back in indexes, it's quite a different story. See my remarks to Thomas just now about that. > Suppose we insert a row, and then the insert aborts; > suppose also that the index wants to use UNDO. In the case of a btree > index, we're going to go insert an index entry for the new row; upon > abort, we should undo the index insertion by removing that index tuple > or at least marking it dead. Unless a page split has happened, > triggered either by the insertion itself or by subsequent activity, > this puts the index in a state that is almost perfectly equivalent to > where we were before the now-aborted transaction did any work. If a > page split has occurred, trying to undo the index insertion is going > to run into two problems. One, we probably can't undo the page split, > so the index will be logically equivalent but not physically > equivalent after we get rid of the new tuple. Two, if the page split > happened after the insertion of the new tuple rather than at the same > time, the index tuple may not be on the page where we left it. Actually, page splits are the archetypal case where undo cannot restore the original physical state. In general, we cannot expect the undo process to reverse page splits. Undo might be able to merge the pages together, but it also might not be able to. It won't be terribly different to the situation with deletes where the transaction commits, most likely. Some other systems have something called "system transactions" for things like page splits. They don't need to have their commit record flushed synchronously, and occur in the foreground of the xact that needs to split the page. That way, rollback doesn't have to concern itself with rolling back things that are pretty much impossible to roll back, like page splits. > Now, my mental model of a GIN index is that you go find N>=0 index > keys inside each value and do basically the same thing as you would > for a btree index for each one of them. Therefore it seems to me, > possibly stupidly, that you're going to have basically the same > problems, except each problem will now potentially happen up to N > times instead of up to 1 time. I assume here that in either case - > GIN or btree - you would tentatively record where you left the tuple > that now needs to be zapped and that you can jump to that place > directly to try to zap it. Possibly those assumptions are bad and > maybe that's where you're seeing a concurrency/deadlock problem; if > so, a more detailed explanation would be very helpful. Imagine a world in which zheap cannot just create a new TID (or HOT chain) for the same logical tuple, which is something that I believe should be an important goal for zheap (again, see my remarks to Thomas). Simplicity for rollbacks in access methods like GIN demands that you lock the entire index tuple, which may point to hundreds of logical rows (or TIDs, since they have a 1:1 correspondence with logical rows in this imaginary world). Rolling back with more granular locking seems very hard for the same reason that rolling back a page split would be very hard -- you cannot possibly have enough book keeping information to make that work in a sane way in the face of concurrent insertions that may also commit or abort unpredictably. It seems necessary to bake concurrency control in to roll back at the index access method level in order to get significant benefits from a design like zheap. Now, maybe zheap should be permitted to not work particularly well with GIN, while teaching btree to take advantage of the common case where we can roll everything back, even in indexes (so zheap behaves much more like heapam when you have a GIN index, which is hopefully not that common). That could be a perfectly reasonable restriction. But ISTM that you need to make heap TIDs completely stable for the case that zheap is expected to excel at. You also need to teach nbtree to take advantage of this by rolling back if and when it's safe to do so (when we know that heap TIDs are stable for the indexed table). In general, the only way that rolling back changes to indexes can work is by making heap TIDs completely stable. Any design for rollback in nbtree that allows there to be multiple entries for the same logical row in the index seems like a disaster to me. Are you really going to put forwarding information in the index that mirrors what has happened in the table? > To me, based on my more limited knowledge of indexing, I'm not really > seeing a concurrency/deadlock issue, but I do see that there's going > to be a horrid efficiency problem if page splits are common. I'm not worried about rolling back page splits. That seems to present us with exactly the same issues as rolling back in GIN indexes reliably (i.e. problems that are practically impossible to solve, or at least don't seem worth solving). > This seems to be a very > general problem with making undo and indexes work nicely together: > almost any index type has to sometimes move tuple around to different > pages, which makes finding them a lot more expensive than re-finding a > heap tuple. Right. That's why undo is totally logical in indexes. And it's why you cannot expect to roll back page splits. > I think that most of the above is a bit of a diversion from the > original topic of the thread. I think I see the connection you're > making between the two topics: the more likely undo application is to > fail, the more worrying a hard limit is, and deadlocks are a way for > undo application to fail, and if that's likely to be common when undo > is applied to indexes, then undo failure will be common and a hard > limit is bad. This is an awkward thing to discuss, because it involves so many interrelated moving parts. And because I know that I could easily miss quite a bit about the zheap design. Forgive me if I've hijacked the thread. > However, I think the solution to that problem is > page-at-a-time undo: if foreground process needs to modify a page with > pending undo, and if the modification it wants to make can't be done > sensibly unless the undo is applied first, it should be prepared to > apply that undo itself - just for that page - rather than wait for > somebody else to get it done. That's important not only for deadlock > avoidance - though deadlock avoidance is certainly a legitimate > concern - but also because the change might be part of some gigantic > rollback that's going to take an hour, and waiting for the undo to hit > all the other pages before it gets to this one will make users very > sad. It's something that users in certain other systems (though certainly not all other systems) have had to live with for some time. SQL Server 2019 has something called "instantaneous transaction rollback", which seems to make SQL Server optionally behave a lot more like Postgres [1], apparently with many of the same disadvantages as Postgres. I agree that there is probably a middle way that more or less has the advantages of both approaches. I don't really know what that should look like, though. [1] https://www.microsoft.com/en-us/research/uploads/prod/2019/06/p700-antonopoulos.pdf -- Peter Geoghegan