On Sat, Jul 15, 2017 at 3:36 PM, Alexander Korotkov <a.korot...@postgrespro.ru> wrote: > I think that pruning and vacuum are artifacts of not only current heap > formats, but they are also artifacts of current index AM API. And this is > more significant circumstance given that we're going to preserve > compatibility of new storage engines with current index AMs. Our current > index AM API assumes that we can delete from index only in bulk manner. Our > payload to index key is TID, not arbitrary piece of data. And that payload > can't be updated.
I agree that this is a big set of problems. This is where I think we can get the most benefit. One nice thing about having a fully logical TID is that you don't have to bloat indexes just because a HOT update was not possible. You bloat something else instead, certainly, but that can be optimized for garbage collection. Not all bloat is equal. MVCC based on UNDO can be very fast because UNDO is very well optimized for garbage collection, and so can be bloated with no long term consequences, and minor short term consequences. Still, it isn't fair to blame the Postgres VACUUM design for the fact that Postgres bloats indexes so easily. Nothing stops us from adding a new layer of indirection, so that bloating an index degrades into something that is not even as bad as bloating the heap [1]. We may just have a data structure problem, which is not nearly the same thing as a fundamental design problem in the storage layer. This approach could pay off soon if we start with unique indexes, where there is "queue" of row versions that can be pruned with simple logic, temporal locality helps a lot, only zero or one versions can be visible to your MVCC snapshot, etc. This might require only minimal revisions the index AM API, to help nbtree. We could later improve this so that you bloat UNDO instead of bloating a heap-like structure, both for indexes and for the actual heap. That seems less urgent. To repeat myself, for emphasis: *Not all bloat is equal*. Index bloat makes the way a B-Tree's keyspace is split up far too granular, making pages sparely packed, a problem that is more or less *irreversible* by VACUUM or any garbage collection process [2]. That's how B-Trees work -- they're optimized for concurrency, not for garbage collection. >> InnoDB isn't much like the PostgreSQL heap, and >> neither is SQL Server, IIUC. If we're creating a heap format that can >> only be different in trivial ways from what we have now, and anything >> that really changes the paradigm is off-limits, well, that's not >> really interesting enough to justify the work of creating a heap >> storage API. > > > My concern is that we probably can't do anything that really changes > paradigm while preserving compatibility with index AM API. If you don't > agree with that, it would be good to provide some examples. It seems > unlikely for me that we're going to have something like InnoDB or SQL Server > table with our current index AM API. InnoDB utilizes index-organized tables > where primary and secondary indexes are versioned independently. SQL Server > utilizes flat data structure similar to our heap, but MVCC implementation > also seems very different. I strongly agree. I simply don't understand how you can adopt UNDO for MVCC, and yet expect to get a benefit commensurate with the effort without also implementing "retail index tuple deletion" first. Pursuing UNDO this way has the same problem that WARM likely has -- it doesn't really help with the worst case, where users get big, unpleasant surprises. Postgres is probably the only major database system that doesn't support retail index tuple deletion. It's a basic thing, that has nothing to do with MVCC. Really, what do we have to lose? The biggest weakness of the current design is IMV how it fails to prevent index bloat in the first place, but avoiding bloating index leaf pages in the first place doesn't seem incompatible with how VACUUM works. Or at least, let's not assume that it is. We should avoid throwing the baby out with the bathwater. > I think in general there are two ways dealing with out index AM API > limitation. One of them is to extend index AM API. At least, we would need > a method for deletion of individual index tuple (for sure, we already have > kill_prior_tuple but it's just a hint for now). kill_prior_tuple can work well, but, like HOT, it works inconsistently, in a way that is hard to predict. > Also, it would be nice to > have arbitrary payload to index tuples instead of TID, and method to update > that payload. But that would be quite big amount of work. Alternatively, > we could allow pluggable storages to have their own index AMs, and that will > move this amount of work to the pluggable storage side. I agree with Robert that being able to store an arbitrary payload as a TID is probably not going to ever work very well. However, I don't think that that's a reason to give up on the underlying idea: creating a new layer of indirection for secondary indexes, that allows updates that now have to create new index tuples to instead just update the indirection layer metadata. You can create a mapping of a TID-like 6 byte integer to a primary key value. Techniques exist. That seems a lot more practical. Of course, TID is what is sometimes called a "physiological" identifier -- it has a "physical" component (block number) and "logical" component (item offset). Nothing I can think of prevents us from creating an alternative, entirely logical identifier that fits in the same 6 bytes. It can map to a versioning indirection layer, for unique indexes, or to a primary key value, for secondary indirect indexes. [1] postgr.es/m/CAH2-Wzmf6intNY1ggiNzOziiO5Eq=DsXfeptODGxO=2j-i1...@mail.gmail.com [2] https://wiki.postgresql.org/wiki/Key_normalization#VACUUM_and_nbtree_page_deletion -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers