On Thu, Jul 5, 2018 at 7:17 AM, Andrey Borodin <x4...@yandex-team.ru> wrote: > Some years ago I've observed viable performance improvement (around 8% in > inserts and 5% in selects) of covering indexes in a series of experiments [0].
Covering indexes are definitely a closely related idea. It kind of makes sense that we got covering indexes first, even though I think most other DB systems had suffix truncation quite early on. Suffix truncation (and being smarter about the point at which pages are split, both in general, and to help suffix truncation) is important for bloat. Index bloat and LP_DEAD killing and heap bloat and heap pruning seem to be related, and to benefit from better locality of access that the TID key thing adds. The same patch that helped a lot with the composite index seems to help an awful lot with pgbench workloads that are I/O bound. and take hours to run -- I saw 30%+ increases in transaction throughput on a benchmark that just finished up, that was running for almost 24 hours. My latest work seems very promising for OLTP workloads that we've traditionally not done so great with. I hope to be able to post a new version of my patch in the next few days, or maybe next week. I just need to do a bit more validation, and clean-up. V3 will be very different to V2 and V1, since it won't just be an idea for a new architectural direction. It addresses several problems all at once. This work seems to have a kind of "inventor's paradox" quality to it -- it's somehow easier to fix all the problems at once than it is to fix just one, which is very counter-intuitive. > Indeed, but prefix truncation can reduce whole index size dramatically, not > by a matter of few percents. I made the example composite index about 15% smaller after random insertions, and I did so with code that mostly just pays close attention to the existing rules. It saves us the cost of many comparisons, and it has benefits for I/O and for bloat control. The optimization has no downside that I'm aware of. That's why I'm starting there. Look at it like this: It's already true that pivot tuples may have values that were from non-pivot tuples that were killed long ago, since the job of pivot tuples is to simply separate the key space; the fact that a particular index tuple happened to have the exact values used for a pivot makes no difference to that index tuple. VACUUM may have removed the index tuple a year ago, without caring about the pivot tuple. As far as anybody can tell, it could be a fictitious value that never existed in the first place. Why not *actually* start with a truly fictitious value for pivot tuples when we're doing leaf page splits? It's obvious that it's safe. We just need to be sure that it separates values on each side of the original split, and be a bit smarter about where on the page to take as a split point. Everything else automatically follows. Right now, the suffix truncation is not very aggressive, since it just truncates whole attributes. Some day, we should be able to truncate up to a single character in a text string. I think we need key normalization for that, though. > If we have foreign key from table 1 with 1M tuples to table 2 with 1K rows, > index size can be reduced by the order of magnitude. And this optimization > seems very straightforward: trim tuple's prefix, if it matches hikey's prefix > (or some other's in case of leaf page). Prefix compression is way more complicated than you seem to think. With key normalization, it's fairly easy to do it for pivot tuples in internal pages, because the normalized key representation is easy to reason about during page splits -- we need pay very close attention to the use of space. But at the leaf level, we have to use the original representation (for index-only scans), and so the space management is hard. You really need to add a "low key" on the leaf, because we need to think about the universe of possible values that can go on the page, not the values that happen to currently be present. Otherwise the free space management in a nightmare. In MySQL, prefix truncation is configurable, and is often not used. I'd expect to have to make it configurable. That's a very different kind of feature. > I see code complexity as somewhat a downside. B-tree is kind of a rocket > science. I know what you mean, but FWIW I see it as a matter of developing a good mental model. If you look at the code without a mental model, the complexity is overwhelming. If you develop a high-level understanding first, and try to visualize the structure, then eventually it starts to make sense, and it seems way less complicated (though still very complicated). > Chances are you have some kind of roadmap of B-tree things to implement in > nearest years? Kind of. That's almost what this page is: https://wiki.postgresql.org/wiki/Key_normalization That page is really just about the representation of pivot tuples and the representation of individual pages, and yet that's almost the only area that I'd like to make changes to. I have a very high opinion of the nbtree code. I'm not really interested in changing the high-level rules. For example, I'm not interested in adding merging of pages that are not yet completely empty. It's way too much complexity for way too little gain. The same problem can actually be more effectively addressed by suffix truncation -- why not just make it so that those pages don't become half empty to begin with? I could perhaps work on prefix compression. But probably not in the next 3 years. It's way down the list for me. -- Peter Geoghegan