On Wed, Jul 24, 2019 at 3:06 PM Peter Geoghegan <p...@bowt.ie> wrote: > There seems to be a kind of "synergy" between the nbtsplitloc.c > handling of pages that have lots of duplicates and posting list > compression. It seems as if the former mechanism "sets up the bowling > pins", while the latter mechanism "knocks them down", which is really > cool. We should try to gain a better understanding of how that works, > because it's possible that it could be even more effective in some > cases.
I found another important way in which this synergy can fail to take place, which I can fix. By removing the BT_COMPRESS_THRESHOLD limit entirely, certain indexes from my test suite become much smaller, while most are not affected. These indexes were not helped too much by the patch before. For example, the TPC-E i_t_st_id index is 50% smaller. It is entirely full of duplicates of a single value (that's how it appears after an initial TPC-E bulk load), as are a couple of other TPC-E indexes. TPC-H's idx_partsupp_partkey index becomes ~18% smaller, while its idx_lineitem_orderkey index becomes ~15% smaller. I believe that this happened because rightmost page splits were an inefficient case for compression. But rightmost page split heavy indexes with lots of duplicates are not that uncommon. Think of any index with many NULL values, for example. I don't know for sure if BT_COMPRESS_THRESHOLD should be removed. I'm not sure what the idea is behind it. My sense is that we're likely to benefit by delaying page splits, no matter what. Though I am still looking at it purely from a space utilization point of view, at least for now. -- Peter Geoghegan