This email has no patch yet -- it's more of a placeholder to gather some issues into one place. During previous work on replacing vacuum's bsearch'd array for TID storage with a radix tree, Andres suggested [1] that the hash table in tidbitmap.c should also be replaced. This will hopefully illuminate how to get there.
* Current limitations Page table entries are of fixed-size. At PGCon, Mark Dilger and Andres discussed the problem that bitmap scans assume a hard-coded limit on the offset of a TID, one that's particular to heap AM. That's not a requirement of hash tables in general, but that's the current state of the implementation. Using a radix tree would smooth the way to allowing variable-length page table entries. I have briefly talked about it with colleagues offlist as well. The radix tree implementation that Masahiko and I have worked on does still currently assume fixed-sized values. (not absolutely, but fixed at compile time for a given use), but last month I did some refactoring that would make variable-sized values fairly straightforward, at least no big challenge. There would of course also be some extra complexity in doing TBM union/intersection operations etc. Recent work I did also went in the direction of storing small-enough values in the last-level pointer, saving memory (as well as time spent accessing it). That seems important, since trees do have some space overhead compared to arrays. * Iteration/ordering There are also now some unpleasant consequences that stem from hashed blocknumbers: - To get them ready for the executor the entries need to be sorted by blocknumber, and "random" is a strenuous sorting case, because of cache misses and branch mispredicts. - Pages get lossified (when necessary) more-or-less at random Radix trees maintain logical ordering, allowing for ordered iteration, so that solves the sorting problem, and should help give a performance boost. One hurdle is that shared iteration must work so that each worker can have a disjoint subset of the input. The radix tree does have shared memory support, but not yet shared iteration since there hasn't been a concrete use case. Also, DSA has a noticeable performance cost. A good interim development step is to use a local-mem radix tree for the index scan, and then move everything out to the current array for the executor, in shmem if the heap scan will be parallel. (I have coded some steps in that direction, not ready to share.) That keeps that part of the interface the same, simplifying testing. It's possible this much would work even for varlen bitmaps: the iteration array could use a "tall skinny" page table entry format, like { blockno; <metadata>; wordnum; bitmapword; } ...which would save space in many cases. Long term, we will want to move to shared memory for the radix tree, at least as a prerequisite for parallel bitmap index scan. The concurrency scheme is likely too coarse to make that worthwhile now, but that will hopefully change at some point. * Possible simplification Some of the above adds complexity, but I see a possible simplification: Many places in tidbitmap.c need to know if we have a single entry, to keep from creating the hash table. That was added before simplehash.h existed. I suspect most of the overhead now in creating the hash table is in zeroing the backing array (correct me if I'm wrong). The radix tree wouldn't do that, but it would create about half a dozen memory contexts, and inserting a single entry would allocate one or two context blocks. Neither of these are free either. If the single-entry case is still worth optimizing, it could be pushed down inside inside the radix tree as a template option that lazily creates memory contexts etc. * Multiple use Vacuum concerns started this in the first place, so it'll have to be kept in mind as we proceed. At the very least, vacuum will need a boolean to disallow lossifying pages, but the rest should work about the same. There are some other things left out, like memory management and lossy entries to work out, but this is enough to give a sense of what's involved. [1] https://www.postgresql.org/message-id/20230216164408.bcatntzzxj3jqn3q%40awork3.anarazel.de -- John Naylor EDB: http://www.enterprisedb.com