On Wed, Jan 11, 2023 at 12:13 PM John Naylor <john.nay...@enterprisedb.com> wrote: > > On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > > It looks no problem in terms of vacuum integration, although I've not > > fully tested yet. TID store uses the radix tree as the main storage, > > and with the template radix tree, the data types for shared and > > non-shared will be different. TID store can have an union for the > > radix tree and the structure would be like follows: > > > /* Storage for Tids */ > > union tree > > { > > local_radix_tree *local; > > shared_radix_tree *shared; > > }; > > We could possibly go back to using a common data type for this, but with > unused fields in each setting, as before. We would have to be more careful of > things like the 32-bit crash from a few weeks ago.
One idea to have a common data type without unused fields is to use radix_tree a base class. We cast it to radix_tree_shared or radix_tree_local depending on the flag is_shared in radix_tree. For instance we have like (based on non-template version), struct radix_tree { bool is_shared; MemoryContext context; }; typedef struct rt_shared { rt_handle handle; uint32 magic; /* Root node */ dsa_pointer root; uint64 max_val; uint64 num_keys; /* need a lwlock */ /* statistics */ #ifdef RT_DEBUG int32 cnt[RT_SIZE_CLASS_COUNT]; #endif } rt_shared; struct radix_tree_shared { radix_tree rt; rt_shared *shared; dsa_area *area; } radix_tree_shared; struct radix_tree_local { radix_tree rt; uint64 max_val; uint64 num_keys; rt_node *root; /* used only when the radix tree is private */ MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT]; MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT]; /* statistics */ #ifdef RT_DEBUG int32 cnt[RT_SIZE_CLASS_COUNT]; #endif } radix_tree_local; > > > In the functions of TID store, we need to call either local or shared > > radix tree functions depending on whether TID store is shared or not. > > We need if-branch for each key-value pair insertion, but I think it > > would not be a big performance problem in TID store use cases, since > > vacuum is an I/O intensive operation in many cases. > > Also, the branch will be easily predicted. That was still true in earlier > patches, but with many more branches and fatter code paths. > > > Overall, I think > > there is no problem and I'll investigate it in depth. > > Okay, great. If the separate-functions approach turns out to be ugly, we can > always go back to the branching approach for shared memory. I think we'll > want to keep this as a template overall, at least to allow different value > types and to ease adding variable-length keys if someone finds a need. I agree to keep this as a template. From the vacuum integration perspective, it would be better if we can use a common data type for shared and local. It makes sense to have different data types if the radix trees have different values types. > > > Apart from that, I've been considering the lock support for shared > > radix tree. As we discussed before, the current usage (i.e, only > > parallel index vacuum) doesn't require locking support at all, so it > > would be enough to have a single lock for simplicity. > > Right, that should be enough for PG16. > > > If we want to > > use the shared radix tree for other use cases such as the parallel > > heap vacuum or the replacement of the hash table for shared buffers, > > we would need better lock support. > > For future parallel pruning, I still think a global lock is "probably" fine > if the workers buffer in local arrays. Highly concurrent applications will > need additional work, of course. > > > For example, if we want to support > > Optimistic Lock Coupling[1], > > Interesting, from the same authors! +1 > > > we would need to change not only the node > > structure but also the logic. Which probably leads to widen the gap > > between the code for non-shared and shared radix tree. In this case, > > once we have a better radix tree optimized for shared case, perhaps we > > can replace the templated shared radix tree with it. I'd like to hear > > your opinion on this line. > > I'm not in a position to speculate on how best to do scalable concurrency, > much less how it should coexist with the local implementation. It's > interesting that their "ROWEX" scheme gives up maintaining order in the > linear nodes. > > > > One review point I'll mention: Somehow I didn't notice there is no use > > > for the "chunk" field in the rt_node type -- it's only set to zero and > > > copied when growing. What is the purpose? Removing it would allow the > > > smallest node to take up only 32 bytes with a fanout of 3, by eliminating > > > padding. > > > > Oh, I didn't notice that. The chunk field was originally used when > > redirecting the child pointer in the parent node from old to new > > (grown) node. When redirecting the pointer, since the corresponding > > chunk surely exists on the parent we can skip existence checks. > > Currently we use RT_NODE_UPDATE_INNER() for that (see > > RT_REPLACE_NODE()) but having a dedicated function to update the > > existing chunk and child pointer might improve the performance. Or > > reducing the node size by getting rid of the chunk field might be > > better. > > I see. IIUC from a brief re-reading of the code, saving that chunk would only > save us from re-loading "parent->shift" from L1 cache and shifting the key. > The cycles spent doing that seem small compared to the rest of the work > involved in growing a node. Expressions like "if (idx < 0) return false;" > return to an asserts-only variable, so in production builds, I would hope > that branch gets elided (I haven't checked). > > I'm quite keen on making the smallest node padding-free, (since we don't yet > have path compression or lazy path expansion), and this seems the way to get > there. Okay, let's get rid of that in the v18. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com