Re: [PoC] Improve dead tuple storage for lazy vacuum

Masahiko Sawada Wed, 11 Jan 2023 21:45:04 -0800

On Wed, Jan 11, 2023 at 12:13 PM John Naylor
<john.nay...@enterprisedb.com> wrote:
>
> On Tue, Jan 10, 2023 at 7:08 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
> > It looks no problem in terms of vacuum integration, although I've not
> > fully tested yet. TID store uses the radix tree as the main storage,
> > and with the template radix tree, the data types for shared and
> > non-shared will be different. TID store can have an union for the
> > radix tree and the structure would be like follows:
>
> >     /* Storage for Tids */
> >     union tree
> >     {
> >         local_radix_tree    *local;
> >         shared_radix_tree   *shared;
> >     };
>
> We could possibly go back to using a common data type for this, but with 
> unused fields in each setting, as before. We would have to be more careful of 
> things like the 32-bit crash from a few weeks ago.


One idea to have a common data type without unused fields is to use
radix_tree a base class. We cast it to radix_tree_shared or
radix_tree_local depending on the flag is_shared in radix_tree. For
instance we have like (based on non-template version),

struct radix_tree
{
    bool    is_shared;
    MemoryContext context;
};

typedef struct rt_shared
{
    rt_handle   handle;
    uint32      magic;

    /* Root node */
   dsa_pointer  root;

    uint64      max_val;
    uint64      num_keys;

    /* need a lwlock */

    /* statistics */
#ifdef RT_DEBUG
    int32       cnt[RT_SIZE_CLASS_COUNT];
#endif
} rt_shared;

struct radix_tree_shared
{
    radix_tree rt;

    rt_shared *shared;
    dsa_area *area;
} radix_tree_shared;

struct radix_tree_local
{
    radix_tree rt;

    uint64  max_val;
    uint64  num_keys;

    rt_node *root;

    /* used only when the radix tree is private */
    MemoryContextData *inner_slabs[RT_SIZE_CLASS_COUNT];
    MemoryContextData *leaf_slabs[RT_SIZE_CLASS_COUNT];

    /* statistics */
#ifdef RT_DEBUG
    int32       cnt[RT_SIZE_CLASS_COUNT];
#endif
} radix_tree_local;

>
> > In the functions of TID store, we need to call either local or shared
> > radix tree functions depending on whether TID store is shared or not.
> > We need if-branch for each key-value pair insertion, but I think it
> > would not be a big performance problem in TID store use cases, since
> > vacuum is an I/O intensive operation in many cases.
>
> Also, the branch will be easily predicted. That was still true in earlier 
> patches, but with many more branches and fatter code paths.
>
> > Overall, I think
> > there is no problem and I'll investigate it in depth.
>
> Okay, great. If the separate-functions approach turns out to be ugly, we can 
> always go back to the branching approach for shared memory. I think we'll 
> want to keep this as a template overall, at least to allow different value 
> types and to ease adding variable-length keys if someone finds a need.

I agree to keep this as a template. From the vacuum integration
perspective, it would be better if we can use a common data type for
shared and local. It makes sense to have different data types if the
radix trees have different values types.

>
> > Apart from that, I've been considering the lock support for shared
> > radix tree. As we discussed before, the current usage (i.e, only
> > parallel index vacuum) doesn't require locking support at all, so it
> > would be enough to have a single lock for simplicity.
>
> Right, that should be enough for PG16.
>
> > If we want to
> > use the shared radix tree for other use cases such as the parallel
> > heap vacuum or the replacement of the hash table for shared buffers,
> > we would need better lock support.
>
> For future parallel pruning, I still think a global lock is "probably" fine 
> if the workers buffer in local arrays. Highly concurrent applications will 
> need additional work, of course.
>
> > For example, if we want to support
> > Optimistic Lock Coupling[1],
>
> Interesting, from the same authors!

+1

>
> > we would need to change not only the node
> > structure but also the logic. Which probably leads to widen the gap
> > between the code for non-shared and shared radix tree. In this case,
> > once we have a better radix tree optimized for shared case, perhaps we
> > can replace the templated shared radix tree with it. I'd like to hear
> > your opinion on this line.
>
> I'm not in a position to speculate on how best to do scalable concurrency, 
> much less how it should coexist with the local implementation. It's 
> interesting that their "ROWEX" scheme gives up maintaining order in the 
> linear nodes.

>
> > > One review point I'll mention: Somehow I didn't notice there is no use 
> > > for the "chunk" field in the rt_node type -- it's only set to zero and 
> > > copied when growing. What is the purpose? Removing it would allow the 
> > > smallest node to take up only 32 bytes with a fanout of 3, by eliminating 
> > > padding.
> >
> > Oh, I didn't notice that. The chunk field was originally used when
> > redirecting the child pointer in the parent node from old to new
> > (grown) node. When redirecting the pointer, since the corresponding
> > chunk surely exists on the parent we can skip existence checks.
> > Currently we use RT_NODE_UPDATE_INNER() for that (see
> > RT_REPLACE_NODE()) but having a dedicated function to update the
> > existing chunk and child pointer might improve the performance. Or
> > reducing the node size by getting rid of the chunk field might be
> > better.
>
> I see. IIUC from a brief re-reading of the code, saving that chunk would only 
> save us from re-loading "parent->shift" from L1 cache and shifting the key. 
> The cycles spent doing that seem small compared to the rest of the work 
> involved in growing a node. Expressions like "if (idx < 0) return false;" 
> return to an asserts-only variable, so in production builds, I would hope 
> that branch gets elided (I haven't checked).
>
> I'm quite keen on making the smallest node padding-free, (since we don't yet 
> have path compression or lazy path expansion), and this seems the way to get 
> there.

Okay, let's get rid of that in the v18.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

Reply via email to