Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-05 Thread John Naylor
On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada  wrote:
>
> On Tue, Mar 5, 2024 at 6:41 PM John Naylor  wrote:

> > Done in v66-0009. I'd be curious to hear any feedback. I like the
> > aspect that the random numbers come from a different seed every time
> > the test runs.
>
> The new tests look good. Here are some comments:
>
> ---
> +   expected = keys[i];
> +   iterval = rt_iterate_next(iter, );
>
> -   ndeleted++;
> +   EXPECT_TRUE(iterval != NULL);
> +   EXPECT_EQ_U64(iterkey, expected);
> +   EXPECT_EQ_U64(*iterval, expected);
>
> Can we verify that the iteration returns keys in ascending order?

We get the "expected" value from the keys we saved in the now-sorted
array, so we do already. Unless I misunderstand you.

> ---
> + /* reset random number generator for deletion */
> + pg_prng_seed(, seed);
>
> Why is resetting the seed required here?

Good catch - My intention was to delete in the same random order we
inserted with. We still have the keys in the array, but they're sorted
by now. I forgot to go the extra step and use the prng when generating
the keys for deletion -- will fix.

> ---
> The radix tree (and dsa in TSET_SHARED_RT case) should be freed at the end.

Will fix.

> ---
> radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
>   "test_radix_tree",
>   ALLOCSET_DEFAULT_SIZES);
>
> We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I
> think it's better to use either one for consistency.

Will change to "small", since 32-bit platforms will use slab for leaves.

I'll look at the memory usage and estimate what 32-bit platforms will
use, and maybe adjust the number of keys. A few megabytes is fine, but
not many megabytes.

> > I'd like to push 0001 and 0002 shortly, and then do another sweep over
> > 0003, with remaining feedback, and get that in so we get some
> > buildfarm testing before the remaining polishing work on
> > tidstore/vacuum.
>
> Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
> polish tidstore and vacuum patches and update commit messages.

Sounds good.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-05 Thread Masahiko Sawada
On Tue, Mar 5, 2024 at 6:41 PM John Naylor  wrote:
>
> On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada  wrote:
> >
> > On Fri, Feb 2, 2024 at 8:47 PM John Naylor  wrote:
>
> > > It's pretty hard to see what test_pattern() is doing, or why it's
> > > useful. I wonder if instead the test could use something like the
> > > benchmark where random integers are masked off. That seems simpler. I
> > > can work on that, but I'd like to hear your side about test_pattern().
> >
> > Yeah, test_pattern() is originally created for the integerset so it
> > doesn't necessarily fit the radixtree. I agree to use some tests from
> > benchmarks.
>
> Done in v66-0009. I'd be curious to hear any feedback. I like the
> aspect that the random numbers come from a different seed every time
> the test runs.

The new tests look good. Here are some comments:

---
+   expected = keys[i];
+   iterval = rt_iterate_next(iter, );

-   ndeleted++;
+   EXPECT_TRUE(iterval != NULL);
+   EXPECT_EQ_U64(iterkey, expected);
+   EXPECT_EQ_U64(*iterval, expected);

Can we verify that the iteration returns keys in ascending order?

---
+ /* reset random number generator for deletion */
+ pg_prng_seed(, seed);

Why is resetting the seed required here?

---
The radix tree (and dsa in TSET_SHARED_RT case) should be freed at the end.

---
radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
  "test_radix_tree",
  ALLOCSET_DEFAULT_SIZES);

We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I
think it's better to use either one for consistency.

> I'd like to push 0001 and 0002 shortly, and then do another sweep over
> 0003, with remaining feedback, and get that in so we get some
> buildfarm testing before the remaining polishing work on
> tidstore/vacuum.

Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
polish tidstore and vacuum patches and update commit messages.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-05 Thread John Naylor
On Tue, Mar 5, 2024 at 8:27 AM Masahiko Sawada  wrote:
>
> On Mon, Mar 4, 2024 at 8:48 PM John Naylor  wrote:
> >
> > On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada  
> > wrote:

> > > Resetting the tree->context seems to work. But I think we should note
> > > for callers that the dsa_area passed to RT_CREATE should be created in
> > > a different context than the context passed to RT_CREATE because
> > > otherwise RT_FREE() will also free the dsa_area. For example, the
> > > following code in test_radixtree.c will no longer work:

I've added a comment in v66-0004, which contains a number of other
small corrections and edits.

On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada  wrote:
> > Thirdly, cosmetic: With the introduction of single-value leaves, it
> > seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?
>
> Agreed.

Done in v66-0005.

v66-0006 removes outdated tests for invalid root that somehow got left over.

> > I wrote:
> > > > Secondly, I thought about my recent work to skip checking if we first
> > > > need to create a root node, and that has a harmless (for vacuum at
> > > > least) but slightly untidy behavior: When RT_SET is first called, and
> > > > the key is bigger than 255, new nodes will go on top of the root node.
> > > > These have chunk '0'. If all subsequent keys are big enough, the
> > > > orginal root node will stay empty. If all keys are deleted, there will
> > > > be a chain of empty nodes remaining. Again, I believe this is
> > > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> > > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> > > > on this, but likely not today.

> > I put a little more work into this, and got it working, just needs a
> > small amount of finicky coding. I'll share tomorrow.

Done in v66-0007. I'm a bit disappointed in the extra messiness this
adds, although it's not a lot.

> > + check_stack_depth();
> > + CHECK_FOR_INTERRUPTS();
> >
> > I'm not sure why these are here: The first seems overly paranoid,
> > although harmless, but the second is probably a bad idea. Why should
> > the user be able to to interrupt the freeing of memory?
>
> Good catch. We should not check the interruption there.

Removed in v66-0008.

> > Also, I'm not quite happy that RT_ITER has a copy of a pointer to the
> > tree, leading to coding like "iter->tree->ctl->root". I *think* it
> > would be easier to read if the tree was a parameter to these iteration
> > functions. That would require an API change, so the tests/tidstore
> > would have some churn. I can do that, but before trying I wanted to
> > see what you think -- is there some reason to keep the current way?
>
> I considered both usages, there are two reasons for the current style.
> I'm concerned that if we pass both the tree and RT_ITER to iteration
> functions, the caller could mistakenly pass a different tree than the
> one that was specified to create the RT_ITER. And the second reason is
> just to make it consistent with other data structures such as
> dynahash.c and dshash.c, but I now realized that in simplehash.h we
> pass both the hash table and the iterator.

Okay, then I don't think it's worth messing with at this point.

On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada  wrote:
>
> On Fri, Feb 2, 2024 at 8:47 PM John Naylor  wrote:

> > It's pretty hard to see what test_pattern() is doing, or why it's
> > useful. I wonder if instead the test could use something like the
> > benchmark where random integers are masked off. That seems simpler. I
> > can work on that, but I'd like to hear your side about test_pattern().
>
> Yeah, test_pattern() is originally created for the integerset so it
> doesn't necessarily fit the radixtree. I agree to use some tests from
> benchmarks.

Done in v66-0009. I'd be curious to hear any feedback. I like the
aspect that the random numbers come from a different seed every time
the test runs.

v66-0010/0011 run pgindent, the latter with one typedef added for the
test module. 0012 - 0017 are copied from v65, and I haven't done any
work on tidstore or vacuum, except for squashing most v65 follow-up
patches.

I'd like to push 0001 and 0002 shortly, and then do another sweep over
0003, with remaining feedback, and get that in so we get some
buildfarm testing before the remaining polishing work on
tidstore/vacuum.


v66-ART.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-04 Thread Masahiko Sawada
On Mon, Mar 4, 2024 at 8:48 PM John Naylor  wrote:
>
> On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada  wrote:
> >
> > On Sun, Mar 3, 2024 at 2:43 PM John Naylor  wrote:
>
> > > Right, I should have said "reset". Resetting a context will delete
> > > it's children as well, and seems like it should work to reset the tree
> > > context, and we don't have to know whether that context actually
> > > contains leaves at all. That should allow copying "tree context" to
> > > "leaf context" in the case where we have no special context for
> > > leaves.
> >
> > Resetting the tree->context seems to work. But I think we should note
> > for callers that the dsa_area passed to RT_CREATE should be created in
> > a different context than the context passed to RT_CREATE because
> > otherwise RT_FREE() will also free the dsa_area. For example, the
> > following code in test_radixtree.c will no longer work:
> >
> > dsa = dsa_create(tranche_id);
> > radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
> > :
> > rt_free(radixtree);
> > dsa_detach(dsa); // dsa is already freed.
> >
> > So I think that a practical usage of the radix tree will be that the
> > caller  creates a memory context for a radix tree and passes it to
> > RT_CREATE().
>
> That sounds workable to me.
>
> > I've attached an update patch set:
> >
> > - 0008 updates RT_FREE_RECURSE().
>
> Thanks!
>
> > - 0009 patch is an updated version of cleanup radix tree memory handling.
>
> Looks pretty good, as does the rest. I'm going through again,
> squashing and making tiny adjustments to the template. The only thing
> not done is changing the test with many values to resemble the perf
> test more.
>
> I wrote:
> > > Secondly, I thought about my recent work to skip checking if we first
> > > need to create a root node, and that has a harmless (for vacuum at
> > > least) but slightly untidy behavior: When RT_SET is first called, and
> > > the key is bigger than 255, new nodes will go on top of the root node.
> > > These have chunk '0'. If all subsequent keys are big enough, the
> > > orginal root node will stay empty. If all keys are deleted, there will
> > > be a chain of empty nodes remaining. Again, I believe this is
> > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> > > on this, but likely not today.
> >
> > This turns out to be a lot trickier than it looked, so it seems best
> > to allow a trivial amount of waste, as long as it's documented
> > somewhere. It also wouldn't be terrible to re-add those branches,
> > since they're highly predictable.
>
> I put a little more work into this, and got it working, just needs a
> small amount of finicky coding. I'll share tomorrow.
>
> I have a question about RT_FREE_RECURSE:
>
> + check_stack_depth();
> + CHECK_FOR_INTERRUPTS();
>
> I'm not sure why these are here: The first seems overly paranoid,
> although harmless, but the second is probably a bad idea. Why should
> the user be able to to interrupt the freeing of memory?

Good catch. We should not check the interruption there.

> Also, I'm not quite happy that RT_ITER has a copy of a pointer to the
> tree, leading to coding like "iter->tree->ctl->root". I *think* it
> would be easier to read if the tree was a parameter to these iteration
> functions. That would require an API change, so the tests/tidstore
> would have some churn. I can do that, but before trying I wanted to
> see what you think -- is there some reason to keep the current way?

I considered both usages, there are two reasons for the current style.
I'm concerned that if we pass both the tree and RT_ITER to iteration
functions, the caller could mistakenly pass a different tree than the
one that was specified to create the RT_ITER. And the second reason is
just to make it consistent with other data structures such as
dynahash.c and dshash.c, but I now realized that in simplehash.h we
pass both the hash table and the iterator.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-04 Thread John Naylor
On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada  wrote:
>
> On Sun, Mar 3, 2024 at 2:43 PM John Naylor  wrote:

> > Right, I should have said "reset". Resetting a context will delete
> > it's children as well, and seems like it should work to reset the tree
> > context, and we don't have to know whether that context actually
> > contains leaves at all. That should allow copying "tree context" to
> > "leaf context" in the case where we have no special context for
> > leaves.
>
> Resetting the tree->context seems to work. But I think we should note
> for callers that the dsa_area passed to RT_CREATE should be created in
> a different context than the context passed to RT_CREATE because
> otherwise RT_FREE() will also free the dsa_area. For example, the
> following code in test_radixtree.c will no longer work:
>
> dsa = dsa_create(tranche_id);
> radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
> :
> rt_free(radixtree);
> dsa_detach(dsa); // dsa is already freed.
>
> So I think that a practical usage of the radix tree will be that the
> caller  creates a memory context for a radix tree and passes it to
> RT_CREATE().

That sounds workable to me.

> I've attached an update patch set:
>
> - 0008 updates RT_FREE_RECURSE().

Thanks!

> - 0009 patch is an updated version of cleanup radix tree memory handling.

Looks pretty good, as does the rest. I'm going through again,
squashing and making tiny adjustments to the template. The only thing
not done is changing the test with many values to resemble the perf
test more.

I wrote:
> > Secondly, I thought about my recent work to skip checking if we first
> > need to create a root node, and that has a harmless (for vacuum at
> > least) but slightly untidy behavior: When RT_SET is first called, and
> > the key is bigger than 255, new nodes will go on top of the root node.
> > These have chunk '0'. If all subsequent keys are big enough, the
> > orginal root node will stay empty. If all keys are deleted, there will
> > be a chain of empty nodes remaining. Again, I believe this is
> > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> > on this, but likely not today.
>
> This turns out to be a lot trickier than it looked, so it seems best
> to allow a trivial amount of waste, as long as it's documented
> somewhere. It also wouldn't be terrible to re-add those branches,
> since they're highly predictable.

I put a little more work into this, and got it working, just needs a
small amount of finicky coding. I'll share tomorrow.

I have a question about RT_FREE_RECURSE:

+ check_stack_depth();
+ CHECK_FOR_INTERRUPTS();

I'm not sure why these are here: The first seems overly paranoid,
although harmless, but the second is probably a bad idea. Why should
the user be able to to interrupt the freeing of memory?

Also, I'm not quite happy that RT_ITER has a copy of a pointer to the
tree, leading to coding like "iter->tree->ctl->root". I *think* it
would be easier to read if the tree was a parameter to these iteration
functions. That would require an API change, so the tests/tidstore
would have some churn. I can do that, but before trying I wanted to
see what you think -- is there some reason to keep the current way?




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-03 Thread Masahiko Sawada
On Sun, Mar 3, 2024 at 2:43 PM John Naylor  wrote:
>
> On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada  wrote:
> >
> > On Thu, Feb 29, 2024 at 8:43 PM John Naylor  wrote:
>
> > > + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
> > > +"tidstore storage",
> > >
> > > "tidstore storage" sounds a bit strange -- maybe look at some other
> > > context names for ideas.
> >
> > Agreed. How about "tidstore's radix tree"?
>
> That might be okay. I'm now thinking "TID storage". On that note, one
> improvement needed when we polish tidstore.c is to make sure it's
> spelled "TID" in comments, like other files do already.
>
> > > - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, 
> > > allocsize);
> > > + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
> > > +? tree->leaf_ctx
> > > +: tree->context, allocsize);
> > >
> > > Instead of branching here, can we copy "context" to "leaf_ctx" when
> > > necessary (those names should look more like eachother, btw)? I think
> > > that means anything not covered by this case:
> > >
> > > +#ifndef RT_VARLEN_VALUE_SIZE
> > > + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
> > > + tree->leaf_ctx = SlabContextCreate(ctx,
> > > +RT_STR(RT_PREFIX) "radix_tree leaf contex",
> > > +RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> > > +sizeof(RT_VALUE_TYPE));
> > > +#endif /* !RT_VARLEN_VALUE_SIZE */
> > >
> > > ...also, we should document why we're using slab here. On that, I
> > > don't recall why we are? We've never had a fixed-length type test case
> > > on 64-bit, so it wasn't because it won through benchmarking. It seems
> > > a hold-over from the days of "multi-value leaves". Is it to avoid the
> > > possibility of space wastage with non-power-of-two size types?
> >
> > Yes, it matches my understanding.
>
> There are two issues quoted here, so not sure if you mean both or only
> the last one...

I meant only the last one.

>
> For the latter, I'm not sure it makes sense to have code and #ifdef's
> to force slab for large-enough fixed-length values just because we
> can. There may never be such a use-case anyway. I'm also not against
> it, either, but it seems like a premature optimization.

Reading the old threads, the fact that using a slab context for leaves
originally came from Andres's prototype patch, was to avoid rounding
up the bytes to a power of 2 number by aset.c. It makes sense to me to
use a slab context for this case. To measure the effect of using a
slab, I've updated bench_radix_tree so it uses a large fixed-length
value. The struct I used is:

typedef struct mytype
{
   uint64  a;
   uint64  b;
   uint64  c;
   uint64  d;
   chare[100];
} mytype;

The struct size is 136 bytes with padding, just above a power-of-2.
The simple benchmark test showed using a slab context for leaves is
more space efficient. The results are:

slab:
= #select * from bench_load_random_int(100);
 mem_allocated | load_ms
---+-
 405643264 | 560
(1 row)

aset:
=# select * from bench_load_random_int(100);
 mem_allocated | load_ms
---+-
 52792 | 576
(1 row)

>
> > > For this stanza that remains unchanged:
> > >
> > > for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
> > > {
> > >   MemoryContextDelete(tree->node_slabs[i]);
> > > }
> > >
> > > if (tree->leaf_ctx)
> > > {
> > >   MemoryContextDelete(tree->leaf_ctx);
> > > }
> > >
> > > ...is there a reason we can't just delete tree->ctx, and let that
> > > recursively delete child contexts?
> >
> > I thought that considering the RT_CREATE doesn't create its own memory
> > context but just uses the passed context, it might be a bit unusable
> > to delete the passed context in the radix tree code. For example, if a
> > caller creates a radix tree (or tidstore) on a memory context and
> > wants to recreate it again and again, he also needs to re-create the
> > memory context together. It might be okay if we leave comments on
> > RT_CREATE as a side effect, though. This is the same reason why we
> > don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(),
>
> Right, I should have said "reset". Resetting a context will delete
> it's children as well, and seems like it should work to reset the tree
> context, and we don't have to know whether that context actually
> contains leaves at all. That should allow copying "tree context" to
> "leaf context" in the case where we have no special context for
> leaves.

Resetting the tree->context seems to work. But I think we should note
for callers that the dsa_area passed to RT_CREATE should be created in
a different context than the context passed to RT_CREATE because
otherwise RT_FREE() will also free the dsa_area. For example, the
following code in test_radixtree.c will no longer work:

dsa = dsa_create(tranche_id);
radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id);
:
rt_free(radixtree);
dsa_detach(dsa); // dsa is already freed.

So I think that a 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-02 Thread John Naylor
On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada  wrote:
>
> On Thu, Feb 29, 2024 at 8:43 PM John Naylor  wrote:

> > + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
> > +"tidstore storage",
> >
> > "tidstore storage" sounds a bit strange -- maybe look at some other
> > context names for ideas.
>
> Agreed. How about "tidstore's radix tree"?

That might be okay. I'm now thinking "TID storage". On that note, one
improvement needed when we polish tidstore.c is to make sure it's
spelled "TID" in comments, like other files do already.

> > - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
> > + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
> > +? tree->leaf_ctx
> > +: tree->context, allocsize);
> >
> > Instead of branching here, can we copy "context" to "leaf_ctx" when
> > necessary (those names should look more like eachother, btw)? I think
> > that means anything not covered by this case:
> >
> > +#ifndef RT_VARLEN_VALUE_SIZE
> > + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
> > + tree->leaf_ctx = SlabContextCreate(ctx,
> > +RT_STR(RT_PREFIX) "radix_tree leaf contex",
> > +RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> > +sizeof(RT_VALUE_TYPE));
> > +#endif /* !RT_VARLEN_VALUE_SIZE */
> >
> > ...also, we should document why we're using slab here. On that, I
> > don't recall why we are? We've never had a fixed-length type test case
> > on 64-bit, so it wasn't because it won through benchmarking. It seems
> > a hold-over from the days of "multi-value leaves". Is it to avoid the
> > possibility of space wastage with non-power-of-two size types?
>
> Yes, it matches my understanding.

There are two issues quoted here, so not sure if you mean both or only
the last one...

For the latter, I'm not sure it makes sense to have code and #ifdef's
to force slab for large-enough fixed-length values just because we
can. There may never be such a use-case anyway. I'm also not against
it, either, but it seems like a premature optimization.

> > For this stanza that remains unchanged:
> >
> > for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
> > {
> >   MemoryContextDelete(tree->node_slabs[i]);
> > }
> >
> > if (tree->leaf_ctx)
> > {
> >   MemoryContextDelete(tree->leaf_ctx);
> > }
> >
> > ...is there a reason we can't just delete tree->ctx, and let that
> > recursively delete child contexts?
>
> I thought that considering the RT_CREATE doesn't create its own memory
> context but just uses the passed context, it might be a bit unusable
> to delete the passed context in the radix tree code. For example, if a
> caller creates a radix tree (or tidstore) on a memory context and
> wants to recreate it again and again, he also needs to re-create the
> memory context together. It might be okay if we leave comments on
> RT_CREATE as a side effect, though. This is the same reason why we
> don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(),

Right, I should have said "reset". Resetting a context will delete
it's children as well, and seems like it should work to reset the tree
context, and we don't have to know whether that context actually
contains leaves at all. That should allow copying "tree context" to
"leaf context" in the case where we have no special context for
leaves.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-01 Thread Masahiko Sawada
On Thu, Feb 29, 2024 at 8:43 PM John Naylor  wrote:
>
> On Tue, Feb 20, 2024 at 1:59 PM Masahiko Sawada  wrote:
>
> > - v63-0008 patch fixes a bug in tidstore.
>
> - page->nwords = wordnum + 1;
> - Assert(page->nwords = WORDS_PER_PAGE(offsets[num_offsets - 1]));
> + page->nwords = wordnum;
> + Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
>
> Yikes, I'm guessing this failed in a non-assert builds? I wonder why
> my compiler didn't yell at me... Have you tried a tidstore-debug build
> without asserts?

Yes. I didn't get any failures.

>
> > - v63-0009 patch is a draft idea of cleanup memory context handling.
>
> Thanks, looks pretty good!
>
> + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
> +"tidstore storage",
>
> "tidstore storage" sounds a bit strange -- maybe look at some other
> context names for ideas.

Agreed. How about "tidstore's radix tree"?

>
> - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
> + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
> +? tree->leaf_ctx
> +: tree->context, allocsize);
>
> Instead of branching here, can we copy "context" to "leaf_ctx" when
> necessary (those names should look more like eachother, btw)? I think
> that means anything not covered by this case:
>
> +#ifndef RT_VARLEN_VALUE_SIZE
> + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
> + tree->leaf_ctx = SlabContextCreate(ctx,
> +RT_STR(RT_PREFIX) "radix_tree leaf contex",
> +RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> +sizeof(RT_VALUE_TYPE));
> +#endif /* !RT_VARLEN_VALUE_SIZE */
>
> ...also, we should document why we're using slab here. On that, I
> don't recall why we are? We've never had a fixed-length type test case
> on 64-bit, so it wasn't because it won through benchmarking. It seems
> a hold-over from the days of "multi-value leaves". Is it to avoid the
> possibility of space wastage with non-power-of-two size types?

Yes, it matches my understanding.

>
> For this stanza that remains unchanged:
>
> for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
> {
>   MemoryContextDelete(tree->node_slabs[i]);
> }
>
> if (tree->leaf_ctx)
> {
>   MemoryContextDelete(tree->leaf_ctx);
> }
>
> ...is there a reason we can't just delete tree->ctx, and let that
> recursively delete child contexts?

I thought that considering the RT_CREATE doesn't create its own memory
context but just uses the passed context, it might be a bit unusable
to delete the passed context in the radix tree code. For example, if a
caller creates a radix tree (or tidstore) on a memory context and
wants to recreate it again and again, he also needs to re-create the
memory context together. It might be okay if we leave comments on
RT_CREATE as a side effect, though. This is the same reason why we
don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(),

On Fri, Mar 1, 2024 at 1:15 PM John Naylor  wrote:
>
> I'm looking at RT_FREE_RECURSE again (only used for DSA memory), and
> I'm not convinced it's freeing all the memory. It's been many months
> since we discussed this last, but IIRC we cannot just tell DSA to free
> all its segments, right?

Right.

>  Is there currently anything preventing us
> from destroying the whole DSA area at once?

When it comes to tidstore and parallel vacuum, we initialize DSA and
create a tidstore there at the beginning of the lazy vacuum, and
recreate the tidstore again after the heap vacuum. So I don't want to
destroy the whole DSA when destroying the tidstore. Otherwise, we will
need to create a new DSA and pass its handle somehow.

Probably the bitmap scan case is similar. Given that bitmap scan
(re)creates tidbitmap in the same DSA multiple times, it's better to
avoid freeing the whole DSA.

>
> + /* The last level node has pointers to values */
> + if (shift == 0)
> + {
> +   dsa_free(tree->dsa, ptr);
> +   return;
> + }
>
> IIUC, this doesn't actually free leaves, it only frees the last-level
> node. And, this function is unaware of whether children could be
> embedded values. I'm thinking we need to get rid of the above
> pre-check and instead, each node kind to have something like (e.g.
> node4):
>
> RT_PTR_ALLOC child = n4->children[i];
>
> if (shift > 0)
>   RT_FREE_RECURSE(tree, child, shift - RT_SPAN);
> else if (!RT_CHILDPTR_IS_VALUE(child))
>   dsa_free(tree->dsa, child);
>
> ...or am I missing something?

You're not missing anything. RT_FREE_RECURSE() has not been updated
for a long time. If we still need to use RT_FREE_RECURSE(), it should
be updated.

> Thirdly, cosmetic: With the introduction of single-value leaves, it
> seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?

Agreed.

On Fri, Mar 1, 2024 at 3:58 PM John Naylor  wrote:
>
> I wrote:
>
> > Secondly, I thought about my recent work to skip checking if we first
> > need to create a root node, and that has a harmless (for vacuum at
> > least) but slightly untidy behavior: When RT_SET is first called, and
> > 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-29 Thread John Naylor
I wrote:

> Secondly, I thought about my recent work to skip checking if we first
> need to create a root node, and that has a harmless (for vacuum at
> least) but slightly untidy behavior: When RT_SET is first called, and
> the key is bigger than 255, new nodes will go on top of the root node.
> These have chunk '0'. If all subsequent keys are big enough, the
> orginal root node will stay empty. If all keys are deleted, there will
> be a chain of empty nodes remaining. Again, I believe this is
> harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
> call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
> on this, but likely not today.

This turns out to be a lot trickier than it looked, so it seems best
to allow a trivial amount of waste, as long as it's documented
somewhere. It also wouldn't be terrible to re-add those branches,
since they're highly predictable.

I just noticed there are a lot of unused function parameters
(referring to parent slots) leftover from a few weeks ago. Those are
removed in v64-0009. 0010 makes the obvious name change in those
remaining to "parent_slot". 0011 is a simplification in two places
regarding reserving slots. This should be a bit easier to read and
possibly makes it easier on the compiler.


v64-ART.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-29 Thread John Naylor
I'm looking at RT_FREE_RECURSE again (only used for DSA memory), and
I'm not convinced it's freeing all the memory. It's been many months
since we discussed this last, but IIRC we cannot just tell DSA to free
all its segments, right? Is there currently anything preventing us
from destroying the whole DSA area at once?

+ /* The last level node has pointers to values */
+ if (shift == 0)
+ {
+   dsa_free(tree->dsa, ptr);
+   return;
+ }

IIUC, this doesn't actually free leaves, it only frees the last-level
node. And, this function is unaware of whether children could be
embedded values. I'm thinking we need to get rid of the above
pre-check and instead, each node kind to have something like (e.g.
node4):

RT_PTR_ALLOC child = n4->children[i];

if (shift > 0)
  RT_FREE_RECURSE(tree, child, shift - RT_SPAN);
else if (!RT_CHILDPTR_IS_VALUE(child))
  dsa_free(tree->dsa, child);

...or am I missing something?




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-29 Thread John Naylor
On Tue, Feb 20, 2024 at 1:59 PM Masahiko Sawada  wrote:

> - v63-0008 patch fixes a bug in tidstore.

- page->nwords = wordnum + 1;
- Assert(page->nwords = WORDS_PER_PAGE(offsets[num_offsets - 1]));
+ page->nwords = wordnum;
+ Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));

Yikes, I'm guessing this failed in a non-assert builds? I wonder why
my compiler didn't yell at me... Have you tried a tidstore-debug build
without asserts?

> - v63-0009 patch is a draft idea of cleanup memory context handling.

Thanks, looks pretty good!

+ ts->rt_context = AllocSetContextCreate(CurrentMemoryContext,
+"tidstore storage",

"tidstore storage" sounds a bit strange -- maybe look at some other
context names for ideas.

- leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize);
+ leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL
+? tree->leaf_ctx
+: tree->context, allocsize);

Instead of branching here, can we copy "context" to "leaf_ctx" when
necessary (those names should look more like eachother, btw)? I think
that means anything not covered by this case:

+#ifndef RT_VARLEN_VALUE_SIZE
+ if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC))
+ tree->leaf_ctx = SlabContextCreate(ctx,
+RT_STR(RT_PREFIX) "radix_tree leaf contex",
+RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+sizeof(RT_VALUE_TYPE));
+#endif /* !RT_VARLEN_VALUE_SIZE */

...also, we should document why we're using slab here. On that, I
don't recall why we are? We've never had a fixed-length type test case
on 64-bit, so it wasn't because it won through benchmarking. It seems
a hold-over from the days of "multi-value leaves". Is it to avoid the
possibility of space wastage with non-power-of-two size types?

For this stanza that remains unchanged:

for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++)
{
  MemoryContextDelete(tree->node_slabs[i]);
}

if (tree->leaf_ctx)
{
  MemoryContextDelete(tree->leaf_ctx);
}

...is there a reason we can't just delete tree->ctx, and let that
recursively delete child contexts?

Secondly, I thought about my recent work to skip checking if we first
need to create a root node, and that has a harmless (for vacuum at
least) but slightly untidy behavior: When RT_SET is first called, and
the key is bigger than 255, new nodes will go on top of the root node.
These have chunk '0'. If all subsequent keys are big enough, the
orginal root node will stay empty. If all keys are deleted, there will
be a chain of empty nodes remaining. Again, I believe this is
harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to
call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work
on this, but likely not today.

Thirdly, cosmetic: With the introduction of single-value leaves, it
seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-19 Thread Masahiko Sawada
On Mon, Feb 19, 2024 at 7:47 PM John Naylor  wrote:
>
> On Mon, Feb 19, 2024 at 9:02 AM Masahiko Sawada  wrote:
> >
> > I think that vacuum and tidbitmap (and future users) would end up
> > having the same max block size calculation. And it seems slightly odd
> > layering to me that max-block-size-specified context is created on
> > vacuum (or tidbitmap) layer, a varlen-value radix tree is created by
> > tidstore layer, and the passed context is used for leaves (if
> > varlen-value is used) on radix tree layer.
>
> That sounds slightly more complicated than I was thinking of, but we
> could actually be talking about the same thing: I'm drawing a
> distinction between "used = must be detected / #ifdef'd" and "used =
> actually happens to call allocation". I meant that the passed context
> would _always_ be used for leaves, regardless of varlen or not. So
> with fixed-length values short enough to live in child pointer slots,
> that context would still be used for iteration etc.
>
> > Another idea is to create a
> > max-block-size-specified context on the tidstore layer. That is,
> > vacuum and tidbitmap pass a work_mem and a flag indicating whether the
> > tidstore can use the bump context, and tidstore creates a (aset of
> > bump) memory context with the calculated max block size and passes it
> > to the radix tree.
>
> That might be a better abstraction since both uses have some memory limit.

I've drafted this idea, and fixed a bug in tidstore.c. Here is the
summary of updates from v62:

- removed v62-0007 patch as we discussed
- squashed v62-0006 and v62-0008 patches into 0003 patch
- v63-0008 patch fixes a bug in tidstore.
- v63-0009 patch is a draft idea of cleanup memory context handling.

>
> > As for using the bump memory context, I feel that we need to store
> > iterator struct in aset context at least as it can be individually
> > freed and re-created. Or it might not be necessary to allocate the
> > iterator struct in the same context as radix tree.
>
> Okay, that's one thing I was concerned about. Since we don't actually
> have a bump context yet, it seems simple to assume aset for non-nodes,
> and if we do get it, we can adjust slightly. Anyway, this seems like a
> good thing to try to clean up, but it's also not a show-stopper.
>
> On that note: I will be going on honeymoon shortly, and then to PGConf
> India, so I will have sporadic connectivity for the next 10 days and
> won't be doing any hacking during that time.

Thank you for letting us know. Enjoy yourself!

Regards

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v63-ART.tar.gz
Description: GNU Zip compressed data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-19 Thread John Naylor
On Mon, Feb 19, 2024 at 9:02 AM Masahiko Sawada  wrote:
>
> I think that vacuum and tidbitmap (and future users) would end up
> having the same max block size calculation. And it seems slightly odd
> layering to me that max-block-size-specified context is created on
> vacuum (or tidbitmap) layer, a varlen-value radix tree is created by
> tidstore layer, and the passed context is used for leaves (if
> varlen-value is used) on radix tree layer.

That sounds slightly more complicated than I was thinking of, but we
could actually be talking about the same thing: I'm drawing a
distinction between "used = must be detected / #ifdef'd" and "used =
actually happens to call allocation". I meant that the passed context
would _always_ be used for leaves, regardless of varlen or not. So
with fixed-length values short enough to live in child pointer slots,
that context would still be used for iteration etc.

> Another idea is to create a
> max-block-size-specified context on the tidstore layer. That is,
> vacuum and tidbitmap pass a work_mem and a flag indicating whether the
> tidstore can use the bump context, and tidstore creates a (aset of
> bump) memory context with the calculated max block size and passes it
> to the radix tree.

That might be a better abstraction since both uses have some memory limit.

> As for using the bump memory context, I feel that we need to store
> iterator struct in aset context at least as it can be individually
> freed and re-created. Or it might not be necessary to allocate the
> iterator struct in the same context as radix tree.

Okay, that's one thing I was concerned about. Since we don't actually
have a bump context yet, it seems simple to assume aset for non-nodes,
and if we do get it, we can adjust slightly. Anyway, this seems like a
good thing to try to clean up, but it's also not a show-stopper.

On that note: I will be going on honeymoon shortly, and then to PGConf
India, so I will have sporadic connectivity for the next 10 days and
won't be doing any hacking during that time.

Andres, did you want to take a look at the radix tree patch 0003?
Aside from the above possible cleanup, most of it should be stable.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-18 Thread Masahiko Sawada
On Fri, Feb 16, 2024 at 12:41 PM John Naylor  wrote:
>
> On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada  
> wrote:
> > > v61-0007: Runtime-embeddable tids -- Optional for v17, but should
> > > reduce memory regressions, so should be considered. Up to 3 tids can
> > > be stored in the last level child pointer. It's not polished, but I'll
> > > only proceed with that if we think we need this. "flags" iis called
> > > that because it could hold tidbitmap.c booleans (recheck, lossy) in
> > > the future, in addition to reserving space for the pointer tag. Note:
> > > I hacked the tests to only have 2 offsets per block to demo, but of
> > > course both paths should be tested.
> >
> > Interesting. I've run the same benchmark tests we did[1][2] (the
> > median of 3 runs):
> >
> > monotonically ordered int column index:
> >
> > master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
> > v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
> > v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s
>
> Hmm, that's strange -- this test is intended to delete all records
> from the last 20% of the blocks, so I wouldn't expect any improvement
> here, only in the sparse case. Maybe something is wrong. All the more
> reason to put it off...

Okay, let's dig it deeper later.

>
> > I'm happy to see a huge improvement. While it's really fascinating to
> > me, I'm concerned about the time left until the feature freeze. We
> > need to polish both tidstore and vacuum integration patches in 5
> > weeks. Personally I'd like to have it as a separate patch for now, and
> > focus on completing the main three patches since we might face some
> > issues after pushing these patches. I think with 0007 patch it's a big
> > win but it's still a win even without 0007 patch.
>
> Agreed to not consider it for initial commit. I'll hold on to it for
> some future time.
>
> > > 2. Management of memory contexts. It's pretty verbose and messy. I
> > > think the abstraction could be better:
> > > A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
> > > can't destroy or reset it. That means we have to do a lot of manual
> > > work.
> > > B: Passing "max_bytes" to the radix tree was my idea, I believe, but
> > > it seems the wrong responsibility. Not all uses will have a
> > > work_mem-type limit, I'm guessing. We only use it for limiting the max
> > > block size, and aset's default 8MB is already plenty small for
> > > vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
> > > smaller, and there it makes sense to limit the max blocksize this way.
> > > C: The context for values has complex #ifdefs based on the value
> > > length/varlen, but it's both too much and not enough. If we get a bump
> > > context, how would we shoehorn that in for values for vacuum but not
> > > for tidbitmap?
> > >
> > > Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
> > > TidStoreCreate(), and then to RT_CREATE. That context will contain the
> > > values (for local mem), and the node slabs will be children of the
> > > value context. That way, measuring memory usage and free-ing can just
> > > call with this parent context, and let recursion handle the rest.
> > > Perhaps the passed context can also hold the radix-tree struct, but
> > > I'm not sure since I haven't tried it. What do you think?
> >
> > If I understand your idea correctly, RT_CREATE() creates the context
> > for values as a child of the passed context and the node slabs as
> > children of the value context. That way, measuring memory usage can
> > just call with the value context. It sounds like a good idea. But it
> > was not clear to me how to address point B and C.
>
> For B & C, vacuum would create a context to pass to TidStoreCreate,
> and it wouldn't need to bother changing max block size. RT_CREATE
> would use that directly for leaves (if any), and would only create
> child slab contexts under it. It would not need to know about
> max_bytes. Modifyng your diagram a bit, something like:
>
> - caller-supplied radix tree memory context (the 3 structs -- and
> leaves, if any) (aset (or future bump?))
> - node slab contexts
>
> This might only be workable with aset, if we need to individually free
> the structs. (I haven't studied this, it was a recent idea)
> It's simpler, because with small fixed length values, we don't need to
> detect that and avoid creating a leaf context. All leaves would live
> in the same context as the structs.

Thank you for the explanation.

I think that vacuum and tidbitmap (and future users) would end up
having the same max block size calculation. And it seems slightly odd
layering to me that max-block-size-specified context is created on
vacuum (or tidbitmap) layer, a varlen-value radix tree is created by
tidstore layer, and the passed context is used for leaves (if
varlen-value is used) on radix tree layer. Another idea is to create a
max-block-size-specified 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-15 Thread John Naylor
On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada  wrote:
> > v61-0007: Runtime-embeddable tids -- Optional for v17, but should
> > reduce memory regressions, so should be considered. Up to 3 tids can
> > be stored in the last level child pointer. It's not polished, but I'll
> > only proceed with that if we think we need this. "flags" iis called
> > that because it could hold tidbitmap.c booleans (recheck, lossy) in
> > the future, in addition to reserving space for the pointer tag. Note:
> > I hacked the tests to only have 2 offsets per block to demo, but of
> > course both paths should be tested.
>
> Interesting. I've run the same benchmark tests we did[1][2] (the
> median of 3 runs):
>
> monotonically ordered int column index:
>
> master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
> v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
> v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s

Hmm, that's strange -- this test is intended to delete all records
from the last 20% of the blocks, so I wouldn't expect any improvement
here, only in the sparse case. Maybe something is wrong. All the more
reason to put it off...

> I'm happy to see a huge improvement. While it's really fascinating to
> me, I'm concerned about the time left until the feature freeze. We
> need to polish both tidstore and vacuum integration patches in 5
> weeks. Personally I'd like to have it as a separate patch for now, and
> focus on completing the main three patches since we might face some
> issues after pushing these patches. I think with 0007 patch it's a big
> win but it's still a win even without 0007 patch.

Agreed to not consider it for initial commit. I'll hold on to it for
some future time.

> > 2. Management of memory contexts. It's pretty verbose and messy. I
> > think the abstraction could be better:
> > A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
> > can't destroy or reset it. That means we have to do a lot of manual
> > work.
> > B: Passing "max_bytes" to the radix tree was my idea, I believe, but
> > it seems the wrong responsibility. Not all uses will have a
> > work_mem-type limit, I'm guessing. We only use it for limiting the max
> > block size, and aset's default 8MB is already plenty small for
> > vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
> > smaller, and there it makes sense to limit the max blocksize this way.
> > C: The context for values has complex #ifdefs based on the value
> > length/varlen, but it's both too much and not enough. If we get a bump
> > context, how would we shoehorn that in for values for vacuum but not
> > for tidbitmap?
> >
> > Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
> > TidStoreCreate(), and then to RT_CREATE. That context will contain the
> > values (for local mem), and the node slabs will be children of the
> > value context. That way, measuring memory usage and free-ing can just
> > call with this parent context, and let recursion handle the rest.
> > Perhaps the passed context can also hold the radix-tree struct, but
> > I'm not sure since I haven't tried it. What do you think?
>
> If I understand your idea correctly, RT_CREATE() creates the context
> for values as a child of the passed context and the node slabs as
> children of the value context. That way, measuring memory usage can
> just call with the value context. It sounds like a good idea. But it
> was not clear to me how to address point B and C.

For B & C, vacuum would create a context to pass to TidStoreCreate,
and it wouldn't need to bother changing max block size. RT_CREATE
would use that directly for leaves (if any), and would only create
child slab contexts under it. It would not need to know about
max_bytes. Modifyng your diagram a bit, something like:

- caller-supplied radix tree memory context (the 3 structs -- and
leaves, if any) (aset (or future bump?))
- node slab contexts

This might only be workable with aset, if we need to individually free
the structs. (I haven't studied this, it was a recent idea)
It's simpler, because with small fixed length values, we don't need to
detect that and avoid creating a leaf context. All leaves would live
in the same context as the structs.

> Another variant of this idea would be that RT_CREATE() creates the
> parent context of the value context to store radix-tree struct. That
> is, the hierarchy would be like:
>
> A MemoryContext (passed by vacuum through tidstore)
> - radix tree memory context (store radx-tree struct, control
> struct, and iterator)
> - value context (aset, slab, or bump)
> - node slab contexts

The template handling the value context here is complex, and is what I
meant by 'C' above. Most fixed length allocations in all of the
backend are aset, so it seems fine to use it always.

> Freeing can just call with the radix tree memory context. And perhaps
> it works even if tidstore passes 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-15 Thread Masahiko Sawada
On Thu, Feb 15, 2024 at 8:26 PM John Naylor  wrote:
>
> On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada  
> wrote:
> >
> > On Sat, Feb 10, 2024 at 9:29 PM John Naylor  wrote:
>
> > I've also run the same scripts in my environment just in case and got
> > similar results:
>
> Thanks for testing, looks good as well.
>
> > > There are still some micro-benchmarks we could do on tidstore, and
> > > it'd be good to find out worse-case memory use (1 dead tuple each on
> > > spread-out pages), but this is decent demonstration.
> >
> > I've tested a simple case where vacuum removes 33k dead tuples spread
> > about every 10 pages.
> >
> > master:
> > 198,000 bytes (=33000 * 6)
> > system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s
> >
> > v-59:
> > 2,834,432 bytes (reported by TidStoreMemoryUsage())
> > system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s
>
> The memory usage for the sparse case may be a concern, although it's
> not bad -- a multiple of something small is probably not huge in
> practice. See below for an option we have for this.
>
> > > > > I'm pretty sure there's an
> > > > > accidental memset call that crept in there, but I'm running out of
> > > > > steam today.
> > >
> > > I have just a little bit of work to add for v59:
> > >
> > > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
> > > any bitmapwords. That can only happen if e.g. there is an offset > 128
> > > and there are none between 64 and 128, so not a huge deal but I think
> > > it's a bit nicer in this patch.
> >
> > LGTM.
>
> Okay, I've squashed this.
>
> > I've drafted the commit message.
>
> Thanks, this is a good start.
>
> > I've run regression tests with valgrind and run the coverity scan, and
> > I don't see critical issues.
>
> Great!
>
> Now, I think we're in pretty good shape. There are a couple of things
> that might be objectionable, so I want to try to improve them in the
> little time we have:
>
> 1. Memory use for the sparse case. I shared an idea a few months ago
> of how runtime-embeddable values (true combined pointer-value slots)
> could work for tids. I don't think this is a must-have, but it's not a
> lot of code, and I have this working:
>
> v61-0006: Preparatory refactoring -- I think we should do this anyway,
> since the intent seems more clear to me.

Looks good refactoring to me.

> v61-0007: Runtime-embeddable tids -- Optional for v17, but should
> reduce memory regressions, so should be considered. Up to 3 tids can
> be stored in the last level child pointer. It's not polished, but I'll
> only proceed with that if we think we need this. "flags" iis called
> that because it could hold tidbitmap.c booleans (recheck, lossy) in
> the future, in addition to reserving space for the pointer tag. Note:
> I hacked the tests to only have 2 offsets per block to demo, but of
> course both paths should be tested.

Interesting. I've run the same benchmark tests we did[1][2] (the
median of 3 runs):

monotonically ordered int column index:

master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s

uuid column index:

master: system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s
v-59: system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s
v-62: system usage: CPU: user: 4.06 s, system: 0.98 s, elapsed: 5.06 s

int & uuid indexes in parallel:

master: system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s
v-59: system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s
v-62: system usage: CPU: user: 2.54 s, system: 0.94 s, elapsed: 6.00 s

sparse case:

master:
198,000 bytes (=33000 * 6)
system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s

v-59:
2,834,432 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

v-62:
729,088 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 4.63 s, system: 0.86 s, elapsed: 5.50 s

I'm happy to see a huge improvement. While it's really fascinating to
me, I'm concerned about the time left until the feature freeze. We
need to polish both tidstore and vacuum integration patches in 5
weeks. Personally I'd like to have it as a separate patch for now, and
focus on completing the main three patches since we might face some
issues after pushing these patches. I think with 0007 patch it's a big
win but it's still a win even without 0007 patch.

>
> 2. Management of memory contexts. It's pretty verbose and messy. I
> think the abstraction could be better:
> A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
> can't destroy or reset it. That means we have to do a lot of manual
> work.
> B: Passing "max_bytes" to the radix tree was my idea, I believe, but
> it seems the wrong responsibility. Not all uses will have a
> work_mem-type limit, I'm 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-15 Thread John Naylor
v61 had a brown-paper-bag bug in the embedded tids patch that didn't
present in the tidstore test, but caused vacuum to fail, fixed in v62.


v62-ART.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-15 Thread John Naylor
On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada  wrote:
>
> On Sat, Feb 10, 2024 at 9:29 PM John Naylor  wrote:

> I've also run the same scripts in my environment just in case and got
> similar results:

Thanks for testing, looks good as well.

> > There are still some micro-benchmarks we could do on tidstore, and
> > it'd be good to find out worse-case memory use (1 dead tuple each on
> > spread-out pages), but this is decent demonstration.
>
> I've tested a simple case where vacuum removes 33k dead tuples spread
> about every 10 pages.
>
> master:
> 198,000 bytes (=33000 * 6)
> system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s
>
> v-59:
> 2,834,432 bytes (reported by TidStoreMemoryUsage())
> system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

The memory usage for the sparse case may be a concern, although it's
not bad -- a multiple of something small is probably not huge in
practice. See below for an option we have for this.

> > > > I'm pretty sure there's an
> > > > accidental memset call that crept in there, but I'm running out of
> > > > steam today.
> >
> > I have just a little bit of work to add for v59:
> >
> > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
> > any bitmapwords. That can only happen if e.g. there is an offset > 128
> > and there are none between 64 and 128, so not a huge deal but I think
> > it's a bit nicer in this patch.
>
> LGTM.

Okay, I've squashed this.

> I've drafted the commit message.

Thanks, this is a good start.

> I've run regression tests with valgrind and run the coverity scan, and
> I don't see critical issues.

Great!

Now, I think we're in pretty good shape. There are a couple of things
that might be objectionable, so I want to try to improve them in the
little time we have:

1. Memory use for the sparse case. I shared an idea a few months ago
of how runtime-embeddable values (true combined pointer-value slots)
could work for tids. I don't think this is a must-have, but it's not a
lot of code, and I have this working:

v61-0006: Preparatory refactoring -- I think we should do this anyway,
since the intent seems more clear to me.
v61-0007: Runtime-embeddable tids -- Optional for v17, but should
reduce memory regressions, so should be considered. Up to 3 tids can
be stored in the last level child pointer. It's not polished, but I'll
only proceed with that if we think we need this. "flags" iis called
that because it could hold tidbitmap.c booleans (recheck, lossy) in
the future, in addition to reserving space for the pointer tag. Note:
I hacked the tests to only have 2 offsets per block to demo, but of
course both paths should be tested.

2. Management of memory contexts. It's pretty verbose and messy. I
think the abstraction could be better:
A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we
can't destroy or reset it. That means we have to do a lot of manual
work.
B: Passing "max_bytes" to the radix tree was my idea, I believe, but
it seems the wrong responsibility. Not all uses will have a
work_mem-type limit, I'm guessing. We only use it for limiting the max
block size, and aset's default 8MB is already plenty small for
vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so
smaller, and there it makes sense to limit the max blocksize this way.
C: The context for values has complex #ifdefs based on the value
length/varlen, but it's both too much and not enough. If we get a bump
context, how would we shoehorn that in for values for vacuum but not
for tidbitmap?

Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to
TidStoreCreate(), and then to RT_CREATE. That context will contain the
values (for local mem), and the node slabs will be children of the
value context. That way, measuring memory usage and free-ing can just
call with this parent context, and let recursion handle the rest.
Perhaps the passed context can also hold the radix-tree struct, but
I'm not sure since I haven't tried it. What do you think?

With this resolved, I think the radix tree is pretty close to
committable. The tid store will likely need some polish yet, but no
major issues I know of.

(And, finally, a small thing I that I wanted to share just so I don't
forget, but maybe not worth the attention: In Andres's prototype,
there is a comment wondering if an update can skip checking if it
first need to create a root node. This is pretty easy, and done in
v61-0008.)


v61-ART.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-14 Thread Masahiko Sawada
On Sat, Feb 10, 2024 at 9:29 PM John Naylor  wrote:
>
> On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada  wrote:
> >
> > On Fri, Feb 2, 2024 at 8:47 PM John Naylor  wrote:
>
> > > My todo:
> > > - benchmark tid store / vacuum again, since we haven't since varlen
> > > types and removing unnecessary locks.
>
> I ran a vacuum benchmark similar to the one in [1] (unlogged tables
> for reproducibility), but smaller tables (100 million records),
> deleting only the last 20% of the table, and including a parallel
> vacuum test. Scripts attached.
>
> monotonically ordered int column index:
>
> master:
> system usage: CPU: user: 4.27 s, system: 0.41 s, elapsed: 4.70 s
> system usage: CPU: user: 4.23 s, system: 0.44 s, elapsed: 4.69 s
> system usage: CPU: user: 4.26 s, system: 0.39 s, elapsed: 4.66 s
>
> v-59:
> system usage: CPU: user: 3.10 s, system: 0.44 s, elapsed: 3.56 s
> system usage: CPU: user: 3.07 s, system: 0.35 s, elapsed: 3.43 s
> system usage: CPU: user: 3.07 s, system: 0.36 s, elapsed: 3.44 s
>
> uuid column index:
>
> master:
> system usage: CPU: user: 18.22 s, system: 1.70 s, elapsed: 20.01 s
> system usage: CPU: user: 17.70 s, system: 1.70 s, elapsed: 19.48 s
> system usage: CPU: user: 18.48 s, system: 1.59 s, elapsed: 20.43 s
>
> v-59:
> system usage: CPU: user: 5.18 s, system: 1.18 s, elapsed: 6.45 s
> system usage: CPU: user: 6.56 s, system: 1.39 s, elapsed: 7.99 s
> system usage: CPU: user: 6.51 s, system: 1.44 s, elapsed: 8.05 s
>
> int & uuid indexes in parallel:
>
> master:
> system usage: CPU: user: 4.53 s, system: 1.22 s, elapsed: 20.43 s
> system usage: CPU: user: 4.49 s, system: 1.29 s, elapsed: 20.98 s
> system usage: CPU: user: 4.46 s, system: 1.33 s, elapsed: 20.50 s
>
> v59:
> system usage: CPU: user: 2.09 s, system: 0.32 s, elapsed: 4.86 s
> system usage: CPU: user: 3.76 s, system: 0.51 s, elapsed: 8.92 s
> system usage: CPU: user: 3.83 s, system: 0.54 s, elapsed: 9.09 s
>
> Over all, I'm pleased with these results, although I'm confused why
> sometimes with the patch the first run reports running faster than the
> others. I'm curious what others get. Traversing a tree that lives in
> DSA has some overhead, as expected, but still comes out way ahead of
> master.

Thanks! That's a great improvement.

I've also run the same scripts in my environment just in case and got
similar results:

monotonically ordered int column index:

master:
system usage: CPU: user: 14.81 s, system: 0.90 s, elapsed: 15.74 s
system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s
system usage: CPU: user: 14.85 s, system: 0.70 s, elapsed: 15.57 s

v-59:
system usage: CPU: user: 9.47 s, system: 1.04 s, elapsed: 10.53 s
system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s
system usage: CPU: user: 9.59 s, system: 0.86 s, elapsed: 10.47 s

uuid column index:

master:
system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s
system usage: CPU: user: 28.05 s, system: 1.37 s, elapsed: 29.47 s
system usage: CPU: user: 28.46 s, system: 1.36 s, elapsed: 29.88 s

v-59:
system usage: CPU: user: 14.87 s, system: 1.13 s, elapsed: 16.02 s
system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s
system usage: CPU: user: 10.96 s, system: 1.24 s, elapsed: 12.22 s

int & uuid indexes in parallel:

master:
system usage: CPU: user: 15.81 s, system: 1.43 s, elapsed: 34.31 s
system usage: CPU: user: 15.84 s, system: 1.41 s, elapsed: 34.34 s
system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s

v-59:
system usage: CPU: user: 10.93 s, system: 0.92 s, elapsed: 17.59 s
system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s
system usage: CPU: user: 10.90 s, system: 1.01 s, elapsed: 17.45 s

>
> There are still some micro-benchmarks we could do on tidstore, and
> it'd be good to find out worse-case memory use (1 dead tuple each on
> spread-out pages), but this is decent demonstration.

I've tested a simple case where vacuum removes 33k dead tuples spread
about every 10 pages.

master:
198,000 bytes (=33000 * 6)
system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s

v-59:
2,834,432 bytes (reported by TidStoreMemoryUsage())
system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s

>
> > > I'm not sure what the test_node_types_* functions are testing that
> > > test_basic doesn't. They have a different, and confusing, way to stop
> > > at every size class and check the keys/values. It seems we can replace
> > > all that with two more calls (asc/desc) to test_basic, with the
> > > maximum level.
>
> v58-0008:
>
> + /* borrowed from RT_MAX_SHIFT */
> + const int max_shift = (pg_leftmost_one_pos64(UINT64_MAX) /
> BITS_PER_BYTE) * BITS_PER_BYTE;
>
> This is harder to read than "64 - 8", and doesn't really help
> maintainability either.
> Maybe "(sizeof(uint64) - 1) * BITS_PER_BYTE" is a good compromise.
>
> + /* leaf nodes */
> + test_basic(test_info, 0);
>
> + /* internal nodes */
> + test_basic(test_info, 8);
> +
> + /* max-level nodes */
> + 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-10 Thread John Naylor
On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada  wrote:
>
> On Fri, Feb 2, 2024 at 8:47 PM John Naylor  wrote:

> > My todo:
> > - benchmark tid store / vacuum again, since we haven't since varlen
> > types and removing unnecessary locks.

I ran a vacuum benchmark similar to the one in [1] (unlogged tables
for reproducibility), but smaller tables (100 million records),
deleting only the last 20% of the table, and including a parallel
vacuum test. Scripts attached.

monotonically ordered int column index:

master:
system usage: CPU: user: 4.27 s, system: 0.41 s, elapsed: 4.70 s
system usage: CPU: user: 4.23 s, system: 0.44 s, elapsed: 4.69 s
system usage: CPU: user: 4.26 s, system: 0.39 s, elapsed: 4.66 s

v-59:
system usage: CPU: user: 3.10 s, system: 0.44 s, elapsed: 3.56 s
system usage: CPU: user: 3.07 s, system: 0.35 s, elapsed: 3.43 s
system usage: CPU: user: 3.07 s, system: 0.36 s, elapsed: 3.44 s

uuid column index:

master:
system usage: CPU: user: 18.22 s, system: 1.70 s, elapsed: 20.01 s
system usage: CPU: user: 17.70 s, system: 1.70 s, elapsed: 19.48 s
system usage: CPU: user: 18.48 s, system: 1.59 s, elapsed: 20.43 s

v-59:
system usage: CPU: user: 5.18 s, system: 1.18 s, elapsed: 6.45 s
system usage: CPU: user: 6.56 s, system: 1.39 s, elapsed: 7.99 s
system usage: CPU: user: 6.51 s, system: 1.44 s, elapsed: 8.05 s

int & uuid indexes in parallel:

master:
system usage: CPU: user: 4.53 s, system: 1.22 s, elapsed: 20.43 s
system usage: CPU: user: 4.49 s, system: 1.29 s, elapsed: 20.98 s
system usage: CPU: user: 4.46 s, system: 1.33 s, elapsed: 20.50 s

v59:
system usage: CPU: user: 2.09 s, system: 0.32 s, elapsed: 4.86 s
system usage: CPU: user: 3.76 s, system: 0.51 s, elapsed: 8.92 s
system usage: CPU: user: 3.83 s, system: 0.54 s, elapsed: 9.09 s

Over all, I'm pleased with these results, although I'm confused why
sometimes with the patch the first run reports running faster than the
others. I'm curious what others get. Traversing a tree that lives in
DSA has some overhead, as expected, but still comes out way ahead of
master.

There are still some micro-benchmarks we could do on tidstore, and
it'd be good to find out worse-case memory use (1 dead tuple each on
spread-out pages), but this is decent demonstration.

> > I'm not sure what the test_node_types_* functions are testing that
> > test_basic doesn't. They have a different, and confusing, way to stop
> > at every size class and check the keys/values. It seems we can replace
> > all that with two more calls (asc/desc) to test_basic, with the
> > maximum level.

v58-0008:

+ /* borrowed from RT_MAX_SHIFT */
+ const int max_shift = (pg_leftmost_one_pos64(UINT64_MAX) /
BITS_PER_BYTE) * BITS_PER_BYTE;

This is harder to read than "64 - 8", and doesn't really help
maintainability either.
Maybe "(sizeof(uint64) - 1) * BITS_PER_BYTE" is a good compromise.

+ /* leaf nodes */
+ test_basic(test_info, 0);

+ /* internal nodes */
+ test_basic(test_info, 8);
+
+ /* max-level nodes */
+ test_basic(test_info, max_shift);

This three-way terminology is not very informative. How about:

+   /* a tree with one level, i.e. a single node under the root node. */
 ...
+   /* a tree with two levels */
 ...
+   /* a tree with the maximum number of levels */

+static void
+test_basic(rt_node_class_test_elem *test_info, int shift)
+{
+ elog(NOTICE, "testing node %s with shift %d", test_info->class_name, shift);
+
+ /* Test nodes while changing the key insertion order */
+ do_test_basic(test_info->nkeys, shift, false);
+ do_test_basic(test_info->nkeys, shift, true);

Adding a level of indirection makes this harder to read, and do we
still know whether a test failed in asc or desc keys?

> > My earlier opinion was that "handle" was a nicer variable name, but
> > this brings back the typedef and also keeps the variable name I didn't
> > like, but pushes it down into the function. I'm a bit confused, so
> > I've kept these not-squashed for now.
>
> I misunderstood your comment. I've changed to use a variable name
> rt_handle and removed the TidStoreHandle type. 0013 patch.

(diff against an earlier version)
-   pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items);
+   pvs->shared->dead_items_dp = TidStoreGetHandle(dead_items);

Shall we use "handle" in vacuum_parallel.c as well?

> > I'm pretty sure there's an
> > accidental memset call that crept in there, but I'm running out of
> > steam today.

I have just a little bit of work to add for v59:

v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero
any bitmapwords. That can only happen if e.g. there is an offset > 128
and there are none between 64 and 128, so not a huge deal but I think
it's a bit nicer in this patch.

> > >  * WIP: notes about traditional radix tree trading off span vs height...
> > >
> > > Are you going to write it?
> >
> > Yes, when I draft a rough commit message, (for next time).

I haven't gotten to the commit message, but:

v59-0004 - I did some 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-05 Thread Masahiko Sawada
On Fri, Feb 2, 2024 at 8:47 PM John Naylor  wrote:
>
> On Wed, Jan 31, 2024 at 12:50 PM Masahiko Sawada  
> wrote:
> > I've attached the new patch set (v56). I've squashed previous updates
> > and addressed review comments on v55 in separate patches. Here are the
> > update summary:
> >
> > 0004: fix compiler warning caught by ci test.
> > 0005-0008: address review comments on radix tree codes.
> > 0009: cleanup #define and #undef
> > 0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is
> > undefined after including radixtree.h so we should not use it in test
> > code.
>
> Great, thanks!
>
> I have a few questions and comments on v56, then I'll address yours
> below with the attached v57, which is mostly cosmetic adjustments.

Thank you for the comments! I've squashed previous updates and your changes.

>
> v56-0003:
>
> (Looking closer at tests)
>
> +static const bool rt_test_stats = false;
>
> I'm thinking we should just remove everything that depends on this,
> and keep this module entirely about correctness.

Agreed. Removed in 0006 patch.

>
> + for (int shift = 0; shift <= (64 - 8); shift += 8)
> + test_node_types(shift);
>
> I'm not sure what the test_node_types_* functions are testing that
> test_basic doesn't. They have a different, and confusing, way to stop
> at every size class and check the keys/values. It seems we can replace
> all that with two more calls (asc/desc) to test_basic, with the
> maximum level.

Agreed, addressed in 0007 patch.

>
> It's pretty hard to see what test_pattern() is doing, or why it's
> useful. I wonder if instead the test could use something like the
> benchmark where random integers are masked off. That seems simpler. I
> can work on that, but I'd like to hear your side about test_pattern().

Yeah, test_pattern() is originally created for the integerset so it
doesn't necessarily fit the radixtree. I agree to use some tests from
benchmarks.

>
> v56-0007:
>
> + *
> + * Since we can rely on DSA_AREA_LOCK to get the total amount of DSA memory,
> + * the caller doesn't need to take a lock.
>
> Maybe something like "Since dsa_get_total_size() does appropriate locking 
> ..."?

Agreed. Fixed in 0005 patch.

>
> v56-0008
>
> Thanks, I like how the tests look now.
>
> -NOTICE:  testing node   4 with height 0 and  ascending keys
> ...
> +NOTICE:  testing node   1 with height 0 and  ascending keys
>
> Now that the number is not intended to match a size class, "node X"
> seems out of place. Maybe we could have a separate array with strings?
>
> + 1, /* RT_CLASS_4 */
>
> This should be more than one, so that the basic test still exercises
> paths that shift elements around.
>
> + 100, /* RT_CLASS_48 */
>
> This node currently holds 64 for local memory.
>
> + 255 /* RT_CLASS_256 */
>
> This is the only one where we know exactly how many it can take, so
> may as well keep it at 256.

Fixed in 0008 patch.

>
> v56-0012:
>
> The test module for tidstore could use a few more comments.

Addressed in 0012 patch.

>
> v56-0015:
>
> +typedef dsa_pointer TidStoreHandle;
> +
>
> -TidStoreAttach(dsa_area *area, dsa_pointer rt_dp)
> +TidStoreAttach(dsa_area *area, TidStoreHandle handle)
>  {
>   TidStore *ts;
> + dsa_pointer rt_dp = handle;
>
> My earlier opinion was that "handle" was a nicer variable name, but
> this brings back the typedef and also keeps the variable name I didn't
> like, but pushes it down into the function. I'm a bit confused, so
> I've kept these not-squashed for now.

I misunderstood your comment. I've changed to use a variable name
rt_handle and removed the TidStoreHandle type. 0013 patch.

>
> ---
>
> Now, for v57:
>
> > Looking at overall changes, there are still XXX and TODO comments in
> > radixtree.h:
>
> That's fine, as long as it's intentional as a message to readers. That
> said, we can get rid of some:
>
> > ---
> >  * XXX There are 4 node kinds, and this should never be increased,
> >  * for several reasons:
> >  * 1. With 5 or more kinds, gcc tends to use a jump table for switch
> >  *statements.
> >  * 2. The 4 kinds can be represented with 2 bits, so we have the option
> >  *in the future to tag the node pointer with the kind, even on
> >  *platforms with 32-bit pointers. This might speed up node traversal
> >  *in trees with highly random node kinds.
> >  * 3. We can have multiple size classes per node kind.
> >
> > Can we just remove "XXX"?
>
> How about "NOTE"?

Agreed.

>
> > ---
> >  * WIP: notes about traditional radix tree trading off span vs height...
> >
> > Are you going to write it?
>
> Yes, when I draft a rough commit message, (for next time).

Thanks!

>
> > ---
> > #ifdef RT_SHMEM
> > /*  WIP: do we really need this? */
> > typedef dsa_pointer RT_HANDLE;
> > #endif
> >
> > I think it's worth having it.
>
> Okay, removed WIP in v57-0004.
>
> > ---
> >  * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-02-02 Thread John Naylor
On Wed, Jan 31, 2024 at 12:50 PM Masahiko Sawada  wrote:
> I've attached the new patch set (v56). I've squashed previous updates
> and addressed review comments on v55 in separate patches. Here are the
> update summary:
>
> 0004: fix compiler warning caught by ci test.
> 0005-0008: address review comments on radix tree codes.
> 0009: cleanup #define and #undef
> 0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is
> undefined after including radixtree.h so we should not use it in test
> code.

Great, thanks!

I have a few questions and comments on v56, then I'll address yours
below with the attached v57, which is mostly cosmetic adjustments.

v56-0003:

(Looking closer at tests)

+static const bool rt_test_stats = false;

I'm thinking we should just remove everything that depends on this,
and keep this module entirely about correctness.

+ for (int shift = 0; shift <= (64 - 8); shift += 8)
+ test_node_types(shift);

I'm not sure what the test_node_types_* functions are testing that
test_basic doesn't. They have a different, and confusing, way to stop
at every size class and check the keys/values. It seems we can replace
all that with two more calls (asc/desc) to test_basic, with the
maximum level.

It's pretty hard to see what test_pattern() is doing, or why it's
useful. I wonder if instead the test could use something like the
benchmark where random integers are masked off. That seems simpler. I
can work on that, but I'd like to hear your side about test_pattern().

v56-0007:

+ *
+ * Since we can rely on DSA_AREA_LOCK to get the total amount of DSA memory,
+ * the caller doesn't need to take a lock.

Maybe something like "Since dsa_get_total_size() does appropriate locking ..."?

v56-0008

Thanks, I like how the tests look now.

-NOTICE:  testing node   4 with height 0 and  ascending keys
...
+NOTICE:  testing node   1 with height 0 and  ascending keys

Now that the number is not intended to match a size class, "node X"
seems out of place. Maybe we could have a separate array with strings?

+ 1, /* RT_CLASS_4 */

This should be more than one, so that the basic test still exercises
paths that shift elements around.

+ 100, /* RT_CLASS_48 */

This node currently holds 64 for local memory.

+ 255 /* RT_CLASS_256 */

This is the only one where we know exactly how many it can take, so
may as well keep it at 256.

v56-0012:

The test module for tidstore could use a few more comments.

v56-0015:

+typedef dsa_pointer TidStoreHandle;
+

-TidStoreAttach(dsa_area *area, dsa_pointer rt_dp)
+TidStoreAttach(dsa_area *area, TidStoreHandle handle)
 {
  TidStore *ts;
+ dsa_pointer rt_dp = handle;

My earlier opinion was that "handle" was a nicer variable name, but
this brings back the typedef and also keeps the variable name I didn't
like, but pushes it down into the function. I'm a bit confused, so
I've kept these not-squashed for now.

---

Now, for v57:

> Looking at overall changes, there are still XXX and TODO comments in
> radixtree.h:

That's fine, as long as it's intentional as a message to readers. That
said, we can get rid of some:

> ---
>  * XXX There are 4 node kinds, and this should never be increased,
>  * for several reasons:
>  * 1. With 5 or more kinds, gcc tends to use a jump table for switch
>  *statements.
>  * 2. The 4 kinds can be represented with 2 bits, so we have the option
>  *in the future to tag the node pointer with the kind, even on
>  *platforms with 32-bit pointers. This might speed up node traversal
>  *in trees with highly random node kinds.
>  * 3. We can have multiple size classes per node kind.
>
> Can we just remove "XXX"?

How about "NOTE"?

> ---
>  * WIP: notes about traditional radix tree trading off span vs height...
>
> Are you going to write it?

Yes, when I draft a rough commit message, (for next time).

> ---
> #ifdef RT_SHMEM
> /*  WIP: do we really need this? */
> typedef dsa_pointer RT_HANDLE;
> #endif
>
> I think it's worth having it.

Okay, removed WIP in v57-0004.

> ---
>  * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
>  * inside a single bitmapword on most platforms, so it's a good starting
>  * point. We can make it higher if we need to.
>  */
> #define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4)
>
> Are you going to work something on this?

Hard-coded 64 for readability, and changed this paragraph to explain
the current rationale more clearly:

"The paper uses at most 64 for this node kind, and one advantage for us
is that "isset" is a single bitmapword on most platforms, rather than
an array, allowing the compiler to get rid of loops."

> ---
> /* WIP: We could go first to the higher node16 size class */
> newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO);
>
> Does it mean to go to RT_CLASS_16_HI and then further go to
> RT_CLASS_16_LO upon further deletion?

Yes. It wouldn't be much work to make 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-30 Thread Masahiko Sawada
On Tue, Jan 30, 2024 at 7:20 PM John Naylor  wrote:
>
> On Tue, Jan 30, 2024 at 7:56 AM Masahiko Sawada  wrote:
> >
> > On Mon, Jan 29, 2024 at 8:48 PM John Naylor  wrote:
>
> > > I meant the macro could probably be
> > >
> > > Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N)
> > >
> > > (Right now N=32). I also realize I didn't answer your question earlier
> > > about block sizes being powers of two. I was talking about PG in
> > > general -- I was thinking all block sizes were powers of two. If
> > > that's true, I'm not sure if it's because programmers find the macro
> > > calculations easy to reason about, or if there was an implementation
> > > reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or
> > > just above a power of two, so if we did  round that up it would be
> > > 128kB.
> >
> > Thank you for your explanation. It might be better to follow other
> > codes. Does the calculation below make sense to you?
> >
> > RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
> > Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE;
> > while (inner_blocksize < 32 * size_class.allocsize)
> >  inner_blocksize <<= 1;
>
> It does make sense, but we can do it more simply:
>
> Max(SLAB_DEFAULT_BLOCK_SIZE, pg_nextpower2_32(size * 32))

Thanks!

I've attached the new patch set (v56). I've squashed previous updates
and addressed review comments on v55 in separate patches. Here are the
update summary:

0004: fix compiler warning caught by ci test.
0005-0008: address review comments on radix tree codes.
0009: cleanup #define and #undef
0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is
undefined after including radixtree.h so we should not use it in test
code.
0013-0015: address review comments on tidstore codes.
0017-0018: address review comments on vacuum integration codes.

Looking at overall changes, there are still XXX and TODO comments in
radixtree.h:

---
 * XXX There are 4 node kinds, and this should never be increased,
 * for several reasons:
 * 1. With 5 or more kinds, gcc tends to use a jump table for switch
 *statements.
 * 2. The 4 kinds can be represented with 2 bits, so we have the option
 *in the future to tag the node pointer with the kind, even on
 *platforms with 32-bit pointers. This might speed up node traversal
 *in trees with highly random node kinds.
 * 3. We can have multiple size classes per node kind.

Can we just remove "XXX"?

---
 * WIP: notes about traditional radix tree trading off span vs height...

Are you going to write it?

---
#ifdef RT_SHMEM
/*  WIP: do we really need this? */
typedef dsa_pointer RT_HANDLE;
#endif

I think it's worth having it.

---
 * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
 * inside a single bitmapword on most platforms, so it's a good starting
 * point. We can make it higher if we need to.
 */
#define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4)

Are you going to work something on this?

---
/* WIP: We could go first to the higher node16 size class */
newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO);

Does it mean to go to RT_CLASS_16_HI and then further go to
RT_CLASS_16_LO upon further deletion?

---
 * TODO: The current locking mechanism is not optimized for high concurrency
 * with mixed read-write workloads. In the future it might be worthwhile
 * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in
 * the paper "The ART of Practical Synchronization" by the same authors as
 * the ART paper, 2016.

I think it's not TODO for now, but a future improvement. We can remove it.

---
/* TODO: consider 5 with subclass 1 or 2. */
#define RT_FANOUT_4 4

Is there something we need to do here?

---
/*
 * Return index of the chunk and slot arrays for inserting into the node,
 * such that the chunk array remains ordered.
 * TODO: Improve performance for non-SIMD platforms.
 */

Are you going to work on this?

---
/* Delete the element at 'idx' */
/*  TODO: replace slow memmove's */

Are you going to work on this?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v56-ART.tar.gz
Description: GNU Zip compressed data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-30 Thread John Naylor
On Tue, Jan 30, 2024 at 7:56 AM Masahiko Sawada  wrote:
>
> On Mon, Jan 29, 2024 at 8:48 PM John Naylor  wrote:

> > I meant the macro could probably be
> >
> > Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N)
> >
> > (Right now N=32). I also realize I didn't answer your question earlier
> > about block sizes being powers of two. I was talking about PG in
> > general -- I was thinking all block sizes were powers of two. If
> > that's true, I'm not sure if it's because programmers find the macro
> > calculations easy to reason about, or if there was an implementation
> > reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or
> > just above a power of two, so if we did  round that up it would be
> > 128kB.
>
> Thank you for your explanation. It might be better to follow other
> codes. Does the calculation below make sense to you?
>
> RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
> Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE;
> while (inner_blocksize < 32 * size_class.allocsize)
>  inner_blocksize <<= 1;

It does make sense, but we can do it more simply:

Max(SLAB_DEFAULT_BLOCK_SIZE, pg_nextpower2_32(size * 32))




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-29 Thread Masahiko Sawada
On Mon, Jan 29, 2024 at 8:48 PM John Naylor  wrote:
>
> On Mon, Jan 29, 2024 at 2:29 PM Masahiko Sawada  wrote:
>
> > > > +/*
> > > > + * Calculate the slab blocksize so that we can allocate at least 32 
> > > > chunks
> > > > + * from the block.
> > > > + */
> > > > +#define RT_SLAB_BLOCK_SIZE(size) \
> > > > + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
> > > >
> > > > The first parameter seems to be trying to make the block size exact,
> > > > but that's not right, because of the chunk header, and maybe
> > > > alignment. If the default block size is big enough to waste only a
> > > > tiny amount of space, let's just use that as-is.
>
> > If we use SLAB_DEFAULT_BLOCK_SIZE (8kB) for each node class, we waste
> > [snip]
> > We might want to calculate a better slab block size for node256 at least.
>
> I meant the macro could probably be
>
> Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N)
>
> (Right now N=32). I also realize I didn't answer your question earlier
> about block sizes being powers of two. I was talking about PG in
> general -- I was thinking all block sizes were powers of two. If
> that's true, I'm not sure if it's because programmers find the macro
> calculations easy to reason about, or if there was an implementation
> reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or
> just above a power of two, so if we did  round that up it would be
> 128kB.

Thank you for your explanation. It might be better to follow other
codes. Does the calculation below make sense to you?

RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE;
while (inner_blocksize < 32 * size_class.allocsize)
 inner_blocksize <<= 1;

As for the lock mode in dsa.c, I've posted a question[1].

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoALgrU2sGWzgq%2B6G9X0ynqyVOjMR5_k4HgsGRWae1j%3DwQ%40mail.gmail.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-29 Thread John Naylor
On Mon, Jan 29, 2024 at 2:29 PM Masahiko Sawada  wrote:

> > > +/*
> > > + * Calculate the slab blocksize so that we can allocate at least 32 
> > > chunks
> > > + * from the block.
> > > + */
> > > +#define RT_SLAB_BLOCK_SIZE(size) \
> > > + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
> > >
> > > The first parameter seems to be trying to make the block size exact,
> > > but that's not right, because of the chunk header, and maybe
> > > alignment. If the default block size is big enough to waste only a
> > > tiny amount of space, let's just use that as-is.

> If we use SLAB_DEFAULT_BLOCK_SIZE (8kB) for each node class, we waste
> [snip]
> We might want to calculate a better slab block size for node256 at least.

I meant the macro could probably be

Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N)

(Right now N=32). I also realize I didn't answer your question earlier
about block sizes being powers of two. I was talking about PG in
general -- I was thinking all block sizes were powers of two. If
that's true, I'm not sure if it's because programmers find the macro
calculations easy to reason about, or if there was an implementation
reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or
just above a power of two, so if we did  round that up it would be
128kB.

> > > + * TODO: The caller must be certain that no other backend will attempt to
> > > + * access the TidStore before calling this function. Other backend must
> > > + * explicitly call TidStoreDetach() to free up backend-local memory 
> > > associated
> > > + * with the TidStore. The backend that calls TidStoreDestroy() must not 
> > > call
> > > + * TidStoreDetach().
> > >
> > > Do we need to do anything now?
> >
> > No, will remove it.
> >
>
> I misunderstood something. I think the above statement is still true
> but we don't need to do anything at this stage. It's a typical usage
> that the leader destroys the shared data after confirming all workers
> are detached. It's not a TODO but probably a NOTE.

Okay.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-28 Thread Masahiko Sawada
On Fri, Jan 26, 2024 at 11:05 PM Masahiko Sawada  wrote:
>
> On Wed, Jan 24, 2024 at 3:42 PM John Naylor  wrote:
> >
> > On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada  
> > wrote:
> > >
> > > The new patches probably need to be polished but the VacDeadItemInfo
> > > idea looks good to me.
> >
> > That idea looks good to me, too. Since you already likely know what
> > you'd like to polish, I don't have much to say except for a few
> > questions below. I also did a quick sweep through every patch, so some
> > of these comments are unrelated to recent changes:
>
> Thank you!
>
> >
> > +/*
> > + * Calculate the slab blocksize so that we can allocate at least 32 chunks
> > + * from the block.
> > + */
> > +#define RT_SLAB_BLOCK_SIZE(size) \
> > + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
> >
> > The first parameter seems to be trying to make the block size exact,
> > but that's not right, because of the chunk header, and maybe
> > alignment. If the default block size is big enough to waste only a
> > tiny amount of space, let's just use that as-is.
>
> Agreed.
>

As of v55 patch, the sizes of each node class are:

- node4: 40 bytes
- node16_lo: 168 bytes
- node16_hi: 296 bytes
- node48: 784 bytes
- node256: 2088 bytes

If we use SLAB_DEFAULT_BLOCK_SIZE (8kB) for each node class, we waste
(approximately):

- node4: 32 bytes
- node16_lo: 128 bytes
- node16_hi: 200 bytes
- node48: 352 bytes
- node256: 1928 bytes

We might want to calculate a better slab block size for node256 at least.

> >
> > + * TODO: The caller must be certain that no other backend will attempt to
> > + * access the TidStore before calling this function. Other backend must
> > + * explicitly call TidStoreDetach() to free up backend-local memory 
> > associated
> > + * with the TidStore. The backend that calls TidStoreDestroy() must not 
> > call
> > + * TidStoreDetach().
> >
> > Do we need to do anything now?
>
> No, will remove it.
>

I misunderstood something. I think the above statement is still true
but we don't need to do anything at this stage. It's a typical usage
that the leader destroys the shared data after confirming all workers
are detached. It's not a TODO but probably a NOTE.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-26 Thread Masahiko Sawada
On Wed, Jan 24, 2024 at 3:42 PM John Naylor  wrote:
>
> On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada  
> wrote:
> >
> > The new patches probably need to be polished but the VacDeadItemInfo
> > idea looks good to me.
>
> That idea looks good to me, too. Since you already likely know what
> you'd like to polish, I don't have much to say except for a few
> questions below. I also did a quick sweep through every patch, so some
> of these comments are unrelated to recent changes:

Thank you!

>
> v55-0003:
>
> +size_t
> +dsa_get_total_size(dsa_area *area)
> +{
> + size_t size;
> +
> + LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
> + size = area->control->total_segment_size;
> + LWLockRelease(DSA_AREA_LOCK(area));
>
> I looked and found dsa.c doesn't already use shared locks in HEAD,
> even dsa_dump. Not sure why that is...

Oh, the dsa_dump part seems to be a bug. But it'll keep it consistent
with others.

>
> +/*
> + * Calculate the slab blocksize so that we can allocate at least 32 chunks
> + * from the block.
> + */
> +#define RT_SLAB_BLOCK_SIZE(size) \
> + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)
>
> The first parameter seems to be trying to make the block size exact,
> but that's not right, because of the chunk header, and maybe
> alignment. If the default block size is big enough to waste only a
> tiny amount of space, let's just use that as-is.

Agreed.

> Also, I think all
> block sizes in the code base have been a power of two, but I'm not
> sure how much that matters.

Did you mean all slab block sizes we use in radixtree.h?

>
> +#ifdef RT_SHMEM
> + fprintf(stderr, "  [%d] chunk %x slot " DSA_POINTER_FORMAT "\n",
> + i, n4->chunks[i], n4->children[i]);
> +#else
> + fprintf(stderr, "  [%d] chunk %x slot %p\n",
> + i, n4->chunks[i], n4->children[i]);
> +#endif
>
> Maybe we could invent a child pointer format, so we only #ifdef in one place.

WIll change.

>
> --- /dev/null
> +++ b/src/test/modules/test_radixtree/meson.build
> @@ -0,0 +1,35 @@
> +# FIXME: prevent install during main install, but not during test :/
>
> Can you look into this?

Okay, I'll look at it.

>
> test_radixtree.c:
>
> +/*
> + * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
> + */
> +static int rt_node_class_fanouts[] = {
> + 4, /* RT_CLASS_3 */
> + 15, /* RT_CLASS_32_MIN */
> + 32, /* RT_CLASS_32_MAX */
> + 125, /* RT_CLASS_125 */
> + 256 /* RT_CLASS_256 */
> +};
>
> These numbers have been wrong a long time, too, but only matters for
> figuring out where it went wrong when something is broken. And for the
> XXX, instead of trying to use the largest number that should fit (it's
> obviously not testing that the expected node can actually hold that
> number anyway), it seems we can just use a "big enough" number to
> cause growing into the desired size class.
>
> As far as cleaning up the tests, I always wondered why these didn't
> use EXPECT_TRUE, EXPECT_FALSE, etc. as in Andres's prototype where
> where convenient, and leave comments above the tests. That seemed like
> a good idea to me -- was there a reason to have hand-written branches
> and elog messages everywhere?

The current test is based on test_integerset. I agree that we can
improve it by using EXPECT_TRUE etc.

>
> --- a/src/tools/pginclude/cpluspluscheck
> +++ b/src/tools/pginclude/cpluspluscheck
> @@ -101,6 +101,12 @@ do
>   test "$f" = src/include/nodes/nodetags.h && continue
>   test "$f" = src/backend/nodes/nodetags.h && continue
>
> + # radixtree_*_impl.h cannot be included standalone: they are just
> code fragments.
> + test "$f" = src/include/lib/radixtree_delete_impl.h && continue
> + test "$f" = src/include/lib/radixtree_insert_impl.h && continue
> + test "$f" = src/include/lib/radixtree_iter_impl.h && continue
> + test "$f" = src/include/lib/radixtree_search_impl.h && continue
>
> Ha! I'd forgotten about these -- they're long outdated.

Will remove.

>
> v55-0005:
>
> - * The radix tree is locked in shared mode during the iteration, so
> - * RT_END_ITERATE needs to be called when finished to release the lock.
> + * The caller needs to acquire a lock in shared mode during the iteration
> + * if necessary.
>
> "need if necessary" is maybe better phrased as "is the caller's 
> responsibility"

Will fix.

>
> + /*
> + * We can rely on DSA_AREA_LOCK to get the total amount of DSA memory.
> + */
>   total = dsa_get_total_size(tree->dsa);
>
> Maybe better to have a header comment for RT_MEMORY_USAGE that the
> caller doesn't need to take a lock.

Will fix.

>
> v55-0006:
>
> "WIP: Not built, since some benchmarks have broken" -- I'll work on
> this when I re-run some benchmarks.

Thanks!

>
> v55-0007:
>
> + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
> + * and stored in the radix tree.
>
> This hasn't been true for a few months now, and I thought we fixed
> this in some earlier version?

Yeah, I'll fix it.

>
> + * TODO: The caller must be certain that no other backend will attempt to
> 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-23 Thread John Naylor
On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada  wrote:
>
> The new patches probably need to be polished but the VacDeadItemInfo
> idea looks good to me.

That idea looks good to me, too. Since you already likely know what
you'd like to polish, I don't have much to say except for a few
questions below. I also did a quick sweep through every patch, so some
of these comments are unrelated to recent changes:

v55-0003:

+size_t
+dsa_get_total_size(dsa_area *area)
+{
+ size_t size;
+
+ LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED);
+ size = area->control->total_segment_size;
+ LWLockRelease(DSA_AREA_LOCK(area));

I looked and found dsa.c doesn't already use shared locks in HEAD,
even dsa_dump. Not sure why that is...

+/*
+ * Calculate the slab blocksize so that we can allocate at least 32 chunks
+ * from the block.
+ */
+#define RT_SLAB_BLOCK_SIZE(size) \
+ Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32)

The first parameter seems to be trying to make the block size exact,
but that's not right, because of the chunk header, and maybe
alignment. If the default block size is big enough to waste only a
tiny amount of space, let's just use that as-is. Also, I think all
block sizes in the code base have been a power of two, but I'm not
sure how much that matters.

+#ifdef RT_SHMEM
+ fprintf(stderr, "  [%d] chunk %x slot " DSA_POINTER_FORMAT "\n",
+ i, n4->chunks[i], n4->children[i]);
+#else
+ fprintf(stderr, "  [%d] chunk %x slot %p\n",
+ i, n4->chunks[i], n4->children[i]);
+#endif

Maybe we could invent a child pointer format, so we only #ifdef in one place.

--- /dev/null
+++ b/src/test/modules/test_radixtree/meson.build
@@ -0,0 +1,35 @@
+# FIXME: prevent install during main install, but not during test :/

Can you look into this?

test_radixtree.c:

+/*
+ * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO?
+ */
+static int rt_node_class_fanouts[] = {
+ 4, /* RT_CLASS_3 */
+ 15, /* RT_CLASS_32_MIN */
+ 32, /* RT_CLASS_32_MAX */
+ 125, /* RT_CLASS_125 */
+ 256 /* RT_CLASS_256 */
+};

These numbers have been wrong a long time, too, but only matters for
figuring out where it went wrong when something is broken. And for the
XXX, instead of trying to use the largest number that should fit (it's
obviously not testing that the expected node can actually hold that
number anyway), it seems we can just use a "big enough" number to
cause growing into the desired size class.

As far as cleaning up the tests, I always wondered why these didn't
use EXPECT_TRUE, EXPECT_FALSE, etc. as in Andres's prototype where
where convenient, and leave comments above the tests. That seemed like
a good idea to me -- was there a reason to have hand-written branches
and elog messages everywhere?

--- a/src/tools/pginclude/cpluspluscheck
+++ b/src/tools/pginclude/cpluspluscheck
@@ -101,6 +101,12 @@ do
  test "$f" = src/include/nodes/nodetags.h && continue
  test "$f" = src/backend/nodes/nodetags.h && continue

+ # radixtree_*_impl.h cannot be included standalone: they are just
code fragments.
+ test "$f" = src/include/lib/radixtree_delete_impl.h && continue
+ test "$f" = src/include/lib/radixtree_insert_impl.h && continue
+ test "$f" = src/include/lib/radixtree_iter_impl.h && continue
+ test "$f" = src/include/lib/radixtree_search_impl.h && continue

Ha! I'd forgotten about these -- they're long outdated.

v55-0005:

- * The radix tree is locked in shared mode during the iteration, so
- * RT_END_ITERATE needs to be called when finished to release the lock.
+ * The caller needs to acquire a lock in shared mode during the iteration
+ * if necessary.

"need if necessary" is maybe better phrased as "is the caller's responsibility"

+ /*
+ * We can rely on DSA_AREA_LOCK to get the total amount of DSA memory.
+ */
  total = dsa_get_total_size(tree->dsa);

Maybe better to have a header comment for RT_MEMORY_USAGE that the
caller doesn't need to take a lock.

v55-0006:

"WIP: Not built, since some benchmarks have broken" -- I'll work on
this when I re-run some benchmarks.

v55-0007:

+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.

This hasn't been true for a few months now, and I thought we fixed
this in some earlier version?

+ * TODO: The caller must be certain that no other backend will attempt to
+ * access the TidStore before calling this function. Other backend must
+ * explicitly call TidStoreDetach() to free up backend-local memory associated
+ * with the TidStore. The backend that calls TidStoreDestroy() must not call
+ * TidStoreDetach().

Do we need to do anything now?

v55-0008:

-TidStoreAttach(dsa_area *area, TidStoreHandle handle)
+TidStoreAttach(dsa_area *area, dsa_pointer rt_dp)

"handle" seemed like a fine name. Is that not the case anymore? The
new one is kind of cryptic. The commit message just says "remove
control object" -- does that imply that we need to think of this
parameter differently, or is it unrelated? (Same with

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-22 Thread Masahiko Sawada
On Tue, Jan 23, 2024 at 12:58 PM Masahiko Sawada  wrote:
>
> On Mon, Jan 22, 2024 at 5:18 PM John Naylor  wrote:
> >
> > On Mon, Jan 22, 2024 at 2:24 PM Masahiko Sawada  
> > wrote:
> > >
> > > For the next version patch, I'll work on this idea and try to clean up
> > > locking stuff both in tidstore and radix tree. Or if you're already
> > > working on some of them, please let me know. I'll review it.
> >
> > Okay go ahead, sounds good. I plan to look at the tests since they
> > haven't been looked at in a while.
>
> I've attached the latest patch set. Here are updates from v54 patch:
>
> 0005 - Expose radix tree lock functions and remove all locks taken
> internally in radixtree.h.
> 0008 - Remove tidstore's control object.
> 0009 - Add tidstore lock functions.
> 0011 - Add VacDeadItemsInfo to store "max_bytes" and "num_items"
> separate from TidStore. Also make lazy vacuum and parallel vacuum use
> it.

John pointed out offlist the tarball includes only the patches up to
0009. I've attached the correct one.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v55-ART.tar.gz
Description: GNU Zip compressed data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-22 Thread Masahiko Sawada
On Mon, Jan 22, 2024 at 5:18 PM John Naylor  wrote:
>
> On Mon, Jan 22, 2024 at 2:24 PM Masahiko Sawada  wrote:
> >
> > For the next version patch, I'll work on this idea and try to clean up
> > locking stuff both in tidstore and radix tree. Or if you're already
> > working on some of them, please let me know. I'll review it.
>
> Okay go ahead, sounds good. I plan to look at the tests since they
> haven't been looked at in a while.

I've attached the latest patch set. Here are updates from v54 patch:

0005 - Expose radix tree lock functions and remove all locks taken
internally in radixtree.h.
0008 - Remove tidstore's control object.
0009 - Add tidstore lock functions.
0011 - Add VacDeadItemsInfo to store "max_bytes" and "num_items"
separate from TidStore. Also make lazy vacuum and parallel vacuum use
it.

The new patches probably need to be polished but the VacDeadItemInfo
idea looks good to me.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v55-ART.tar.gz
Description: GNU Zip compressed data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-22 Thread John Naylor
On Mon, Jan 22, 2024 at 2:24 PM Masahiko Sawada  wrote:
>
> For the next version patch, I'll work on this idea and try to clean up
> locking stuff both in tidstore and radix tree. Or if you're already
> working on some of them, please let me know. I'll review it.

Okay go ahead, sounds good. I plan to look at the tests since they
haven't been looked at in a while.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-21 Thread Masahiko Sawada
On Mon, Jan 22, 2024 at 2:36 PM John Naylor  wrote:
>
> On Mon, Jan 22, 2024 at 10:28 AM Masahiko Sawada  
> wrote:
> >
> > On further thought, as you pointed out before, "num_tids" should not
> > be in tidstore in terms of integration with tidbitmap.c, because
> > tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no
> > longer accurate and useful. Similarly, looking at tidbitmap.c, it has
> > npages and nchunks but they will not be necessary in lazy vacuum use
> > case. Also, assuming that we support parallel heap pruning, probably
> > we need to somehow lock the tidstore while adding tids to the tidstore
> > concurrently by parallel vacuum worker. But in tidbitmap use case, we
> > don't need to lock the tidstore since it doesn't have multiple
> > writers.
>
> Not currently, and it does seem bad to require locking where it's not 
> required.
>
> (That would be a prerequisite for parallel index scan. It's been tried
> before with the hash table, but concurrency didn't scale well with the
> hash table. I have no reason to think that the radix tree would scale
> significantly better with the same global LW lock, but as you know
> there are other locking schemes possible.)
>
> > Given these facts, different statistics and different lock
> > strategies are required by different use case. So I think there are 3
> > options:
> >
> > 1. expose lock functions for tidstore and the caller manages the
> > statistics in the outside of tidstore. For example, in lazyvacuum.c we
> > would have a TidStore for tid storage as well as VacDeadItemsInfo that
> > has num_tids and max_bytes. Both are in LVRelState. For parallel
> > vacuum, we pass both to the workers via DSM and pass both to function
> > where the statistics are required. As for the exposed lock functions,
> > when adding tids to the tidstore, the caller would need to call
> > something like TidStoreLockExclusive(ts) that further calls
> > LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally.
>
> The advantage here is that vacuum can avoid locking entirely while
> using shared memory, just like it does now, and has the option to add
> it later.

True.

> IIUC, the radix tree struct would have a lock member, but wouldn't
> take any locks internally? Maybe we still need one for
> RT_MEMORY_USAGE? For that, I see dsa_get_total_size() takes its own
> DSA_AREA_LOCK -- maybe that's enough?

I think that's a good point. So there will be no place where the radix
tree takes any locks internally.

>
> That seems simplest, and is not very far from what we do now. If we do
> this, then the lock functions should be where we branch for is_shared.

Agreed.

>
> > 2. add callback functions to tidstore so that the caller can do its
> > work while holding a lock on the tidstore. This is like the idea we
> > just discussed for radix tree. The caller passes a callback function
> > and user data to TidStoreSetBlockOffsets(), and the callback is called
> > after setting tids. Similar to option 1, the statistics need to be
> > stored in a different area.
>
> I think we'll have to move to something like this eventually, but it
> seems like overkill right now.

Right.

>
> > 3. keep tidstore.c and tidbitmap.c separate implementations but use
> > radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its
> > control object and doesn't have any lossy page support. On the other
> > hand, in tidbitmap.c we replace simplehash with radix tree. This makes
> > tidstore.c simple but we would end up having different data structures
> > for similar usage.
>
> They have so much in common that it's worth it to use the same
> interface and (eventually) value type. They just need separate paths
> for adding tids, as we've discussed.

Agreed.

>
> > I think it's worth trying option 1. What do you think, John?
>
> +1

Thanks!

Before working on this idea, since the latest patches conflict with
the current HEAD, I share the latest patch set (v54). Here is the
summary:

- As for radix tree part, it's based on v53 patch. I've squashed most
of cleanups and changes in v53 except for "DRAFT: Stop using invalid
pointers as placeholders." as I thought you might want to still work
on it. BTW it includes "#undef RT_SHMEM".
- As for tidstore, it's based on v51. That is, it still has the
control object and num_tids there.
- As for vacuum integration, it's also based on v51. But we no longer
need to change has_lpdead_items and LVPagePruneState thanks to the
recent commit c120550edb8 and e313a61137.

For the next version patch, I'll work on this idea and try to clean up
locking stuff both in tidstore and radix tree. Or if you're already
working on some of them, please let me know. I'll review it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v54-ART.tar.gz
Description: GNU Zip compressed data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-21 Thread Masahiko Sawada
On Wed, Jan 17, 2024 at 12:32 PM John Naylor  wrote:
>
> I wrote:
>
> > > Hmm, I wonder if that's a side-effect of the "create" functions doing
> > > their own allocations and returning a pointer. Would it be less tricky
> > > if the structs were declared where we need them and passed to "init"
> > > functions?
>
> If this is a possibility, I thought I'd first send the last (I hope)
> large-ish set of radix tree cleanups to avoid rebasing issues. I'm not
> including tidstore/vacuum here, because recent discussion has some
> up-in-the-air work.

Thank you for updating the patches! These updates look good to me.

>
> Should be self-explanatory, but some thing are worth calling out:
> 0012 and 0013: Some time ago I started passing insertpos as a
> parameter, but now see that is not ideal -- when growing from node16
> to node48 we don't need it at all, so it's a wasted calculation. While
> reverting that, I found that this also allows passing constants in
> some cases.
> 0014 makes a cleaner separation between adding a child and growing a
> node, resulting in more compact-looking functions.
> 0019 is a bit unpolished, but I realized that it's pointless to assign
> a zero child when further up the call stack we overwrite it anyway
> with the actual value. With this, that assignment is skipped. This
> makes some comments and names strange, so needs a bit of polish, but
> wanted to get it out there anyway.

Cool.

I'll merge these patches in the next version v54 patch set.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-21 Thread John Naylor
On Mon, Jan 22, 2024 at 10:28 AM Masahiko Sawada  wrote:
>
> On further thought, as you pointed out before, "num_tids" should not
> be in tidstore in terms of integration with tidbitmap.c, because
> tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no
> longer accurate and useful. Similarly, looking at tidbitmap.c, it has
> npages and nchunks but they will not be necessary in lazy vacuum use
> case. Also, assuming that we support parallel heap pruning, probably
> we need to somehow lock the tidstore while adding tids to the tidstore
> concurrently by parallel vacuum worker. But in tidbitmap use case, we
> don't need to lock the tidstore since it doesn't have multiple
> writers.

Not currently, and it does seem bad to require locking where it's not required.

(That would be a prerequisite for parallel index scan. It's been tried
before with the hash table, but concurrency didn't scale well with the
hash table. I have no reason to think that the radix tree would scale
significantly better with the same global LW lock, but as you know
there are other locking schemes possible.)

> Given these facts, different statistics and different lock
> strategies are required by different use case. So I think there are 3
> options:
>
> 1. expose lock functions for tidstore and the caller manages the
> statistics in the outside of tidstore. For example, in lazyvacuum.c we
> would have a TidStore for tid storage as well as VacDeadItemsInfo that
> has num_tids and max_bytes. Both are in LVRelState. For parallel
> vacuum, we pass both to the workers via DSM and pass both to function
> where the statistics are required. As for the exposed lock functions,
> when adding tids to the tidstore, the caller would need to call
> something like TidStoreLockExclusive(ts) that further calls
> LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally.

The advantage here is that vacuum can avoid locking entirely while
using shared memory, just like it does now, and has the option to add
it later.
IIUC, the radix tree struct would have a lock member, but wouldn't
take any locks internally? Maybe we still need one for
RT_MEMORY_USAGE? For that, I see dsa_get_total_size() takes its own
DSA_AREA_LOCK -- maybe that's enough?

That seems simplest, and is not very far from what we do now. If we do
this, then the lock functions should be where we branch for is_shared.

> 2. add callback functions to tidstore so that the caller can do its
> work while holding a lock on the tidstore. This is like the idea we
> just discussed for radix tree. The caller passes a callback function
> and user data to TidStoreSetBlockOffsets(), and the callback is called
> after setting tids. Similar to option 1, the statistics need to be
> stored in a different area.

I think we'll have to move to something like this eventually, but it
seems like overkill right now.

> 3. keep tidstore.c and tidbitmap.c separate implementations but use
> radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its
> control object and doesn't have any lossy page support. On the other
> hand, in tidbitmap.c we replace simplehash with radix tree. This makes
> tidstore.c simple but we would end up having different data structures
> for similar usage.

They have so much in common that it's worth it to use the same
interface and (eventually) value type. They just need separate paths
for adding tids, as we've discussed.

> I think it's worth trying option 1. What do you think, John?

+1




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-21 Thread Masahiko Sawada
On Fri, Jan 19, 2024 at 6:48 PM John Naylor  wrote:
>
> On Fri, Jan 19, 2024 at 2:26 PM Masahiko Sawada  wrote:
> >
> > On Thu, Jan 18, 2024 at 1:30 PM John Naylor  wrote:
> > > I'm not quite sure what the point of "num_items" is anymore, because
> > > it was really tied to the array in VacDeadItems. dead_items->num_items
> > > is essential to reading/writing the array correctly. If this number is
> > > wrong, the array is corrupt. There is no such requirement for the
> > > radix tree. We don't need to know the number of tids to add to it or
> > > do a lookup, or anything.
> >
> > True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking
> > we need to have the number of TIDs in a tidstore, especially in the
> > tidstore's control object.
>
> Hmm, it would be kind of sad to require explicit locking in tidstore.c
> is only for maintaining that one number at all times. Aside from the
> two ereports after an index scan / second heap pass, the only
> non-assert place where it's used is
>
> @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
>   * Do index vacuuming (call each index's ambulkdelete routine), then do
>   * related heap vacuuming
>   */
> - if (dead_items->num_items > 0)
> + if (TidStoreNumTids(dead_items) > 0)
>   lazy_vacuum(vacrel);
>
> ...and that condition can be checked by doing a single step of
> iteration to see if it shows anything. But for the ereport, my idea
> for iteration + popcount is probably quite slow.

Right.

On further thought, as you pointed out before, "num_tids" should not
be in tidstore in terms of integration with tidbitmap.c, because
tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no
longer accurate and useful. Similarly, looking at tidbitmap.c, it has
npages and nchunks but they will not be necessary in lazy vacuum use
case. Also, assuming that we support parallel heap pruning, probably
we need to somehow lock the tidstore while adding tids to the tidstore
concurrently by parallel vacuum worker. But in tidbitmap use case, we
don't need to lock the tidstore since it doesn't have multiple
writers. Given these facts, different statistics and different lock
strategies are required by different use case. So I think there are 3
options:

1. expose lock functions for tidstore and the caller manages the
statistics in the outside of tidstore. For example, in lazyvacuum.c we
would have a TidStore for tid storage as well as VacDeadItemsInfo that
has num_tids and max_bytes. Both are in LVRelState. For parallel
vacuum, we pass both to the workers via DSM and pass both to function
where the statistics are required. As for the exposed lock functions,
when adding tids to the tidstore, the caller would need to call
something like TidStoreLockExclusive(ts) that further calls
LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally.

2. add callback functions to tidstore so that the caller can do its
work while holding a lock on the tidstore. This is like the idea we
just discussed for radix tree. The caller passes a callback function
and user data to TidStoreSetBlockOffsets(), and the callback is called
after setting tids. Similar to option 1, the statistics need to be
stored in a different area.

3. keep tidstore.c and tidbitmap.c separate implementations but use
radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its
control object and doesn't have any lossy page support. On the other
hand, in tidbitmap.c we replace simplehash with radix tree. This makes
tidstore.c simple but we would end up having different data structures
for similar usage.

I think it's worth trying option 1. What do you think, John?

>
> > IIUC lpdead_items is the total number of LP_DEAD items vacuumed during
> > the whole lazy vacuum operation whereas num_items is the number of
> > LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle.
> > That is, after heap vacuum, the latter counter is reset while the
> > former counter is not.
> >
> > The latter counter is used in lazyvacuum.c as well as the ereport in
> > vac_bulkdel_one_index().
>
> Ah, of course.
>
> > Putting a copy of the key in BlocktableEntry's header is an
> > interesting idea. But the current debug code in the tidstore also
> > makes sure that the tidstore returns TIDs in the correct order during
> > an iterate operation. I think it still has a value and you can disable
> > it by removing the "#define TIDSTORE_DEBUG" line.
>
> Fair enough. I just thought it'd be less work to leave this out in
> case we change how locking is called.
>
> > > This week I tried an idea to use a callback there so that after
> > > internal unlocking, the caller received the value (or whatever else
> > > needs to happen, such as lookup an offset in the tid bitmap). I've
> > > attached a draft for that that passes radix tree tests. It's a bit
> > > awkward, but I'm guessing this would more closely match future
> > > internal atomic locking. Let me know what you think of the concept,
> > > and then do 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-19 Thread John Naylor
I wrote:

> On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada  wrote:
> > During trying this idea, I realized that there is a visibility problem
> > in the radix tree template
>
> If it's broken even without the embedding I'll look into this (I don't
> know if this configuration has ever been tested). I think a good test
> is putting the shared tid tree in it's own translation unit, to see if
> anything needs to be fixed. I'll go try that.

Here's a quick test that this works. The only thing that really needed
fixing in the template was failure to un-define one symbol. The rest
was just moving some things around.


v51-addendum-Put-shared-radix-tree-for-tidstore-in-its-own-tr.patch.nocfbot
Description: Binary data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-19 Thread John Naylor
On Fri, Jan 19, 2024 at 2:26 PM Masahiko Sawada  wrote:
>
> On Thu, Jan 18, 2024 at 1:30 PM John Naylor  wrote:
> > I'm not quite sure what the point of "num_items" is anymore, because
> > it was really tied to the array in VacDeadItems. dead_items->num_items
> > is essential to reading/writing the array correctly. If this number is
> > wrong, the array is corrupt. There is no such requirement for the
> > radix tree. We don't need to know the number of tids to add to it or
> > do a lookup, or anything.
>
> True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking
> we need to have the number of TIDs in a tidstore, especially in the
> tidstore's control object.

Hmm, it would be kind of sad to require explicit locking in tidstore.c
is only for maintaining that one number at all times. Aside from the
two ereports after an index scan / second heap pass, the only
non-assert place where it's used is

@@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * Do index vacuuming (call each index's ambulkdelete routine), then do
  * related heap vacuuming
  */
- if (dead_items->num_items > 0)
+ if (TidStoreNumTids(dead_items) > 0)
  lazy_vacuum(vacrel);

...and that condition can be checked by doing a single step of
iteration to see if it shows anything. But for the ereport, my idea
for iteration + popcount is probably quite slow.

> IIUC lpdead_items is the total number of LP_DEAD items vacuumed during
> the whole lazy vacuum operation whereas num_items is the number of
> LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle.
> That is, after heap vacuum, the latter counter is reset while the
> former counter is not.
>
> The latter counter is used in lazyvacuum.c as well as the ereport in
> vac_bulkdel_one_index().

Ah, of course.

> Putting a copy of the key in BlocktableEntry's header is an
> interesting idea. But the current debug code in the tidstore also
> makes sure that the tidstore returns TIDs in the correct order during
> an iterate operation. I think it still has a value and you can disable
> it by removing the "#define TIDSTORE_DEBUG" line.

Fair enough. I just thought it'd be less work to leave this out in
case we change how locking is called.

> > This week I tried an idea to use a callback there so that after
> > internal unlocking, the caller received the value (or whatever else
> > needs to happen, such as lookup an offset in the tid bitmap). I've
> > attached a draft for that that passes radix tree tests. It's a bit
> > awkward, but I'm guessing this would more closely match future
> > internal atomic locking. Let me know what you think of the concept,
> > and then do whichever way you think is best. (using v53 as the basis)
>
> Thank you for verifying this idea! Interesting. While it's promising
> in terms of future atomic locking, I'm concerned it might not be easy
> to use if radix tree APIs supports only such callback style.

Yeah, it's quite awkward. It could be helped by only exposing it for
varlen types. For simply returning "present or not" (used a lot in the
regression tests), we could skip the callback if the data is null.
That is all also extra stuff.

> I believe
> the caller would like to pass one more data along with val_data. For

That's trivial, however, if I understand you correctly. With "void *",
a callback can receive anything, including a struct containing
additional pointers to elsewhere.

> example, considering tidstore that has num_tids internally, it wants
> to pass both a pointer to BlocktableEntry and a pointer to TidStore
> itself so that it increments the counter while holding a lock.

Hmm, so a callback to RT_SET also. That's interesting!

Anyway, I agree it needs to be simple, since the first use doesn't
even have multiple writers.

> BTW in radixtree.h pg_attribute_unused() is used for some functions,
> but is it for debugging purposes? I don't see why it's used only for
> some functions.

It was there to silence warnings about unused functions. I only see
one remaining, and it's already behind a debug symbol, so we might not
need this attribute anymore.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-18 Thread Masahiko Sawada
On Thu, Jan 18, 2024 at 1:30 PM John Naylor  wrote:
>
> On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada  wrote:
> > It seems we cannot make this work nicely. IIUC VacDeadItems is
> > allocated in DSM and TidStore is embedded there. However,
> > dead_items->items.area is a local pointer to dsa_area. So we cannot
> > include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we
> > would need to pass dsa_area to each interface by callers.
>
> Thanks again for exploring this line of thinking! Okay, it seems even
> if there's a way to make this work, it would be too invasive to
> justify when compared with the advantage I was hoping for.
>
> > > If we can't make this work nicely, I'd be okay with keeping the tid
> > > store control object. My biggest concern is unnecessary
> > > double-locking.
> >
> > If we don't do any locking stuff in radix tree APIs and it's the
> > user's responsibility at all, probably we don't need a lock for
> > tidstore? That is, we expose lock functions as you mentioned and the
> > user (like tidstore) acquires/releases the lock before/after accessing
> > the radix tree and num_items.
>
> I'm not quite sure what the point of "num_items" is anymore, because
> it was really tied to the array in VacDeadItems. dead_items->num_items
> is essential to reading/writing the array correctly. If this number is
> wrong, the array is corrupt. There is no such requirement for the
> radix tree. We don't need to know the number of tids to add to it or
> do a lookup, or anything.

True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking
we need to have the number of TIDs in a tidstore, especially in the
tidstore's control object.

>
> There are a number of places where we assert "the running count of the
> dead items" is the same as "the length of the dead items array", like
> here:
>
> @@ -2214,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel)
>   BlockNumber threshold;
>
>   Assert(vacrel->num_index_scans == 0);
> - Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
> + Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));
>
> As such, in HEAD I'm guessing it's arbitrary which one is used for
> control flow. Correct me if I'm mistaken. If I am wrong for some part
> of the code, it'd be good to understand when that invariant can't be
> maintained.
>
> @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
>   * Do index vacuuming (call each index's ambulkdelete routine), then do
>   * related heap vacuuming
>   */
> - if (dead_items->num_items > 0)
> + if (TidStoreNumTids(dead_items) > 0)
>   lazy_vacuum(vacrel);
>
> Like here. In HEAD, could this have used vacrel->dead_items?
>
> @@ -2479,14 +2473,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
>   * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
>   * the second heap pass.  No more, no less.
>   */
> - Assert(index > 0);
>   Assert(vacrel->num_index_scans > 1 ||
> -(index == vacrel->lpdead_items &&
> +(TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
>   vacuumed_pages == vacrel->lpdead_item_pages));
>
>   ereport(DEBUG2,
> - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
> - vacrel->relname, (long long) index, vacuumed_pages)));
> + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
> in %u pages",
> + vacrel->relname, TidStoreNumTids(vacrel->dead_items),
> + vacuumed_pages)));
>
> We assert that vacrel->lpdead_items has the expected value, and then
> the ereport repeats the function call (with a lock) to read the value
> we just consulted to pass the assert.
>
> If we *really* want to compare counts, maybe we could invent a
> debugging-only function that iterates over the tree and popcounts the
> bitmaps. That seems too expensive for regular assert builds, though.

IIUC lpdead_items is the total number of LP_DEAD items vacuumed during
the whole lazy vacuum operation whereas num_items is the number of
LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle.
That is, after heap vacuum, the latter counter is reset while the
former counter is not.

The latter counter is used in lazyvacuum.c as well as the ereport in
vac_bulkdel_one_index().

>
> On the subject of debugging builds, I think it no longer makes sense
> to have the array for debug checking in tid store, even during
> development. A few months ago, we had an encoding scheme that looked
> simple on paper, but its code was fiendishly difficult to follow (at
> least for me). That's gone. In addition to the debugging count above,
> we could also put a copy of the key in the BlockTableEntry's header,
> in debug builds. We don't yet need to care about the key size, since
> we don't (yet) have runtime-embeddable values.

Putting a copy of the key in BlocktableEntry's header is an
interesting idea. But the current debug code in the tidstore also
makes sure that the tidstore returns TIDs in the correct order during
an iterate operation. I think it still 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-17 Thread John Naylor
On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada  wrote:
> It seems we cannot make this work nicely. IIUC VacDeadItems is
> allocated in DSM and TidStore is embedded there. However,
> dead_items->items.area is a local pointer to dsa_area. So we cannot
> include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we
> would need to pass dsa_area to each interface by callers.

Thanks again for exploring this line of thinking! Okay, it seems even
if there's a way to make this work, it would be too invasive to
justify when compared with the advantage I was hoping for.

> > If we can't make this work nicely, I'd be okay with keeping the tid
> > store control object. My biggest concern is unnecessary
> > double-locking.
>
> If we don't do any locking stuff in radix tree APIs and it's the
> user's responsibility at all, probably we don't need a lock for
> tidstore? That is, we expose lock functions as you mentioned and the
> user (like tidstore) acquires/releases the lock before/after accessing
> the radix tree and num_items.

I'm not quite sure what the point of "num_items" is anymore, because
it was really tied to the array in VacDeadItems. dead_items->num_items
is essential to reading/writing the array correctly. If this number is
wrong, the array is corrupt. There is no such requirement for the
radix tree. We don't need to know the number of tids to add to it or
do a lookup, or anything.

There are a number of places where we assert "the running count of the
dead items" is the same as "the length of the dead items array", like
here:

@@ -2214,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel)
  BlockNumber threshold;

  Assert(vacrel->num_index_scans == 0);
- Assert(vacrel->lpdead_items == vacrel->dead_items->num_items);
+ Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items));

As such, in HEAD I'm guessing it's arbitrary which one is used for
control flow. Correct me if I'm mistaken. If I am wrong for some part
of the code, it'd be good to understand when that invariant can't be
maintained.

@@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel)
  * Do index vacuuming (call each index's ambulkdelete routine), then do
  * related heap vacuuming
  */
- if (dead_items->num_items > 0)
+ if (TidStoreNumTids(dead_items) > 0)
  lazy_vacuum(vacrel);

Like here. In HEAD, could this have used vacrel->dead_items?

@@ -2479,14 +2473,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel)
  * We set all LP_DEAD items from the first heap pass to LP_UNUSED during
  * the second heap pass.  No more, no less.
  */
- Assert(index > 0);
  Assert(vacrel->num_index_scans > 1 ||
-(index == vacrel->lpdead_items &&
+(TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items &&
  vacuumed_pages == vacrel->lpdead_item_pages));

  ereport(DEBUG2,
- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
in %u pages",
+ vacrel->relname, TidStoreNumTids(vacrel->dead_items),
+ vacuumed_pages)));

We assert that vacrel->lpdead_items has the expected value, and then
the ereport repeats the function call (with a lock) to read the value
we just consulted to pass the assert.

If we *really* want to compare counts, maybe we could invent a
debugging-only function that iterates over the tree and popcounts the
bitmaps. That seems too expensive for regular assert builds, though.

On the subject of debugging builds, I think it no longer makes sense
to have the array for debug checking in tid store, even during
development. A few months ago, we had an encoding scheme that looked
simple on paper, but its code was fiendishly difficult to follow (at
least for me). That's gone. In addition to the debugging count above,
we could also put a copy of the key in the BlockTableEntry's header,
in debug builds. We don't yet need to care about the key size, since
we don't (yet) have runtime-embeddable values.

> Currently (as of v52 patch) RT_FIND is
> doing so,

[meaning, there is no internal "automatic" locking here since after we
switched to variable-length types, an outstanding TODO]
Maybe it's okay to expose global locking for v17. I have one possible
alternative:

This week I tried an idea to use a callback there so that after
internal unlocking, the caller received the value (or whatever else
needs to happen, such as lookup an offset in the tid bitmap). I've
attached a draft for that that passes radix tree tests. It's a bit
awkward, but I'm guessing this would more closely match future
internal atomic locking. Let me know what you think of the concept,
and then do whichever way you think is best. (using v53 as the basis)

I believe this is the only open question remaining. The rest is just
polish and testing.

> During trying this idea, I realized that there is a visibility problem
> in the radix tree template

If it's broken even without the embedding I'll look into this (I don't
know if this 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-17 Thread Masahiko Sawada
On Wed, Jan 17, 2024 at 11:37 AM John Naylor  wrote:
>
> On Wed, Jan 17, 2024 at 8:39 AM Masahiko Sawada  wrote:
> >
> > On Wed, Jan 17, 2024 at 9:20 AM John Naylor  wrote:
> > >
> > > On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada  
> > > wrote:
> > > > Just changing "items" to be the local tidstore struct could make the
> > > > code tricky a bit, since max_bytes and num_items are on the shared
> > > > memory while "items" is a local pointer to the shared tidstore.
> > >
> > > Thanks for trying it this way! I like the overall simplification but
> > > this aspect is not great.
> > > Hmm, I wonder if that's a side-effect of the "create" functions doing
> > > their own allocations and returning a pointer. Would it be less tricky
> > > if the structs were declared where we need them and passed to "init"
> > > functions?
> >
> > Seems worth trying. The current RT_CREATE() API is also convenient as
> > other data structure such as simplehash.h and dshash.c supports a
> > similar
>
> I don't happen to know if these paths had to solve similar trickiness
> with some values being local, and some shared.
>
> > > That may be a good idea for other reasons. It's awkward that the
> > > create function is declared like this:
> > >
> > > #ifdef RT_SHMEM
> > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
> > > dsa_area *dsa,
> > > int tranche_id);
> > > #else
> > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
> > > #endif
> > >
> > > An init function wouldn't need these parameters: it could look at the
> > > passed struct to know what to do.
> >
> > But the init function would initialize leaf_ctx etc,no? Initializing
> > leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE.
>
> I was more referring to the parameters that were different above
> depending on shared memory. My first thought was that the tricky part
> is because of the allocation in local memory, but it's certainly
> possible I've misunderstood the problem.
>
> > The same
> > is true for dsa. I imagined that an init function would allocate a DSA
> > memory for the control object.
>
> Yes:
>
> ...
> //  embedded in VacDeadItems
>   TidStore items;
> };
>
> // NULL DSA in local case, etc
> dead_items->items.area = dead_items_dsa;
> dead_items->items.tranche_id = FOO_ID;
>
> TidStoreInit(_items->items, vac_work_mem);
>
> That's how I imagined it would work (leaving out some details). I
> haven't tried it, so not sure how much it helps. Maybe it has other
> problems, but I'm hoping it's just a matter of programming.

It seems we cannot make this work nicely. IIUC VacDeadItems is
allocated in DSM and TidStore is embedded there. However,
dead_items->items.area is a local pointer to dsa_area. So we cannot
include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we
would need to pass dsa_area to each interface by callers.

>
> If we can't make this work nicely, I'd be okay with keeping the tid
> store control object. My biggest concern is unnecessary
> double-locking.

If we don't do any locking stuff in radix tree APIs and it's the
user's responsibility at all, probably we don't need a lock for
tidstore? That is, we expose lock functions as you mentioned and the
user (like tidstore) acquires/releases the lock before/after accessing
the radix tree and num_items. Currently (as of v52 patch) RT_FIND is
doing so, but we would need to change RT_SET() and iteration functions
as well.

During trying this idea, I realized that there is a visibility problem
in the radix tree template especially if we want to embed the radix
tree in a struct. Considering a use case where we want to use a radix
tree in an exposed struct, we would declare only interfaces in a .h
file and define actual implementation in a .c file (FYI
TupleHashTableData does a similar thing with simplehash.h). The .c
file and .h file would be like:

in .h file:
#define RT_PREFIX local_rt
#define RT_SCOPE extern
#define RT_DECLARE
#define RT_VALUE_TYPE BlocktableEntry
#define RT_VARLEN_VALUE
#include "lib/radixtree.h"

typedef struct TidStore
{
:
local_rt_radix_tree tree; /* embedded */
:
} TidStore;

in .c file:

#define RT_PREFIX local_rt
#define RT_SCOPE extern
#define RT_DEFINE
#define RT_VALUE_TYPE BlocktableEntry
#define RT_VARLEN_VALUE
#include "lib/radixtree.h"

But it doesn't work as the compiler doesn't know the actual definition
of local_rt_radix_tree. If the 'tree' is *local_rt_radix_tree, it
works. The reason is that with RT_DECLARE but without RT_DEFINE, the
radix tree template generates only forward declarations:

#ifdef RT_DECLARE

typedef struct RT_RADIX_TREE RT_RADIX_TREE;
typedef struct RT_ITER RT_ITER;

In order to make it work, we need to move the definitions required to
expose RT_RADIX_TREE struct to RT_DECLARE part, which actually
requires to move RT_NODE, RT_HANDLE, RT_NODE_PTR, RT_SIZE_CLASS_COUNT,
and RT_RADIX_TREE_CONTROL etc. However RT_SIZE_CLASS_COUNT, used in
RT_RADIX_TREE, could be bothersome. Since it refers to

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-16 Thread John Naylor
I wrote:

> > Hmm, I wonder if that's a side-effect of the "create" functions doing
> > their own allocations and returning a pointer. Would it be less tricky
> > if the structs were declared where we need them and passed to "init"
> > functions?

If this is a possibility, I thought I'd first send the last (I hope)
large-ish set of radix tree cleanups to avoid rebasing issues. I'm not
including tidstore/vacuum here, because recent discussion has some
up-in-the-air work.

Should be self-explanatory, but some thing are worth calling out:
0012 and 0013: Some time ago I started passing insertpos as a
parameter, but now see that is not ideal -- when growing from node16
to node48 we don't need it at all, so it's a wasted calculation. While
reverting that, I found that this also allows passing constants in
some cases.
0014 makes a cleaner separation between adding a child and growing a
node, resulting in more compact-looking functions.
0019 is a bit unpolished, but I realized that it's pointless to assign
a zero child when further up the call stack we overwrite it anyway
with the actual value. With this, that assignment is skipped. This
makes some comments and names strange, so needs a bit of polish, but
wanted to get it out there anyway.


v53-ART.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-16 Thread John Naylor
On Wed, Jan 17, 2024 at 8:39 AM Masahiko Sawada  wrote:
>
> On Wed, Jan 17, 2024 at 9:20 AM John Naylor  wrote:
> >
> > On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada  
> > wrote:
> > > Just changing "items" to be the local tidstore struct could make the
> > > code tricky a bit, since max_bytes and num_items are on the shared
> > > memory while "items" is a local pointer to the shared tidstore.
> >
> > Thanks for trying it this way! I like the overall simplification but
> > this aspect is not great.
> > Hmm, I wonder if that's a side-effect of the "create" functions doing
> > their own allocations and returning a pointer. Would it be less tricky
> > if the structs were declared where we need them and passed to "init"
> > functions?
>
> Seems worth trying. The current RT_CREATE() API is also convenient as
> other data structure such as simplehash.h and dshash.c supports a
> similar

I don't happen to know if these paths had to solve similar trickiness
with some values being local, and some shared.

> > That may be a good idea for other reasons. It's awkward that the
> > create function is declared like this:
> >
> > #ifdef RT_SHMEM
> > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
> > dsa_area *dsa,
> > int tranche_id);
> > #else
> > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
> > #endif
> >
> > An init function wouldn't need these parameters: it could look at the
> > passed struct to know what to do.
>
> But the init function would initialize leaf_ctx etc,no? Initializing
> leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE.

I was more referring to the parameters that were different above
depending on shared memory. My first thought was that the tricky part
is because of the allocation in local memory, but it's certainly
possible I've misunderstood the problem.

> The same
> is true for dsa. I imagined that an init function would allocate a DSA
> memory for the control object.

Yes:

...
//  embedded in VacDeadItems
  TidStore items;
};

// NULL DSA in local case, etc
dead_items->items.area = dead_items_dsa;
dead_items->items.tranche_id = FOO_ID;

TidStoreInit(_items->items, vac_work_mem);

That's how I imagined it would work (leaving out some details). I
haven't tried it, so not sure how much it helps. Maybe it has other
problems, but I'm hoping it's just a matter of programming.

If we can't make this work nicely, I'd be okay with keeping the tid
store control object. My biggest concern is unnecessary
double-locking.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-16 Thread Masahiko Sawada
On Wed, Jan 17, 2024 at 9:20 AM John Naylor  wrote:
>
> On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada  wrote:
> > Just changing "items" to be the local tidstore struct could make the
> > code tricky a bit, since max_bytes and num_items are on the shared
> > memory while "items" is a local pointer to the shared tidstore.
>
> Thanks for trying it this way! I like the overall simplification but
> this aspect is not great.
> Hmm, I wonder if that's a side-effect of the "create" functions doing
> their own allocations and returning a pointer. Would it be less tricky
> if the structs were declared where we need them and passed to "init"
> functions?

Seems worth trying. The current RT_CREATE() API is also convenient as
other data structure such as simplehash.h and dshash.c supports a
similar

>
> That may be a good idea for other reasons. It's awkward that the
> create function is declared like this:
>
> #ifdef RT_SHMEM
> RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
> dsa_area *dsa,
> int tranche_id);
> #else
> RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
> #endif
>
> An init function wouldn't need these parameters: it could look at the
> passed struct to know what to do.

But the init function would initialize leaf_ctx etc,no? Initializing
leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE. The same
is true for dsa. I imagined that an init function would allocate a DSA
memory for the control object. So I imagine we will end up still
requiring some of them.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-16 Thread John Naylor
On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada  wrote:
> Just changing "items" to be the local tidstore struct could make the
> code tricky a bit, since max_bytes and num_items are on the shared
> memory while "items" is a local pointer to the shared tidstore.

Thanks for trying it this way! I like the overall simplification but
this aspect is not great.
Hmm, I wonder if that's a side-effect of the "create" functions doing
their own allocations and returning a pointer. Would it be less tricky
if the structs were declared where we need them and passed to "init"
functions?

That may be a good idea for other reasons. It's awkward that the
create function is declared like this:

#ifdef RT_SHMEM
RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes,
dsa_area *dsa,
int tranche_id);
#else
RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes);
#endif

An init function wouldn't need these parameters: it could look at the
passed struct to know what to do.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-15 Thread Masahiko Sawada
On Sun, Jan 14, 2024 at 10:43 PM John Naylor  wrote:
>
> On Fri, Jan 12, 2024 at 3:49 PM Masahiko Sawada  wrote:
> >
> > On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada  
> > wrote:
> > > So I agree to remove both max_bytes and num_items from the control
> > > object.Also, as you mentioned, we can remove the tidstore control
> > > object itself. TidStoreGetHandle() returns a radix tree handle, and we
> > > can pass it to TidStoreAttach().  I'll try it.
>
> Thanks. It's worth looking closely here.
>
> > I realized that if we remove the whole tidstore control object
> > including max_bytes, processes who attached the shared tidstore cannot
> > use TidStoreIsFull() actually as it always returns true.
>
> I imagine that we'd replace that with a function (maybe an earlier
> version had it?) to report the memory usage to the caller, which
> should know where to find max_bytes.
>
> > Also they
> > cannot use TidStoreReset() as well since it needs to pass max_bytes to
> > RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it
> > could be problematic for general use.
>
> HEAD has no problem finding the necessary values, and I don't think
> it'd be difficult to maintain that ability. I'm not actually sure what
> "general use" needs to have, and I'm not sure anyone can guess.
> There's the future possibility of parallel heap-scanning, but I'm
> guessing a *lot* more needs to happen for that to work, so I'm not
> sure how much it buys us to immediately start putting those two fields
> in a special abstraction. The only other concrete use case mentioned
> in this thread that I remember is bitmap heap scan, and I believe that
> would never need to reset, only free the whole thing when finished.
>
> I spent some more time studying parallel vacuum, and have some
> thoughts. In HEAD, we have
>
> -/*
> - * VacDeadItems stores TIDs whose index tuples are deleted by index 
> vacuuming.
> - */
> -typedef struct VacDeadItems
> -{
> - int max_items; /* # slots allocated in array */
> - int num_items; /* current # of entries */
> -
> - /* Sorted array of TIDs to delete from indexes */
> - ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
> -} VacDeadItems;
>
> ...which has the tids, plus two fields that function _very similarly_
> to the two extra fields in the tidstore control object. It's a bit
> strange to me that the patch doesn't have this struct anymore.
>
> I suspect if we keep it around (just change "items" to be the local
> tidstore struct), the patch would have a bit less churn and look/work
> more like the current code. I think it might be easier to read if the
> v17 commits are suited to the current needs of vacuum, rather than try
> to anticipate all uses. Richer abstractions can come later if needed.

Just changing "items" to be the local tidstore struct could make the
code tricky a bit, since max_bytes and num_items are on the shared
memory while "items" is a local pointer to the shared tidstore. This
is a reason why I abstract them behind TidStore. However, IIUC the
current parallel vacuum can work with such VacDeadItems fields,
fortunately. The leader process can use VacDeadItems allocated on DSM,
and worker processes can use a local VacDeadItems of which max_bytes
and num_items are copied from the shared one and "items" is a local
pointer.

Assuming parallel heap scan requires for both the leader and workers
to update the shared VacDeadItems concurrently, we may need such
richer abstractions.

I've implemented this idea in the v52 patch set. Here is the summary
of the updates:

0008: Remove the control object from tidstore. Also removed some
unsupported functions such as TidStoreNumTids()
0009: Adjust lazy vacuum integration patch with the control object removal.

I've not updated any locking code yet. Once we confirm this direction,
I'll update the locking code too.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v52-ART.tar.gz
Description: GNU Zip compressed data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-14 Thread John Naylor
On Fri, Jan 12, 2024 at 3:49 PM Masahiko Sawada  wrote:
>
> On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada  wrote:
> > So I agree to remove both max_bytes and num_items from the control
> > object.Also, as you mentioned, we can remove the tidstore control
> > object itself. TidStoreGetHandle() returns a radix tree handle, and we
> > can pass it to TidStoreAttach().  I'll try it.

Thanks. It's worth looking closely here.

> I realized that if we remove the whole tidstore control object
> including max_bytes, processes who attached the shared tidstore cannot
> use TidStoreIsFull() actually as it always returns true.

I imagine that we'd replace that with a function (maybe an earlier
version had it?) to report the memory usage to the caller, which
should know where to find max_bytes.

> Also they
> cannot use TidStoreReset() as well since it needs to pass max_bytes to
> RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it
> could be problematic for general use.

HEAD has no problem finding the necessary values, and I don't think
it'd be difficult to maintain that ability. I'm not actually sure what
"general use" needs to have, and I'm not sure anyone can guess.
There's the future possibility of parallel heap-scanning, but I'm
guessing a *lot* more needs to happen for that to work, so I'm not
sure how much it buys us to immediately start putting those two fields
in a special abstraction. The only other concrete use case mentioned
in this thread that I remember is bitmap heap scan, and I believe that
would never need to reset, only free the whole thing when finished.

I spent some more time studying parallel vacuum, and have some
thoughts. In HEAD, we have

-/*
- * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming.
- */
-typedef struct VacDeadItems
-{
- int max_items; /* # slots allocated in array */
- int num_items; /* current # of entries */
-
- /* Sorted array of TIDs to delete from indexes */
- ItemPointerData items[FLEXIBLE_ARRAY_MEMBER];
-} VacDeadItems;

...which has the tids, plus two fields that function _very similarly_
to the two extra fields in the tidstore control object. It's a bit
strange to me that the patch doesn't have this struct anymore.

I suspect if we keep it around (just change "items" to be the local
tidstore struct), the patch would have a bit less churn and look/work
more like the current code. I think it might be easier to read if the
v17 commits are suited to the current needs of vacuum, rather than try
to anticipate all uses. Richer abstractions can come later if needed.
Another stanza:

- /* Prepare the dead_items space */
- dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc,
-est_dead_items_len);
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
- MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items);
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items);
- pvs->dead_items = dead_items;

With s/max_items/max_bytes/, I wonder if we can still use some of
this, and parallel workers would have no problem getting the necessary
info, as they do today. If not, I don't really understand why. I'm not
very familiar with working with shared memory, and I know the tree
itself needs some different setup, so it's quite possible I'm missing
something.

I find it difficult to kept straight these four things:

- radix tree
- radix tree control object
- tidstore
- tidstore control object

Even with the code in front of me, it's hard to reason about how these
concepts fit together. It'd be much more readable if this was
simplified.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-12 Thread Masahiko Sawada
On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada  wrote:
>
> On Mon, Jan 8, 2024 at 8:35 PM John Naylor  wrote:
> >
> > On Wed, Jan 3, 2024 at 9:10 PM John Naylor  wrote:
> > >
> > > On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada  
> > > wrote:
> > >
> > > > I agree that we expose RT_LOCK_* functions and have tidstore use them,
> > > > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
> > > > calls part. I think that even if we expose them, we will still need to
> > > > do something like "if (TidStoreIsShared(ts))
> > > > shared_rt_lock_share(ts->tree.shared)", no?
> > >
> > > I'll come back to this topic separately.
> >
> > To answer your question, sure, but that "if (TidStoreIsShared(ts))"
> > part would be pushed down into a function so that only one place has
> > to care about it.
> >
> > However, I'm starting to question whether we even need that. Meaning,
> > lock the tidstore separately. To "lock the tidstore" means to take a
> > lock, _separate_ from the radix tree's internal lock, to control
> > access to two fields in a separate "control object":
> >
> > +typedef struct TidStoreControl
> > +{
> > + /* the number of tids in the store */
> > + int64 num_tids;
> > +
> > + /* the maximum bytes a TidStore can use */
> > + size_t max_bytes;
> >
> > I'm pretty sure max_bytes does not need to be in shared memory, and
> > certainly not under a lock: Thinking of a hypothetical
> > parallel-prune-phase scenario, one way would be for a leader process
> > to pass out ranges of blocks to workers, and when the limit is
> > exceeded, stop passing out blocks and wait for all the workers to
> > finish.
>
> True. I agreed that it doesn't need to be under a lock anyway, as it's
> read-only.
>
> >
> > As for num_tids, vacuum previously put the similar count in
> >
> > @@ -176,7 +179,8 @@ struct ParallelVacuumState
> >   PVIndStats *indstats;
> >
> >   /* Shared dead items space among parallel vacuum workers */
> > - VacDeadItems *dead_items;
> > + TidStore *dead_items;
> >
> > VacDeadItems contained "num_items". What was the reason to have new
> > infrastructure for that count? And it doesn't seem like access to it
> > was controlled by a lock -- can you confirm? If we did get parallel
> > pruning, maybe the count would belong inside PVShared?
>
> I thought that since the tidstore is a general-purpose data structure
> the shared counter should be protected by a lock. One thing I'm
> concerned about is that we might need to update both the radix tree
> and the counter atomically in some cases. But that's true we don't
> need it for lazy vacuum at least for now. Even given the parallel scan
> phase, probably we won't need to have workers check the total number
> of stored tuples during a parallel scan.
>
> >
> > The number of tids is not that tightly bound to the tidstore's job. I
> > believe tidbitmap.c (a possible future client) doesn't care about the
> > global number of tids -- not only that, but AND/OR operations can
> > change the number in a non-obvious way, so it would not be convenient
> > to keep an accurate number anyway. But the lock would still be
> > mandatory with this patch.
>
> Very good point.
>
> >
> > If we can make vacuum work a bit closer to how it does now, it'd be a
> > big step up in readability, I think. Namely, getting rid of all the
> > locking logic inside tidstore.c and let the radix tree's locking do
> > the right thing. We'd need to make that work correctly when receiving
> > pointers to values upon lookup, and I already shared ideas for that.
> > But I want to see if there is any obstacle in the way of removing the
> > tidstore control object and it's separate lock.
>
> So I agree to remove both max_bytes and num_items from the control
> object.Also, as you mentioned, we can remove the tidstore control
> object itself. TidStoreGetHandle() returns a radix tree handle, and we
> can pass it to TidStoreAttach().  I'll try it.
>

I realized that if we remove the whole tidstore control object
including max_bytes, processes who attached the shared tidstore cannot
use TidStoreIsFull() actually as it always returns true. Also they
cannot use TidStoreReset() as well since it needs to pass max_bytes to
RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it
could be problematic for general use. If we remove it, we probably
need a safeguard to prevent those who attached the tidstore from
calling these functions. Or we can keep the control object but remove
the lock and num_tids.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-10 Thread Masahiko Sawada
On Mon, Jan 8, 2024 at 8:35 PM John Naylor  wrote:
>
> On Wed, Jan 3, 2024 at 9:10 PM John Naylor  wrote:
> >
> > On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada  
> > wrote:
> >
> > > I agree that we expose RT_LOCK_* functions and have tidstore use them,
> > > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
> > > calls part. I think that even if we expose them, we will still need to
> > > do something like "if (TidStoreIsShared(ts))
> > > shared_rt_lock_share(ts->tree.shared)", no?
> >
> > I'll come back to this topic separately.
>
> To answer your question, sure, but that "if (TidStoreIsShared(ts))"
> part would be pushed down into a function so that only one place has
> to care about it.
>
> However, I'm starting to question whether we even need that. Meaning,
> lock the tidstore separately. To "lock the tidstore" means to take a
> lock, _separate_ from the radix tree's internal lock, to control
> access to two fields in a separate "control object":
>
> +typedef struct TidStoreControl
> +{
> + /* the number of tids in the store */
> + int64 num_tids;
> +
> + /* the maximum bytes a TidStore can use */
> + size_t max_bytes;
>
> I'm pretty sure max_bytes does not need to be in shared memory, and
> certainly not under a lock: Thinking of a hypothetical
> parallel-prune-phase scenario, one way would be for a leader process
> to pass out ranges of blocks to workers, and when the limit is
> exceeded, stop passing out blocks and wait for all the workers to
> finish.

True. I agreed that it doesn't need to be under a lock anyway, as it's
read-only.

>
> As for num_tids, vacuum previously put the similar count in
>
> @@ -176,7 +179,8 @@ struct ParallelVacuumState
>   PVIndStats *indstats;
>
>   /* Shared dead items space among parallel vacuum workers */
> - VacDeadItems *dead_items;
> + TidStore *dead_items;
>
> VacDeadItems contained "num_items". What was the reason to have new
> infrastructure for that count? And it doesn't seem like access to it
> was controlled by a lock -- can you confirm? If we did get parallel
> pruning, maybe the count would belong inside PVShared?

I thought that since the tidstore is a general-purpose data structure
the shared counter should be protected by a lock. One thing I'm
concerned about is that we might need to update both the radix tree
and the counter atomically in some cases. But that's true we don't
need it for lazy vacuum at least for now. Even given the parallel scan
phase, probably we won't need to have workers check the total number
of stored tuples during a parallel scan.

>
> The number of tids is not that tightly bound to the tidstore's job. I
> believe tidbitmap.c (a possible future client) doesn't care about the
> global number of tids -- not only that, but AND/OR operations can
> change the number in a non-obvious way, so it would not be convenient
> to keep an accurate number anyway. But the lock would still be
> mandatory with this patch.

Very good point.

>
> If we can make vacuum work a bit closer to how it does now, it'd be a
> big step up in readability, I think. Namely, getting rid of all the
> locking logic inside tidstore.c and let the radix tree's locking do
> the right thing. We'd need to make that work correctly when receiving
> pointers to values upon lookup, and I already shared ideas for that.
> But I want to see if there is any obstacle in the way of removing the
> tidstore control object and it's separate lock.

So I agree to remove both max_bytes and num_items from the control
object.Also, as you mentioned, we can remove the tidstore control
object itself. TidStoreGetHandle() returns a radix tree handle, and we
can pass it to TidStoreAttach().  I'll try it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-09 Thread John Naylor
On Wed, Jan 10, 2024 at 9:05 AM Masahiko Sawada  wrote:
>
> I've done in 0010 patch in v51 patch set.  Whereas RT_NODE_4 and
> RT_NODE_16 structs declaration needs RT_FANOUT_4_HI and
> RT_FANOUT_16_HI respectively, RT_FANOUT_16_LO and RT_FANOUT_48 need
> RT_NODE_16 and RT_NODE_48 structs declaration. So fanout declarations
> are now spread before and after RT_NODE_XXX struct declaration. It's a
> bit less readable, but I'm not sure of a better way.

They were before and after the *_BASE types, so it's not really worse,
I think. I did notice that RT_SLOT_IDX_LIMIT has been considered
special for a very long time, before we even had size classes, so it's
the same thing but even more far away. I have an idea to introduce
*_MAX macros, allowing to turn RT_SLOT_IDX_LIMIT into
RT_FANOUT_48_MAX, so that everything is in the same spot, and to make
this area more consistent. I also noticed that I'd been assuming that
RT_FANOUT_16_HI fits easily into a DSA size class, but that's only
true on 64-bit, and in any case we don't want to assume it. I've
attached an addendum .txt to demo this idea.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index bde6916184..ffb0b58826 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -287,19 +287,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE * tree);
 /* Tree level the radix tree uses */
 #define RT_MAX_LEVEL   ((sizeof(uint64) * BITS_PER_BYTE) / RT_SPAN)
 
-/*
- * Number of bits necessary for isset array in node48.
- * Since bitmapword can be 64 bits, the only values that make sense
- * here are 64 and 128.
- * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
- * inside a single bitmapword on most platforms, so it's a good starting
- * point. We can make it higher if we need to.
- */
-#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 4)
-
-/* Invalid index used in node48 */
-#define RT_INVALID_SLOT_IDX0xFF
-
 /* Get a chunk from the key */
 #define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & 
RT_CHUNK_MASK))
 
@@ -426,20 +413,29 @@ typedef union RT_NODE_PTR
 }  RT_NODE_PTR;
 
 /*
- * fanout values for each size class.
- *
- * RT_FANOUT_4_HI and RT_FANOUT_16_HI are declared here as they are
+ * Symbols for maximum possible fanout are declared first as they are
  * required to declare each node kind. The declarations of other fanout
  * values are followed as they need the struct sizes of each node kind.
- *
- * TODO: consider 5 with subclass 1 or 2.
  */
 
 /* max possible key chunks without struct padding */
-#define RT_FANOUT_4_HI (8 - sizeof(RT_NODE))
+#define RT_FANOUT_4_MAX (8 - sizeof(RT_NODE))
 
 /* equal to two 128-bit SIMD registers, regardless of availability */
-#define RT_FANOUT_16_HI32
+#define RT_FANOUT_16_MAX   32
+
+/*
+ * This also determines the number of bits necessary for the isset array,
+ * so we need to be mindful of the size of bitmapword.
+ * Since bitmapword can be 64 bits, the only values that make sense
+ * here are 64 and 128.
+ * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
+ * inside a single bitmapword on most platforms, so it's a good starting
+ * point. We can make it higher if we need to.
+ */
+#define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4)
+
+#define RT_FANOUT_256   RT_NODE_MAX_SLOTS
 
 /*
  * Node structs, one for each "kind"
@@ -448,7 +444,7 @@ typedef struct RT_NODE_4
 {
RT_NODE base;
 
-   uint8   chunks[RT_FANOUT_4_HI];
+   uint8   chunks[RT_FANOUT_4_MAX];
 
/* number of children depends on size class */
RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
@@ -458,7 +454,7 @@ typedef struct RT_NODE_16
 {
RT_NODE  base;
 
-   uint8   chunks[RT_FANOUT_16_HI];
+   uint8   chunks[RT_FANOUT_16_MAX];
 
/* number of children depends on size class */
RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
@@ -476,8 +472,11 @@ typedef struct RT_NODE_48
/* The index of slots for each fanout */
uint8   slot_idxs[RT_NODE_MAX_SLOTS];
 
+/* Invalid index */
+#define RT_INVALID_SLOT_IDX0xFF
+
/* bitmap to track which slots are in use */
-   bitmapword  isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)];
+   bitmapword  isset[RT_BM_IDX(RT_FANOUT_48_MAX)];
 
/* number of children depends on size class */
RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER];
@@ -493,28 +492,29 @@ typedef struct RT_NODE_256
 {
RT_NODE  base;
 
-   /*
-* Zero is a valid value for embedded values, so we use a bitmap to 
track
-* which slots are in use.
-*/
-   bitmapword  isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)];
+   /* bitmap to track which slots are in use */
+   bitmapword  isset[RT_BM_IDX(RT_FANOUT_256)];
 
/* Slots for 256 children */
-   RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS];
+   RT_PTR_ALLOC 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-09 Thread Masahiko Sawada
On Tue, Jan 9, 2024 at 8:19 PM John Naylor  wrote:
>
> On Tue, Jan 9, 2024 at 9:40 AM Masahiko Sawada  wrote:
> > In addition, I've made some changes and cleanups:
>
> These look good to me, although I have not tried dumping a node in a while.
>
> > 0011 - simplify the radix tree iteration code. I hope it makes the
> > code clear and readable. Also I removed RT_UPDATE_ITER_STACK().
>
> I'm very pleased with how much simpler it is now!
>
> > 0013 - In RT_SHMEM case, we use SIZEOF_VOID_P for
> > RT_VALUE_IS_EMBEDDABLE check, but I think it's not correct. Because
> > DSA has its own pointer size, SIZEOF_DSA_POINTER, it could be 4 bytes
> > even if SIZEOF_VOID_P is 8 bytes, for example in a case where
> > !defined(PG_HAVE_ATOMIC_U64_SUPPORT). Please refer to dsa.h for
> > details.
>
> Thanks for the pointer. ;-)
>
> > BTW, now that the inner and leaf nodes use the same structure, do we
> > still need RT_NODE_BASE_XXX types? Most places where we use
> > RT_NODE_BASE_XXX types can be replaced with RT_NODE_XXX types.
>
> That's been in the back of my mind as well. Maybe the common header
> should be the new "base" member? At least, something other than "n".

Agreed.

>
> > Exceptions are RT_FANOUT_XX calculations:
> >
> > #if SIZEOF_VOID_P < 8
> > #define RT_FANOUT_16_LO ((96 - sizeof(RT_NODE_BASE_16)) / 
> > sizeof(RT_PTR_ALLOC))
> > #define RT_FANOUT_48((512 - sizeof(RT_NODE_BASE_48)) / 
> > sizeof(RT_PTR_ALLOC))
> > #else
> > #define RT_FANOUT_16_LO ((160 - sizeof(RT_NODE_BASE_16)) / 
> > sizeof(RT_PTR_ALLOC))
> > #define RT_FANOUT_48((768 - sizeof(RT_NODE_BASE_48)) / 
> > sizeof(RT_PTR_ALLOC))
> > #endif  /* SIZEOF_VOID_P < 8 */
> >
> > But I think we can replace them with offsetof(RT_NODE_16, children) etc.
>
> That makes sense. Do you want to have a go at it, or shall I?

I've done in 0010 patch in v51 patch set.  Whereas RT_NODE_4 and
RT_NODE_16 structs declaration needs RT_FANOUT_4_HI and
RT_FANOUT_16_HI respectively, RT_FANOUT_16_LO and RT_FANOUT_48 need
RT_NODE_16 and RT_NODE_48 structs declaration. So fanout declarations
are now spread before and after RT_NODE_XXX struct declaration. It's a
bit less readable, but I'm not sure of a better way.

The previous updates are merged into the main radix tree patch and
tidstore patch. Nothing changes in other patches from v50.

>
> I think after that, the only big cleanup needed is putting things in a
> more readable order. I can do that at a later date, and other
> opportunities for beautification are pretty minor and localized.

Agreed.

>
> Rationalizing locking is the only thing left that requires a bit of thought.

Right, I'll send a reply soon.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v51-ART.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-09 Thread John Naylor
On Tue, Jan 9, 2024 at 9:40 AM Masahiko Sawada  wrote:
> In addition, I've made some changes and cleanups:

These look good to me, although I have not tried dumping a node in a while.

> 0011 - simplify the radix tree iteration code. I hope it makes the
> code clear and readable. Also I removed RT_UPDATE_ITER_STACK().

I'm very pleased with how much simpler it is now!

> 0013 - In RT_SHMEM case, we use SIZEOF_VOID_P for
> RT_VALUE_IS_EMBEDDABLE check, but I think it's not correct. Because
> DSA has its own pointer size, SIZEOF_DSA_POINTER, it could be 4 bytes
> even if SIZEOF_VOID_P is 8 bytes, for example in a case where
> !defined(PG_HAVE_ATOMIC_U64_SUPPORT). Please refer to dsa.h for
> details.

Thanks for the pointer. ;-)

> BTW, now that the inner and leaf nodes use the same structure, do we
> still need RT_NODE_BASE_XXX types? Most places where we use
> RT_NODE_BASE_XXX types can be replaced with RT_NODE_XXX types.

That's been in the back of my mind as well. Maybe the common header
should be the new "base" member? At least, something other than "n".

> Exceptions are RT_FANOUT_XX calculations:
>
> #if SIZEOF_VOID_P < 8
> #define RT_FANOUT_16_LO ((96 - sizeof(RT_NODE_BASE_16)) / 
> sizeof(RT_PTR_ALLOC))
> #define RT_FANOUT_48((512 - sizeof(RT_NODE_BASE_48)) / 
> sizeof(RT_PTR_ALLOC))
> #else
> #define RT_FANOUT_16_LO ((160 - sizeof(RT_NODE_BASE_16)) / 
> sizeof(RT_PTR_ALLOC))
> #define RT_FANOUT_48((768 - sizeof(RT_NODE_BASE_48)) / 
> sizeof(RT_PTR_ALLOC))
> #endif  /* SIZEOF_VOID_P < 8 */
>
> But I think we can replace them with offsetof(RT_NODE_16, children) etc.

That makes sense. Do you want to have a go at it, or shall I?

I think after that, the only big cleanup needed is putting things in a
more readable order. I can do that at a later date, and other
opportunities for beautification are pretty minor and localized.

Rationalizing locking is the only thing left that requires a bit of thought.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-08 Thread Masahiko Sawada
On Wed, Jan 3, 2024 at 11:10 PM John Naylor  wrote:
>
> On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada  wrote:
>
> > I agree that we expose RT_LOCK_* functions and have tidstore use them,
> > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
> > calls part. I think that even if we expose them, we will still need to
> > do something like "if (TidStoreIsShared(ts))
> > shared_rt_lock_share(ts->tree.shared)", no?
>
> I'll come back to this topic separately.
>
> > I've attached a new patch set. From v47 patch, I've merged your
> > changes for radix tree, and split the vacuum integration patch into 3
> > patches: simply replaces VacDeadItems with TidsTore (0007 patch), and
> > use a simple TID array for one-pass strategy (0008 patch), and replace
> > has_lpdead_items with "num_offsets > 0" (0009 patch), while
> > incorporating your review comments on the vacuum integration patch
>
> Nice!
>
> > (sorry for making it difficult to see the changes from v47 patch).
>
> It's actually pretty clear. I just have a couple comments before
> sharing my latest cleanups:
>
> (diff'ing between v47 and v48):
>
> --   /*
> -* In the shared case, TidStoreControl and radix_tree are backed by 
> the
> -* same DSA area and rt_memory_usage() returns the value including 
> both.
> -* So we don't need to add the size of TidStoreControl separately.
> -*/
> if (TidStoreIsShared(ts))
> -   return sizeof(TidStore) +
> shared_rt_memory_usage(ts->tree.shared);
> +   rt_mem = shared_rt_memory_usage(ts->tree.shared);
> +   else
> +   rt_mem = local_rt_memory_usage(ts->tree.local);
>
> -   return sizeof(TidStore) + sizeof(TidStore) +
> local_rt_memory_usage(ts->tree.local);
> +   return sizeof(TidStore) + sizeof(TidStoreControl) + rt_mem;
>
> Upthread, I meant that I don't see the need to include the size of
> these structs *at all*. They're tiny, and the blocks/segments will
> almost certainly have some empty space counted in the total anyway.
> The returned size is already overestimated, so this extra code is just
> a distraction.

Agreed.

>
> - if (result->num_offsets + bmw_popcount(w) > result->max_offset)
> + if (result->num_offsets + (sizeof(bitmapword) * BITS_PER_BITMAPWORD)
> >= result->max_offset)
>
> I believe this math is wrong. We care about "result->num_offsets +
> BITS_PER_BITMAPWORD", right?
> Also, it seems if the condition evaluates to equal, we still have
> enough space, in which case ">" the max is the right condition.

Oops, you're right. Fixed.

>
> - if (off < 1 || off > MAX_TUPLES_PER_PAGE)
> + if (off < 1 || off > MaxOffsetNumber)
>
> This can now use OffsetNumberIsValid().

Fixed.

>
> > 0013 to 0015 patches are also updates from v47 patch.
>
> > I'm thinking that we should change the order of the patches so that
> > tidstore patch requires the patch for changing DSA segment sizes. That
> > way, we can remove the complex max memory calculation part that we no
> > longer use from the tidstore patch.
>
> I don't think there is any reason to have those calculations at all at
> this point. Every patch in every version should at least *work
> correctly*, without kludging m_w_m and without constraining max
> segment size. I'm fine with the latter remaining in its own thread,
> and I hope we can consider it an enhancement that respects the admin's
> configured limits more effectively, and not a pre-requisite for not
> breaking. I *think* we're there now, but it's hard to tell since 0015
> was at the very end. As I said recently, if something still fails, I'd
> like to know why. So for v49, I took the liberty of removing the DSA
> max segment patches for now, and squashing v48-0015.

Fair enough.

>
> In addition for v49, I have quite a few cleanups:
>
> 0001 - This hasn't been touched in a very long time, but I ran
> pgindent and clarified a comment
> 0002 - We no longer need to isolate the rightmost bit anywhere, so
> removed that part and revised the commit message accordingly.

Thanks.

>
> radix tree:
> 0003 - v48 plus squashed v48-0013
> 0004 - Removed or adjusted WIP, FIXME, TODO items. Some were outdated,
> and I fixed most of the rest.
> 0005 - Remove the RT_PTR_LOCAL macro, since it's not really useful anymore.
> 0006 - RT_FREE_LEAF only needs the allocated pointer, so pass that. A
> bit simpler.
> 0007 - Uses the same idea from a previous cleanup of RT_SET, for RT_DELETE.
> 0008 - Removes a holdover from the multi-value leaves era.
> 0009 - It occurred to me that we need to have unique names for memory
> contexts for different instantiations of the template. This is one way
> to do it, by using the configured RT_PREFIX in the context name. I
> also took an extra step to make the size class fanout show up
> correctly on different platforms, but that's probably overkill and
> undesirable, and I'll probably use only the class name next time.
> 0010/11 - Make the array functions less surprising and with more

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-08 Thread John Naylor
On Wed, Jan 3, 2024 at 9:10 PM John Naylor  wrote:
>
> On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada  wrote:
>
> > I agree that we expose RT_LOCK_* functions and have tidstore use them,
> > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
> > calls part. I think that even if we expose them, we will still need to
> > do something like "if (TidStoreIsShared(ts))
> > shared_rt_lock_share(ts->tree.shared)", no?
>
> I'll come back to this topic separately.

To answer your question, sure, but that "if (TidStoreIsShared(ts))"
part would be pushed down into a function so that only one place has
to care about it.

However, I'm starting to question whether we even need that. Meaning,
lock the tidstore separately. To "lock the tidstore" means to take a
lock, _separate_ from the radix tree's internal lock, to control
access to two fields in a separate "control object":

+typedef struct TidStoreControl
+{
+ /* the number of tids in the store */
+ int64 num_tids;
+
+ /* the maximum bytes a TidStore can use */
+ size_t max_bytes;

I'm pretty sure max_bytes does not need to be in shared memory, and
certainly not under a lock: Thinking of a hypothetical
parallel-prune-phase scenario, one way would be for a leader process
to pass out ranges of blocks to workers, and when the limit is
exceeded, stop passing out blocks and wait for all the workers to
finish.

As for num_tids, vacuum previously put the similar count in

@@ -176,7 +179,8 @@ struct ParallelVacuumState
  PVIndStats *indstats;

  /* Shared dead items space among parallel vacuum workers */
- VacDeadItems *dead_items;
+ TidStore *dead_items;

VacDeadItems contained "num_items". What was the reason to have new
infrastructure for that count? And it doesn't seem like access to it
was controlled by a lock -- can you confirm? If we did get parallel
pruning, maybe the count would belong inside PVShared?

The number of tids is not that tightly bound to the tidstore's job. I
believe tidbitmap.c (a possible future client) doesn't care about the
global number of tids -- not only that, but AND/OR operations can
change the number in a non-obvious way, so it would not be convenient
to keep an accurate number anyway. But the lock would still be
mandatory with this patch.

If we can make vacuum work a bit closer to how it does now, it'd be a
big step up in readability, I think. Namely, getting rid of all the
locking logic inside tidstore.c and let the radix tree's locking do
the right thing. We'd need to make that work correctly when receiving
pointers to values upon lookup, and I already shared ideas for that.
But I want to see if there is any obstacle in the way of removing the
tidstore control object and it's separate lock.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-03 Thread John Naylor
On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada  wrote:

> I agree that we expose RT_LOCK_* functions and have tidstore use them,
> but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
> calls part. I think that even if we expose them, we will still need to
> do something like "if (TidStoreIsShared(ts))
> shared_rt_lock_share(ts->tree.shared)", no?

I'll come back to this topic separately.

> I've attached a new patch set. From v47 patch, I've merged your
> changes for radix tree, and split the vacuum integration patch into 3
> patches: simply replaces VacDeadItems with TidsTore (0007 patch), and
> use a simple TID array for one-pass strategy (0008 patch), and replace
> has_lpdead_items with "num_offsets > 0" (0009 patch), while
> incorporating your review comments on the vacuum integration patch

Nice!

> (sorry for making it difficult to see the changes from v47 patch).

It's actually pretty clear. I just have a couple comments before
sharing my latest cleanups:

(diff'ing between v47 and v48):

--   /*
-* In the shared case, TidStoreControl and radix_tree are backed by the
-* same DSA area and rt_memory_usage() returns the value including both.
-* So we don't need to add the size of TidStoreControl separately.
-*/
if (TidStoreIsShared(ts))
-   return sizeof(TidStore) +
shared_rt_memory_usage(ts->tree.shared);
+   rt_mem = shared_rt_memory_usage(ts->tree.shared);
+   else
+   rt_mem = local_rt_memory_usage(ts->tree.local);

-   return sizeof(TidStore) + sizeof(TidStore) +
local_rt_memory_usage(ts->tree.local);
+   return sizeof(TidStore) + sizeof(TidStoreControl) + rt_mem;

Upthread, I meant that I don't see the need to include the size of
these structs *at all*. They're tiny, and the blocks/segments will
almost certainly have some empty space counted in the total anyway.
The returned size is already overestimated, so this extra code is just
a distraction.

- if (result->num_offsets + bmw_popcount(w) > result->max_offset)
+ if (result->num_offsets + (sizeof(bitmapword) * BITS_PER_BITMAPWORD)
>= result->max_offset)

I believe this math is wrong. We care about "result->num_offsets +
BITS_PER_BITMAPWORD", right?
Also, it seems if the condition evaluates to equal, we still have
enough space, in which case ">" the max is the right condition.

- if (off < 1 || off > MAX_TUPLES_PER_PAGE)
+ if (off < 1 || off > MaxOffsetNumber)

This can now use OffsetNumberIsValid().

> 0013 to 0015 patches are also updates from v47 patch.

> I'm thinking that we should change the order of the patches so that
> tidstore patch requires the patch for changing DSA segment sizes. That
> way, we can remove the complex max memory calculation part that we no
> longer use from the tidstore patch.

I don't think there is any reason to have those calculations at all at
this point. Every patch in every version should at least *work
correctly*, without kludging m_w_m and without constraining max
segment size. I'm fine with the latter remaining in its own thread,
and I hope we can consider it an enhancement that respects the admin's
configured limits more effectively, and not a pre-requisite for not
breaking. I *think* we're there now, but it's hard to tell since 0015
was at the very end. As I said recently, if something still fails, I'd
like to know why. So for v49, I took the liberty of removing the DSA
max segment patches for now, and squashing v48-0015.

In addition for v49, I have quite a few cleanups:

0001 - This hasn't been touched in a very long time, but I ran
pgindent and clarified a comment
0002 - We no longer need to isolate the rightmost bit anywhere, so
removed that part and revised the commit message accordingly.

radix tree:
0003 - v48 plus squashed v48-0013
0004 - Removed or adjusted WIP, FIXME, TODO items. Some were outdated,
and I fixed most of the rest.
0005 - Remove the RT_PTR_LOCAL macro, since it's not really useful anymore.
0006 - RT_FREE_LEAF only needs the allocated pointer, so pass that. A
bit simpler.
0007 - Uses the same idea from a previous cleanup of RT_SET, for RT_DELETE.
0008 - Removes a holdover from the multi-value leaves era.
0009 - It occurred to me that we need to have unique names for memory
contexts for different instantiations of the template. This is one way
to do it, by using the configured RT_PREFIX in the context name. I
also took an extra step to make the size class fanout show up
correctly on different platforms, but that's probably overkill and
undesirable, and I'll probably use only the class name next time.
0010/11 - Make the array functions less surprising and with more
informative names.
0012 - Restore a useful technique from Andres's prototype. This part
has been slow for a long time, so much that it showed up in a profile
where this path wasn't even taken much.

tid store / vacuum:
0013/14 - Same as v48 TID store, with review squashed
0015 - Rationalize comment and starting 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-01-02 Thread Masahiko Sawada
On Wed, Dec 27, 2023 at 12:08 PM John Naylor  wrote:
>
> On Tue, Dec 26, 2023 at 12:43 PM Masahiko Sawada  
> wrote:
> >
> > On Thu, Dec 21, 2023 at 4:41 PM John Naylor  wrote:
>
> > > +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber 
> > > *offsets,
> > > + int num_offsets)
> > > +{
> > > + char buf[MaxBlocktableEntrySize];
> > > + BlocktableEntry *page = (BlocktableEntry *) buf;
> > >
> > > I'm not sure this is safe with alignment. Maybe rather than plain
> > > "char", it needs to be a union with BlocktableEntry, or something.
> >
> > I tried it in the new patch set but could you explain why it could not
> > be safe with alignment?
>
> I was thinking because "buf" is just an array of bytes. But, since the
> next declaration is a cast to a pointer to the actual type, maybe we
> can rely on the compiler to do the right thing. (It seems to on my
> machine in any case)

Okay, I kept it.

>
> > > About separation of responsibilities for locking: The only thing
> > > currently where the tid store is not locked is tree iteration. That's
> > > a strange exception. Also, we've recently made RT_FIND return a
> > > pointer, so the caller must somehow hold a share lock, but I think we
> > > haven't exposed callers the ability to do that, and we rely on the tid
> > > store lock for that. We have a mix of tree locking and tid store
> > > locking. We will need to consider carefully how to make this more
> > > clear, maintainable, and understandable.
> >
> > Yes, tidstore should be locked during the iteration.
> >
> > One simple direction about locking is that the radix tree has the lock
> > but no APIs hold/release it. It's the caller's responsibility. If a
> > data structure using a radix tree for its storage has its own lock
> > (like tidstore), it can use it instead of the radix tree's one. A
>
> It looks like the only reason tidstore has its own lock is because it
> has no way to delegate locking to the tree's lock. Instead of working
> around the limitations of the thing we've designed, let's make it work
> for the one use case we have. I think we need to expose RT_LOCK_*
> functions to the outside, and have tid store use them. That would
> allow us to simplify all those "if (TidStoreIsShared(ts)
> LWLockAcquire(..., ...)" calls, which are complex and often redundant.

I agree that we expose RT_LOCK_* functions and have tidstore use them,
but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)"
calls part. I think that even if we expose them, we will still need to
do something like "if (TidStoreIsShared(ts))
shared_rt_lock_share(ts->tree.shared)", no?

>
> At some point, we'll probably want to keep locking inside, at least to
> smooth the way for fine-grained locking you mentioned.
>
> > > In a green field, it'd be fine to replace these with an expression of
> > > "num_offsets", but it adds a bit of noise for reviewers and the git
> > > log. Is it really necessary?
> >
> > I see your point. I think we can live with having both
> > has_lpdead_items and num_offsets. But we will have to check if these
> > values are consistent, which could be less maintainable.
>
> It would be clearer if that removal was split out into a separate patch.

Agreed.

>
> > >  I'm also not quite sure why "deadoffsets" and "lpdead_items" got
> > > moved to the PruneState. The latter was renamed in a way that makes
> > > more sense, but I don't see why the churn is necessary.
> ...
> > > I guess it was added here, 800 lines away? If so, why?
> >
> > The above changes are related. The idea is not to use tidstore in a
> > one-pass strategy. If the table doesn't have any indexes, in
> > lazy_scan_prune() we collect offset numbers of dead tuples on the page
> > and vacuum the page using them. In this case, we don't need to use
> > tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The
> > LVPagePruneState is a convenient place to store collected offset
> > numbers.
>
> Okay, that makes sense, but if it was ever explained, I don't
> remember, and there is nothing in the commit message either.
>
> I'm not sure this can be split up easily, but if so it might help reviewing.

Agreed.

>
> This change also leads to a weird-looking control flow:
>
> if (vacrel->nindexes == 0)
> {
>   if (prunestate.num_offsets > 0)
>   {
> ...
>   }
> }
> else if (prunestate.num_offsets > 0)
> {
>   ...
> }

Fixed.

I've attached a new patch set. From v47 patch, I've merged your
changes for radix tree, and split the vacuum integration patch into 3
patches: simply replaces VacDeadItems with TidsTore (0007 patch), and
use a simple TID array for one-pass strategy (0008 patch), and replace
has_lpdead_items with "num_offsets > 0" (0009 patch), while
incorporating your review comments on the vacuum integration patch
(sorry for making it difficult to see the changes from v47 patch).
0013 to 0015 patches are also updates from v47 patch.

I'm thinking that we should change the order of the patches so that
tidstore patch 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-26 Thread John Naylor
On Tue, Dec 26, 2023 at 12:43 PM Masahiko Sawada  wrote:
>
> On Thu, Dec 21, 2023 at 4:41 PM John Naylor  wrote:

> > +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber 
> > *offsets,
> > + int num_offsets)
> > +{
> > + char buf[MaxBlocktableEntrySize];
> > + BlocktableEntry *page = (BlocktableEntry *) buf;
> >
> > I'm not sure this is safe with alignment. Maybe rather than plain
> > "char", it needs to be a union with BlocktableEntry, or something.
>
> I tried it in the new patch set but could you explain why it could not
> be safe with alignment?

I was thinking because "buf" is just an array of bytes. But, since the
next declaration is a cast to a pointer to the actual type, maybe we
can rely on the compiler to do the right thing. (It seems to on my
machine in any case)

> > About separation of responsibilities for locking: The only thing
> > currently where the tid store is not locked is tree iteration. That's
> > a strange exception. Also, we've recently made RT_FIND return a
> > pointer, so the caller must somehow hold a share lock, but I think we
> > haven't exposed callers the ability to do that, and we rely on the tid
> > store lock for that. We have a mix of tree locking and tid store
> > locking. We will need to consider carefully how to make this more
> > clear, maintainable, and understandable.
>
> Yes, tidstore should be locked during the iteration.
>
> One simple direction about locking is that the radix tree has the lock
> but no APIs hold/release it. It's the caller's responsibility. If a
> data structure using a radix tree for its storage has its own lock
> (like tidstore), it can use it instead of the radix tree's one. A

It looks like the only reason tidstore has its own lock is because it
has no way to delegate locking to the tree's lock. Instead of working
around the limitations of the thing we've designed, let's make it work
for the one use case we have. I think we need to expose RT_LOCK_*
functions to the outside, and have tid store use them. That would
allow us to simplify all those "if (TidStoreIsShared(ts)
LWLockAcquire(..., ...)" calls, which are complex and often redundant.

At some point, we'll probably want to keep locking inside, at least to
smooth the way for fine-grained locking you mentioned.

> > In a green field, it'd be fine to replace these with an expression of
> > "num_offsets", but it adds a bit of noise for reviewers and the git
> > log. Is it really necessary?
>
> I see your point. I think we can live with having both
> has_lpdead_items and num_offsets. But we will have to check if these
> values are consistent, which could be less maintainable.

It would be clearer if that removal was split out into a separate patch.

> >  I'm also not quite sure why "deadoffsets" and "lpdead_items" got
> > moved to the PruneState. The latter was renamed in a way that makes
> > more sense, but I don't see why the churn is necessary.
...
> > I guess it was added here, 800 lines away? If so, why?
>
> The above changes are related. The idea is not to use tidstore in a
> one-pass strategy. If the table doesn't have any indexes, in
> lazy_scan_prune() we collect offset numbers of dead tuples on the page
> and vacuum the page using them. In this case, we don't need to use
> tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The
> LVPagePruneState is a convenient place to store collected offset
> numbers.

Okay, that makes sense, but if it was ever explained, I don't
remember, and there is nothing in the commit message either.

I'm not sure this can be split up easily, but if so it might help reviewing.

This change also leads to a weird-looking control flow:

if (vacrel->nindexes == 0)
{
  if (prunestate.num_offsets > 0)
  {
...
  }
}
else if (prunestate.num_offsets > 0)
{
  ...
}




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-25 Thread Masahiko Sawada
On Thu, Dec 21, 2023 at 4:41 PM John Naylor  wrote:
>
> On Thu, Dec 21, 2023 at 8:33 AM Masahiko Sawada  wrote:
> >
> > I found the following comment and wanted to discuss:
> >
> > // this might be better as "iterate over nodes", plus a callback to
> > RT_DUMP_NODE,
> > // which should really only concern itself with single nodes
> > RT_SCOPE void
> > RT_DUMP(RT_RADIX_TREE *tree)
> >
> > If it means we need to somehow use the iteration functions also for
> > dumping the whole tree, it would probably need to refactor the
> > iteration codes so that the RT_DUMP() can use them while dumping
> > visited nodes. But we need to be careful of not adding overheads to
> > the iteration performance.
>
> Yeah, some months ago I thought a callback interface would make some
> things easier. I don't think we need that at the moment (possibly
> never), so that comment can be just removed. As far as these debug
> functions, I only found useful the stats and dumping a single node,
> FWIW.
>
> I've attached v47, which is v46 plus some fixes for radix tree.
>
> 0004 - moves everything for "delete" to the end -- gradually other
> things will be grouped together in a sensible order
>
> 0005 - trivial

LGTM.

>
> 0006 - shrink nodes -- still needs testing, but nothing crashes yet.

Cool. The coverage test results showed the shrink codes are also covered.

> This shows some renaming might be good: Previously we had
> RT_CHUNK_CHILDREN_ARRAY_COPY for growing nodes, but for shrinking I've
> added RT_COPY_ARRAYS_AND_DELETE, since the deletion happens by simply
> not copying the slot to be deleted. This means when growing it would
> be more clear to call the former RT_COPY_ARRAYS_FOR_INSERT, since that
> reserves a new slot for the caller in the new node, but the caller
> must do the insert itself.

Agreed.

> Note that there are some practical
> restrictions/best-practices on whether shrinking should happen after
> deletion or vice versa. Hopefully it's clear, but let me know if the
> description can be improved. Also, it doesn't yet shrink from size
> class 32 to 16, but it could with a bit of work.

Sounds reasonable.

>
> 0007 - trivial, but could use a better comment. I also need to make
> sure stats reporting works (may also need some cleanup work).
>
> 0008 - fixes RT_FREE_RECURSE -- I believe you wondered some months ago
> if DSA could just free all our allocated segments without throwing
> away the DSA, and that's still a good question.

LGTM.

>
> 0009 - fixes the assert in RT_ITER_SET_NODE_FROM (btw, I don't think
> this name is better than RT_UPDATE_ITER_STACK, so maybe we should go
> back to that).

Will rename it.

> The assert doesn't fire, so I guess it does what it's
> supposed to?

Yes.

> For me, the iteration logic is still the most confusing
> piece out of the whole radix tree. Maybe that could be helped with
> some better variable names, but I wonder if it needs more invasive
> work.

True. Maybe more comments would also help.

>
> 0010 - some fixes for number of children accounting in node256
>
> 0011 - Long overdue pgindent of radixtree.h, without trying to fix up
> afterwards. Feel free to throw out and redo if this interferes with
> ongoing work.
>

LGTM.

I'm working on the below review comments and most of them are already
incorporated on the local branch:

> The rest are from your v46. The bench doesn't work for tid store
> anymore, so I squashed "disable bench for CI" until we get back to
> that. Some more review comments (note: patch numbers are for v47, but
> I changed nothing from v46 in this area):
>
> 0013:
>
> + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
> + * and stored in the radix tree.
>
> Recently outdated. The variable length values seems to work, so let's
> make everything match.
>
> +#define MAX_TUPLES_PER_PAGE  MaxOffsetNumber
>
> Maybe we don't need this macro anymore? The name no longer fits, in any case.

Removed.

>
> +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber 
> *offsets,
> + int num_offsets)
> +{
> + char buf[MaxBlocktableEntrySize];
> + BlocktableEntry *page = (BlocktableEntry *) buf;
>
> I'm not sure this is safe with alignment. Maybe rather than plain
> "char", it needs to be a union with BlocktableEntry, or something.

I tried it in the new patch set but could you explain why it could not
be safe with alignment?

>
> +static inline BlocktableEntry *
> +tidstore_iter_kv(TidStoreIter *iter, uint64 *key)
> +{
> + if (TidStoreIsShared(iter->ts))
> + return shared_rt_iterate_next(iter->tree_iter.shared, key);
> +
> + return local_rt_iterate_next(iter->tree_iter.local, key);
> +}
>
> In the old encoding scheme, this function did something important, but
> now it's a useless wrapper with one caller.

Removed.

>
> + /*
> + * In the shared case, TidStoreControl and radix_tree are backed by the
> + * same DSA area and rt_memory_usage() returns the value including both.
> + * So we don't need to add the size of TidStoreControl 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-22 Thread John Naylor
On Thu, Dec 21, 2023 at 6:27 PM Andres Freund  wrote:
>
> Could either of you summarize what the design changes you've made in the last
> months are and why you've done them? Unfortunately this thread is very long,
> and the comments in the file just say "FIXME" in places that apparently are
> affected by design changes.  This makes it hard to catch up here.

I'd be happy to try, since we are about due for a summary. I was also
hoping to reach a coherent-enough state sometime in early January to
request your feedback, so good timing. Not sure how much detail to go
into, but here goes:

Back in May [1], the method of value storage shifted towards "combined
pointer-value slots", which was described and recommended in the
paper. There were some other changes for simplicity and efficiency,
but none as far-reaching as this.

This is enabled by using the template architecture that we adopted
long ago for different reasons. Fixed length values are either stored
in the slot of the last-level node (if the value fits into the
platform's pointer), or are a "single-value" leaf (otherwise).

For tid store, we want to eventually support bitmap heap scans (in
addition to vacuum), and in doing so make it independent of heap AM.
That means value types similar to PageTableEntry tidbitmap.c, but with
a variable number of bitmapwords.

That required radix tree to support variable length values. That has
been the main focus in the last several months, and it basically works
now.

To my mind, the biggest architectural issues in the patch today are:

- Variable-length values means that pointers are passed around in
places. This will require some shifting responsibility for locking to
the caller, or longer-term maybe a callback interface. (This is new,
the below are pre-existing issues.)
- The tid store has its own "control object" (when shared memory is
needed) with its own lock, in addition to the same for the associated
radix tree. This leads to unnecessary double-locking. This area needs
some attention.
- Memory accounting is still unsettled. The current thinking is to cap
max block/segment size, scaled to a fraction of m_w_m, but there are
still open questions.

There has been some recent effort toward finishing work started
earlier, like shrinking nodes. There a couple places that can still
use either simplification or optimization, but otherwise work fine.
Most of the remaining fixmes/todos/wips are trivial; a few are
actually outdated now that I look again, and will be removed shortly.
The regression tests could use some tidying up.

-John

[1] 
https://www.postgresql.org/message-id/CAFBsxsFyWLxweHVDtKb7otOCR4XdQGYR4b%2B9svxpVFnJs08BmQ%40mail.gmail.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-21 Thread Andres Freund
Hi,

On 2023-12-21 14:41:37 +0700, John Naylor wrote:
> I've attached v47, which is v46 plus some fixes for radix tree.

Could either of you summarize what the design changes you've made in the last
months are and why you've done them? Unfortunately this thread is very long,
and the comments in the file just say "FIXME" in places that apparently are
affected by design changes.  This makes it hard to catch up here.

Greetings,

Andres Freund




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-20 Thread John Naylor
On Thu, Dec 21, 2023 at 8:33 AM Masahiko Sawada  wrote:
>
> I found the following comment and wanted to discuss:
>
> // this might be better as "iterate over nodes", plus a callback to
> RT_DUMP_NODE,
> // which should really only concern itself with single nodes
> RT_SCOPE void
> RT_DUMP(RT_RADIX_TREE *tree)
>
> If it means we need to somehow use the iteration functions also for
> dumping the whole tree, it would probably need to refactor the
> iteration codes so that the RT_DUMP() can use them while dumping
> visited nodes. But we need to be careful of not adding overheads to
> the iteration performance.

Yeah, some months ago I thought a callback interface would make some
things easier. I don't think we need that at the moment (possibly
never), so that comment can be just removed. As far as these debug
functions, I only found useful the stats and dumping a single node,
FWIW.

I've attached v47, which is v46 plus some fixes for radix tree.

0004 - moves everything for "delete" to the end -- gradually other
things will be grouped together in a sensible order

0005 - trivial

0006 - shrink nodes -- still needs testing, but nothing crashes yet.
This shows some renaming might be good: Previously we had
RT_CHUNK_CHILDREN_ARRAY_COPY for growing nodes, but for shrinking I've
added RT_COPY_ARRAYS_AND_DELETE, since the deletion happens by simply
not copying the slot to be deleted. This means when growing it would
be more clear to call the former RT_COPY_ARRAYS_FOR_INSERT, since that
reserves a new slot for the caller in the new node, but the caller
must do the insert itself. Note that there are some practical
restrictions/best-practices on whether shrinking should happen after
deletion or vice versa. Hopefully it's clear, but let me know if the
description can be improved. Also, it doesn't yet shrink from size
class 32 to 16, but it could with a bit of work.

0007 - trivial, but could use a better comment. I also need to make
sure stats reporting works (may also need some cleanup work).

0008 - fixes RT_FREE_RECURSE -- I believe you wondered some months ago
if DSA could just free all our allocated segments without throwing
away the DSA, and that's still a good question.

0009 - fixes the assert in RT_ITER_SET_NODE_FROM (btw, I don't think
this name is better than RT_UPDATE_ITER_STACK, so maybe we should go
back to that). The assert doesn't fire, so I guess it does what it's
supposed to? For me, the iteration logic is still the most confusing
piece out of the whole radix tree. Maybe that could be helped with
some better variable names, but I wonder if it needs more invasive
work. I confess I don't have better ideas for how it would work
differently.

0010 - some fixes for number of children accounting in node256

0011 - Long overdue pgindent of radixtree.h, without trying to fix up
afterwards. Feel free to throw out and redo if this interferes with
ongoing work.

The rest are from your v46. The bench doesn't work for tid store
anymore, so I squashed "disable bench for CI" until we get back to
that. Some more review comments (note: patch numbers are for v47, but
I changed nothing from v46 in this area):

0013:

+ * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value,
+ * and stored in the radix tree.

Recently outdated. The variable length values seems to work, so let's
make everything match.

+#define MAX_TUPLES_PER_PAGE  MaxOffsetNumber

Maybe we don't need this macro anymore? The name no longer fits, in any case.

+TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
+ int num_offsets)
+{
+ char buf[MaxBlocktableEntrySize];
+ BlocktableEntry *page = (BlocktableEntry *) buf;

I'm not sure this is safe with alignment. Maybe rather than plain
"char", it needs to be a union with BlocktableEntry, or something.

+static inline BlocktableEntry *
+tidstore_iter_kv(TidStoreIter *iter, uint64 *key)
+{
+ if (TidStoreIsShared(iter->ts))
+ return shared_rt_iterate_next(iter->tree_iter.shared, key);
+
+ return local_rt_iterate_next(iter->tree_iter.local, key);
+}

In the old encoding scheme, this function did something important, but
now it's a useless wrapper with one caller.

+ /*
+ * In the shared case, TidStoreControl and radix_tree are backed by the
+ * same DSA area and rt_memory_usage() returns the value including both.
+ * So we don't need to add the size of TidStoreControl separately.
+ */
+ if (TidStoreIsShared(ts))
+ return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared);
+
+ return sizeof(TidStore) + sizeof(TidStore) +
local_rt_memory_usage(ts->tree.local);

I don't see the point in including these tiny structs, since we will
always blow past the limit by a number of kilobytes (at least, often
megabytes or more) at the time it happens.

+ iter->output.max_offset = 64;

Maybe needs a comment that this is just some starting size and not
anything particular.

+ iter->output.offsets = palloc(sizeof(OffsetNumber) * iter->output.max_offset);

+ /* Make sure there 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-20 Thread Masahiko Sawada
On Thu, Dec 21, 2023 at 10:19 AM John Naylor  wrote:
>
> On Wed, Dec 20, 2023 at 6:36 PM Masahiko Sawada  wrote:
> >
> > I've updated the new patch set that incorporated comments I got so
> > far. 0007, 0008, and 0012 patches are updates from the v45 patch set.
> > In addition to the review comments, I made some changes in tidstore to
> > make it independent from heap. Specifically, it uses MaxOffsetNumber
> > instead of MaxHeapTuplesPerPage. Now we don't need to include
> > htup_details.h. It enlarged MaxBlocktableEntrySize but it's still 272
> > bytes.
>
> That's a good idea.
>
> > BTW regarding the previous comment I got before:
> >
> > > - RT_PTR_ALLOC *slot;
> > > + RT_PTR_ALLOC *slot = NULL;
> > >
> > > We have a macro for invalid pointer because of DSA.
> >
> > I think that since *slot is a pointer to a RT_PTR_ALLOC it's okay to set 
> > NULL.
>
> Ah right, it's the address of the slot.
>
> > I'm going to update RT_DUMP() and RT_DUMP_NODE() codes for the next step.
>
> That could probably use some discussion. A few months ago, I found the
> debugging functions only worked when everything else worked. When
> things weren't working, I had to rip one of these functions apart so
> it only looked at one node. If something is broken, we can't count on
> recursion or iteration working, because we won't get that far. I don't
> remember how things are in the current patch.

Agreed.

I found the following comment and wanted to discuss:

// this might be better as "iterate over nodes", plus a callback to
RT_DUMP_NODE,
// which should really only concern itself with single nodes
RT_SCOPE void
RT_DUMP(RT_RADIX_TREE *tree)

If it means we need to somehow use the iteration functions also for
dumping the whole tree, it would probably need to refactor the
iteration codes so that the RT_DUMP() can use them while dumping
visited nodes. But we need to be careful of not adding overheads to
the iteration performance.

>
> I've finished the node shrinking and addressed some fixme/todo areas
> -- can I share these and squash your v46 changes first?

Cool! Yes, please do so.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-20 Thread John Naylor
On Wed, Dec 20, 2023 at 6:36 PM Masahiko Sawada  wrote:
>
> I've updated the new patch set that incorporated comments I got so
> far. 0007, 0008, and 0012 patches are updates from the v45 patch set.
> In addition to the review comments, I made some changes in tidstore to
> make it independent from heap. Specifically, it uses MaxOffsetNumber
> instead of MaxHeapTuplesPerPage. Now we don't need to include
> htup_details.h. It enlarged MaxBlocktableEntrySize but it's still 272
> bytes.

That's a good idea.

> BTW regarding the previous comment I got before:
>
> > - RT_PTR_ALLOC *slot;
> > + RT_PTR_ALLOC *slot = NULL;
> >
> > We have a macro for invalid pointer because of DSA.
>
> I think that since *slot is a pointer to a RT_PTR_ALLOC it's okay to set NULL.

Ah right, it's the address of the slot.

> I'm going to update RT_DUMP() and RT_DUMP_NODE() codes for the next step.

That could probably use some discussion. A few months ago, I found the
debugging functions only worked when everything else worked. When
things weren't working, I had to rip one of these functions apart so
it only looked at one node. If something is broken, we can't count on
recursion or iteration working, because we won't get that far. I don't
remember how things are in the current patch.

I've finished the node shrinking and addressed some fixme/todo areas
-- can I share these and squash your v46 changes first?




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-20 Thread Masahiko Sawada
On Tue, Dec 19, 2023 at 4:37 PM John Naylor  wrote:
>
> On Tue, Dec 19, 2023 at 12:37 PM Masahiko Sawada  
> wrote:
> >
> > On Mon, Dec 18, 2023 at 3:41 PM John Naylor  wrote:
> > > Let's do it in just one place. In TidStoreCreate(), do
> > >
> > > /* clamp max_bytes to at least the size of the empty tree with
> > > allocated blocks, so it doesn't immediately appear full */
> > > ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage);
> > >
> > > Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all 
> > > that.
> >
> > But doesn't it mean that even if we create a shared tidstore with
> > small memory, say 64kB, it actually uses 1MB?
>
> This sounds like an argument for controlling the minimum DSA segment
> size. (I'm not really in favor of that, but open to others' opinion)
>
> I wasn't talking about that above -- I was saying we should have only
> one place where we clamp max_bytes so that the tree doesn't
> immediately appear full.

Thank you for your clarification. Understood.

I've updated the new patch set that incorporated comments I got so
far. 0007, 0008, and 0012 patches are updates from the v45 patch set.
In addition to the review comments, I made some changes in tidstore to
make it independent from heap. Specifically, it uses MaxOffsetNumber
instead of MaxHeapTuplesPerPage. Now we don't need to include
htup_details.h. It enlarged MaxBlocktableEntrySize but it's still 272
bytes.

BTW regarding the previous comment I got before:

> - RT_PTR_ALLOC *slot;
> + RT_PTR_ALLOC *slot = NULL;
>
> We have a macro for invalid pointer because of DSA.

I think that since *slot is a pointer to a RT_PTR_ALLOC it's okay to set NULL.

As for the initial and maximum DSA segment sizes, I've sent a summary
on that thread:

https://www.postgresql.org/message-id/CAD21AoCVMw6DSmgZY9h%2BxfzKtzJeqWiwxaUD2T-FztVcV-XibQ%40mail.gmail.com

I'm going to update RT_DUMP() and RT_DUMP_NODE() codes for the next step.


Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v46-ART.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-18 Thread John Naylor
On Tue, Dec 19, 2023 at 12:37 PM Masahiko Sawada  wrote:
>
> On Mon, Dec 18, 2023 at 3:41 PM John Naylor  wrote:
> > Let's do it in just one place. In TidStoreCreate(), do
> >
> > /* clamp max_bytes to at least the size of the empty tree with
> > allocated blocks, so it doesn't immediately appear full */
> > ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage);
> >
> > Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that.
>
> But doesn't it mean that even if we create a shared tidstore with
> small memory, say 64kB, it actually uses 1MB?

This sounds like an argument for controlling the minimum DSA segment
size. (I'm not really in favor of that, but open to others' opinion)

I wasn't talking about that above -- I was saying we should have only
one place where we clamp max_bytes so that the tree doesn't
immediately appear full.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-18 Thread Masahiko Sawada
On Mon, Dec 18, 2023 at 3:41 PM John Naylor  wrote:
>
> On Fri, Dec 15, 2023 at 3:15 PM Masahiko Sawada  wrote:
> >
> > On Fri, Dec 15, 2023 at 10:30 AM John Naylor  
> > wrote:
>
> > >  parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
> > > - int nrequested_workers, int max_items,
> > > - int elevel, BufferAccessStrategy bstrategy)
> > > + int nrequested_workers, int vac_work_mem,
> > > + int max_offset, int elevel,
> > > + BufferAccessStrategy bstrategy)
> > >
> > > It seems very strange to me that this function has to pass the
> > > max_offset. In general, it's been simpler to assume we have a constant
> > > max_offset, but in this case that fact is not helping. Something to
> > > think about for later.
> >
> > max_offset was previously used in old TID encoding in tidstore. Since
> > tidstore has entries for each block, I think we no longer need it.
>
> It's needed now to properly size the allocation of TidStoreIter which
> contains...
>
> +/* Result struct for TidStoreIterateNext */
> +typedef struct TidStoreIterResult
> +{
> + BlockNumber blkno;
> + int num_offsets;
> + OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
> +} TidStoreIterResult;
>
> Maybe we can palloc the offset array to "almost always" big enough,
> with logic to resize if needed? If not too hard, seems worth it to
> avoid churn in the parameter list.

Yes, I was thinking of that.

>
> > > v45-0010:
> > >
> > > Thinking about this some more, I'm not sure we need to do anything
> > > different for the *starting* segment size. (Controlling *max* size
> > > does seem important, however.) For the corner case of m_w_m = 1MB,
> > > it's fine if vacuum quits pruning immediately after (in effect) it
> > > finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If
> > > the memory accounting starts >1MB because we're adding the trivial
> > > size of some struct, let's just stop doing that. The segment
> > > allocations are what we care about.
> >
> > IIUC it's for work_mem, whose the minimum value is 64kB.
> >
> > >
> > > v45-0011:
> > >
> > > + /*
> > > + * max_bytes is forced to be at least 64kB, the current minimum valid
> > > + * value for the work_mem GUC.
> > > + */
> > > + max_bytes = Max(64 * 1024L, max_bytes);
> > >
> > > Why?
> >
> > This is to avoid creating a radix tree within very small memory. The
> > minimum work_mem value is a reasonable lower bound that PostgreSQL
> > uses internally. It's actually copied from tuplesort.c.
>
> There is no explanation for why it should be done like tuplesort.c. Also...
>
> - tree->leaf_ctx = SlabContextCreate(ctx,
> -"radix tree leaves",
> -RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> -sizeof(RT_VALUE_TYPE));
> + tree->leaf_ctx = SlabContextCreate(ctx,
> +"radix tree leaves",
> +Min(RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> +work_mem),
> +sizeof(RT_VALUE_TYPE));
>
> At first, my eyes skipped over this apparent re-indent, but hidden
> inside here is another (undocumented) attempt to clamp the size of
> something. There are too many of these sprinkled in various places,
> and they're already a maintenance hazard -- a different one was left
> behind in v45-0011:
>
> @@ -201,6 +183,7 @@ TidStoreCreate(size_t max_bytes, int max_off,
> dsa_area *area)
> ts->control->max_bytes = max_bytes - (70 * 1024);
>   }
>
> Let's do it in just one place. In TidStoreCreate(), do
>
> /* clamp max_bytes to at least the size of the empty tree with
> allocated blocks, so it doesn't immediately appear full */
> ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage);
>
> Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that.

But doesn't it mean that even if we create a shared tidstore with
small memory, say 64kB, it actually uses 1MB?

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-17 Thread John Naylor
On Fri, Dec 15, 2023 at 3:15 PM Masahiko Sawada  wrote:
>
> On Fri, Dec 15, 2023 at 10:30 AM John Naylor  wrote:

> >  parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
> > - int nrequested_workers, int max_items,
> > - int elevel, BufferAccessStrategy bstrategy)
> > + int nrequested_workers, int vac_work_mem,
> > + int max_offset, int elevel,
> > + BufferAccessStrategy bstrategy)
> >
> > It seems very strange to me that this function has to pass the
> > max_offset. In general, it's been simpler to assume we have a constant
> > max_offset, but in this case that fact is not helping. Something to
> > think about for later.
>
> max_offset was previously used in old TID encoding in tidstore. Since
> tidstore has entries for each block, I think we no longer need it.

It's needed now to properly size the allocation of TidStoreIter which
contains...

+/* Result struct for TidStoreIterateNext */
+typedef struct TidStoreIterResult
+{
+ BlockNumber blkno;
+ int num_offsets;
+ OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER];
+} TidStoreIterResult;

Maybe we can palloc the offset array to "almost always" big enough,
with logic to resize if needed? If not too hard, seems worth it to
avoid churn in the parameter list.

> > v45-0010:
> >
> > Thinking about this some more, I'm not sure we need to do anything
> > different for the *starting* segment size. (Controlling *max* size
> > does seem important, however.) For the corner case of m_w_m = 1MB,
> > it's fine if vacuum quits pruning immediately after (in effect) it
> > finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If
> > the memory accounting starts >1MB because we're adding the trivial
> > size of some struct, let's just stop doing that. The segment
> > allocations are what we care about.
>
> IIUC it's for work_mem, whose the minimum value is 64kB.
>
> >
> > v45-0011:
> >
> > + /*
> > + * max_bytes is forced to be at least 64kB, the current minimum valid
> > + * value for the work_mem GUC.
> > + */
> > + max_bytes = Max(64 * 1024L, max_bytes);
> >
> > Why?
>
> This is to avoid creating a radix tree within very small memory. The
> minimum work_mem value is a reasonable lower bound that PostgreSQL
> uses internally. It's actually copied from tuplesort.c.

There is no explanation for why it should be done like tuplesort.c. Also...

- tree->leaf_ctx = SlabContextCreate(ctx,
-"radix tree leaves",
-RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
-sizeof(RT_VALUE_TYPE));
+ tree->leaf_ctx = SlabContextCreate(ctx,
+"radix tree leaves",
+Min(RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+work_mem),
+sizeof(RT_VALUE_TYPE));

At first, my eyes skipped over this apparent re-indent, but hidden
inside here is another (undocumented) attempt to clamp the size of
something. There are too many of these sprinkled in various places,
and they're already a maintenance hazard -- a different one was left
behind in v45-0011:

@@ -201,6 +183,7 @@ TidStoreCreate(size_t max_bytes, int max_off,
dsa_area *area)
ts->control->max_bytes = max_bytes - (70 * 1024);
  }

Let's do it in just one place. In TidStoreCreate(), do

/* clamp max_bytes to at least the size of the empty tree with
allocated blocks, so it doesn't immediately appear full */
ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage);

Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that.

I may not recall everything while writing this, but it seems the only
other thing we should be clamping is the max aset block size (solved)
/ max DSM segment size (in progress).




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-15 Thread Masahiko Sawada
On Fri, Dec 15, 2023 at 10:30 AM John Naylor  wrote:
>
> On Thu, Dec 14, 2023 at 7:22 AM Masahiko Sawada  wrote:
> > In v45, 0001 - 0006 are from earlier versions but I've merged previous
> > updates. So the radix tree now has RT_SET() and RT_FIND() but not
> > RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous
> > versions that incorporated the above comments. 0009 patch integrates
> > tidstore with lazy vacuum.
>
> Excellent! I repeated a quick run of the small "test 1" with very low m_w_m 
> from
>
> https://www.postgresql.org/message-id/CAFBsxsHrvTPUK%3DC1%3DxweJjGujja4Xjfgva3C8jnW3Shz6RBnFg%40mail.gmail.com
>
> ...and got similar results, so we still have good space-efficiency on this 
> test:
>
> master:
> INFO:  finished vacuuming "john.public.test": index scans: 9
> system usage: CPU: user: 56.83 s, system: 9.36 s, elapsed: 119.62 s
>
> v45:
> INFO:  finished vacuuming "john.public.test": index scans: 1
> system usage: CPU: user: 6.82 s, system: 2.05 s, elapsed: 10.89 s

Thank you for testing it again. That's a very good result.

> For my next steps, I will finish the node-shrinking behavior and save
> for a later patchset. Not needed for tid store, but needs to happen
> because of assumptions in the code. Also, some time ago, I think I
> commented out RT_FREE_RECURSE to get something working, so I'll fix
> it, and look at other fixmes and todos.

Great!

>
> > Note that DSA segment problem is not
> > resolved yet in this patch.
>
> I remember you started a separate thread about this, but I don't think
> it got any attention. Maybe reply with a "TLDR;" and share a patch to
> allow controlling max segment size.

Yeah, I recalled that thread. Will send a reply.

>
> Some more comments:
>
> v45-0003:
>
> Since RT_ITERATE_NEXT_PTR works for tid store, do we even need
> RT_ITERATE_NEXT anymore? The former should handle fixed-length values
> just fine? If so, we should rename it to match the latter.

Agreed to rename it.

>
> + * The caller is responsible for locking/unlocking the tree in shared mode.
>
> This is not new to v45, but this will come up again below. This needs
> more explanation: Since we're returning a pointer (to support
> variable-length values), the caller needs to maintain control until
> it's finished with the value.

Will fix.

>
> v45-0005:
>
> + * Regarding the concurrency support, we use a single LWLock for the 
> TidStore.
> + * The TidStore is exclusively locked when inserting encoded tids to the
> + * radix tree or when resetting itself. When searching on the TidStore or
> + * doing the iteration, it is not locked but the underlying radix tree is
> + * locked in shared mode.
>
> This is just stating facts without giving any reasons. Readers are
> going to wonder why it's inconsistent. The "why" is much more
> important than the "what". Even with that, this comment is also far
> from the relevant parts, and so will get out of date. Maybe we can
> just make sure each relevant function is explained individually.

Right, I'll fix it.

>
> v45-0007:
>
> -RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
> +RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, Size work_mem);
>
> Tid store calls this max_bytes -- can we use that name here, too?
> "work_mem" is highly specific.

While I agree that "work_mem" is highly specific, I avoided using
"max_bytes" in radix tree because "max_bytes" sounds to me there is a
memory limitation but the radix tree doesn't have it actually. It
might be sufficient to mention it in the comment, though.

>
> - RT_PTR_ALLOC *slot;
> + RT_PTR_ALLOC *slot = NULL;
>
> We have a macro for invalid pointer because of DSA.

Will fix.

>
> v45-0008:
>
> - if (off < 1 || off > MAX_TUPLES_PER_PAGE)
> + if (unlikely(off < 1 || off > MAX_TUPLES_PER_PAGE))
>   elog(ERROR, "tuple offset out of range: %u", off);
>
> This is a superfluous distraction, since the error path is located way
> off in the cold segment of the binary.

Okay, will remove it.

>
> v45-0009:
>
> (just a few small things for now)
>
> - * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
> - *   vacrel->dead_items array.
> + * lazy_vacuum_heap_page() -- free page's LP_DEAD items.
>
> I think we can keep as "listed in the TID store".
>
> - * Allocate dead_items (either using palloc, or in dynamic shared memory).
> - * Sets dead_items in vacrel for caller.
> + * Allocate a (local or shared) TidStore for storing dead TIDs. Sets 
> dead_items
> + * in vacrel for caller.
>
> I think we want to keep "in dynamic shared memory". It's still true.
> I'm not sure anything needs to change here, actually.

Agreed with above comments. Will fix them.

>
>  parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
> - int nrequested_workers, int max_items,
> - int elevel, BufferAccessStrategy bstrategy)
> + int nrequested_workers, int vac_work_mem,
> + int max_offset, int elevel,
> + BufferAccessStrategy bstrategy)
>
> It seems very strange to me that this function 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-14 Thread John Naylor
On Thu, Dec 14, 2023 at 7:22 AM Masahiko Sawada  wrote:
> In v45, 0001 - 0006 are from earlier versions but I've merged previous
> updates. So the radix tree now has RT_SET() and RT_FIND() but not
> RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous
> versions that incorporated the above comments. 0009 patch integrates
> tidstore with lazy vacuum.

Excellent! I repeated a quick run of the small "test 1" with very low m_w_m from

https://www.postgresql.org/message-id/CAFBsxsHrvTPUK%3DC1%3DxweJjGujja4Xjfgva3C8jnW3Shz6RBnFg%40mail.gmail.com

...and got similar results, so we still have good space-efficiency on this test:

master:
INFO:  finished vacuuming "john.public.test": index scans: 9
system usage: CPU: user: 56.83 s, system: 9.36 s, elapsed: 119.62 s

v45:
INFO:  finished vacuuming "john.public.test": index scans: 1
system usage: CPU: user: 6.82 s, system: 2.05 s, elapsed: 10.89 s

More sparse TID distributions won't be as favorable, but we have ideas
to improve that in the future.

For my next steps, I will finish the node-shrinking behavior and save
for a later patchset. Not needed for tid store, but needs to happen
because of assumptions in the code. Also, some time ago, I think I
commented out RT_FREE_RECURSE to get something working, so I'll fix
it, and look at other fixmes and todos.

> Note that DSA segment problem is not
> resolved yet in this patch.

I remember you started a separate thread about this, but I don't think
it got any attention. Maybe reply with a "TLDR;" and share a patch to
allow controlling max segment size.

Some more comments:

v45-0003:

Since RT_ITERATE_NEXT_PTR works for tid store, do we even need
RT_ITERATE_NEXT anymore? The former should handle fixed-length values
just fine? If so, we should rename it to match the latter.

+ * The caller is responsible for locking/unlocking the tree in shared mode.

This is not new to v45, but this will come up again below. This needs
more explanation: Since we're returning a pointer (to support
variable-length values), the caller needs to maintain control until
it's finished with the value.

v45-0005:

+ * Regarding the concurrency support, we use a single LWLock for the TidStore.
+ * The TidStore is exclusively locked when inserting encoded tids to the
+ * radix tree or when resetting itself. When searching on the TidStore or
+ * doing the iteration, it is not locked but the underlying radix tree is
+ * locked in shared mode.

This is just stating facts without giving any reasons. Readers are
going to wonder why it's inconsistent. The "why" is much more
important than the "what". Even with that, this comment is also far
from the relevant parts, and so will get out of date. Maybe we can
just make sure each relevant function is explained individually.

v45-0007:

-RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx);
+RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, Size work_mem);

Tid store calls this max_bytes -- can we use that name here, too?
"work_mem" is highly specific.

- RT_PTR_ALLOC *slot;
+ RT_PTR_ALLOC *slot = NULL;

We have a macro for invalid pointer because of DSA.

v45-0008:

- if (off < 1 || off > MAX_TUPLES_PER_PAGE)
+ if (unlikely(off < 1 || off > MAX_TUPLES_PER_PAGE))
  elog(ERROR, "tuple offset out of range: %u", off);

This is a superfluous distraction, since the error path is located way
off in the cold segment of the binary.

v45-0009:

(just a few small things for now)

- * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the
- *   vacrel->dead_items array.
+ * lazy_vacuum_heap_page() -- free page's LP_DEAD items.

I think we can keep as "listed in the TID store".

- * Allocate dead_items (either using palloc, or in dynamic shared memory).
- * Sets dead_items in vacrel for caller.
+ * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items
+ * in vacrel for caller.

I think we want to keep "in dynamic shared memory". It's still true.
I'm not sure anything needs to change here, actually.

 parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
- int nrequested_workers, int max_items,
- int elevel, BufferAccessStrategy bstrategy)
+ int nrequested_workers, int vac_work_mem,
+ int max_offset, int elevel,
+ BufferAccessStrategy bstrategy)

It seems very strange to me that this function has to pass the
max_offset. In general, it's been simpler to assume we have a constant
max_offset, but in this case that fact is not helping. Something to
think about for later.

- (errmsg("scanned index \"%s\" to remove %d row versions",
+ (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions",

This should be signed int64.

v45-0010:

Thinking about this some more, I'm not sure we need to do anything
different for the *starting* segment size. (Controlling *max* size
does seem important, however.) For the corner case of m_w_m = 1MB,
it's fine if vacuum quits pruning immediately after (in effect) it
finds the DSA has gone to 2MB. It's not worth 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-13 Thread Masahiko Sawada
On Tue, Dec 12, 2023 at 11:53 AM John Naylor  wrote:
>
> On Mon, Dec 11, 2023 at 1:18 PM Masahiko Sawada  wrote:
>
> > I've attached the updated patch set. From the previous patch set, I've
> > merged patches 0007 to 0010. The other changes such as adding RT_GET()
> > still are unmerged for now, for discussion. Probably we can make them
> > as follow-up patches as we discussed. 0011 to 0015 patches are new
> > changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and
> > support variable-length values.
>
> This looks like the right direction, and I'm pleased it's not much
> additional code on top of my last patch.
>
> v44-0014:
>
> +#ifdef RT_VARLEN_VALUE
> + /* XXX: need to choose block sizes? */
> + tree->leaf_ctx = AllocSetContextCreate(ctx,
> +"radix tree leaves",
> +ALLOCSET_DEFAULT_SIZES);
> +#else
> + tree->leaf_ctx = SlabContextCreate(ctx,
> +"radix tree leaves",
> +RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
> +sizeof(RT_VALUE_TYPE));
> +#endif /* RT_VARLEN_VALUE */
>
> Choosing block size: Similar to what we've discussed previously around
> DSA segments, we might model this on CreateWorkExprContext() in
> src/backend/executor/execUtils.c. Maybe tid store can pass maint_w_m /
> autovac_w_m (later work_mem for bitmap scan). RT_CREATE could set the
> max block size to 1/16 of that, or less.
>
> Also, it occurred to me that compile-time embeddable values don't need
> a leaf context. I'm not sure how many places assume that there is
> always a leaf context. If not many, it may be worth not creating one
> here, just to be tidy.
>
> + size_t copysize;
>
> - memcpy(leaf.local, value_p, sizeof(RT_VALUE_TYPE));
> + copysize = sizeof(RT_VALUE_TYPE);
> +#endif
> +
> + memcpy(leaf.local, value_p, copysize);
>
> I'm not sure this indirection adds clarity. I guess the intent was to
> keep from saying "memcpy" twice, but now the code has to say "copysize
> = foo" twice.
>
> For varlen case, we need to watch out for slowness because of memcpy.
> Let's put that off for later testing, though. We may someday want to
> avoid a memcpy call for the varlen case, so let's keep it flexible
> here.
>
> v44-0015:
>
> +#define SizeOfBlocktableEntry (offsetof(
>
> Unused.
>
> + char buf[MaxBlocktableEntrySize] = {0};
>
> Zeroing this buffer is probably going to be expensive. Also see this
> pre-existing comment:
> /* WIP: slow, since it writes to memory for every bit */
> page->words[wordnum] |= ((bitmapword) 1 << bitnum);
>
> For this function (which will be vacuum-only, so we can assume
> ordering), in the loop we can:
> * declare the local bitmapword variable to be zero
> * set the bits on it
> * write it out to the right location when done.
>
> Let's fix both of these at once.
>
> + if (TidStoreIsShared(ts))
> + shared_rt_set(ts->tree.shared, blkno, (void *) page, page_len);
> + else
> + local_rt_set(ts->tree.local, blkno, (void *) page, page_len);
>
> Is there a reason for "void *"? The declared parameter is
> "RT_VALUE_TYPE *value_p" in 0014.
> Also, since this function is for vacuum (and other uses will need a
> new function), let's assert the returned bool is false.
>
> Does iteration still work? If so, it's not too early to re-wire this
> up with vacuum and see how it behaves.
>
> Lastly, my compiler has a warning that CI doesn't have:
>
> In file included from ../src/test/modules/test_radixtree/test_radixtree.c:121:
> ../src/include/lib/radixtree.h: In function ‘rt_find.isra’:
> ../src/include/lib/radixtree.h:2142:24: warning: ‘slot’ may be used
> uninitialized [-Wmaybe-uninitialized]
>  2142 | return (RT_VALUE_TYPE*) slot;
>   |^
> ../src/include/lib/radixtree.h:2112:23: note: ‘slot’ was declared here
>  2112 | RT_PTR_ALLOC *slot;
>   |   ^~~~

Thank you for the comments! I agreed with all of them and incorporated
them into the attached latest patch set, v45.

In v45, 0001 - 0006 are from earlier versions but I've merged previous
updates. So the radix tree now has RT_SET() and RT_FIND() but not
RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous
versions that incorporated the above comments. 0009 patch integrates
tidstore with lazy vacuum. Note that DSA segment problem is not
resolved yet in this patch. 0010 and 0011 makes DSA initial/max
segment size configurable and make parallel vacuum specify both in
proportion to maintenance_work_mem. 0012 is a development-purpose
patch to make it easy to investigate bugs in tidstore. I'd like to
keep it in the patch set at least during the development.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v45-ART.tar.gz
Description: GNU Zip compressed data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-11 Thread John Naylor
On Mon, Dec 11, 2023 at 1:18 PM Masahiko Sawada  wrote:

> I've attached the updated patch set. From the previous patch set, I've
> merged patches 0007 to 0010. The other changes such as adding RT_GET()
> still are unmerged for now, for discussion. Probably we can make them
> as follow-up patches as we discussed. 0011 to 0015 patches are new
> changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and
> support variable-length values.

This looks like the right direction, and I'm pleased it's not much
additional code on top of my last patch.

v44-0014:

+#ifdef RT_VARLEN_VALUE
+ /* XXX: need to choose block sizes? */
+ tree->leaf_ctx = AllocSetContextCreate(ctx,
+"radix tree leaves",
+ALLOCSET_DEFAULT_SIZES);
+#else
+ tree->leaf_ctx = SlabContextCreate(ctx,
+"radix tree leaves",
+RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)),
+sizeof(RT_VALUE_TYPE));
+#endif /* RT_VARLEN_VALUE */

Choosing block size: Similar to what we've discussed previously around
DSA segments, we might model this on CreateWorkExprContext() in
src/backend/executor/execUtils.c. Maybe tid store can pass maint_w_m /
autovac_w_m (later work_mem for bitmap scan). RT_CREATE could set the
max block size to 1/16 of that, or less.

Also, it occurred to me that compile-time embeddable values don't need
a leaf context. I'm not sure how many places assume that there is
always a leaf context. If not many, it may be worth not creating one
here, just to be tidy.

+ size_t copysize;

- memcpy(leaf.local, value_p, sizeof(RT_VALUE_TYPE));
+ copysize = sizeof(RT_VALUE_TYPE);
+#endif
+
+ memcpy(leaf.local, value_p, copysize);

I'm not sure this indirection adds clarity. I guess the intent was to
keep from saying "memcpy" twice, but now the code has to say "copysize
= foo" twice.

For varlen case, we need to watch out for slowness because of memcpy.
Let's put that off for later testing, though. We may someday want to
avoid a memcpy call for the varlen case, so let's keep it flexible
here.

v44-0015:

+#define SizeOfBlocktableEntry (offsetof(

Unused.

+ char buf[MaxBlocktableEntrySize] = {0};

Zeroing this buffer is probably going to be expensive. Also see this
pre-existing comment:
/* WIP: slow, since it writes to memory for every bit */
page->words[wordnum] |= ((bitmapword) 1 << bitnum);

For this function (which will be vacuum-only, so we can assume
ordering), in the loop we can:
* declare the local bitmapword variable to be zero
* set the bits on it
* write it out to the right location when done.

Let's fix both of these at once.

+ if (TidStoreIsShared(ts))
+ shared_rt_set(ts->tree.shared, blkno, (void *) page, page_len);
+ else
+ local_rt_set(ts->tree.local, blkno, (void *) page, page_len);

Is there a reason for "void *"? The declared parameter is
"RT_VALUE_TYPE *value_p" in 0014.
Also, since this function is for vacuum (and other uses will need a
new function), let's assert the returned bool is false.

Does iteration still work? If so, it's not too early to re-wire this
up with vacuum and see how it behaves.

Lastly, my compiler has a warning that CI doesn't have:

In file included from ../src/test/modules/test_radixtree/test_radixtree.c:121:
../src/include/lib/radixtree.h: In function ‘rt_find.isra’:
../src/include/lib/radixtree.h:2142:24: warning: ‘slot’ may be used
uninitialized [-Wmaybe-uninitialized]
 2142 | return (RT_VALUE_TYPE*) slot;
  |^
../src/include/lib/radixtree.h:2112:23: note: ‘slot’ was declared here
 2112 | RT_PTR_ALLOC *slot;
  |   ^~~~




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-10 Thread Masahiko Sawada
On Fri, Dec 8, 2023 at 9:44 PM Masahiko Sawada  wrote:
>
> On Fri, Dec 8, 2023 at 7:46 PM John Naylor  wrote:
> >
> > On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada  
> > wrote:
> > >
> > > BTW Given that the actual value size can be calculated only by the
> > > caller, how does the tree know if the value is embedded or not? It's
> > > probably related to how to store combined pointer/value slots.
> >
> > Right, this is future work. At first, variable-length types will have
> > to be single-value leaves. In fact, the idea for storing up to 3
> > offsets in the bitmap header could be done this way -- it would just
> > be a (small) single-value leaf.
>
> Agreed.
>
> >
> > (Reminder: Currently, fixed-length values are compile-time embeddable
> > if the platform pointer size is big enough.)
> >
> > > If leaf
> > > nodes have a bitmap array that indicates the corresponding slot is an
> > > embedded value or a pointer to a value, it would be easy.
> >
> > That's the most general way to do it. We could do it much more easily
> > with a pointer tag, although for the above idea it may require some
> > endian-aware coding. Both were mentioned in the paper, I recall.
>
> True. Probably we can use the combined pointer/value slots approach
> only if the tree is able to use the pointer tagging. That is, if the
> caller allows the tree to use one bit of the value.
>
> I'm going to update the patch based on the recent discussion (RT_SET()
> and variable-length values) etc., and post the patch set early next
> week.

I've attached the updated patch set. From the previous patch set, I've
merged patches 0007 to 0010. The other changes such as adding RT_GET()
still are unmerged for now, for discussion. Probably we can make them
as follow-up patches as we discussed. 0011 to 0015 patches are new
changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and
support variable-length values.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v44-ART.tar.gz
Description: GNU Zip compressed data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-08 Thread Masahiko Sawada
On Fri, Dec 8, 2023 at 7:46 PM John Naylor  wrote:
>
> On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada  wrote:
> >
> > BTW Given that the actual value size can be calculated only by the
> > caller, how does the tree know if the value is embedded or not? It's
> > probably related to how to store combined pointer/value slots.
>
> Right, this is future work. At first, variable-length types will have
> to be single-value leaves. In fact, the idea for storing up to 3
> offsets in the bitmap header could be done this way -- it would just
> be a (small) single-value leaf.

Agreed.

>
> (Reminder: Currently, fixed-length values are compile-time embeddable
> if the platform pointer size is big enough.)
>
> > If leaf
> > nodes have a bitmap array that indicates the corresponding slot is an
> > embedded value or a pointer to a value, it would be easy.
>
> That's the most general way to do it. We could do it much more easily
> with a pointer tag, although for the above idea it may require some
> endian-aware coding. Both were mentioned in the paper, I recall.

True. Probably we can use the combined pointer/value slots approach
only if the tree is able to use the pointer tagging. That is, if the
caller allows the tree to use one bit of the value.

I'm going to update the patch based on the recent discussion (RT_SET()
and variable-length values) etc., and post the patch set early next
week.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-08 Thread John Naylor
On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada  wrote:
>
> BTW Given that the actual value size can be calculated only by the
> caller, how does the tree know if the value is embedded or not? It's
> probably related to how to store combined pointer/value slots.

Right, this is future work. At first, variable-length types will have
to be single-value leaves. In fact, the idea for storing up to 3
offsets in the bitmap header could be done this way -- it would just
be a (small) single-value leaf.

(Reminder: Currently, fixed-length values are compile-time embeddable
if the platform pointer size is big enough.)

> If leaf
> nodes have a bitmap array that indicates the corresponding slot is an
> embedded value or a pointer to a value, it would be easy.

That's the most general way to do it. We could do it much more easily
with a pointer tag, although for the above idea it may require some
endian-aware coding. Both were mentioned in the paper, I recall.

> But since
> the bitmap array is needed only in the leaf nodes, internal nodes and
> leaf nodes will no longer be identical structure, which is not a bad
> thing to me, though.

Absolutely no way we are going back to double everything: double
types, double functions, double memory contexts. Plus, that bitmap in
inner nodes could indicate a pointer to a leaf that got there by "lazy
expansion".




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-08 Thread Masahiko Sawada
On Fri, Dec 8, 2023 at 3:45 PM Masahiko Sawada  wrote:
>
> On Fri, Dec 8, 2023 at 1:37 PM John Naylor  wrote:
> >
> > On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada  
> > wrote:
> >
> > > It's still unclear to me why the value doesn't need to contain the size.
> > >
> > > If I understand you correctly, in RT_GET(), the tree allocs a new
> > > memory and updates the slot where the value is embedded with the
> > > pointer to the allocated memory, and returns the pointer to the
> > > caller. Since the returned value, newly allocated memory, is still
> > > empty, the callner needs to copy the contents of the old value to the
> > > new value and do whatever else it needs to.
> > >
> > > If the value is already a single-leave value and RT_GET() is called
> > > with a larger size, the slot is always replaced with the newly
> > > allocated area and the caller needs to copy the contents? If the tree
> > > does realloc the value with a new size, how does the tree know the new
> > > value is larger than the existing value? It seems like the caller
> > > needs to provide a function to calculate the size of the value based
> > > on the length.
> >
> > Right. My brief description mentioned one thing without details: The
> > caller would need to control whether to re-alloc. RT_GET would pass
> > the size. If nothing is found, the tree would allocate. If there is a
> > value already, just return it. That means both the address of the
> > slot, and the local pointer to the value (with embedded, would be the
> > same address).

BTW Given that the actual value size can be calculated only by the
caller, how does the tree know if the value is embedded or not? It's
probably related to how to store combined pointer/value slots. If leaf
nodes have a bitmap array that indicates the corresponding slot is an
embedded value or a pointer to a value, it would be easy. But since
the bitmap array is needed only in the leaf nodes, internal nodes and
leaf nodes will no longer be identical structure, which is not a bad
thing to me, though.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-07 Thread Masahiko Sawada
On Fri, Dec 8, 2023 at 1:37 PM John Naylor  wrote:
>
> On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada  wrote:
>
> > It's still unclear to me why the value doesn't need to contain the size.
> >
> > If I understand you correctly, in RT_GET(), the tree allocs a new
> > memory and updates the slot where the value is embedded with the
> > pointer to the allocated memory, and returns the pointer to the
> > caller. Since the returned value, newly allocated memory, is still
> > empty, the callner needs to copy the contents of the old value to the
> > new value and do whatever else it needs to.
> >
> > If the value is already a single-leave value and RT_GET() is called
> > with a larger size, the slot is always replaced with the newly
> > allocated area and the caller needs to copy the contents? If the tree
> > does realloc the value with a new size, how does the tree know the new
> > value is larger than the existing value? It seems like the caller
> > needs to provide a function to calculate the size of the value based
> > on the length.
>
> Right. My brief description mentioned one thing without details: The
> caller would need to control whether to re-alloc. RT_GET would pass
> the size. If nothing is found, the tree would allocate. If there is a
> value already, just return it. That means both the address of the
> slot, and the local pointer to the value (with embedded, would be the
> same address). The caller checks if the array is long enough. If not,
> call a new function that takes the new size, the address of the slot,
> and the pointer to the old value. The tree would re-alloc, put the
> alloc pointer in the slot and return the new local pointer. But as we
> agreed, that is all follow-up work.

Thank you for the detailed explanation. That makes sense to me. We
will address it as a follow-up work.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-07 Thread John Naylor
On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada  wrote:

> It's still unclear to me why the value doesn't need to contain the size.
>
> If I understand you correctly, in RT_GET(), the tree allocs a new
> memory and updates the slot where the value is embedded with the
> pointer to the allocated memory, and returns the pointer to the
> caller. Since the returned value, newly allocated memory, is still
> empty, the callner needs to copy the contents of the old value to the
> new value and do whatever else it needs to.
>
> If the value is already a single-leave value and RT_GET() is called
> with a larger size, the slot is always replaced with the newly
> allocated area and the caller needs to copy the contents? If the tree
> does realloc the value with a new size, how does the tree know the new
> value is larger than the existing value? It seems like the caller
> needs to provide a function to calculate the size of the value based
> on the length.

Right. My brief description mentioned one thing without details: The
caller would need to control whether to re-alloc. RT_GET would pass
the size. If nothing is found, the tree would allocate. If there is a
value already, just return it. That means both the address of the
slot, and the local pointer to the value (with embedded, would be the
same address). The caller checks if the array is long enough. If not,
call a new function that takes the new size, the address of the slot,
and the pointer to the old value. The tree would re-alloc, put the
alloc pointer in the slot and return the new local pointer. But as we
agreed, that is all follow-up work.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-07 Thread Masahiko Sawada
On Thu, Dec 7, 2023 at 12:27 PM John Naylor  wrote:
>
> On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada  wrote:
> >
> > On Sat, Oct 28, 2023 at 5:56 PM John Naylor  wrote:
>
> > bool
> > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
> > or for variable-length value support,
> > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);
> >
> > If an entry already exists, update its value to 'value_p' and return
> > true. Otherwise set the value and return false.
>
> > RT_VALUE_TYPE
> > RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
> >
> > If the entry already exists, replace the value with a new empty value
> > with sz bytes and set "found" to true. Otherwise, insert an empty
> > value, return its pointer, and set "found" to false.
> >
> > We probably will find a better name but I use RT_INSERT() for
> > discussion. RT_INSERT() returns an empty slot regardless of existing
> > values. It can be used to insert a new value or to replace the value
> > with a larger value.
>
> Looking at TidStoreSetBlockOffsets again (in particular how it works
> with RT_GET), and thinking about issues we've discussed, I think
> RT_SET is sufficient for vacuum. Here's how it could work:
>
> TidStoreSetBlockOffsets could have a stack variable that's "almost
> always" large enough. When not, it can allocate in its own context. It
> sets the necessary bits there. Then, it passes the pointer to RT_SET
> with the number of bytes to copy. That seems very simple.

Right.

>
> At some future time, we can add a new function with the complex
> business about getting the current value to modify it, with the
> re-alloc'ing that it might require.
>
> In other words, from both an API perspective and a performance
> perspective, it makes sense for tid store to have a simple "set"
> interface for vacuum that can be optimized for its characteristics
> (insert only, ordered offsets). And also a more complex one for bitmap
> scan (setting/unsetting bits of existing values, in any order). They
> can share the same iteration interface, key types, and value types.
>
> What do you think, Masahiko?

Good point. RT_SET() would be faster than RT_GET() and updating the
value because RT_SET() would not need to take care of the existing
value (its size, embedded or not, realloc etc).

I think that we can separate the radix tree patch into two parts: the
main implementation with RT_SET(), and more complex APIs such as
RT_GET() etc. That way, it would probably make it easy to complete the
radix tree and tidstore first.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-07 Thread Masahiko Sawada
On Wed, Dec 6, 2023 at 3:39 PM John Naylor  wrote:
>
> On Wed, Dec 6, 2023 at 4:34 AM Masahiko Sawada  wrote:
> >
> > On Mon, Dec 4, 2023 at 5:21 PM John Naylor  wrote:
>
> > > > Given variable-length value support, RT_GET() would have to do
> > > > repalloc() if the existing value size is not big enough for the new
> > > > value, but it cannot as the radix tree doesn't know the size of each
> > > > stored value.
> > >
> > > I think we have two choices:
> > >
> > > - the value stores the "length". The caller would need to specify a
> > > function to compute size from the "length" member. Note this assumes
> > > there is an array. I think both aspects are not great.
> > > - the value stores the "size". Callers that store an array (as
> > > PageTableEntry's do) would compute length when they need to. This
> > > sounds easier.
> >
> > As for the second idea, do we always need to require the value to have
> > the "size" (e.g. int32) in the first field of its struct? If so, the
> > caller will be able to use only 4 bytes in embedded value cases (or
> > won't be able to use at all if the pointer size is 4 bytes).
>
> We could have an RT_SIZE_TYPE for varlen value types. That's easy.
> There is another way, though: (This is a digression into embedded
> values, but it does illuminate some issues even aside from that)
>
> My thinking a while ago was that an embedded value had no explicit
> length/size, but could be "expanded" into a conventional value for the
> caller. For bitmaps, the smallest full value would have length 1 and
> whatever size (For tid store maybe 16 bytes). This would happen
> automatically via a template function.
>
> Now I think that could be too complicated (especially for page table
> entries, which have more bookkeeping than vacuum needs) and slow.
> Imagine this as an embedded value:
>
> typedef struct BlocktableEntry
> {
>   uint16 size;
>
>   /* later: uint8 flags; for bitmap scan */
>
>   /* 64 bit: 3 elements , 32-bit: 1 element */
>   OffsetNumber offsets[( sizeof(Pointer) - sizeof(int16) ) /
> sizeof(OffsetNumber)];
>
>   /* end of embeddable value */
>
>   bitmapword words[FLEXIBLE_ARRAY_MEMBER];
> } BlocktableEntry;
>
> Here we can use a slot to store up to 3 offsets, no matter how big
> they are. That's great because a bitmap could be mostly wasted space.

Interesting idea.

> But now the caller can't know up front how many bytes it needs until
> it retrieves the value and sees what's already there. If there are
> already three values, the caller needs to tell the tree "alloc this
> much, update this slot you just gave me with the alloc (maybe DSA)
> pointer, and return the local pointer". Then copy the 3 offsets into
> set bits, and set whatever else it needs to.  With normal values, same
> thing, but with realloc.
>
> This is a bit complex, but I see an advantage The tree doesn't need to
> care so much about the size, so the value doesn't need to contain the
> size. For our case, we can use length (number of bitmapwords) without
> the disadvantages I mentioned above, with length zero (or maybe -1)
> meaning "no bitmapword array, the offsets are all in this small
> array".

It's still unclear to me why the value doesn't need to contain the size.

If I understand you correctly, in RT_GET(), the tree allocs a new
memory and updates the slot where the value is embedded with the
pointer to the allocated memory, and returns the pointer to the
caller. Since the returned value, newly allocated memory, is still
empty, the callner needs to copy the contents of the old value to the
new value and do whatever else it needs to.

If the value is already a single-leave value and RT_GET() is called
with a larger size, the slot is always replaced with the newly
allocated area and the caller needs to copy the contents? If the tree
does realloc the value with a new size, how does the tree know the new
value is larger than the existing value? It seems like the caller
needs to provide a function to calculate the size of the value based
on the length.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-06 Thread John Naylor
On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada  wrote:
>
> On Sat, Oct 28, 2023 at 5:56 PM John Naylor  wrote:

> bool
> RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
> or for variable-length value support,
> RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);
>
> If an entry already exists, update its value to 'value_p' and return
> true. Otherwise set the value and return false.

> RT_VALUE_TYPE
> RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
>
> If the entry already exists, replace the value with a new empty value
> with sz bytes and set "found" to true. Otherwise, insert an empty
> value, return its pointer, and set "found" to false.
>
> We probably will find a better name but I use RT_INSERT() for
> discussion. RT_INSERT() returns an empty slot regardless of existing
> values. It can be used to insert a new value or to replace the value
> with a larger value.

Looking at TidStoreSetBlockOffsets again (in particular how it works
with RT_GET), and thinking about issues we've discussed, I think
RT_SET is sufficient for vacuum. Here's how it could work:

TidStoreSetBlockOffsets could have a stack variable that's "almost
always" large enough. When not, it can allocate in its own context. It
sets the necessary bits there. Then, it passes the pointer to RT_SET
with the number of bytes to copy. That seems very simple.

At some future time, we can add a new function with the complex
business about getting the current value to modify it, with the
re-alloc'ing that it might require.

In other words, from both an API perspective and a performance
perspective, it makes sense for tid store to have a simple "set"
interface for vacuum that can be optimized for its characteristics
(insert only, ordered offsets). And also a more complex one for bitmap
scan (setting/unsetting bits of existing values, in any order). They
can share the same iteration interface, key types, and value types.

What do you think, Masahiko?




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-05 Thread John Naylor
On Wed, Dec 6, 2023 at 4:34 AM Masahiko Sawada  wrote:
>
> On Mon, Dec 4, 2023 at 5:21 PM John Naylor  wrote:

> > > Given variable-length value support, RT_GET() would have to do
> > > repalloc() if the existing value size is not big enough for the new
> > > value, but it cannot as the radix tree doesn't know the size of each
> > > stored value.
> >
> > I think we have two choices:
> >
> > - the value stores the "length". The caller would need to specify a
> > function to compute size from the "length" member. Note this assumes
> > there is an array. I think both aspects are not great.
> > - the value stores the "size". Callers that store an array (as
> > PageTableEntry's do) would compute length when they need to. This
> > sounds easier.
>
> As for the second idea, do we always need to require the value to have
> the "size" (e.g. int32) in the first field of its struct? If so, the
> caller will be able to use only 4 bytes in embedded value cases (or
> won't be able to use at all if the pointer size is 4 bytes).

We could have an RT_SIZE_TYPE for varlen value types. That's easy.
There is another way, though: (This is a digression into embedded
values, but it does illuminate some issues even aside from that)

My thinking a while ago was that an embedded value had no explicit
length/size, but could be "expanded" into a conventional value for the
caller. For bitmaps, the smallest full value would have length 1 and
whatever size (For tid store maybe 16 bytes). This would happen
automatically via a template function.

Now I think that could be too complicated (especially for page table
entries, which have more bookkeeping than vacuum needs) and slow.
Imagine this as an embedded value:

typedef struct BlocktableEntry
{
  uint16 size;

  /* later: uint8 flags; for bitmap scan */

  /* 64 bit: 3 elements , 32-bit: 1 element */
  OffsetNumber offsets[( sizeof(Pointer) - sizeof(int16) ) /
sizeof(OffsetNumber)];

  /* end of embeddable value */

  bitmapword words[FLEXIBLE_ARRAY_MEMBER];
} BlocktableEntry;

Here we can use a slot to store up to 3 offsets, no matter how big
they are. That's great because a bitmap could be mostly wasted space.
But now the caller can't know up front how many bytes it needs until
it retrieves the value and sees what's already there. If there are
already three values, the caller needs to tell the tree "alloc this
much, update this slot you just gave me with the alloc (maybe DSA)
pointer, and return the local pointer". Then copy the 3 offsets into
set bits, and set whatever else it needs to. With normal values, same
thing, but with realloc.

This is a bit complex, but I see an advantage The tree doesn't need to
care so much about the size, so the value doesn't need to contain the
size. For our case, we can use length (number of bitmapwords) without
the disadvantages I mentioned above, with length zero (or maybe -1)
meaning "no bitmapword array, the offsets are all in this small
array".

> > > Another idea is that the radix tree returns the pointer
> > > to the slot and the caller updates the value accordingly.
> >
> > I did exactly this in v43 TidStore if I understood you correctly. If I
> > misunderstood you, can you clarify?
>
> I meant to expose RT_GET_SLOT_RECURSIVE() so that the caller updates
> the value as they want.

Did my sketch above get closer to that? Side note: I don't think we
can expose that directly (e.g. need to check for create or extend
upwards), but some functionality can be a thin wrapper around it.

> > However, for vacuum, we have all values that we need up front. That
> > gives me an idea: Something like this insert API could be optimized
> > for "insert-only": If we only free values when we free the whole tree
> > at the end, that's a clear use case for David Rowley's proposed "bump
> > context", which would save 8 bytes per allocation and be a bit faster.
> > [1] (RT_GET for varlen values would use an aset context, to allow
> > repalloc, and nodes would continue to use slab).
>
> Interesting idea and worth trying it. Do we need to protect the whole
> tree as insert-only for safety? It's problematic if the user uses
> mixed RT_INSERT() and RT_GET().

You're right, but I'm not sure what the policy should be.




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-05 Thread Masahiko Sawada
On Mon, Dec 4, 2023 at 5:21 PM John Naylor  wrote:
>
> On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada  wrote:
>
> > Since the variable-length values support is a big deal and would be
> > related to API design I'd like to discuss the API design first.
>
> Thanks for the fine summary of the issues here.
>
> [Swapping this back in my head]
>
> > RT_VALUE_TYPE
> > RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found);
> > or for variable-length value support,
> > RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
> >
> > If an entry already exists, return its pointer and set "found" to
> > true. Otherwize, insert an empty value with sz bytes, return its
> > pointer, and set "found" to false.
>
> > ---
> > bool
> > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
> > or for variable-length value support,
> > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);
> >
> > If an entry already exists, update its value to 'value_p' and return
> > true. Otherwise set the value and return false.
>
> I'd have to double-check, but I think RT_SET is vestigial and I'm not
> sure it has any advantage over RT_GET as I've sketched it out. I'm
> pretty sure it's only there now because changing the radix tree
> regression tests is much harder than changing TID store.

Agreed.

>
> > Given variable-length value support, RT_GET() would have to do
> > repalloc() if the existing value size is not big enough for the new
> > value, but it cannot as the radix tree doesn't know the size of each
> > stored value.
>
> I think we have two choices:
>
> - the value stores the "length". The caller would need to specify a
> function to compute size from the "length" member. Note this assumes
> there is an array. I think both aspects are not great.
> - the value stores the "size". Callers that store an array (as
> PageTableEntry's do) would compute length when they need to. This
> sounds easier.

As for the second idea, do we always need to require the value to have
the "size" (e.g. int32) in the first field of its struct? If so, the
caller will be able to use only 4 bytes in embedded value cases (or
won't be able to use at all if the pointer size is 4 bytes).

>
> > Another idea is that the radix tree returns the pointer
> > to the slot and the caller updates the value accordingly.
>
> I did exactly this in v43 TidStore if I understood you correctly. If I
> misunderstood you, can you clarify?

I meant to expose RT_GET_SLOT_RECURSIVE() so that the caller updates
the value as they want.

>
> > But it means
> > that the caller has to update the slot properly while considering the
> > value size (embedded vs. single-leave value), which seems not a good
> > idea.
>
> For this optimization, callers will have to know about pointer-sized
> values and treat them differently, but they don't need to know the
> details about how where they are stored.
>
> While we want to keep embedded values in the back of our minds, I
> really think the details should be postponed to a follow-up commit.

Agreed.

>
> > To deal with this problem, I think we can somewhat change RT_GET() API
> > as follow:
> >
> > RT_VALUE_TYPE
> > RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
> >
> > If the entry already exists, replace the value with a new empty value
> > with sz bytes and set "found" to true. Otherwise, insert an empty
> > value, return its pointer, and set "found" to false.
> >
> > We probably will find a better name but I use RT_INSERT() for
> > discussion. RT_INSERT() returns an empty slot regardless of existing
> > values. It can be used to insert a new value or to replace the value
> > with a larger value.
>
> For the case we are discussing, bitmaps, updating an existing value is
> a bit tricky. We need the existing value to properly update it with
> set or unset bits. This can't work in general without a lot of work
> for the caller.

True.

>
> However, for vacuum, we have all values that we need up front. That
> gives me an idea: Something like this insert API could be optimized
> for "insert-only": If we only free values when we free the whole tree
> at the end, that's a clear use case for David Rowley's proposed "bump
> context", which would save 8 bytes per allocation and be a bit faster.
> [1] (RT_GET for varlen values would use an aset context, to allow
> repalloc, and nodes would continue to use slab).

Interesting idea and worth trying it. Do we need to protect the whole
tree as insert-only for safety? It's problematic if the user uses
mixed RT_INSERT() and RT_GET().

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-12-04 Thread John Naylor
On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada  wrote:

> Since the variable-length values support is a big deal and would be
> related to API design I'd like to discuss the API design first.

Thanks for the fine summary of the issues here.

[Swapping this back in my head]

> RT_VALUE_TYPE
> RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found);
> or for variable-length value support,
> RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
>
> If an entry already exists, return its pointer and set "found" to
> true. Otherwize, insert an empty value with sz bytes, return its
> pointer, and set "found" to false.

> ---
> bool
> RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
> or for variable-length value support,
> RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);
>
> If an entry already exists, update its value to 'value_p' and return
> true. Otherwise set the value and return false.

I'd have to double-check, but I think RT_SET is vestigial and I'm not
sure it has any advantage over RT_GET as I've sketched it out. I'm
pretty sure it's only there now because changing the radix tree
regression tests is much harder than changing TID store.

> Given variable-length value support, RT_GET() would have to do
> repalloc() if the existing value size is not big enough for the new
> value, but it cannot as the radix tree doesn't know the size of each
> stored value.

I think we have two choices:

- the value stores the "length". The caller would need to specify a
function to compute size from the "length" member. Note this assumes
there is an array. I think both aspects are not great.
- the value stores the "size". Callers that store an array (as
PageTableEntry's do) would compute length when they need to. This
sounds easier.

> Another idea is that the radix tree returns the pointer
> to the slot and the caller updates the value accordingly.

I did exactly this in v43 TidStore if I understood you correctly. If I
misunderstood you, can you clarify?

> But it means
> that the caller has to update the slot properly while considering the
> value size (embedded vs. single-leave value), which seems not a good
> idea.

For this optimization, callers will have to know about pointer-sized
values and treat them differently, but they don't need to know the
details about how where they are stored.

While we want to keep embedded values in the back of our minds, I
really think the details should be postponed to a follow-up commit.

> To deal with this problem, I think we can somewhat change RT_GET() API
> as follow:
>
> RT_VALUE_TYPE
> RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);
>
> If the entry already exists, replace the value with a new empty value
> with sz bytes and set "found" to true. Otherwise, insert an empty
> value, return its pointer, and set "found" to false.
>
> We probably will find a better name but I use RT_INSERT() for
> discussion. RT_INSERT() returns an empty slot regardless of existing
> values. It can be used to insert a new value or to replace the value
> with a larger value.

For the case we are discussing, bitmaps, updating an existing value is
a bit tricky. We need the existing value to properly update it with
set or unset bits. This can't work in general without a lot of work
for the caller.

However, for vacuum, we have all values that we need up front. That
gives me an idea: Something like this insert API could be optimized
for "insert-only": If we only free values when we free the whole tree
at the end, that's a clear use case for David Rowley's proposed "bump
context", which would save 8 bytes per allocation and be a bit faster.
[1] (RT_GET for varlen values would use an aset context, to allow
repalloc, and nodes would continue to use slab).

[1] 
https://www.postgresql.org/message-id/flat/CAApHDvqGSpCU95TmM=Bp=6xjl_nlys4zdzopfnywbk97xrd...@mail.gmail.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-11-26 Thread Masahiko Sawada
On Sat, Oct 28, 2023 at 5:56 PM John Naylor  wrote:
>
> I wrote:
>
> > Seems fine at a glance, thanks. I will build on this to implement 
> > variable-length values. I have already finished one prerequisite which is: 
> > public APIs passing pointers to values.
>
> Since my publishing schedule has not kept up, I'm just going to share
> something similar to what I mentioned earlier, just to get things
> moving again.

Thanks for sharing the updates. I've returned to work today and will
resume working on this feature.

>
> 0001-0009 are from earlier versions, except for 0007 which makes a
> bunch of superficial naming updates, similar to those done in a recent
> other version. Somewhere along the way I fixed long-standing git
> whitespace warnings, but I don't remember if that's new here. In any
> case, let's try to preserve that.
>
> 0010 is some minor refactoring to reduce duplication
>
> 0011-0014 add public functions that give the caller more control over
> the input and responsibility for locking. They are not named well, but
> I plan these to be temporary: They are currently used for the tidstore
> only, since that has much simpler tests than the standard radix tree
> tests. One thing to note: since the tidstore has always done it's own
> locking within a larger structure, these patches don't bother to do
> locking at the radix tree level. Locking twice seems...not great.
> These patches are the main prerequisite for variable-length values.
> Once that is working well, we can switch the standard tests to the new
> APIs.

Since the variable-length values support is a big deal and would be
related to API design I'd like to discuss the API design first.
Currently, we have the following APIs:

---
RT_VALUE_TYPE
RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found);
or for variable-length value support,
RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);

If an entry already exists, return its pointer and set "found" to
true. Otherwize, insert an empty value with sz bytes, return its
pointer, and set "found" to false.

---
RT_VALUE_TYPE
RT_FIND(RT_RADIX_TREE *tree, uint64 key);

If an entry exists, return the pointer to the value, otherwise return NULL.

(I omitted RT_SEARCH() as it's essentially the same as RT_FIND() and
will probably get removed.)

---
bool
RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p);
or for variable-length value support,
RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz);

If an entry already exists, update its value to 'value_p' and return
true. Otherwise set the value and return false.

Given variable-length value support, RT_GET() would have to do
repalloc() if the existing value size is not big enough for the new
value, but it cannot as the radix tree doesn't know the size of each
stored value. Another idea is that the radix tree returns the pointer
to the slot and the caller updates the value accordingly. But it means
that the caller has to update the slot properly while considering the
value size (embedded vs. single-leave value), which seems not a good
idea.

To deal with this problem, I think we can somewhat change RT_GET() API
as follow:

RT_VALUE_TYPE
RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found);

If the entry already exists, replace the value with a new empty value
with sz bytes and set "found" to true. Otherwise, insert an empty
value, return its pointer, and set "found" to false.

We probably will find a better name but I use RT_INSERT() for
discussion. RT_INSERT() returns an empty slot regardless of existing
values. It can be used to insert a new value or to replace the value
with a larger value.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-10-28 Thread John Naylor
I wrote:

> Seems fine at a glance, thanks. I will build on this to implement 
> variable-length values. I have already finished one prerequisite which is: 
> public APIs passing pointers to values.

Since my publishing schedule has not kept up, I'm just going to share
something similar to what I mentioned earlier, just to get things
moving again.

0001-0009 are from earlier versions, except for 0007 which makes a
bunch of superficial naming updates, similar to those done in a recent
other version. Somewhere along the way I fixed long-standing git
whitespace warnings, but I don't remember if that's new here. In any
case, let's try to preserve that.

0010 is some minor refactoring to reduce duplication

0011-0014 add public functions that give the caller more control over
the input and responsibility for locking. They are not named well, but
I plan these to be temporary: They are currently used for the tidstore
only, since that has much simpler tests than the standard radix tree
tests. One thing to note: since the tidstore has always done it's own
locking within a larger structure, these patches don't bother to do
locking at the radix tree level. Locking twice seems...not great.
These patches are the main prerequisite for variable-length values.
Once that is working well, we can switch the standard tests to the new
APIs.

Next steps include (some of these were briefly discussed off-list with
Sawada-san):

- template parameter for varlen values
- some callers to pass length in bytes
- block entries to have num_elems for # of bitmap words
- a way for updates to re-alloc values when needed
- aset allocation for values when appropriate


v43-ART.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-09-16 Thread Masahiko Sawada
On Sat, Sep 16, 2023 at 9:03 AM Andres Freund  wrote:
>
> Hi,
>
> On 2023-08-28 23:43:22 +0900, Masahiko Sawada wrote:
> > I've attached v42 patch set. I improved tidstore regression test codes
> > in addition of imcorporating the above comments.
>
> Why did you need to disable the benchmark module for CI?

I didn't want to unnecessarily make cfbot unhappy since the benchmark
module is not going to get committed to the core and sometimes not
up-to-date.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-09-15 Thread Andres Freund
Hi,

On 2023-08-28 23:43:22 +0900, Masahiko Sawada wrote:
> I've attached v42 patch set. I improved tidstore regression test codes
> in addition of imcorporating the above comments.

Why did you need to disable the benchmark module for CI?

Greetings,

Andres Freund




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-09-06 Thread Masahiko Sawada
On Wed, Sep 6, 2023 at 3:23 PM John Naylor  wrote:
>
>
> On Mon, Aug 28, 2023 at 9:44 PM Masahiko Sawada  wrote:
> >
> > I've attached v42 patch set. I improved tidstore regression test codes
> > in addition of imcorporating the above comments.
>
> Seems fine at a glance, thanks. I will build on this to implement 
> variable-length values.

Thanks.

> I have already finished one prerequisite which is: public APIs passing 
> pointers to values.

Great!

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-09-06 Thread John Naylor
On Mon, Aug 28, 2023 at 9:44 PM Masahiko Sawada 
wrote:
>
> I've attached v42 patch set. I improved tidstore regression test codes
> in addition of imcorporating the above comments.

Seems fine at a glance, thanks. I will build on this to implement
variable-length values. I have already finished one prerequisite which is:
public APIs passing pointers to values.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-08-28 Thread Masahiko Sawada
On Mon, Aug 28, 2023 at 4:20 PM John Naylor
 wrote:
>
> On Sun, Aug 27, 2023 at 7:53 PM Masahiko Sawada  wrote:
> >
> > I've updated the regression tests for tidstore so that it uses SQL
> > functions to add blocks/offsets and dump its contents. The new test
> > covers the same test coverages but it's executed using SQL functions
> > instead of executing all tests in one SQL function.
>
> This is much nicer and more flexible, thanks! A few questions/comments:
>
> tidstore_dump_tids() returns a string -- is it difficult to turn this into a 
> SRF, or is it just a bit more work?

It's not difficult. I've changed it in v42 patch.

>
> The lookup test seems fine for now. The output would look nicer with an 
> "order by tid".

Agreed.

>
> I think we could have the SQL function tidstore_create() take a boolean for 
> shared memory. That would allow ad-hoc testing without a recompile, if I'm 
> not mistaken.

Agreed.

>
> +SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
> +  FROM blocks, offsets
> +  GROUP BY blk;
> + tidstore_set_block_offsets
> +
> +
> +
> +
> +
> +
> +(5 rows)
>
> Calling a void function multiple times leads to vertical whitespace, which 
> looks a bit strange and may look better with some output, even if irrelevant:
>
> -SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
> +SELECT row_number() over(order by blk), tidstore_set_block_offsets(blk, 
> array_agg(offsets.off)::int2[])
>
>  row_number | tidstore_set_block_offsets
> +
>   1 |
>   2 |
>   3 |
>   4 |
>   5 |
> (5 rows)

Yes, it looks better.

I've attached v42 patch set. I improved tidstore regression test codes
in addition of imcorporating the above comments.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v42-ART.tar.gz
Description: GNU Zip compressed data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-08-28 Thread John Naylor
On Sun, Aug 27, 2023 at 7:53 PM Masahiko Sawada 
wrote:
>
> I've updated the regression tests for tidstore so that it uses SQL
> functions to add blocks/offsets and dump its contents. The new test
> covers the same test coverages but it's executed using SQL functions
> instead of executing all tests in one SQL function.

This is much nicer and more flexible, thanks! A few questions/comments:

tidstore_dump_tids() returns a string -- is it difficult to turn this into
a SRF, or is it just a bit more work?

The lookup test seems fine for now. The output would look nicer with an
"order by tid".

I think we could have the SQL function tidstore_create() take a boolean for
shared memory. That would allow ad-hoc testing without a recompile, if I'm
not mistaken.

+SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+  FROM blocks, offsets
+  GROUP BY blk;
+ tidstore_set_block_offsets
+
+
+
+
+
+
+(5 rows)

Calling a void function multiple times leads to vertical whitespace, which
looks a bit strange and may look better with some output, even if
irrelevant:

-SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[])
+SELECT row_number() over(order by blk), tidstore_set_block_offsets(blk,
array_agg(offsets.off)::int2[])

 row_number | tidstore_set_block_offsets
+
  1 |
  2 |
  3 |
  4 |
  5 |
(5 rows)

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-08-27 Thread Masahiko Sawada
On Wed, Aug 16, 2023 at 8:04 PM John Naylor
 wrote:
>
>
> On Tue, Aug 15, 2023 at 6:53 PM John Naylor  
> wrote:
> >
> > On Tue, Aug 15, 2023 at 9:34 AM Masahiko Sawada  
> > wrote:
> >
> > > BTW cfbot reported that some regression tests failed due to OOM. I've
> > > attached the patch to fix it.
> >
> > Seems worth doing now rather than later, so added this and squashed most of 
> > the rest together.
>
> This segfaults because of a mistake fixing a rebase conflict, so v40 attached.
>

Thank you for updating the patch set.

On Tue, Aug 15, 2023 at 11:33 AM Masahiko Sawada  wrote:
> On Mon, Aug 14, 2023 at 8:05 PM John Naylor
>  wrote:
> > Looking at the tidstore tests again after some months, I'm not particularly 
> > pleased with the amount of code required for how little it seems to be 
> > testing, nor the output when something fails. (I wonder how hard it would 
> > be to have SQL functions that add blocks/offsets to the tid store, and emit 
> > tuples of tids found in the store.)
>
> It would not be hard to have such SQL functions. I'll try it.

I've updated the regression tests for tidstore so that it uses SQL
functions to add blocks/offsets and dump its contents. The new test
covers the same test coverages but it's executed using SQL functions
instead of executing all tests in one SQL function.

0008 patch fixes a bug in tidstore which I found during this work. We
didn't recreate the radix tree in the same memory context when
TidStoreReset().

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v41-ART.tar.gz
Description: GNU Zip compressed data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-08-16 Thread John Naylor
On Tue, Aug 15, 2023 at 6:53 PM John Naylor 
wrote:
>
> On Tue, Aug 15, 2023 at 9:34 AM Masahiko Sawada 
wrote:
>
> > BTW cfbot reported that some regression tests failed due to OOM. I've
> > attached the patch to fix it.
>
> Seems worth doing now rather than later, so added this and squashed most
of the rest together.

This segfaults because of a mistake fixing a rebase conflict, so v40
attached.

--
John Naylor
EDB: http://www.enterprisedb.com


v40-ART.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-08-15 Thread John Naylor
On Tue, Aug 15, 2023 at 9:34 AM Masahiko Sawada 
wrote:

> BTW cfbot reported that some regression tests failed due to OOM. I've
> attached the patch to fix it.

Seems worth doing now rather than later, so added this and squashed most of
the rest together. I wonder if that test uses too much memory in general.
Maybe using the full uint64 is too much.

> On Mon, Aug 14, 2023 at 8:05 PM John Naylor
>  wrote:

> > This is possibly faster than v38-0010, but looking like not worth the
complexity, assuming the other way avoids the bug going forward.
>
> I prefer 0010 but is it worth testing with other compilers such as clang?

Okay, keeping 0010 with a comment, and leaving out 0011 for now. Clang is
aggressive about unrolling loops, so may be worth looking globally at some
point.

> > v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch

> > Within noise level of v38-0011, but it's small and simple, so I like
it, at least for small arrays.
>
> Agreed.

Keeping 0012 and not 0013.

--
John Naylor
EDB: http://www.enterprisedb.com


v39-ART.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-08-14 Thread Masahiko Sawada
Hi,

On Mon, Aug 14, 2023 at 8:05 PM John Naylor
 wrote:
>
> On Thu, Jul 13, 2023 at 3:09 PM Masahiko Sawada  wrote:
> >
> > 0007, 0008, 0010, and 0011 are straightforward and agree to merge them.

Thank you for updating the patch!

>
> [Part 1 - clear the deck of earlier performance work etc]
>
> Thanks for taking a look! I've merged 0007 and 0008. The others need a 
> performance test to justify them -- an eyeball check is not enough. I've now 
> made the time to do that.
>
>  sparse loads
>
> v38 0001-0006 (still using node3 for this test only):
>
> select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from 
> bench_load_random_int(100 * 1000 * (1+x-x))) a;
>  avg
> -
>  27.1000
>
> select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from 
> bench_load_random_int(500 * 1000 * (1+x-x))) a;
>  avg
> --
>  165.6333
>
> v38-0007-Optimize-RT_EXTEND_DOWN.patch
>
> select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from 
> bench_load_random_int(100 * 1000 * (1+x-x))) a;
>  avg
> -
>  25.0900
>
> select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from 
> bench_load_random_int(500 * 1000 * (1+x-x))) a;
>  avg
> --
>  157.3667
>
> That seems worth doing.
>
> v38-0008-Use-4-children-for-node-4-also-attempt-portable-.patch
>
> This combines two things because I messed up a rebase: Use fanout of 4, and 
> try some macros for shmem sizes, both 32- and 64-bit. Looking at this much, I 
> no longer have a goal to have a separate set of size-classes for non-SIMD 
> platforms, because that would cause global maintenance problems -- it's 
> probably better to reduce worst-case search time where necessary. That would 
> be much more localized.
>
> > I have some questions on 0009 patch:
>
> > According to the comment, this optimization is for only gcc?
>
> No, not at all. That tells me the comment is misleading.
>
> > I think this change reduces
> > readability and maintainability.
>
> Well, that much is obvious. What is not obvious is how much it gains us over 
> the alternatives. I do have a simpler idea, though...
>
>  load mostly node4
>
> select * from bench_search_random_nodes(250*1000, '0xFF');
> n4 = 42626, n16 = 21492, n32 = 0, n64 = 0, n256 = 257
>  mem_allocated | load_ms | search_ms
> ---+-+---
>7352384 |  25 | 0
>
> v38-0009-TEMP-take-out-search-time-from-bench.patch
>
> This is just to allow LATERAL queries for better measurements.
>
> select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from 
> bench_search_random_nodes(250*1000 * (1+x-x), '0xFF')) a;
>
>  avg
> -
>  24.8333

0007, 0008, and 0009 look good to me.

>
> v38-0010-Try-a-simpler-way-to-avoid-memmove.patch
>
> This slightly rewrites the standard loop so that gcc doesn't turn it into a 
> memmove(). Unlike the patch you didn't like, this *is* gcc-specific. (needs a 
> comment, which I forgot)
>
>  avg
> -
>  21.9600
>
> So, that's not a trivial difference. I wasn't a big fan of Andres' __asm("") 
> workaround, but that may be just my ignorance about it. We need something 
> like either of the two.
>
> v38-0011-Optimize-add_child_4-take-2.patch
>  avg
> -
>  21.3500
>
> This is possibly faster than v38-0010, but looking like not worth the 
> complexity, assuming the other way avoids the bug going forward.

I prefer 0010 but is it worth testing with other compilers such as clang?

>
> > According to the bugzilla ticket
> > referred to in the comment, it's realized as a bug in the community,
> > so once the gcc bug fixes, we might no longer need this trick, no?
>
> No comment in two years...
>
> v38-0013-Use-constant-for-initial-copy-of-chunks-and-chil.patch
>
> This is the same as v37-0011. I wasn't quite satisfied with it since it still 
> has two memcpy() calls, but it actually seems to regress:
>
>  avg
> -
>  22.0900
>
> v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch
>
> This patch uses a single loop for the copy.
>
>  avg
> -
>  21.0300
>
> Within noise level of v38-0011, but it's small and simple, so I like it, at 
> least for small arrays.

Agreed.

>
> v38-0014-node48-Remove-need-for-RIGHTMOST_ONE-in-radix-tr.patch
> v38-0015-node48-Remove-dead-code-by-using-loop-local-var.patch
>
> Just small cleanups.
>
> v38-0016-Use-memcpy-for-children-when-growing-into-node48.patch
>
> Makes sense, but untested.

Agreed.

BTW cfbot reported that some regression tests failed due to OOM. I've
attached the patch to fix it.

>
> ===
> [Part 2]
>
> Per off-list discussion with Masahiko, it makes sense to take some of the 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-08-14 Thread John Naylor
On Thu, Jul 13, 2023 at 3:09 PM Masahiko Sawada 
wrote:
>
> 0007, 0008, 0010, and 0011 are straightforward and agree to merge them.

[Part 1 - clear the deck of earlier performance work etc]

Thanks for taking a look! I've merged 0007 and 0008. The others need a
performance test to justify them -- an eyeball check is not enough. I've
now made the time to do that.

 sparse loads

v38 0001-0006 (still using node3 for this test only):

select avg(load_ms) from generate_series(1,100) x(x), lateral (select *
from bench_load_random_int(100 * 1000 * (1+x-x))) a;
 avg
-
 27.1000

select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from
bench_load_random_int(500 * 1000 * (1+x-x))) a;
 avg
--
 165.6333

v38-0007-Optimize-RT_EXTEND_DOWN.patch

select avg(load_ms) from generate_series(1,100) x(x), lateral (select *
from bench_load_random_int(100 * 1000 * (1+x-x))) a;
 avg
-
 25.0900

select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from
bench_load_random_int(500 * 1000 * (1+x-x))) a;
 avg
--
 157.3667

That seems worth doing.

v38-0008-Use-4-children-for-node-4-also-attempt-portable-.patch

This combines two things because I messed up a rebase: Use fanout of 4, and
try some macros for shmem sizes, both 32- and 64-bit. Looking at this much,
I no longer have a goal to have a separate set of size-classes for non-SIMD
platforms, because that would cause global maintenance problems -- it's
probably better to reduce worst-case search time where necessary. That
would be much more localized.

> I have some questions on 0009 patch:

> According to the comment, this optimization is for only gcc?

No, not at all. That tells me the comment is misleading.

> I think this change reduces
> readability and maintainability.

Well, that much is obvious. What is not obvious is how much it gains us
over the alternatives. I do have a simpler idea, though...

 load mostly node4

select * from bench_search_random_nodes(250*1000, '0xFF');
n4 = 42626, n16 = 21492, n32 = 0, n64 = 0, n256 = 257
 mem_allocated | load_ms | search_ms
---+-+---
   7352384 |  25 | 0

v38-0009-TEMP-take-out-search-time-from-bench.patch

This is just to allow LATERAL queries for better measurements.

select avg(load_ms) from generate_series(1,100) x(x), lateral (select *
from bench_search_random_nodes(250*1000 * (1+x-x), '0xFF')) a;

 avg
-
 24.8333

v38-0010-Try-a-simpler-way-to-avoid-memmove.patch

This slightly rewrites the standard loop so that gcc doesn't turn it into a
memmove(). Unlike the patch you didn't like, this *is* gcc-specific. (needs
a comment, which I forgot)

 avg
-
 21.9600

So, that's not a trivial difference. I wasn't a big fan of Andres'
__asm("") workaround, but that may be just my ignorance about it. We need
something like either of the two.

v38-0011-Optimize-add_child_4-take-2.patch
 avg
-
 21.3500

This is possibly faster than v38-0010, but looking like not worth the
complexity, assuming the other way avoids the bug going forward.

> According to the bugzilla ticket
> referred to in the comment, it's realized as a bug in the community,
> so once the gcc bug fixes, we might no longer need this trick, no?

No comment in two years...

v38-0013-Use-constant-for-initial-copy-of-chunks-and-chil.patch

This is the same as v37-0011. I wasn't quite satisfied with it since it
still has two memcpy() calls, but it actually seems to regress:

 avg
-
 22.0900

v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch

This patch uses a single loop for the copy.

 avg
-
 21.0300

Within noise level of v38-0011, but it's small and simple, so I like it, at
least for small arrays.

v38-0014-node48-Remove-need-for-RIGHTMOST_ONE-in-radix-tr.patch
v38-0015-node48-Remove-dead-code-by-using-loop-local-var.patch

Just small cleanups.

v38-0016-Use-memcpy-for-children-when-growing-into-node48.patch

Makes sense, but untested.

===
[Part 2]

Per off-list discussion with Masahiko, it makes sense to take some of the
ideas I've used locally on tidbitmap, and start incorporating them into
earlier vacuum work to get that out the door faster. With that in mind...

v38-0017-Make-tidstore-more-similar-to-tidbitmap.patch

This uses a simplified PagetableEntry (unimaginatively called
BlocktableEntry just to avoid confusion), to be replaced with the real
thing at a later date. This is still fixed size, to be replaced with a
varlen type.

Looking at the tidstore tests again after some months, I'm not particularly
pleased with the amount of code required for how little it seems to be
testing, nor the output 

Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-08-01 Thread Masahiko Sawada
Hi,

On Thu, Jul 13, 2023 at 5:08 PM Masahiko Sawada  wrote:
>
> On Sat, Jul 8, 2023 at 11:54 AM John Naylor
>  wrote:
> >
> >
> > On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Jul 5, 2023 at 8:21 PM John Naylor  
> > > wrote:
> > > > Well, it's going to be a bit of a mess until I can demonstrate it 
> > > > working (and working well) with bitmap heap scan. Fixing that now is 
> > > > just going to create conflicts. I do have a couple small older patches 
> > > > laying around that were quick experiments -- I think at least some of 
> > > > them should give a performance boost in loading speed, but haven't had 
> > > > time to test. Would you like to take a look?
> > >
> > > Yes, I can experiment with these patches in the meantime.
> >
> > Okay, here it is in v36. 0001-6 are same as v35.
> >
> > 0007 removes a wasted extra computation newly introduced by refactoring 
> > growing nodes. 0008 just makes 0011 nicer. Not worth testing by themselves, 
> > but better to be tidy.
> > 0009 is an experiment to get rid of slow memmoves in node4, addressing a 
> > long-standing inefficiency. It looks a bit tricky, but I think it's 
> > actually straightforward after drawing out the cases with pen and paper. It 
> > works if the fanout is either 4 or 5, so we have some wiggle room. This may 
> > give a noticeable boost if the input is reversed or random.
> > 0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with 
> > sparse trees.
> > 0011 reduces function calls when growing the smaller nodes. Not sure about 
> > this one -- possibly worth it for node4 only?
> >
> > If these help, it'll show up more easily in smaller inputs. Large inputs 
> > tend to be more dominated by RAM latency.

cfbot reported some failures[1], and the v36 patch cannot be applied
cleanly to the current HEAD. I've attached updated patches to make
cfbot happy.

Regards,

[1] http://cfbot.cputube.org/highlights/all.html#3687

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v37-ART.tar.gz
Description: GNU Zip compressed data


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-07-13 Thread Masahiko Sawada
On Sat, Jul 8, 2023 at 11:54 AM John Naylor
 wrote:
>
>
> On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada  wrote:
> >
> > On Wed, Jul 5, 2023 at 8:21 PM John Naylor  
> > wrote:
> > > Well, it's going to be a bit of a mess until I can demonstrate it working 
> > > (and working well) with bitmap heap scan. Fixing that now is just going 
> > > to create conflicts. I do have a couple small older patches laying around 
> > > that were quick experiments -- I think at least some of them should give 
> > > a performance boost in loading speed, but haven't had time to test. Would 
> > > you like to take a look?
> >
> > Yes, I can experiment with these patches in the meantime.
>
> Okay, here it is in v36. 0001-6 are same as v35.
>
> 0007 removes a wasted extra computation newly introduced by refactoring 
> growing nodes. 0008 just makes 0011 nicer. Not worth testing by themselves, 
> but better to be tidy.
> 0009 is an experiment to get rid of slow memmoves in node4, addressing a 
> long-standing inefficiency. It looks a bit tricky, but I think it's actually 
> straightforward after drawing out the cases with pen and paper. It works if 
> the fanout is either 4 or 5, so we have some wiggle room. This may give a 
> noticeable boost if the input is reversed or random.
> 0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with 
> sparse trees.
> 0011 reduces function calls when growing the smaller nodes. Not sure about 
> this one -- possibly worth it for node4 only?
>
> If these help, it'll show up more easily in smaller inputs. Large inputs tend 
> to be more dominated by RAM latency.

Thanks for sharing the patches!

0007, 0008, 0010, and 0011 are straightforward and agree to merge them.

I have some questions on 0009 patch:

+   /* shift chunks and children
+
+   Unfortunately, gcc has gotten too aggressive in
turning simple loops
+   into slow memmove's, so we have to be a bit more clever.
+   See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101481
+
+   We take advantage of the fact that a good
+   compiler can turn a memmove of a small constant power-of-two
+   number of bytes into a single load/store.
+   */

According to the comment, this optimization is for only gcc? and there
is no negative impact when building with other compilers such as clang
by this change?

I'm not sure that it's a good approach to hand-optimize the code much
to generate better instructions on gcc. I think this change reduces
readability and maintainability. According to the bugzilla ticket
referred to in the comment, it's realized as a bug in the community,
so once the gcc bug fixes, we might no longer need this trick, no?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-07-07 Thread John Naylor
On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada 
wrote:
>
> On Wed, Jul 5, 2023 at 8:21 PM John Naylor 
wrote:
> > Well, it's going to be a bit of a mess until I can demonstrate it
working (and working well) with bitmap heap scan. Fixing that now is just
going to create conflicts. I do have a couple small older patches laying
around that were quick experiments -- I think at least some of them should
give a performance boost in loading speed, but haven't had time to test.
Would you like to take a look?
>
> Yes, I can experiment with these patches in the meantime.

Okay, here it is in v36. 0001-6 are same as v35.

0007 removes a wasted extra computation newly introduced by refactoring
growing nodes. 0008 just makes 0011 nicer. Not worth testing by themselves,
but better to be tidy.
0009 is an experiment to get rid of slow memmoves in node4, addressing a
long-standing inefficiency. It looks a bit tricky, but I think it's
actually straightforward after drawing out the cases with pen and paper. It
works if the fanout is either 4 or 5, so we have some wiggle room. This may
give a noticeable boost if the input is reversed or random.
0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with
sparse trees.
0011 reduces function calls when growing the smaller nodes. Not sure about
this one -- possibly worth it for node4 only?

If these help, it'll show up more easily in smaller inputs. Large inputs
tend to be more dominated by RAM latency.

--
John Naylor
EDB: http://www.enterprisedb.com


v36-ART.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2023-07-07 Thread Masahiko Sawada
On Wed, Jul 5, 2023 at 8:21 PM John Naylor  wrote:
>
> On Tue, Jul 4, 2023 at 12:49 PM Masahiko Sawada  wrote:
> >
> > As you mentioned, the 1-byte value is embedded into 8 byte so 7 bytes
> > are unused, but we use less memory since we use less slab contexts and
> > save fragmentations.
>
> Thanks for testing. This tree is sparse enough that most of the space is 
> taken up by small inner nodes, and not by leaves. So, it's encouraging to see 
> a small space savings even here.
>
> > I've also tested some large value cases (e.g. the value is 80-bytes)
> > and got a similar result.
>
> Interesting. With a separate allocation per value the overhead would be 8 
> bytes, or 10% here. It's plausible that savings elsewhere can hide that, 
> globally.
>
> > Regarding the codes, there are many todo and fixme comments so it
> > seems to me that your recent work is still in-progress. What is the
> > current status? Can I start reviewing the code or should I wait for a
> > while until your recent work completes?
>
> Well, it's going to be a bit of a mess until I can demonstrate it working 
> (and working well) with bitmap heap scan. Fixing that now is just going to 
> create conflicts. I do have a couple small older patches laying around that 
> were quick experiments -- I think at least some of them should give a 
> performance boost in loading speed, but haven't had time to test. Would you 
> like to take a look?

Yes, I can experiment with these patches in the meantime.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com




<    1   2   3   4   5   >