Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada wrote: > > On Tue, Mar 5, 2024 at 6:41 PM John Naylor wrote: > > Done in v66-0009. I'd be curious to hear any feedback. I like the > > aspect that the random numbers come from a different seed every time > > the test runs. > > The new tests look good. Here are some comments: > > --- > + expected = keys[i]; > + iterval = rt_iterate_next(iter, ); > > - ndeleted++; > + EXPECT_TRUE(iterval != NULL); > + EXPECT_EQ_U64(iterkey, expected); > + EXPECT_EQ_U64(*iterval, expected); > > Can we verify that the iteration returns keys in ascending order? We get the "expected" value from the keys we saved in the now-sorted array, so we do already. Unless I misunderstand you. > --- > + /* reset random number generator for deletion */ > + pg_prng_seed(, seed); > > Why is resetting the seed required here? Good catch - My intention was to delete in the same random order we inserted with. We still have the keys in the array, but they're sorted by now. I forgot to go the extra step and use the prng when generating the keys for deletion -- will fix. > --- > The radix tree (and dsa in TSET_SHARED_RT case) should be freed at the end. Will fix. > --- > radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext, > "test_radix_tree", > ALLOCSET_DEFAULT_SIZES); > > We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I > think it's better to use either one for consistency. Will change to "small", since 32-bit platforms will use slab for leaves. I'll look at the memory usage and estimate what 32-bit platforms will use, and maybe adjust the number of keys. A few megabytes is fine, but not many megabytes. > > I'd like to push 0001 and 0002 shortly, and then do another sweep over > > 0003, with remaining feedback, and get that in so we get some > > buildfarm testing before the remaining polishing work on > > tidstore/vacuum. > > Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to > polish tidstore and vacuum patches and update commit messages. Sounds good.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Mar 5, 2024 at 6:41 PM John Naylor wrote: > > On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada wrote: > > > > On Fri, Feb 2, 2024 at 8:47 PM John Naylor wrote: > > > > It's pretty hard to see what test_pattern() is doing, or why it's > > > useful. I wonder if instead the test could use something like the > > > benchmark where random integers are masked off. That seems simpler. I > > > can work on that, but I'd like to hear your side about test_pattern(). > > > > Yeah, test_pattern() is originally created for the integerset so it > > doesn't necessarily fit the radixtree. I agree to use some tests from > > benchmarks. > > Done in v66-0009. I'd be curious to hear any feedback. I like the > aspect that the random numbers come from a different seed every time > the test runs. The new tests look good. Here are some comments: --- + expected = keys[i]; + iterval = rt_iterate_next(iter, ); - ndeleted++; + EXPECT_TRUE(iterval != NULL); + EXPECT_EQ_U64(iterkey, expected); + EXPECT_EQ_U64(*iterval, expected); Can we verify that the iteration returns keys in ascending order? --- + /* reset random number generator for deletion */ + pg_prng_seed(, seed); Why is resetting the seed required here? --- The radix tree (and dsa in TSET_SHARED_RT case) should be freed at the end. --- radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext, "test_radix_tree", ALLOCSET_DEFAULT_SIZES); We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I think it's better to use either one for consistency. > I'd like to push 0001 and 0002 shortly, and then do another sweep over > 0003, with remaining feedback, and get that in so we get some > buildfarm testing before the remaining polishing work on > tidstore/vacuum. Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to polish tidstore and vacuum patches and update commit messages. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Mar 5, 2024 at 8:27 AM Masahiko Sawada wrote: > > On Mon, Mar 4, 2024 at 8:48 PM John Naylor wrote: > > > > On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada > > wrote: > > > Resetting the tree->context seems to work. But I think we should note > > > for callers that the dsa_area passed to RT_CREATE should be created in > > > a different context than the context passed to RT_CREATE because > > > otherwise RT_FREE() will also free the dsa_area. For example, the > > > following code in test_radixtree.c will no longer work: I've added a comment in v66-0004, which contains a number of other small corrections and edits. On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada wrote: > > Thirdly, cosmetic: With the introduction of single-value leaves, it > > seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think? > > Agreed. Done in v66-0005. v66-0006 removes outdated tests for invalid root that somehow got left over. > > I wrote: > > > > Secondly, I thought about my recent work to skip checking if we first > > > > need to create a root node, and that has a harmless (for vacuum at > > > > least) but slightly untidy behavior: When RT_SET is first called, and > > > > the key is bigger than 255, new nodes will go on top of the root node. > > > > These have chunk '0'. If all subsequent keys are big enough, the > > > > orginal root node will stay empty. If all keys are deleted, there will > > > > be a chain of empty nodes remaining. Again, I believe this is > > > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to > > > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work > > > > on this, but likely not today. > > I put a little more work into this, and got it working, just needs a > > small amount of finicky coding. I'll share tomorrow. Done in v66-0007. I'm a bit disappointed in the extra messiness this adds, although it's not a lot. > > + check_stack_depth(); > > + CHECK_FOR_INTERRUPTS(); > > > > I'm not sure why these are here: The first seems overly paranoid, > > although harmless, but the second is probably a bad idea. Why should > > the user be able to to interrupt the freeing of memory? > > Good catch. We should not check the interruption there. Removed in v66-0008. > > Also, I'm not quite happy that RT_ITER has a copy of a pointer to the > > tree, leading to coding like "iter->tree->ctl->root". I *think* it > > would be easier to read if the tree was a parameter to these iteration > > functions. That would require an API change, so the tests/tidstore > > would have some churn. I can do that, but before trying I wanted to > > see what you think -- is there some reason to keep the current way? > > I considered both usages, there are two reasons for the current style. > I'm concerned that if we pass both the tree and RT_ITER to iteration > functions, the caller could mistakenly pass a different tree than the > one that was specified to create the RT_ITER. And the second reason is > just to make it consistent with other data structures such as > dynahash.c and dshash.c, but I now realized that in simplehash.h we > pass both the hash table and the iterator. Okay, then I don't think it's worth messing with at this point. On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada wrote: > > On Fri, Feb 2, 2024 at 8:47 PM John Naylor wrote: > > It's pretty hard to see what test_pattern() is doing, or why it's > > useful. I wonder if instead the test could use something like the > > benchmark where random integers are masked off. That seems simpler. I > > can work on that, but I'd like to hear your side about test_pattern(). > > Yeah, test_pattern() is originally created for the integerset so it > doesn't necessarily fit the radixtree. I agree to use some tests from > benchmarks. Done in v66-0009. I'd be curious to hear any feedback. I like the aspect that the random numbers come from a different seed every time the test runs. v66-0010/0011 run pgindent, the latter with one typedef added for the test module. 0012 - 0017 are copied from v65, and I haven't done any work on tidstore or vacuum, except for squashing most v65 follow-up patches. I'd like to push 0001 and 0002 shortly, and then do another sweep over 0003, with remaining feedback, and get that in so we get some buildfarm testing before the remaining polishing work on tidstore/vacuum. v66-ART.tar.gz Description: application/gzip
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 4, 2024 at 8:48 PM John Naylor wrote: > > On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada wrote: > > > > On Sun, Mar 3, 2024 at 2:43 PM John Naylor wrote: > > > > Right, I should have said "reset". Resetting a context will delete > > > it's children as well, and seems like it should work to reset the tree > > > context, and we don't have to know whether that context actually > > > contains leaves at all. That should allow copying "tree context" to > > > "leaf context" in the case where we have no special context for > > > leaves. > > > > Resetting the tree->context seems to work. But I think we should note > > for callers that the dsa_area passed to RT_CREATE should be created in > > a different context than the context passed to RT_CREATE because > > otherwise RT_FREE() will also free the dsa_area. For example, the > > following code in test_radixtree.c will no longer work: > > > > dsa = dsa_create(tranche_id); > > radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id); > > : > > rt_free(radixtree); > > dsa_detach(dsa); // dsa is already freed. > > > > So I think that a practical usage of the radix tree will be that the > > caller creates a memory context for a radix tree and passes it to > > RT_CREATE(). > > That sounds workable to me. > > > I've attached an update patch set: > > > > - 0008 updates RT_FREE_RECURSE(). > > Thanks! > > > - 0009 patch is an updated version of cleanup radix tree memory handling. > > Looks pretty good, as does the rest. I'm going through again, > squashing and making tiny adjustments to the template. The only thing > not done is changing the test with many values to resemble the perf > test more. > > I wrote: > > > Secondly, I thought about my recent work to skip checking if we first > > > need to create a root node, and that has a harmless (for vacuum at > > > least) but slightly untidy behavior: When RT_SET is first called, and > > > the key is bigger than 255, new nodes will go on top of the root node. > > > These have chunk '0'. If all subsequent keys are big enough, the > > > orginal root node will stay empty. If all keys are deleted, there will > > > be a chain of empty nodes remaining. Again, I believe this is > > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to > > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work > > > on this, but likely not today. > > > > This turns out to be a lot trickier than it looked, so it seems best > > to allow a trivial amount of waste, as long as it's documented > > somewhere. It also wouldn't be terrible to re-add those branches, > > since they're highly predictable. > > I put a little more work into this, and got it working, just needs a > small amount of finicky coding. I'll share tomorrow. > > I have a question about RT_FREE_RECURSE: > > + check_stack_depth(); > + CHECK_FOR_INTERRUPTS(); > > I'm not sure why these are here: The first seems overly paranoid, > although harmless, but the second is probably a bad idea. Why should > the user be able to to interrupt the freeing of memory? Good catch. We should not check the interruption there. > Also, I'm not quite happy that RT_ITER has a copy of a pointer to the > tree, leading to coding like "iter->tree->ctl->root". I *think* it > would be easier to read if the tree was a parameter to these iteration > functions. That would require an API change, so the tests/tidstore > would have some churn. I can do that, but before trying I wanted to > see what you think -- is there some reason to keep the current way? I considered both usages, there are two reasons for the current style. I'm concerned that if we pass both the tree and RT_ITER to iteration functions, the caller could mistakenly pass a different tree than the one that was specified to create the RT_ITER. And the second reason is just to make it consistent with other data structures such as dynahash.c and dshash.c, but I now realized that in simplehash.h we pass both the hash table and the iterator. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 4, 2024 at 1:05 PM Masahiko Sawada wrote: > > On Sun, Mar 3, 2024 at 2:43 PM John Naylor wrote: > > Right, I should have said "reset". Resetting a context will delete > > it's children as well, and seems like it should work to reset the tree > > context, and we don't have to know whether that context actually > > contains leaves at all. That should allow copying "tree context" to > > "leaf context" in the case where we have no special context for > > leaves. > > Resetting the tree->context seems to work. But I think we should note > for callers that the dsa_area passed to RT_CREATE should be created in > a different context than the context passed to RT_CREATE because > otherwise RT_FREE() will also free the dsa_area. For example, the > following code in test_radixtree.c will no longer work: > > dsa = dsa_create(tranche_id); > radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id); > : > rt_free(radixtree); > dsa_detach(dsa); // dsa is already freed. > > So I think that a practical usage of the radix tree will be that the > caller creates a memory context for a radix tree and passes it to > RT_CREATE(). That sounds workable to me. > I've attached an update patch set: > > - 0008 updates RT_FREE_RECURSE(). Thanks! > - 0009 patch is an updated version of cleanup radix tree memory handling. Looks pretty good, as does the rest. I'm going through again, squashing and making tiny adjustments to the template. The only thing not done is changing the test with many values to resemble the perf test more. I wrote: > > Secondly, I thought about my recent work to skip checking if we first > > need to create a root node, and that has a harmless (for vacuum at > > least) but slightly untidy behavior: When RT_SET is first called, and > > the key is bigger than 255, new nodes will go on top of the root node. > > These have chunk '0'. If all subsequent keys are big enough, the > > orginal root node will stay empty. If all keys are deleted, there will > > be a chain of empty nodes remaining. Again, I believe this is > > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to > > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work > > on this, but likely not today. > > This turns out to be a lot trickier than it looked, so it seems best > to allow a trivial amount of waste, as long as it's documented > somewhere. It also wouldn't be terrible to re-add those branches, > since they're highly predictable. I put a little more work into this, and got it working, just needs a small amount of finicky coding. I'll share tomorrow. I have a question about RT_FREE_RECURSE: + check_stack_depth(); + CHECK_FOR_INTERRUPTS(); I'm not sure why these are here: The first seems overly paranoid, although harmless, but the second is probably a bad idea. Why should the user be able to to interrupt the freeing of memory? Also, I'm not quite happy that RT_ITER has a copy of a pointer to the tree, leading to coding like "iter->tree->ctl->root". I *think* it would be easier to read if the tree was a parameter to these iteration functions. That would require an API change, so the tests/tidstore would have some churn. I can do that, but before trying I wanted to see what you think -- is there some reason to keep the current way?
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Sun, Mar 3, 2024 at 2:43 PM John Naylor wrote: > > On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada wrote: > > > > On Thu, Feb 29, 2024 at 8:43 PM John Naylor wrote: > > > > + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext, > > > +"tidstore storage", > > > > > > "tidstore storage" sounds a bit strange -- maybe look at some other > > > context names for ideas. > > > > Agreed. How about "tidstore's radix tree"? > > That might be okay. I'm now thinking "TID storage". On that note, one > improvement needed when we polish tidstore.c is to make sure it's > spelled "TID" in comments, like other files do already. > > > > - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, > > > allocsize); > > > + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL > > > +? tree->leaf_ctx > > > +: tree->context, allocsize); > > > > > > Instead of branching here, can we copy "context" to "leaf_ctx" when > > > necessary (those names should look more like eachother, btw)? I think > > > that means anything not covered by this case: > > > > > > +#ifndef RT_VARLEN_VALUE_SIZE > > > + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC)) > > > + tree->leaf_ctx = SlabContextCreate(ctx, > > > +RT_STR(RT_PREFIX) "radix_tree leaf contex", > > > +RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > > > +sizeof(RT_VALUE_TYPE)); > > > +#endif /* !RT_VARLEN_VALUE_SIZE */ > > > > > > ...also, we should document why we're using slab here. On that, I > > > don't recall why we are? We've never had a fixed-length type test case > > > on 64-bit, so it wasn't because it won through benchmarking. It seems > > > a hold-over from the days of "multi-value leaves". Is it to avoid the > > > possibility of space wastage with non-power-of-two size types? > > > > Yes, it matches my understanding. > > There are two issues quoted here, so not sure if you mean both or only > the last one... I meant only the last one. > > For the latter, I'm not sure it makes sense to have code and #ifdef's > to force slab for large-enough fixed-length values just because we > can. There may never be such a use-case anyway. I'm also not against > it, either, but it seems like a premature optimization. Reading the old threads, the fact that using a slab context for leaves originally came from Andres's prototype patch, was to avoid rounding up the bytes to a power of 2 number by aset.c. It makes sense to me to use a slab context for this case. To measure the effect of using a slab, I've updated bench_radix_tree so it uses a large fixed-length value. The struct I used is: typedef struct mytype { uint64 a; uint64 b; uint64 c; uint64 d; chare[100]; } mytype; The struct size is 136 bytes with padding, just above a power-of-2. The simple benchmark test showed using a slab context for leaves is more space efficient. The results are: slab: = #select * from bench_load_random_int(100); mem_allocated | load_ms ---+- 405643264 | 560 (1 row) aset: =# select * from bench_load_random_int(100); mem_allocated | load_ms ---+- 52792 | 576 (1 row) > > > > For this stanza that remains unchanged: > > > > > > for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++) > > > { > > > MemoryContextDelete(tree->node_slabs[i]); > > > } > > > > > > if (tree->leaf_ctx) > > > { > > > MemoryContextDelete(tree->leaf_ctx); > > > } > > > > > > ...is there a reason we can't just delete tree->ctx, and let that > > > recursively delete child contexts? > > > > I thought that considering the RT_CREATE doesn't create its own memory > > context but just uses the passed context, it might be a bit unusable > > to delete the passed context in the radix tree code. For example, if a > > caller creates a radix tree (or tidstore) on a memory context and > > wants to recreate it again and again, he also needs to re-create the > > memory context together. It might be okay if we leave comments on > > RT_CREATE as a side effect, though. This is the same reason why we > > don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(), > > Right, I should have said "reset". Resetting a context will delete > it's children as well, and seems like it should work to reset the tree > context, and we don't have to know whether that context actually > contains leaves at all. That should allow copying "tree context" to > "leaf context" in the case where we have no special context for > leaves. Resetting the tree->context seems to work. But I think we should note for callers that the dsa_area passed to RT_CREATE should be created in a different context than the context passed to RT_CREATE because otherwise RT_FREE() will also free the dsa_area. For example, the following code in test_radixtree.c will no longer work: dsa = dsa_create(tranche_id); radixtree = rt_create(CurrentMemoryContext, dsa, tranche_id); : rt_free(radixtree); dsa_detach(dsa); // dsa is already freed. So I think that a
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 1, 2024 at 3:01 PM Masahiko Sawada wrote: > > On Thu, Feb 29, 2024 at 8:43 PM John Naylor wrote: > > + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext, > > +"tidstore storage", > > > > "tidstore storage" sounds a bit strange -- maybe look at some other > > context names for ideas. > > Agreed. How about "tidstore's radix tree"? That might be okay. I'm now thinking "TID storage". On that note, one improvement needed when we polish tidstore.c is to make sure it's spelled "TID" in comments, like other files do already. > > - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize); > > + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL > > +? tree->leaf_ctx > > +: tree->context, allocsize); > > > > Instead of branching here, can we copy "context" to "leaf_ctx" when > > necessary (those names should look more like eachother, btw)? I think > > that means anything not covered by this case: > > > > +#ifndef RT_VARLEN_VALUE_SIZE > > + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC)) > > + tree->leaf_ctx = SlabContextCreate(ctx, > > +RT_STR(RT_PREFIX) "radix_tree leaf contex", > > +RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > > +sizeof(RT_VALUE_TYPE)); > > +#endif /* !RT_VARLEN_VALUE_SIZE */ > > > > ...also, we should document why we're using slab here. On that, I > > don't recall why we are? We've never had a fixed-length type test case > > on 64-bit, so it wasn't because it won through benchmarking. It seems > > a hold-over from the days of "multi-value leaves". Is it to avoid the > > possibility of space wastage with non-power-of-two size types? > > Yes, it matches my understanding. There are two issues quoted here, so not sure if you mean both or only the last one... For the latter, I'm not sure it makes sense to have code and #ifdef's to force slab for large-enough fixed-length values just because we can. There may never be such a use-case anyway. I'm also not against it, either, but it seems like a premature optimization. > > For this stanza that remains unchanged: > > > > for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++) > > { > > MemoryContextDelete(tree->node_slabs[i]); > > } > > > > if (tree->leaf_ctx) > > { > > MemoryContextDelete(tree->leaf_ctx); > > } > > > > ...is there a reason we can't just delete tree->ctx, and let that > > recursively delete child contexts? > > I thought that considering the RT_CREATE doesn't create its own memory > context but just uses the passed context, it might be a bit unusable > to delete the passed context in the radix tree code. For example, if a > caller creates a radix tree (or tidstore) on a memory context and > wants to recreate it again and again, he also needs to re-create the > memory context together. It might be okay if we leave comments on > RT_CREATE as a side effect, though. This is the same reason why we > don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(), Right, I should have said "reset". Resetting a context will delete it's children as well, and seems like it should work to reset the tree context, and we don't have to know whether that context actually contains leaves at all. That should allow copying "tree context" to "leaf context" in the case where we have no special context for leaves.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Feb 29, 2024 at 8:43 PM John Naylor wrote: > > On Tue, Feb 20, 2024 at 1:59 PM Masahiko Sawada wrote: > > > - v63-0008 patch fixes a bug in tidstore. > > - page->nwords = wordnum + 1; > - Assert(page->nwords = WORDS_PER_PAGE(offsets[num_offsets - 1])); > + page->nwords = wordnum; > + Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1])); > > Yikes, I'm guessing this failed in a non-assert builds? I wonder why > my compiler didn't yell at me... Have you tried a tidstore-debug build > without asserts? Yes. I didn't get any failures. > > > - v63-0009 patch is a draft idea of cleanup memory context handling. > > Thanks, looks pretty good! > > + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext, > +"tidstore storage", > > "tidstore storage" sounds a bit strange -- maybe look at some other > context names for ideas. Agreed. How about "tidstore's radix tree"? > > - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize); > + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL > +? tree->leaf_ctx > +: tree->context, allocsize); > > Instead of branching here, can we copy "context" to "leaf_ctx" when > necessary (those names should look more like eachother, btw)? I think > that means anything not covered by this case: > > +#ifndef RT_VARLEN_VALUE_SIZE > + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC)) > + tree->leaf_ctx = SlabContextCreate(ctx, > +RT_STR(RT_PREFIX) "radix_tree leaf contex", > +RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > +sizeof(RT_VALUE_TYPE)); > +#endif /* !RT_VARLEN_VALUE_SIZE */ > > ...also, we should document why we're using slab here. On that, I > don't recall why we are? We've never had a fixed-length type test case > on 64-bit, so it wasn't because it won through benchmarking. It seems > a hold-over from the days of "multi-value leaves". Is it to avoid the > possibility of space wastage with non-power-of-two size types? Yes, it matches my understanding. > > For this stanza that remains unchanged: > > for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++) > { > MemoryContextDelete(tree->node_slabs[i]); > } > > if (tree->leaf_ctx) > { > MemoryContextDelete(tree->leaf_ctx); > } > > ...is there a reason we can't just delete tree->ctx, and let that > recursively delete child contexts? I thought that considering the RT_CREATE doesn't create its own memory context but just uses the passed context, it might be a bit unusable to delete the passed context in the radix tree code. For example, if a caller creates a radix tree (or tidstore) on a memory context and wants to recreate it again and again, he also needs to re-create the memory context together. It might be okay if we leave comments on RT_CREATE as a side effect, though. This is the same reason why we don't destroy tree->dsa in RT_FREE(). And, as for RT_FREE_RECURSE(), On Fri, Mar 1, 2024 at 1:15 PM John Naylor wrote: > > I'm looking at RT_FREE_RECURSE again (only used for DSA memory), and > I'm not convinced it's freeing all the memory. It's been many months > since we discussed this last, but IIRC we cannot just tell DSA to free > all its segments, right? Right. > Is there currently anything preventing us > from destroying the whole DSA area at once? When it comes to tidstore and parallel vacuum, we initialize DSA and create a tidstore there at the beginning of the lazy vacuum, and recreate the tidstore again after the heap vacuum. So I don't want to destroy the whole DSA when destroying the tidstore. Otherwise, we will need to create a new DSA and pass its handle somehow. Probably the bitmap scan case is similar. Given that bitmap scan (re)creates tidbitmap in the same DSA multiple times, it's better to avoid freeing the whole DSA. > > + /* The last level node has pointers to values */ > + if (shift == 0) > + { > + dsa_free(tree->dsa, ptr); > + return; > + } > > IIUC, this doesn't actually free leaves, it only frees the last-level > node. And, this function is unaware of whether children could be > embedded values. I'm thinking we need to get rid of the above > pre-check and instead, each node kind to have something like (e.g. > node4): > > RT_PTR_ALLOC child = n4->children[i]; > > if (shift > 0) > RT_FREE_RECURSE(tree, child, shift - RT_SPAN); > else if (!RT_CHILDPTR_IS_VALUE(child)) > dsa_free(tree->dsa, child); > > ...or am I missing something? You're not missing anything. RT_FREE_RECURSE() has not been updated for a long time. If we still need to use RT_FREE_RECURSE(), it should be updated. > Thirdly, cosmetic: With the introduction of single-value leaves, it > seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think? Agreed. On Fri, Mar 1, 2024 at 3:58 PM John Naylor wrote: > > I wrote: > > > Secondly, I thought about my recent work to skip checking if we first > > need to create a root node, and that has a harmless (for vacuum at > > least) but slightly untidy behavior: When RT_SET is first called, and > >
Re: [PoC] Improve dead tuple storage for lazy vacuum
I wrote: > Secondly, I thought about my recent work to skip checking if we first > need to create a root node, and that has a harmless (for vacuum at > least) but slightly untidy behavior: When RT_SET is first called, and > the key is bigger than 255, new nodes will go on top of the root node. > These have chunk '0'. If all subsequent keys are big enough, the > orginal root node will stay empty. If all keys are deleted, there will > be a chain of empty nodes remaining. Again, I believe this is > harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to > call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work > on this, but likely not today. This turns out to be a lot trickier than it looked, so it seems best to allow a trivial amount of waste, as long as it's documented somewhere. It also wouldn't be terrible to re-add those branches, since they're highly predictable. I just noticed there are a lot of unused function parameters (referring to parent slots) leftover from a few weeks ago. Those are removed in v64-0009. 0010 makes the obvious name change in those remaining to "parent_slot". 0011 is a simplification in two places regarding reserving slots. This should be a bit easier to read and possibly makes it easier on the compiler. v64-ART.tar.gz Description: application/gzip
Re: [PoC] Improve dead tuple storage for lazy vacuum
I'm looking at RT_FREE_RECURSE again (only used for DSA memory), and I'm not convinced it's freeing all the memory. It's been many months since we discussed this last, but IIRC we cannot just tell DSA to free all its segments, right? Is there currently anything preventing us from destroying the whole DSA area at once? + /* The last level node has pointers to values */ + if (shift == 0) + { + dsa_free(tree->dsa, ptr); + return; + } IIUC, this doesn't actually free leaves, it only frees the last-level node. And, this function is unaware of whether children could be embedded values. I'm thinking we need to get rid of the above pre-check and instead, each node kind to have something like (e.g. node4): RT_PTR_ALLOC child = n4->children[i]; if (shift > 0) RT_FREE_RECURSE(tree, child, shift - RT_SPAN); else if (!RT_CHILDPTR_IS_VALUE(child)) dsa_free(tree->dsa, child); ...or am I missing something?
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Feb 20, 2024 at 1:59 PM Masahiko Sawada wrote: > - v63-0008 patch fixes a bug in tidstore. - page->nwords = wordnum + 1; - Assert(page->nwords = WORDS_PER_PAGE(offsets[num_offsets - 1])); + page->nwords = wordnum; + Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1])); Yikes, I'm guessing this failed in a non-assert builds? I wonder why my compiler didn't yell at me... Have you tried a tidstore-debug build without asserts? > - v63-0009 patch is a draft idea of cleanup memory context handling. Thanks, looks pretty good! + ts->rt_context = AllocSetContextCreate(CurrentMemoryContext, +"tidstore storage", "tidstore storage" sounds a bit strange -- maybe look at some other context names for ideas. - leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx, allocsize); + leaf.alloc = (RT_PTR_ALLOC) MemoryContextAlloc(tree->leaf_ctx != NULL +? tree->leaf_ctx +: tree->context, allocsize); Instead of branching here, can we copy "context" to "leaf_ctx" when necessary (those names should look more like eachother, btw)? I think that means anything not covered by this case: +#ifndef RT_VARLEN_VALUE_SIZE + if (sizeof(RT_VALUE_TYPE) > sizeof(RT_PTR_ALLOC)) + tree->leaf_ctx = SlabContextCreate(ctx, +RT_STR(RT_PREFIX) "radix_tree leaf contex", +RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), +sizeof(RT_VALUE_TYPE)); +#endif /* !RT_VARLEN_VALUE_SIZE */ ...also, we should document why we're using slab here. On that, I don't recall why we are? We've never had a fixed-length type test case on 64-bit, so it wasn't because it won through benchmarking. It seems a hold-over from the days of "multi-value leaves". Is it to avoid the possibility of space wastage with non-power-of-two size types? For this stanza that remains unchanged: for (int i = 0; i < RT_SIZE_CLASS_COUNT; i++) { MemoryContextDelete(tree->node_slabs[i]); } if (tree->leaf_ctx) { MemoryContextDelete(tree->leaf_ctx); } ...is there a reason we can't just delete tree->ctx, and let that recursively delete child contexts? Secondly, I thought about my recent work to skip checking if we first need to create a root node, and that has a harmless (for vacuum at least) but slightly untidy behavior: When RT_SET is first called, and the key is bigger than 255, new nodes will go on top of the root node. These have chunk '0'. If all subsequent keys are big enough, the orginal root node will stay empty. If all keys are deleted, there will be a chain of empty nodes remaining. Again, I believe this is harmless, but to make tidy, it should easy to teach RT_EXTEND_UP to call out to RT_EXTEND_DOWN if it finds the tree is empty. I can work on this, but likely not today. Thirdly, cosmetic: With the introduction of single-value leaves, it seems we should do s/RT_NODE_PTR/RT_CHILD_PTR/ -- what do you think?
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Feb 19, 2024 at 7:47 PM John Naylor wrote: > > On Mon, Feb 19, 2024 at 9:02 AM Masahiko Sawada wrote: > > > > I think that vacuum and tidbitmap (and future users) would end up > > having the same max block size calculation. And it seems slightly odd > > layering to me that max-block-size-specified context is created on > > vacuum (or tidbitmap) layer, a varlen-value radix tree is created by > > tidstore layer, and the passed context is used for leaves (if > > varlen-value is used) on radix tree layer. > > That sounds slightly more complicated than I was thinking of, but we > could actually be talking about the same thing: I'm drawing a > distinction between "used = must be detected / #ifdef'd" and "used = > actually happens to call allocation". I meant that the passed context > would _always_ be used for leaves, regardless of varlen or not. So > with fixed-length values short enough to live in child pointer slots, > that context would still be used for iteration etc. > > > Another idea is to create a > > max-block-size-specified context on the tidstore layer. That is, > > vacuum and tidbitmap pass a work_mem and a flag indicating whether the > > tidstore can use the bump context, and tidstore creates a (aset of > > bump) memory context with the calculated max block size and passes it > > to the radix tree. > > That might be a better abstraction since both uses have some memory limit. I've drafted this idea, and fixed a bug in tidstore.c. Here is the summary of updates from v62: - removed v62-0007 patch as we discussed - squashed v62-0006 and v62-0008 patches into 0003 patch - v63-0008 patch fixes a bug in tidstore. - v63-0009 patch is a draft idea of cleanup memory context handling. > > > As for using the bump memory context, I feel that we need to store > > iterator struct in aset context at least as it can be individually > > freed and re-created. Or it might not be necessary to allocate the > > iterator struct in the same context as radix tree. > > Okay, that's one thing I was concerned about. Since we don't actually > have a bump context yet, it seems simple to assume aset for non-nodes, > and if we do get it, we can adjust slightly. Anyway, this seems like a > good thing to try to clean up, but it's also not a show-stopper. > > On that note: I will be going on honeymoon shortly, and then to PGConf > India, so I will have sporadic connectivity for the next 10 days and > won't be doing any hacking during that time. Thank you for letting us know. Enjoy yourself! Regards -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v63-ART.tar.gz Description: GNU Zip compressed data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Feb 19, 2024 at 9:02 AM Masahiko Sawada wrote: > > I think that vacuum and tidbitmap (and future users) would end up > having the same max block size calculation. And it seems slightly odd > layering to me that max-block-size-specified context is created on > vacuum (or tidbitmap) layer, a varlen-value radix tree is created by > tidstore layer, and the passed context is used for leaves (if > varlen-value is used) on radix tree layer. That sounds slightly more complicated than I was thinking of, but we could actually be talking about the same thing: I'm drawing a distinction between "used = must be detected / #ifdef'd" and "used = actually happens to call allocation". I meant that the passed context would _always_ be used for leaves, regardless of varlen or not. So with fixed-length values short enough to live in child pointer slots, that context would still be used for iteration etc. > Another idea is to create a > max-block-size-specified context on the tidstore layer. That is, > vacuum and tidbitmap pass a work_mem and a flag indicating whether the > tidstore can use the bump context, and tidstore creates a (aset of > bump) memory context with the calculated max block size and passes it > to the radix tree. That might be a better abstraction since both uses have some memory limit. > As for using the bump memory context, I feel that we need to store > iterator struct in aset context at least as it can be individually > freed and re-created. Or it might not be necessary to allocate the > iterator struct in the same context as radix tree. Okay, that's one thing I was concerned about. Since we don't actually have a bump context yet, it seems simple to assume aset for non-nodes, and if we do get it, we can adjust slightly. Anyway, this seems like a good thing to try to clean up, but it's also not a show-stopper. On that note: I will be going on honeymoon shortly, and then to PGConf India, so I will have sporadic connectivity for the next 10 days and won't be doing any hacking during that time. Andres, did you want to take a look at the radix tree patch 0003? Aside from the above possible cleanup, most of it should be stable.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Feb 16, 2024 at 12:41 PM John Naylor wrote: > > On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada > wrote: > > > v61-0007: Runtime-embeddable tids -- Optional for v17, but should > > > reduce memory regressions, so should be considered. Up to 3 tids can > > > be stored in the last level child pointer. It's not polished, but I'll > > > only proceed with that if we think we need this. "flags" iis called > > > that because it could hold tidbitmap.c booleans (recheck, lossy) in > > > the future, in addition to reserving space for the pointer tag. Note: > > > I hacked the tests to only have 2 offsets per block to demo, but of > > > course both paths should be tested. > > > > Interesting. I've run the same benchmark tests we did[1][2] (the > > median of 3 runs): > > > > monotonically ordered int column index: > > > > master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s > > v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s > > v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s > > Hmm, that's strange -- this test is intended to delete all records > from the last 20% of the blocks, so I wouldn't expect any improvement > here, only in the sparse case. Maybe something is wrong. All the more > reason to put it off... Okay, let's dig it deeper later. > > > I'm happy to see a huge improvement. While it's really fascinating to > > me, I'm concerned about the time left until the feature freeze. We > > need to polish both tidstore and vacuum integration patches in 5 > > weeks. Personally I'd like to have it as a separate patch for now, and > > focus on completing the main three patches since we might face some > > issues after pushing these patches. I think with 0007 patch it's a big > > win but it's still a win even without 0007 patch. > > Agreed to not consider it for initial commit. I'll hold on to it for > some future time. > > > > 2. Management of memory contexts. It's pretty verbose and messy. I > > > think the abstraction could be better: > > > A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we > > > can't destroy or reset it. That means we have to do a lot of manual > > > work. > > > B: Passing "max_bytes" to the radix tree was my idea, I believe, but > > > it seems the wrong responsibility. Not all uses will have a > > > work_mem-type limit, I'm guessing. We only use it for limiting the max > > > block size, and aset's default 8MB is already plenty small for > > > vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so > > > smaller, and there it makes sense to limit the max blocksize this way. > > > C: The context for values has complex #ifdefs based on the value > > > length/varlen, but it's both too much and not enough. If we get a bump > > > context, how would we shoehorn that in for values for vacuum but not > > > for tidbitmap? > > > > > > Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to > > > TidStoreCreate(), and then to RT_CREATE. That context will contain the > > > values (for local mem), and the node slabs will be children of the > > > value context. That way, measuring memory usage and free-ing can just > > > call with this parent context, and let recursion handle the rest. > > > Perhaps the passed context can also hold the radix-tree struct, but > > > I'm not sure since I haven't tried it. What do you think? > > > > If I understand your idea correctly, RT_CREATE() creates the context > > for values as a child of the passed context and the node slabs as > > children of the value context. That way, measuring memory usage can > > just call with the value context. It sounds like a good idea. But it > > was not clear to me how to address point B and C. > > For B & C, vacuum would create a context to pass to TidStoreCreate, > and it wouldn't need to bother changing max block size. RT_CREATE > would use that directly for leaves (if any), and would only create > child slab contexts under it. It would not need to know about > max_bytes. Modifyng your diagram a bit, something like: > > - caller-supplied radix tree memory context (the 3 structs -- and > leaves, if any) (aset (or future bump?)) > - node slab contexts > > This might only be workable with aset, if we need to individually free > the structs. (I haven't studied this, it was a recent idea) > It's simpler, because with small fixed length values, we don't need to > detect that and avoid creating a leaf context. All leaves would live > in the same context as the structs. Thank you for the explanation. I think that vacuum and tidbitmap (and future users) would end up having the same max block size calculation. And it seems slightly odd layering to me that max-block-size-specified context is created on vacuum (or tidbitmap) layer, a varlen-value radix tree is created by tidstore layer, and the passed context is used for leaves (if varlen-value is used) on radix tree layer. Another idea is to create a max-block-size-specified
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada wrote: > > v61-0007: Runtime-embeddable tids -- Optional for v17, but should > > reduce memory regressions, so should be considered. Up to 3 tids can > > be stored in the last level child pointer. It's not polished, but I'll > > only proceed with that if we think we need this. "flags" iis called > > that because it could hold tidbitmap.c booleans (recheck, lossy) in > > the future, in addition to reserving space for the pointer tag. Note: > > I hacked the tests to only have 2 offsets per block to demo, but of > > course both paths should be tested. > > Interesting. I've run the same benchmark tests we did[1][2] (the > median of 3 runs): > > monotonically ordered int column index: > > master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s > v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s > v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s Hmm, that's strange -- this test is intended to delete all records from the last 20% of the blocks, so I wouldn't expect any improvement here, only in the sparse case. Maybe something is wrong. All the more reason to put it off... > I'm happy to see a huge improvement. While it's really fascinating to > me, I'm concerned about the time left until the feature freeze. We > need to polish both tidstore and vacuum integration patches in 5 > weeks. Personally I'd like to have it as a separate patch for now, and > focus on completing the main three patches since we might face some > issues after pushing these patches. I think with 0007 patch it's a big > win but it's still a win even without 0007 patch. Agreed to not consider it for initial commit. I'll hold on to it for some future time. > > 2. Management of memory contexts. It's pretty verbose and messy. I > > think the abstraction could be better: > > A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we > > can't destroy or reset it. That means we have to do a lot of manual > > work. > > B: Passing "max_bytes" to the radix tree was my idea, I believe, but > > it seems the wrong responsibility. Not all uses will have a > > work_mem-type limit, I'm guessing. We only use it for limiting the max > > block size, and aset's default 8MB is already plenty small for > > vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so > > smaller, and there it makes sense to limit the max blocksize this way. > > C: The context for values has complex #ifdefs based on the value > > length/varlen, but it's both too much and not enough. If we get a bump > > context, how would we shoehorn that in for values for vacuum but not > > for tidbitmap? > > > > Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to > > TidStoreCreate(), and then to RT_CREATE. That context will contain the > > values (for local mem), and the node slabs will be children of the > > value context. That way, measuring memory usage and free-ing can just > > call with this parent context, and let recursion handle the rest. > > Perhaps the passed context can also hold the radix-tree struct, but > > I'm not sure since I haven't tried it. What do you think? > > If I understand your idea correctly, RT_CREATE() creates the context > for values as a child of the passed context and the node slabs as > children of the value context. That way, measuring memory usage can > just call with the value context. It sounds like a good idea. But it > was not clear to me how to address point B and C. For B & C, vacuum would create a context to pass to TidStoreCreate, and it wouldn't need to bother changing max block size. RT_CREATE would use that directly for leaves (if any), and would only create child slab contexts under it. It would not need to know about max_bytes. Modifyng your diagram a bit, something like: - caller-supplied radix tree memory context (the 3 structs -- and leaves, if any) (aset (or future bump?)) - node slab contexts This might only be workable with aset, if we need to individually free the structs. (I haven't studied this, it was a recent idea) It's simpler, because with small fixed length values, we don't need to detect that and avoid creating a leaf context. All leaves would live in the same context as the structs. > Another variant of this idea would be that RT_CREATE() creates the > parent context of the value context to store radix-tree struct. That > is, the hierarchy would be like: > > A MemoryContext (passed by vacuum through tidstore) > - radix tree memory context (store radx-tree struct, control > struct, and iterator) > - value context (aset, slab, or bump) > - node slab contexts The template handling the value context here is complex, and is what I meant by 'C' above. Most fixed length allocations in all of the backend are aset, so it seems fine to use it always. > Freeing can just call with the radix tree memory context. And perhaps > it works even if tidstore passes
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Feb 15, 2024 at 8:26 PM John Naylor wrote: > > On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada > wrote: > > > > On Sat, Feb 10, 2024 at 9:29 PM John Naylor wrote: > > > I've also run the same scripts in my environment just in case and got > > similar results: > > Thanks for testing, looks good as well. > > > > There are still some micro-benchmarks we could do on tidstore, and > > > it'd be good to find out worse-case memory use (1 dead tuple each on > > > spread-out pages), but this is decent demonstration. > > > > I've tested a simple case where vacuum removes 33k dead tuples spread > > about every 10 pages. > > > > master: > > 198,000 bytes (=33000 * 6) > > system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s > > > > v-59: > > 2,834,432 bytes (reported by TidStoreMemoryUsage()) > > system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s > > The memory usage for the sparse case may be a concern, although it's > not bad -- a multiple of something small is probably not huge in > practice. See below for an option we have for this. > > > > > > I'm pretty sure there's an > > > > > accidental memset call that crept in there, but I'm running out of > > > > > steam today. > > > > > > I have just a little bit of work to add for v59: > > > > > > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero > > > any bitmapwords. That can only happen if e.g. there is an offset > 128 > > > and there are none between 64 and 128, so not a huge deal but I think > > > it's a bit nicer in this patch. > > > > LGTM. > > Okay, I've squashed this. > > > I've drafted the commit message. > > Thanks, this is a good start. > > > I've run regression tests with valgrind and run the coverity scan, and > > I don't see critical issues. > > Great! > > Now, I think we're in pretty good shape. There are a couple of things > that might be objectionable, so I want to try to improve them in the > little time we have: > > 1. Memory use for the sparse case. I shared an idea a few months ago > of how runtime-embeddable values (true combined pointer-value slots) > could work for tids. I don't think this is a must-have, but it's not a > lot of code, and I have this working: > > v61-0006: Preparatory refactoring -- I think we should do this anyway, > since the intent seems more clear to me. Looks good refactoring to me. > v61-0007: Runtime-embeddable tids -- Optional for v17, but should > reduce memory regressions, so should be considered. Up to 3 tids can > be stored in the last level child pointer. It's not polished, but I'll > only proceed with that if we think we need this. "flags" iis called > that because it could hold tidbitmap.c booleans (recheck, lossy) in > the future, in addition to reserving space for the pointer tag. Note: > I hacked the tests to only have 2 offsets per block to demo, but of > course both paths should be tested. Interesting. I've run the same benchmark tests we did[1][2] (the median of 3 runs): monotonically ordered int column index: master: system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s v-59: system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s v-62: system usage: CPU: user: 1.94 s, system: 0.69 s, elapsed: 2.64 s uuid column index: master: system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s v-59: system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s v-62: system usage: CPU: user: 4.06 s, system: 0.98 s, elapsed: 5.06 s int & uuid indexes in parallel: master: system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s v-59: system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s v-62: system usage: CPU: user: 2.54 s, system: 0.94 s, elapsed: 6.00 s sparse case: master: 198,000 bytes (=33000 * 6) system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s v-59: 2,834,432 bytes (reported by TidStoreMemoryUsage()) system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s v-62: 729,088 bytes (reported by TidStoreMemoryUsage()) system usage: CPU: user: 4.63 s, system: 0.86 s, elapsed: 5.50 s I'm happy to see a huge improvement. While it's really fascinating to me, I'm concerned about the time left until the feature freeze. We need to polish both tidstore and vacuum integration patches in 5 weeks. Personally I'd like to have it as a separate patch for now, and focus on completing the main three patches since we might face some issues after pushing these patches. I think with 0007 patch it's a big win but it's still a win even without 0007 patch. > > 2. Management of memory contexts. It's pretty verbose and messy. I > think the abstraction could be better: > A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we > can't destroy or reset it. That means we have to do a lot of manual > work. > B: Passing "max_bytes" to the radix tree was my idea, I believe, but > it seems the wrong responsibility. Not all uses will have a > work_mem-type limit, I'm
Re: [PoC] Improve dead tuple storage for lazy vacuum
v61 had a brown-paper-bag bug in the embedded tids patch that didn't present in the tidstore test, but caused vacuum to fail, fixed in v62. v62-ART.tar.gz Description: application/gzip
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Feb 15, 2024 at 10:21 AM Masahiko Sawada wrote: > > On Sat, Feb 10, 2024 at 9:29 PM John Naylor wrote: > I've also run the same scripts in my environment just in case and got > similar results: Thanks for testing, looks good as well. > > There are still some micro-benchmarks we could do on tidstore, and > > it'd be good to find out worse-case memory use (1 dead tuple each on > > spread-out pages), but this is decent demonstration. > > I've tested a simple case where vacuum removes 33k dead tuples spread > about every 10 pages. > > master: > 198,000 bytes (=33000 * 6) > system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s > > v-59: > 2,834,432 bytes (reported by TidStoreMemoryUsage()) > system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s The memory usage for the sparse case may be a concern, although it's not bad -- a multiple of something small is probably not huge in practice. See below for an option we have for this. > > > > I'm pretty sure there's an > > > > accidental memset call that crept in there, but I'm running out of > > > > steam today. > > > > I have just a little bit of work to add for v59: > > > > v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero > > any bitmapwords. That can only happen if e.g. there is an offset > 128 > > and there are none between 64 and 128, so not a huge deal but I think > > it's a bit nicer in this patch. > > LGTM. Okay, I've squashed this. > I've drafted the commit message. Thanks, this is a good start. > I've run regression tests with valgrind and run the coverity scan, and > I don't see critical issues. Great! Now, I think we're in pretty good shape. There are a couple of things that might be objectionable, so I want to try to improve them in the little time we have: 1. Memory use for the sparse case. I shared an idea a few months ago of how runtime-embeddable values (true combined pointer-value slots) could work for tids. I don't think this is a must-have, but it's not a lot of code, and I have this working: v61-0006: Preparatory refactoring -- I think we should do this anyway, since the intent seems more clear to me. v61-0007: Runtime-embeddable tids -- Optional for v17, but should reduce memory regressions, so should be considered. Up to 3 tids can be stored in the last level child pointer. It's not polished, but I'll only proceed with that if we think we need this. "flags" iis called that because it could hold tidbitmap.c booleans (recheck, lossy) in the future, in addition to reserving space for the pointer tag. Note: I hacked the tests to only have 2 offsets per block to demo, but of course both paths should be tested. 2. Management of memory contexts. It's pretty verbose and messy. I think the abstraction could be better: A: tidstore currently passes CurrentMemoryContext to RT_CREATE, so we can't destroy or reset it. That means we have to do a lot of manual work. B: Passing "max_bytes" to the radix tree was my idea, I believe, but it seems the wrong responsibility. Not all uses will have a work_mem-type limit, I'm guessing. We only use it for limiting the max block size, and aset's default 8MB is already plenty small for vacuum's large limit anyway. tidbitmap.c's limit is work_mem, so smaller, and there it makes sense to limit the max blocksize this way. C: The context for values has complex #ifdefs based on the value length/varlen, but it's both too much and not enough. If we get a bump context, how would we shoehorn that in for values for vacuum but not for tidbitmap? Here's an idea: Have vacuum (or tidbitmap etc.) pass a context to TidStoreCreate(), and then to RT_CREATE. That context will contain the values (for local mem), and the node slabs will be children of the value context. That way, measuring memory usage and free-ing can just call with this parent context, and let recursion handle the rest. Perhaps the passed context can also hold the radix-tree struct, but I'm not sure since I haven't tried it. What do you think? With this resolved, I think the radix tree is pretty close to committable. The tid store will likely need some polish yet, but no major issues I know of. (And, finally, a small thing I that I wanted to share just so I don't forget, but maybe not worth the attention: In Andres's prototype, there is a comment wondering if an update can skip checking if it first need to create a root node. This is pretty easy, and done in v61-0008.) v61-ART.tar.gz Description: application/gzip
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Sat, Feb 10, 2024 at 9:29 PM John Naylor wrote: > > On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada wrote: > > > > On Fri, Feb 2, 2024 at 8:47 PM John Naylor wrote: > > > > My todo: > > > - benchmark tid store / vacuum again, since we haven't since varlen > > > types and removing unnecessary locks. > > I ran a vacuum benchmark similar to the one in [1] (unlogged tables > for reproducibility), but smaller tables (100 million records), > deleting only the last 20% of the table, and including a parallel > vacuum test. Scripts attached. > > monotonically ordered int column index: > > master: > system usage: CPU: user: 4.27 s, system: 0.41 s, elapsed: 4.70 s > system usage: CPU: user: 4.23 s, system: 0.44 s, elapsed: 4.69 s > system usage: CPU: user: 4.26 s, system: 0.39 s, elapsed: 4.66 s > > v-59: > system usage: CPU: user: 3.10 s, system: 0.44 s, elapsed: 3.56 s > system usage: CPU: user: 3.07 s, system: 0.35 s, elapsed: 3.43 s > system usage: CPU: user: 3.07 s, system: 0.36 s, elapsed: 3.44 s > > uuid column index: > > master: > system usage: CPU: user: 18.22 s, system: 1.70 s, elapsed: 20.01 s > system usage: CPU: user: 17.70 s, system: 1.70 s, elapsed: 19.48 s > system usage: CPU: user: 18.48 s, system: 1.59 s, elapsed: 20.43 s > > v-59: > system usage: CPU: user: 5.18 s, system: 1.18 s, elapsed: 6.45 s > system usage: CPU: user: 6.56 s, system: 1.39 s, elapsed: 7.99 s > system usage: CPU: user: 6.51 s, system: 1.44 s, elapsed: 8.05 s > > int & uuid indexes in parallel: > > master: > system usage: CPU: user: 4.53 s, system: 1.22 s, elapsed: 20.43 s > system usage: CPU: user: 4.49 s, system: 1.29 s, elapsed: 20.98 s > system usage: CPU: user: 4.46 s, system: 1.33 s, elapsed: 20.50 s > > v59: > system usage: CPU: user: 2.09 s, system: 0.32 s, elapsed: 4.86 s > system usage: CPU: user: 3.76 s, system: 0.51 s, elapsed: 8.92 s > system usage: CPU: user: 3.83 s, system: 0.54 s, elapsed: 9.09 s > > Over all, I'm pleased with these results, although I'm confused why > sometimes with the patch the first run reports running faster than the > others. I'm curious what others get. Traversing a tree that lives in > DSA has some overhead, as expected, but still comes out way ahead of > master. Thanks! That's a great improvement. I've also run the same scripts in my environment just in case and got similar results: monotonically ordered int column index: master: system usage: CPU: user: 14.81 s, system: 0.90 s, elapsed: 15.74 s system usage: CPU: user: 14.91 s, system: 0.80 s, elapsed: 15.73 s system usage: CPU: user: 14.85 s, system: 0.70 s, elapsed: 15.57 s v-59: system usage: CPU: user: 9.47 s, system: 1.04 s, elapsed: 10.53 s system usage: CPU: user: 9.67 s, system: 0.81 s, elapsed: 10.50 s system usage: CPU: user: 9.59 s, system: 0.86 s, elapsed: 10.47 s uuid column index: master: system usage: CPU: user: 28.37 s, system: 1.38 s, elapsed: 29.81 s system usage: CPU: user: 28.05 s, system: 1.37 s, elapsed: 29.47 s system usage: CPU: user: 28.46 s, system: 1.36 s, elapsed: 29.88 s v-59: system usage: CPU: user: 14.87 s, system: 1.13 s, elapsed: 16.02 s system usage: CPU: user: 14.84 s, system: 1.31 s, elapsed: 16.18 s system usage: CPU: user: 10.96 s, system: 1.24 s, elapsed: 12.22 s int & uuid indexes in parallel: master: system usage: CPU: user: 15.81 s, system: 1.43 s, elapsed: 34.31 s system usage: CPU: user: 15.84 s, system: 1.41 s, elapsed: 34.34 s system usage: CPU: user: 15.92 s, system: 1.39 s, elapsed: 34.33 s v-59: system usage: CPU: user: 10.93 s, system: 0.92 s, elapsed: 17.59 s system usage: CPU: user: 10.92 s, system: 1.20 s, elapsed: 17.58 s system usage: CPU: user: 10.90 s, system: 1.01 s, elapsed: 17.45 s > > There are still some micro-benchmarks we could do on tidstore, and > it'd be good to find out worse-case memory use (1 dead tuple each on > spread-out pages), but this is decent demonstration. I've tested a simple case where vacuum removes 33k dead tuples spread about every 10 pages. master: 198,000 bytes (=33000 * 6) system usage: CPU: user: 29.49 s, system: 0.88 s, elapsed: 30.40 s v-59: 2,834,432 bytes (reported by TidStoreMemoryUsage()) system usage: CPU: user: 15.96 s, system: 0.89 s, elapsed: 16.88 s > > > > I'm not sure what the test_node_types_* functions are testing that > > > test_basic doesn't. They have a different, and confusing, way to stop > > > at every size class and check the keys/values. It seems we can replace > > > all that with two more calls (asc/desc) to test_basic, with the > > > maximum level. > > v58-0008: > > + /* borrowed from RT_MAX_SHIFT */ > + const int max_shift = (pg_leftmost_one_pos64(UINT64_MAX) / > BITS_PER_BYTE) * BITS_PER_BYTE; > > This is harder to read than "64 - 8", and doesn't really help > maintainability either. > Maybe "(sizeof(uint64) - 1) * BITS_PER_BYTE" is a good compromise. > > + /* leaf nodes */ > + test_basic(test_info, 0); > > + /* internal nodes */ > + test_basic(test_info, 8); > + > + /* max-level nodes */ > +
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Feb 6, 2024 at 9:58 AM Masahiko Sawada wrote: > > On Fri, Feb 2, 2024 at 8:47 PM John Naylor wrote: > > My todo: > > - benchmark tid store / vacuum again, since we haven't since varlen > > types and removing unnecessary locks. I ran a vacuum benchmark similar to the one in [1] (unlogged tables for reproducibility), but smaller tables (100 million records), deleting only the last 20% of the table, and including a parallel vacuum test. Scripts attached. monotonically ordered int column index: master: system usage: CPU: user: 4.27 s, system: 0.41 s, elapsed: 4.70 s system usage: CPU: user: 4.23 s, system: 0.44 s, elapsed: 4.69 s system usage: CPU: user: 4.26 s, system: 0.39 s, elapsed: 4.66 s v-59: system usage: CPU: user: 3.10 s, system: 0.44 s, elapsed: 3.56 s system usage: CPU: user: 3.07 s, system: 0.35 s, elapsed: 3.43 s system usage: CPU: user: 3.07 s, system: 0.36 s, elapsed: 3.44 s uuid column index: master: system usage: CPU: user: 18.22 s, system: 1.70 s, elapsed: 20.01 s system usage: CPU: user: 17.70 s, system: 1.70 s, elapsed: 19.48 s system usage: CPU: user: 18.48 s, system: 1.59 s, elapsed: 20.43 s v-59: system usage: CPU: user: 5.18 s, system: 1.18 s, elapsed: 6.45 s system usage: CPU: user: 6.56 s, system: 1.39 s, elapsed: 7.99 s system usage: CPU: user: 6.51 s, system: 1.44 s, elapsed: 8.05 s int & uuid indexes in parallel: master: system usage: CPU: user: 4.53 s, system: 1.22 s, elapsed: 20.43 s system usage: CPU: user: 4.49 s, system: 1.29 s, elapsed: 20.98 s system usage: CPU: user: 4.46 s, system: 1.33 s, elapsed: 20.50 s v59: system usage: CPU: user: 2.09 s, system: 0.32 s, elapsed: 4.86 s system usage: CPU: user: 3.76 s, system: 0.51 s, elapsed: 8.92 s system usage: CPU: user: 3.83 s, system: 0.54 s, elapsed: 9.09 s Over all, I'm pleased with these results, although I'm confused why sometimes with the patch the first run reports running faster than the others. I'm curious what others get. Traversing a tree that lives in DSA has some overhead, as expected, but still comes out way ahead of master. There are still some micro-benchmarks we could do on tidstore, and it'd be good to find out worse-case memory use (1 dead tuple each on spread-out pages), but this is decent demonstration. > > I'm not sure what the test_node_types_* functions are testing that > > test_basic doesn't. They have a different, and confusing, way to stop > > at every size class and check the keys/values. It seems we can replace > > all that with two more calls (asc/desc) to test_basic, with the > > maximum level. v58-0008: + /* borrowed from RT_MAX_SHIFT */ + const int max_shift = (pg_leftmost_one_pos64(UINT64_MAX) / BITS_PER_BYTE) * BITS_PER_BYTE; This is harder to read than "64 - 8", and doesn't really help maintainability either. Maybe "(sizeof(uint64) - 1) * BITS_PER_BYTE" is a good compromise. + /* leaf nodes */ + test_basic(test_info, 0); + /* internal nodes */ + test_basic(test_info, 8); + + /* max-level nodes */ + test_basic(test_info, max_shift); This three-way terminology is not very informative. How about: + /* a tree with one level, i.e. a single node under the root node. */ ... + /* a tree with two levels */ ... + /* a tree with the maximum number of levels */ +static void +test_basic(rt_node_class_test_elem *test_info, int shift) +{ + elog(NOTICE, "testing node %s with shift %d", test_info->class_name, shift); + + /* Test nodes while changing the key insertion order */ + do_test_basic(test_info->nkeys, shift, false); + do_test_basic(test_info->nkeys, shift, true); Adding a level of indirection makes this harder to read, and do we still know whether a test failed in asc or desc keys? > > My earlier opinion was that "handle" was a nicer variable name, but > > this brings back the typedef and also keeps the variable name I didn't > > like, but pushes it down into the function. I'm a bit confused, so > > I've kept these not-squashed for now. > > I misunderstood your comment. I've changed to use a variable name > rt_handle and removed the TidStoreHandle type. 0013 patch. (diff against an earlier version) - pvs->shared->dead_items_handle = TidStoreGetHandle(dead_items); + pvs->shared->dead_items_dp = TidStoreGetHandle(dead_items); Shall we use "handle" in vacuum_parallel.c as well? > > I'm pretty sure there's an > > accidental memset call that crept in there, but I'm running out of > > steam today. I have just a little bit of work to add for v59: v59-0009 - set_offset_bitmap_at() will call memset if it needs to zero any bitmapwords. That can only happen if e.g. there is an offset > 128 and there are none between 64 and 128, so not a huge deal but I think it's a bit nicer in this patch. > > > * WIP: notes about traditional radix tree trading off span vs height... > > > > > > Are you going to write it? > > > > Yes, when I draft a rough commit message, (for next time). I haven't gotten to the commit message, but: v59-0004 - I did some
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Feb 2, 2024 at 8:47 PM John Naylor wrote: > > On Wed, Jan 31, 2024 at 12:50 PM Masahiko Sawada > wrote: > > I've attached the new patch set (v56). I've squashed previous updates > > and addressed review comments on v55 in separate patches. Here are the > > update summary: > > > > 0004: fix compiler warning caught by ci test. > > 0005-0008: address review comments on radix tree codes. > > 0009: cleanup #define and #undef > > 0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is > > undefined after including radixtree.h so we should not use it in test > > code. > > Great, thanks! > > I have a few questions and comments on v56, then I'll address yours > below with the attached v57, which is mostly cosmetic adjustments. Thank you for the comments! I've squashed previous updates and your changes. > > v56-0003: > > (Looking closer at tests) > > +static const bool rt_test_stats = false; > > I'm thinking we should just remove everything that depends on this, > and keep this module entirely about correctness. Agreed. Removed in 0006 patch. > > + for (int shift = 0; shift <= (64 - 8); shift += 8) > + test_node_types(shift); > > I'm not sure what the test_node_types_* functions are testing that > test_basic doesn't. They have a different, and confusing, way to stop > at every size class and check the keys/values. It seems we can replace > all that with two more calls (asc/desc) to test_basic, with the > maximum level. Agreed, addressed in 0007 patch. > > It's pretty hard to see what test_pattern() is doing, or why it's > useful. I wonder if instead the test could use something like the > benchmark where random integers are masked off. That seems simpler. I > can work on that, but I'd like to hear your side about test_pattern(). Yeah, test_pattern() is originally created for the integerset so it doesn't necessarily fit the radixtree. I agree to use some tests from benchmarks. > > v56-0007: > > + * > + * Since we can rely on DSA_AREA_LOCK to get the total amount of DSA memory, > + * the caller doesn't need to take a lock. > > Maybe something like "Since dsa_get_total_size() does appropriate locking > ..."? Agreed. Fixed in 0005 patch. > > v56-0008 > > Thanks, I like how the tests look now. > > -NOTICE: testing node 4 with height 0 and ascending keys > ... > +NOTICE: testing node 1 with height 0 and ascending keys > > Now that the number is not intended to match a size class, "node X" > seems out of place. Maybe we could have a separate array with strings? > > + 1, /* RT_CLASS_4 */ > > This should be more than one, so that the basic test still exercises > paths that shift elements around. > > + 100, /* RT_CLASS_48 */ > > This node currently holds 64 for local memory. > > + 255 /* RT_CLASS_256 */ > > This is the only one where we know exactly how many it can take, so > may as well keep it at 256. Fixed in 0008 patch. > > v56-0012: > > The test module for tidstore could use a few more comments. Addressed in 0012 patch. > > v56-0015: > > +typedef dsa_pointer TidStoreHandle; > + > > -TidStoreAttach(dsa_area *area, dsa_pointer rt_dp) > +TidStoreAttach(dsa_area *area, TidStoreHandle handle) > { > TidStore *ts; > + dsa_pointer rt_dp = handle; > > My earlier opinion was that "handle" was a nicer variable name, but > this brings back the typedef and also keeps the variable name I didn't > like, but pushes it down into the function. I'm a bit confused, so > I've kept these not-squashed for now. I misunderstood your comment. I've changed to use a variable name rt_handle and removed the TidStoreHandle type. 0013 patch. > > --- > > Now, for v57: > > > Looking at overall changes, there are still XXX and TODO comments in > > radixtree.h: > > That's fine, as long as it's intentional as a message to readers. That > said, we can get rid of some: > > > --- > > * XXX There are 4 node kinds, and this should never be increased, > > * for several reasons: > > * 1. With 5 or more kinds, gcc tends to use a jump table for switch > > *statements. > > * 2. The 4 kinds can be represented with 2 bits, so we have the option > > *in the future to tag the node pointer with the kind, even on > > *platforms with 32-bit pointers. This might speed up node traversal > > *in trees with highly random node kinds. > > * 3. We can have multiple size classes per node kind. > > > > Can we just remove "XXX"? > > How about "NOTE"? Agreed. > > > --- > > * WIP: notes about traditional radix tree trading off span vs height... > > > > Are you going to write it? > > Yes, when I draft a rough commit message, (for next time). Thanks! > > > --- > > #ifdef RT_SHMEM > > /* WIP: do we really need this? */ > > typedef dsa_pointer RT_HANDLE; > > #endif > > > > I think it's worth having it. > > Okay, removed WIP in v57-0004. > > > --- > > * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Jan 31, 2024 at 12:50 PM Masahiko Sawada wrote: > I've attached the new patch set (v56). I've squashed previous updates > and addressed review comments on v55 in separate patches. Here are the > update summary: > > 0004: fix compiler warning caught by ci test. > 0005-0008: address review comments on radix tree codes. > 0009: cleanup #define and #undef > 0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is > undefined after including radixtree.h so we should not use it in test > code. Great, thanks! I have a few questions and comments on v56, then I'll address yours below with the attached v57, which is mostly cosmetic adjustments. v56-0003: (Looking closer at tests) +static const bool rt_test_stats = false; I'm thinking we should just remove everything that depends on this, and keep this module entirely about correctness. + for (int shift = 0; shift <= (64 - 8); shift += 8) + test_node_types(shift); I'm not sure what the test_node_types_* functions are testing that test_basic doesn't. They have a different, and confusing, way to stop at every size class and check the keys/values. It seems we can replace all that with two more calls (asc/desc) to test_basic, with the maximum level. It's pretty hard to see what test_pattern() is doing, or why it's useful. I wonder if instead the test could use something like the benchmark where random integers are masked off. That seems simpler. I can work on that, but I'd like to hear your side about test_pattern(). v56-0007: + * + * Since we can rely on DSA_AREA_LOCK to get the total amount of DSA memory, + * the caller doesn't need to take a lock. Maybe something like "Since dsa_get_total_size() does appropriate locking ..."? v56-0008 Thanks, I like how the tests look now. -NOTICE: testing node 4 with height 0 and ascending keys ... +NOTICE: testing node 1 with height 0 and ascending keys Now that the number is not intended to match a size class, "node X" seems out of place. Maybe we could have a separate array with strings? + 1, /* RT_CLASS_4 */ This should be more than one, so that the basic test still exercises paths that shift elements around. + 100, /* RT_CLASS_48 */ This node currently holds 64 for local memory. + 255 /* RT_CLASS_256 */ This is the only one where we know exactly how many it can take, so may as well keep it at 256. v56-0012: The test module for tidstore could use a few more comments. v56-0015: +typedef dsa_pointer TidStoreHandle; + -TidStoreAttach(dsa_area *area, dsa_pointer rt_dp) +TidStoreAttach(dsa_area *area, TidStoreHandle handle) { TidStore *ts; + dsa_pointer rt_dp = handle; My earlier opinion was that "handle" was a nicer variable name, but this brings back the typedef and also keeps the variable name I didn't like, but pushes it down into the function. I'm a bit confused, so I've kept these not-squashed for now. --- Now, for v57: > Looking at overall changes, there are still XXX and TODO comments in > radixtree.h: That's fine, as long as it's intentional as a message to readers. That said, we can get rid of some: > --- > * XXX There are 4 node kinds, and this should never be increased, > * for several reasons: > * 1. With 5 or more kinds, gcc tends to use a jump table for switch > *statements. > * 2. The 4 kinds can be represented with 2 bits, so we have the option > *in the future to tag the node pointer with the kind, even on > *platforms with 32-bit pointers. This might speed up node traversal > *in trees with highly random node kinds. > * 3. We can have multiple size classes per node kind. > > Can we just remove "XXX"? How about "NOTE"? > --- > * WIP: notes about traditional radix tree trading off span vs height... > > Are you going to write it? Yes, when I draft a rough commit message, (for next time). > --- > #ifdef RT_SHMEM > /* WIP: do we really need this? */ > typedef dsa_pointer RT_HANDLE; > #endif > > I think it's worth having it. Okay, removed WIP in v57-0004. > --- > * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit > * inside a single bitmapword on most platforms, so it's a good starting > * point. We can make it higher if we need to. > */ > #define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4) > > Are you going to work something on this? Hard-coded 64 for readability, and changed this paragraph to explain the current rationale more clearly: "The paper uses at most 64 for this node kind, and one advantage for us is that "isset" is a single bitmapword on most platforms, rather than an array, allowing the compiler to get rid of loops." > --- > /* WIP: We could go first to the higher node16 size class */ > newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO); > > Does it mean to go to RT_CLASS_16_HI and then further go to > RT_CLASS_16_LO upon further deletion? Yes. It wouldn't be much work to make
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Jan 30, 2024 at 7:20 PM John Naylor wrote: > > On Tue, Jan 30, 2024 at 7:56 AM Masahiko Sawada wrote: > > > > On Mon, Jan 29, 2024 at 8:48 PM John Naylor wrote: > > > > I meant the macro could probably be > > > > > > Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N) > > > > > > (Right now N=32). I also realize I didn't answer your question earlier > > > about block sizes being powers of two. I was talking about PG in > > > general -- I was thinking all block sizes were powers of two. If > > > that's true, I'm not sure if it's because programmers find the macro > > > calculations easy to reason about, or if there was an implementation > > > reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or > > > just above a power of two, so if we did round that up it would be > > > 128kB. > > > > Thank you for your explanation. It might be better to follow other > > codes. Does the calculation below make sense to you? > > > > RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i]; > > Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE; > > while (inner_blocksize < 32 * size_class.allocsize) > > inner_blocksize <<= 1; > > It does make sense, but we can do it more simply: > > Max(SLAB_DEFAULT_BLOCK_SIZE, pg_nextpower2_32(size * 32)) Thanks! I've attached the new patch set (v56). I've squashed previous updates and addressed review comments on v55 in separate patches. Here are the update summary: 0004: fix compiler warning caught by ci test. 0005-0008: address review comments on radix tree codes. 0009: cleanup #define and #undef 0010: use TEST_SHARED_RT macro for shared radix tree test. RT_SHMEM is undefined after including radixtree.h so we should not use it in test code. 0013-0015: address review comments on tidstore codes. 0017-0018: address review comments on vacuum integration codes. Looking at overall changes, there are still XXX and TODO comments in radixtree.h: --- * XXX There are 4 node kinds, and this should never be increased, * for several reasons: * 1. With 5 or more kinds, gcc tends to use a jump table for switch *statements. * 2. The 4 kinds can be represented with 2 bits, so we have the option *in the future to tag the node pointer with the kind, even on *platforms with 32-bit pointers. This might speed up node traversal *in trees with highly random node kinds. * 3. We can have multiple size classes per node kind. Can we just remove "XXX"? --- * WIP: notes about traditional radix tree trading off span vs height... Are you going to write it? --- #ifdef RT_SHMEM /* WIP: do we really need this? */ typedef dsa_pointer RT_HANDLE; #endif I think it's worth having it. --- * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit * inside a single bitmapword on most platforms, so it's a good starting * point. We can make it higher if we need to. */ #define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4) Are you going to work something on this? --- /* WIP: We could go first to the higher node16 size class */ newnode = RT_ALLOC_NODE(tree, RT_NODE_KIND_16, RT_CLASS_16_LO); Does it mean to go to RT_CLASS_16_HI and then further go to RT_CLASS_16_LO upon further deletion? --- * TODO: The current locking mechanism is not optimized for high concurrency * with mixed read-write workloads. In the future it might be worthwhile * to replace it with the Optimistic Lock Coupling or ROWEX mentioned in * the paper "The ART of Practical Synchronization" by the same authors as * the ART paper, 2016. I think it's not TODO for now, but a future improvement. We can remove it. --- /* TODO: consider 5 with subclass 1 or 2. */ #define RT_FANOUT_4 4 Is there something we need to do here? --- /* * Return index of the chunk and slot arrays for inserting into the node, * such that the chunk array remains ordered. * TODO: Improve performance for non-SIMD platforms. */ Are you going to work on this? --- /* Delete the element at 'idx' */ /* TODO: replace slow memmove's */ Are you going to work on this? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v56-ART.tar.gz Description: GNU Zip compressed data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Jan 30, 2024 at 7:56 AM Masahiko Sawada wrote: > > On Mon, Jan 29, 2024 at 8:48 PM John Naylor wrote: > > I meant the macro could probably be > > > > Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N) > > > > (Right now N=32). I also realize I didn't answer your question earlier > > about block sizes being powers of two. I was talking about PG in > > general -- I was thinking all block sizes were powers of two. If > > that's true, I'm not sure if it's because programmers find the macro > > calculations easy to reason about, or if there was an implementation > > reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or > > just above a power of two, so if we did round that up it would be > > 128kB. > > Thank you for your explanation. It might be better to follow other > codes. Does the calculation below make sense to you? > > RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i]; > Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE; > while (inner_blocksize < 32 * size_class.allocsize) > inner_blocksize <<= 1; It does make sense, but we can do it more simply: Max(SLAB_DEFAULT_BLOCK_SIZE, pg_nextpower2_32(size * 32))
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Jan 29, 2024 at 8:48 PM John Naylor wrote: > > On Mon, Jan 29, 2024 at 2:29 PM Masahiko Sawada wrote: > > > > > +/* > > > > + * Calculate the slab blocksize so that we can allocate at least 32 > > > > chunks > > > > + * from the block. > > > > + */ > > > > +#define RT_SLAB_BLOCK_SIZE(size) \ > > > > + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32) > > > > > > > > The first parameter seems to be trying to make the block size exact, > > > > but that's not right, because of the chunk header, and maybe > > > > alignment. If the default block size is big enough to waste only a > > > > tiny amount of space, let's just use that as-is. > > > If we use SLAB_DEFAULT_BLOCK_SIZE (8kB) for each node class, we waste > > [snip] > > We might want to calculate a better slab block size for node256 at least. > > I meant the macro could probably be > > Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N) > > (Right now N=32). I also realize I didn't answer your question earlier > about block sizes being powers of two. I was talking about PG in > general -- I was thinking all block sizes were powers of two. If > that's true, I'm not sure if it's because programmers find the macro > calculations easy to reason about, or if there was an implementation > reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or > just above a power of two, so if we did round that up it would be > 128kB. Thank you for your explanation. It might be better to follow other codes. Does the calculation below make sense to you? RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i]; Size inner_blocksize = SLAB_DEFAULT_BLOCK_SIZE; while (inner_blocksize < 32 * size_class.allocsize) inner_blocksize <<= 1; As for the lock mode in dsa.c, I've posted a question[1]. Regards, [1] https://www.postgresql.org/message-id/CAD21AoALgrU2sGWzgq%2B6G9X0ynqyVOjMR5_k4HgsGRWae1j%3DwQ%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Jan 29, 2024 at 2:29 PM Masahiko Sawada wrote: > > > +/* > > > + * Calculate the slab blocksize so that we can allocate at least 32 > > > chunks > > > + * from the block. > > > + */ > > > +#define RT_SLAB_BLOCK_SIZE(size) \ > > > + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32) > > > > > > The first parameter seems to be trying to make the block size exact, > > > but that's not right, because of the chunk header, and maybe > > > alignment. If the default block size is big enough to waste only a > > > tiny amount of space, let's just use that as-is. > If we use SLAB_DEFAULT_BLOCK_SIZE (8kB) for each node class, we waste > [snip] > We might want to calculate a better slab block size for node256 at least. I meant the macro could probably be Max(SLAB_DEFAULT_BLOCK_SIZE, (size) * N) (Right now N=32). I also realize I didn't answer your question earlier about block sizes being powers of two. I was talking about PG in general -- I was thinking all block sizes were powers of two. If that's true, I'm not sure if it's because programmers find the macro calculations easy to reason about, or if there was an implementation reason for it (e.g. libc behavior). 32*2088 bytes is about 65kB, or just above a power of two, so if we did round that up it would be 128kB. > > > + * TODO: The caller must be certain that no other backend will attempt to > > > + * access the TidStore before calling this function. Other backend must > > > + * explicitly call TidStoreDetach() to free up backend-local memory > > > associated > > > + * with the TidStore. The backend that calls TidStoreDestroy() must not > > > call > > > + * TidStoreDetach(). > > > > > > Do we need to do anything now? > > > > No, will remove it. > > > > I misunderstood something. I think the above statement is still true > but we don't need to do anything at this stage. It's a typical usage > that the leader destroys the shared data after confirming all workers > are detached. It's not a TODO but probably a NOTE. Okay.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Jan 26, 2024 at 11:05 PM Masahiko Sawada wrote: > > On Wed, Jan 24, 2024 at 3:42 PM John Naylor wrote: > > > > On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada > > wrote: > > > > > > The new patches probably need to be polished but the VacDeadItemInfo > > > idea looks good to me. > > > > That idea looks good to me, too. Since you already likely know what > > you'd like to polish, I don't have much to say except for a few > > questions below. I also did a quick sweep through every patch, so some > > of these comments are unrelated to recent changes: > > Thank you! > > > > > +/* > > + * Calculate the slab blocksize so that we can allocate at least 32 chunks > > + * from the block. > > + */ > > +#define RT_SLAB_BLOCK_SIZE(size) \ > > + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32) > > > > The first parameter seems to be trying to make the block size exact, > > but that's not right, because of the chunk header, and maybe > > alignment. If the default block size is big enough to waste only a > > tiny amount of space, let's just use that as-is. > > Agreed. > As of v55 patch, the sizes of each node class are: - node4: 40 bytes - node16_lo: 168 bytes - node16_hi: 296 bytes - node48: 784 bytes - node256: 2088 bytes If we use SLAB_DEFAULT_BLOCK_SIZE (8kB) for each node class, we waste (approximately): - node4: 32 bytes - node16_lo: 128 bytes - node16_hi: 200 bytes - node48: 352 bytes - node256: 1928 bytes We might want to calculate a better slab block size for node256 at least. > > > > + * TODO: The caller must be certain that no other backend will attempt to > > + * access the TidStore before calling this function. Other backend must > > + * explicitly call TidStoreDetach() to free up backend-local memory > > associated > > + * with the TidStore. The backend that calls TidStoreDestroy() must not > > call > > + * TidStoreDetach(). > > > > Do we need to do anything now? > > No, will remove it. > I misunderstood something. I think the above statement is still true but we don't need to do anything at this stage. It's a typical usage that the leader destroys the shared data after confirming all workers are detached. It's not a TODO but probably a NOTE. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Jan 24, 2024 at 3:42 PM John Naylor wrote: > > On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada > wrote: > > > > The new patches probably need to be polished but the VacDeadItemInfo > > idea looks good to me. > > That idea looks good to me, too. Since you already likely know what > you'd like to polish, I don't have much to say except for a few > questions below. I also did a quick sweep through every patch, so some > of these comments are unrelated to recent changes: Thank you! > > v55-0003: > > +size_t > +dsa_get_total_size(dsa_area *area) > +{ > + size_t size; > + > + LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED); > + size = area->control->total_segment_size; > + LWLockRelease(DSA_AREA_LOCK(area)); > > I looked and found dsa.c doesn't already use shared locks in HEAD, > even dsa_dump. Not sure why that is... Oh, the dsa_dump part seems to be a bug. But it'll keep it consistent with others. > > +/* > + * Calculate the slab blocksize so that we can allocate at least 32 chunks > + * from the block. > + */ > +#define RT_SLAB_BLOCK_SIZE(size) \ > + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32) > > The first parameter seems to be trying to make the block size exact, > but that's not right, because of the chunk header, and maybe > alignment. If the default block size is big enough to waste only a > tiny amount of space, let's just use that as-is. Agreed. > Also, I think all > block sizes in the code base have been a power of two, but I'm not > sure how much that matters. Did you mean all slab block sizes we use in radixtree.h? > > +#ifdef RT_SHMEM > + fprintf(stderr, " [%d] chunk %x slot " DSA_POINTER_FORMAT "\n", > + i, n4->chunks[i], n4->children[i]); > +#else > + fprintf(stderr, " [%d] chunk %x slot %p\n", > + i, n4->chunks[i], n4->children[i]); > +#endif > > Maybe we could invent a child pointer format, so we only #ifdef in one place. WIll change. > > --- /dev/null > +++ b/src/test/modules/test_radixtree/meson.build > @@ -0,0 +1,35 @@ > +# FIXME: prevent install during main install, but not during test :/ > > Can you look into this? Okay, I'll look at it. > > test_radixtree.c: > > +/* > + * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO? > + */ > +static int rt_node_class_fanouts[] = { > + 4, /* RT_CLASS_3 */ > + 15, /* RT_CLASS_32_MIN */ > + 32, /* RT_CLASS_32_MAX */ > + 125, /* RT_CLASS_125 */ > + 256 /* RT_CLASS_256 */ > +}; > > These numbers have been wrong a long time, too, but only matters for > figuring out where it went wrong when something is broken. And for the > XXX, instead of trying to use the largest number that should fit (it's > obviously not testing that the expected node can actually hold that > number anyway), it seems we can just use a "big enough" number to > cause growing into the desired size class. > > As far as cleaning up the tests, I always wondered why these didn't > use EXPECT_TRUE, EXPECT_FALSE, etc. as in Andres's prototype where > where convenient, and leave comments above the tests. That seemed like > a good idea to me -- was there a reason to have hand-written branches > and elog messages everywhere? The current test is based on test_integerset. I agree that we can improve it by using EXPECT_TRUE etc. > > --- a/src/tools/pginclude/cpluspluscheck > +++ b/src/tools/pginclude/cpluspluscheck > @@ -101,6 +101,12 @@ do > test "$f" = src/include/nodes/nodetags.h && continue > test "$f" = src/backend/nodes/nodetags.h && continue > > + # radixtree_*_impl.h cannot be included standalone: they are just > code fragments. > + test "$f" = src/include/lib/radixtree_delete_impl.h && continue > + test "$f" = src/include/lib/radixtree_insert_impl.h && continue > + test "$f" = src/include/lib/radixtree_iter_impl.h && continue > + test "$f" = src/include/lib/radixtree_search_impl.h && continue > > Ha! I'd forgotten about these -- they're long outdated. Will remove. > > v55-0005: > > - * The radix tree is locked in shared mode during the iteration, so > - * RT_END_ITERATE needs to be called when finished to release the lock. > + * The caller needs to acquire a lock in shared mode during the iteration > + * if necessary. > > "need if necessary" is maybe better phrased as "is the caller's > responsibility" Will fix. > > + /* > + * We can rely on DSA_AREA_LOCK to get the total amount of DSA memory. > + */ > total = dsa_get_total_size(tree->dsa); > > Maybe better to have a header comment for RT_MEMORY_USAGE that the > caller doesn't need to take a lock. Will fix. > > v55-0006: > > "WIP: Not built, since some benchmarks have broken" -- I'll work on > this when I re-run some benchmarks. Thanks! > > v55-0007: > > + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, > + * and stored in the radix tree. > > This hasn't been true for a few months now, and I thought we fixed > this in some earlier version? Yeah, I'll fix it. > > + * TODO: The caller must be certain that no other backend will attempt to >
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Jan 23, 2024 at 10:58 AM Masahiko Sawada wrote: > > The new patches probably need to be polished but the VacDeadItemInfo > idea looks good to me. That idea looks good to me, too. Since you already likely know what you'd like to polish, I don't have much to say except for a few questions below. I also did a quick sweep through every patch, so some of these comments are unrelated to recent changes: v55-0003: +size_t +dsa_get_total_size(dsa_area *area) +{ + size_t size; + + LWLockAcquire(DSA_AREA_LOCK(area), LW_SHARED); + size = area->control->total_segment_size; + LWLockRelease(DSA_AREA_LOCK(area)); I looked and found dsa.c doesn't already use shared locks in HEAD, even dsa_dump. Not sure why that is... +/* + * Calculate the slab blocksize so that we can allocate at least 32 chunks + * from the block. + */ +#define RT_SLAB_BLOCK_SIZE(size) \ + Max((SLAB_DEFAULT_BLOCK_SIZE / (size)) * (size), (size) * 32) The first parameter seems to be trying to make the block size exact, but that's not right, because of the chunk header, and maybe alignment. If the default block size is big enough to waste only a tiny amount of space, let's just use that as-is. Also, I think all block sizes in the code base have been a power of two, but I'm not sure how much that matters. +#ifdef RT_SHMEM + fprintf(stderr, " [%d] chunk %x slot " DSA_POINTER_FORMAT "\n", + i, n4->chunks[i], n4->children[i]); +#else + fprintf(stderr, " [%d] chunk %x slot %p\n", + i, n4->chunks[i], n4->children[i]); +#endif Maybe we could invent a child pointer format, so we only #ifdef in one place. --- /dev/null +++ b/src/test/modules/test_radixtree/meson.build @@ -0,0 +1,35 @@ +# FIXME: prevent install during main install, but not during test :/ Can you look into this? test_radixtree.c: +/* + * XXX: should we expose and use RT_SIZE_CLASS and RT_SIZE_CLASS_INFO? + */ +static int rt_node_class_fanouts[] = { + 4, /* RT_CLASS_3 */ + 15, /* RT_CLASS_32_MIN */ + 32, /* RT_CLASS_32_MAX */ + 125, /* RT_CLASS_125 */ + 256 /* RT_CLASS_256 */ +}; These numbers have been wrong a long time, too, but only matters for figuring out where it went wrong when something is broken. And for the XXX, instead of trying to use the largest number that should fit (it's obviously not testing that the expected node can actually hold that number anyway), it seems we can just use a "big enough" number to cause growing into the desired size class. As far as cleaning up the tests, I always wondered why these didn't use EXPECT_TRUE, EXPECT_FALSE, etc. as in Andres's prototype where where convenient, and leave comments above the tests. That seemed like a good idea to me -- was there a reason to have hand-written branches and elog messages everywhere? --- a/src/tools/pginclude/cpluspluscheck +++ b/src/tools/pginclude/cpluspluscheck @@ -101,6 +101,12 @@ do test "$f" = src/include/nodes/nodetags.h && continue test "$f" = src/backend/nodes/nodetags.h && continue + # radixtree_*_impl.h cannot be included standalone: they are just code fragments. + test "$f" = src/include/lib/radixtree_delete_impl.h && continue + test "$f" = src/include/lib/radixtree_insert_impl.h && continue + test "$f" = src/include/lib/radixtree_iter_impl.h && continue + test "$f" = src/include/lib/radixtree_search_impl.h && continue Ha! I'd forgotten about these -- they're long outdated. v55-0005: - * The radix tree is locked in shared mode during the iteration, so - * RT_END_ITERATE needs to be called when finished to release the lock. + * The caller needs to acquire a lock in shared mode during the iteration + * if necessary. "need if necessary" is maybe better phrased as "is the caller's responsibility" + /* + * We can rely on DSA_AREA_LOCK to get the total amount of DSA memory. + */ total = dsa_get_total_size(tree->dsa); Maybe better to have a header comment for RT_MEMORY_USAGE that the caller doesn't need to take a lock. v55-0006: "WIP: Not built, since some benchmarks have broken" -- I'll work on this when I re-run some benchmarks. v55-0007: + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, + * and stored in the radix tree. This hasn't been true for a few months now, and I thought we fixed this in some earlier version? + * TODO: The caller must be certain that no other backend will attempt to + * access the TidStore before calling this function. Other backend must + * explicitly call TidStoreDetach() to free up backend-local memory associated + * with the TidStore. The backend that calls TidStoreDestroy() must not call + * TidStoreDetach(). Do we need to do anything now? v55-0008: -TidStoreAttach(dsa_area *area, TidStoreHandle handle) +TidStoreAttach(dsa_area *area, dsa_pointer rt_dp) "handle" seemed like a fine name. Is that not the case anymore? The new one is kind of cryptic. The commit message just says "remove control object" -- does that imply that we need to think of this parameter differently, or is it unrelated? (Same with
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Jan 23, 2024 at 12:58 PM Masahiko Sawada wrote: > > On Mon, Jan 22, 2024 at 5:18 PM John Naylor wrote: > > > > On Mon, Jan 22, 2024 at 2:24 PM Masahiko Sawada > > wrote: > > > > > > For the next version patch, I'll work on this idea and try to clean up > > > locking stuff both in tidstore and radix tree. Or if you're already > > > working on some of them, please let me know. I'll review it. > > > > Okay go ahead, sounds good. I plan to look at the tests since they > > haven't been looked at in a while. > > I've attached the latest patch set. Here are updates from v54 patch: > > 0005 - Expose radix tree lock functions and remove all locks taken > internally in radixtree.h. > 0008 - Remove tidstore's control object. > 0009 - Add tidstore lock functions. > 0011 - Add VacDeadItemsInfo to store "max_bytes" and "num_items" > separate from TidStore. Also make lazy vacuum and parallel vacuum use > it. John pointed out offlist the tarball includes only the patches up to 0009. I've attached the correct one. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v55-ART.tar.gz Description: GNU Zip compressed data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Jan 22, 2024 at 5:18 PM John Naylor wrote: > > On Mon, Jan 22, 2024 at 2:24 PM Masahiko Sawada wrote: > > > > For the next version patch, I'll work on this idea and try to clean up > > locking stuff both in tidstore and radix tree. Or if you're already > > working on some of them, please let me know. I'll review it. > > Okay go ahead, sounds good. I plan to look at the tests since they > haven't been looked at in a while. I've attached the latest patch set. Here are updates from v54 patch: 0005 - Expose radix tree lock functions and remove all locks taken internally in radixtree.h. 0008 - Remove tidstore's control object. 0009 - Add tidstore lock functions. 0011 - Add VacDeadItemsInfo to store "max_bytes" and "num_items" separate from TidStore. Also make lazy vacuum and parallel vacuum use it. The new patches probably need to be polished but the VacDeadItemInfo idea looks good to me. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v55-ART.tar.gz Description: GNU Zip compressed data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Jan 22, 2024 at 2:24 PM Masahiko Sawada wrote: > > For the next version patch, I'll work on this idea and try to clean up > locking stuff both in tidstore and radix tree. Or if you're already > working on some of them, please let me know. I'll review it. Okay go ahead, sounds good. I plan to look at the tests since they haven't been looked at in a while.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Jan 22, 2024 at 2:36 PM John Naylor wrote: > > On Mon, Jan 22, 2024 at 10:28 AM Masahiko Sawada > wrote: > > > > On further thought, as you pointed out before, "num_tids" should not > > be in tidstore in terms of integration with tidbitmap.c, because > > tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no > > longer accurate and useful. Similarly, looking at tidbitmap.c, it has > > npages and nchunks but they will not be necessary in lazy vacuum use > > case. Also, assuming that we support parallel heap pruning, probably > > we need to somehow lock the tidstore while adding tids to the tidstore > > concurrently by parallel vacuum worker. But in tidbitmap use case, we > > don't need to lock the tidstore since it doesn't have multiple > > writers. > > Not currently, and it does seem bad to require locking where it's not > required. > > (That would be a prerequisite for parallel index scan. It's been tried > before with the hash table, but concurrency didn't scale well with the > hash table. I have no reason to think that the radix tree would scale > significantly better with the same global LW lock, but as you know > there are other locking schemes possible.) > > > Given these facts, different statistics and different lock > > strategies are required by different use case. So I think there are 3 > > options: > > > > 1. expose lock functions for tidstore and the caller manages the > > statistics in the outside of tidstore. For example, in lazyvacuum.c we > > would have a TidStore for tid storage as well as VacDeadItemsInfo that > > has num_tids and max_bytes. Both are in LVRelState. For parallel > > vacuum, we pass both to the workers via DSM and pass both to function > > where the statistics are required. As for the exposed lock functions, > > when adding tids to the tidstore, the caller would need to call > > something like TidStoreLockExclusive(ts) that further calls > > LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally. > > The advantage here is that vacuum can avoid locking entirely while > using shared memory, just like it does now, and has the option to add > it later. True. > IIUC, the radix tree struct would have a lock member, but wouldn't > take any locks internally? Maybe we still need one for > RT_MEMORY_USAGE? For that, I see dsa_get_total_size() takes its own > DSA_AREA_LOCK -- maybe that's enough? I think that's a good point. So there will be no place where the radix tree takes any locks internally. > > That seems simplest, and is not very far from what we do now. If we do > this, then the lock functions should be where we branch for is_shared. Agreed. > > > 2. add callback functions to tidstore so that the caller can do its > > work while holding a lock on the tidstore. This is like the idea we > > just discussed for radix tree. The caller passes a callback function > > and user data to TidStoreSetBlockOffsets(), and the callback is called > > after setting tids. Similar to option 1, the statistics need to be > > stored in a different area. > > I think we'll have to move to something like this eventually, but it > seems like overkill right now. Right. > > > 3. keep tidstore.c and tidbitmap.c separate implementations but use > > radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its > > control object and doesn't have any lossy page support. On the other > > hand, in tidbitmap.c we replace simplehash with radix tree. This makes > > tidstore.c simple but we would end up having different data structures > > for similar usage. > > They have so much in common that it's worth it to use the same > interface and (eventually) value type. They just need separate paths > for adding tids, as we've discussed. Agreed. > > > I think it's worth trying option 1. What do you think, John? > > +1 Thanks! Before working on this idea, since the latest patches conflict with the current HEAD, I share the latest patch set (v54). Here is the summary: - As for radix tree part, it's based on v53 patch. I've squashed most of cleanups and changes in v53 except for "DRAFT: Stop using invalid pointers as placeholders." as I thought you might want to still work on it. BTW it includes "#undef RT_SHMEM". - As for tidstore, it's based on v51. That is, it still has the control object and num_tids there. - As for vacuum integration, it's also based on v51. But we no longer need to change has_lpdead_items and LVPagePruneState thanks to the recent commit c120550edb8 and e313a61137. For the next version patch, I'll work on this idea and try to clean up locking stuff both in tidstore and radix tree. Or if you're already working on some of them, please let me know. I'll review it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v54-ART.tar.gz Description: GNU Zip compressed data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Jan 17, 2024 at 12:32 PM John Naylor wrote: > > I wrote: > > > > Hmm, I wonder if that's a side-effect of the "create" functions doing > > > their own allocations and returning a pointer. Would it be less tricky > > > if the structs were declared where we need them and passed to "init" > > > functions? > > If this is a possibility, I thought I'd first send the last (I hope) > large-ish set of radix tree cleanups to avoid rebasing issues. I'm not > including tidstore/vacuum here, because recent discussion has some > up-in-the-air work. Thank you for updating the patches! These updates look good to me. > > Should be self-explanatory, but some thing are worth calling out: > 0012 and 0013: Some time ago I started passing insertpos as a > parameter, but now see that is not ideal -- when growing from node16 > to node48 we don't need it at all, so it's a wasted calculation. While > reverting that, I found that this also allows passing constants in > some cases. > 0014 makes a cleaner separation between adding a child and growing a > node, resulting in more compact-looking functions. > 0019 is a bit unpolished, but I realized that it's pointless to assign > a zero child when further up the call stack we overwrite it anyway > with the actual value. With this, that assignment is skipped. This > makes some comments and names strange, so needs a bit of polish, but > wanted to get it out there anyway. Cool. I'll merge these patches in the next version v54 patch set. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Jan 22, 2024 at 10:28 AM Masahiko Sawada wrote: > > On further thought, as you pointed out before, "num_tids" should not > be in tidstore in terms of integration with tidbitmap.c, because > tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no > longer accurate and useful. Similarly, looking at tidbitmap.c, it has > npages and nchunks but they will not be necessary in lazy vacuum use > case. Also, assuming that we support parallel heap pruning, probably > we need to somehow lock the tidstore while adding tids to the tidstore > concurrently by parallel vacuum worker. But in tidbitmap use case, we > don't need to lock the tidstore since it doesn't have multiple > writers. Not currently, and it does seem bad to require locking where it's not required. (That would be a prerequisite for parallel index scan. It's been tried before with the hash table, but concurrency didn't scale well with the hash table. I have no reason to think that the radix tree would scale significantly better with the same global LW lock, but as you know there are other locking schemes possible.) > Given these facts, different statistics and different lock > strategies are required by different use case. So I think there are 3 > options: > > 1. expose lock functions for tidstore and the caller manages the > statistics in the outside of tidstore. For example, in lazyvacuum.c we > would have a TidStore for tid storage as well as VacDeadItemsInfo that > has num_tids and max_bytes. Both are in LVRelState. For parallel > vacuum, we pass both to the workers via DSM and pass both to function > where the statistics are required. As for the exposed lock functions, > when adding tids to the tidstore, the caller would need to call > something like TidStoreLockExclusive(ts) that further calls > LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally. The advantage here is that vacuum can avoid locking entirely while using shared memory, just like it does now, and has the option to add it later. IIUC, the radix tree struct would have a lock member, but wouldn't take any locks internally? Maybe we still need one for RT_MEMORY_USAGE? For that, I see dsa_get_total_size() takes its own DSA_AREA_LOCK -- maybe that's enough? That seems simplest, and is not very far from what we do now. If we do this, then the lock functions should be where we branch for is_shared. > 2. add callback functions to tidstore so that the caller can do its > work while holding a lock on the tidstore. This is like the idea we > just discussed for radix tree. The caller passes a callback function > and user data to TidStoreSetBlockOffsets(), and the callback is called > after setting tids. Similar to option 1, the statistics need to be > stored in a different area. I think we'll have to move to something like this eventually, but it seems like overkill right now. > 3. keep tidstore.c and tidbitmap.c separate implementations but use > radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its > control object and doesn't have any lossy page support. On the other > hand, in tidbitmap.c we replace simplehash with radix tree. This makes > tidstore.c simple but we would end up having different data structures > for similar usage. They have so much in common that it's worth it to use the same interface and (eventually) value type. They just need separate paths for adding tids, as we've discussed. > I think it's worth trying option 1. What do you think, John? +1
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Jan 19, 2024 at 6:48 PM John Naylor wrote: > > On Fri, Jan 19, 2024 at 2:26 PM Masahiko Sawada wrote: > > > > On Thu, Jan 18, 2024 at 1:30 PM John Naylor wrote: > > > I'm not quite sure what the point of "num_items" is anymore, because > > > it was really tied to the array in VacDeadItems. dead_items->num_items > > > is essential to reading/writing the array correctly. If this number is > > > wrong, the array is corrupt. There is no such requirement for the > > > radix tree. We don't need to know the number of tids to add to it or > > > do a lookup, or anything. > > > > True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking > > we need to have the number of TIDs in a tidstore, especially in the > > tidstore's control object. > > Hmm, it would be kind of sad to require explicit locking in tidstore.c > is only for maintaining that one number at all times. Aside from the > two ereports after an index scan / second heap pass, the only > non-assert place where it's used is > > @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel) > * Do index vacuuming (call each index's ambulkdelete routine), then do > * related heap vacuuming > */ > - if (dead_items->num_items > 0) > + if (TidStoreNumTids(dead_items) > 0) > lazy_vacuum(vacrel); > > ...and that condition can be checked by doing a single step of > iteration to see if it shows anything. But for the ereport, my idea > for iteration + popcount is probably quite slow. Right. On further thought, as you pointed out before, "num_tids" should not be in tidstore in terms of integration with tidbitmap.c, because tidbitmap.c has "lossy pages". With lossy pages, "num_tids" is no longer accurate and useful. Similarly, looking at tidbitmap.c, it has npages and nchunks but they will not be necessary in lazy vacuum use case. Also, assuming that we support parallel heap pruning, probably we need to somehow lock the tidstore while adding tids to the tidstore concurrently by parallel vacuum worker. But in tidbitmap use case, we don't need to lock the tidstore since it doesn't have multiple writers. Given these facts, different statistics and different lock strategies are required by different use case. So I think there are 3 options: 1. expose lock functions for tidstore and the caller manages the statistics in the outside of tidstore. For example, in lazyvacuum.c we would have a TidStore for tid storage as well as VacDeadItemsInfo that has num_tids and max_bytes. Both are in LVRelState. For parallel vacuum, we pass both to the workers via DSM and pass both to function where the statistics are required. As for the exposed lock functions, when adding tids to the tidstore, the caller would need to call something like TidStoreLockExclusive(ts) that further calls LWLockAcquire(ts->tree.shared->ctl.lock, LW_EXCLUSIVE) internally. 2. add callback functions to tidstore so that the caller can do its work while holding a lock on the tidstore. This is like the idea we just discussed for radix tree. The caller passes a callback function and user data to TidStoreSetBlockOffsets(), and the callback is called after setting tids. Similar to option 1, the statistics need to be stored in a different area. 3. keep tidstore.c and tidbitmap.c separate implementations but use radix tree in tidbitmap.c. tidstore.c would have "num_tids" in its control object and doesn't have any lossy page support. On the other hand, in tidbitmap.c we replace simplehash with radix tree. This makes tidstore.c simple but we would end up having different data structures for similar usage. I think it's worth trying option 1. What do you think, John? > > > IIUC lpdead_items is the total number of LP_DEAD items vacuumed during > > the whole lazy vacuum operation whereas num_items is the number of > > LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle. > > That is, after heap vacuum, the latter counter is reset while the > > former counter is not. > > > > The latter counter is used in lazyvacuum.c as well as the ereport in > > vac_bulkdel_one_index(). > > Ah, of course. > > > Putting a copy of the key in BlocktableEntry's header is an > > interesting idea. But the current debug code in the tidstore also > > makes sure that the tidstore returns TIDs in the correct order during > > an iterate operation. I think it still has a value and you can disable > > it by removing the "#define TIDSTORE_DEBUG" line. > > Fair enough. I just thought it'd be less work to leave this out in > case we change how locking is called. > > > > This week I tried an idea to use a callback there so that after > > > internal unlocking, the caller received the value (or whatever else > > > needs to happen, such as lookup an offset in the tid bitmap). I've > > > attached a draft for that that passes radix tree tests. It's a bit > > > awkward, but I'm guessing this would more closely match future > > > internal atomic locking. Let me know what you think of the concept, > > > and then do
Re: [PoC] Improve dead tuple storage for lazy vacuum
I wrote: > On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada wrote: > > During trying this idea, I realized that there is a visibility problem > > in the radix tree template > > If it's broken even without the embedding I'll look into this (I don't > know if this configuration has ever been tested). I think a good test > is putting the shared tid tree in it's own translation unit, to see if > anything needs to be fixed. I'll go try that. Here's a quick test that this works. The only thing that really needed fixing in the template was failure to un-define one symbol. The rest was just moving some things around. v51-addendum-Put-shared-radix-tree-for-tidstore-in-its-own-tr.patch.nocfbot Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Jan 19, 2024 at 2:26 PM Masahiko Sawada wrote: > > On Thu, Jan 18, 2024 at 1:30 PM John Naylor wrote: > > I'm not quite sure what the point of "num_items" is anymore, because > > it was really tied to the array in VacDeadItems. dead_items->num_items > > is essential to reading/writing the array correctly. If this number is > > wrong, the array is corrupt. There is no such requirement for the > > radix tree. We don't need to know the number of tids to add to it or > > do a lookup, or anything. > > True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking > we need to have the number of TIDs in a tidstore, especially in the > tidstore's control object. Hmm, it would be kind of sad to require explicit locking in tidstore.c is only for maintaining that one number at all times. Aside from the two ereports after an index scan / second heap pass, the only non-assert place where it's used is @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel) * Do index vacuuming (call each index's ambulkdelete routine), then do * related heap vacuuming */ - if (dead_items->num_items > 0) + if (TidStoreNumTids(dead_items) > 0) lazy_vacuum(vacrel); ...and that condition can be checked by doing a single step of iteration to see if it shows anything. But for the ereport, my idea for iteration + popcount is probably quite slow. > IIUC lpdead_items is the total number of LP_DEAD items vacuumed during > the whole lazy vacuum operation whereas num_items is the number of > LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle. > That is, after heap vacuum, the latter counter is reset while the > former counter is not. > > The latter counter is used in lazyvacuum.c as well as the ereport in > vac_bulkdel_one_index(). Ah, of course. > Putting a copy of the key in BlocktableEntry's header is an > interesting idea. But the current debug code in the tidstore also > makes sure that the tidstore returns TIDs in the correct order during > an iterate operation. I think it still has a value and you can disable > it by removing the "#define TIDSTORE_DEBUG" line. Fair enough. I just thought it'd be less work to leave this out in case we change how locking is called. > > This week I tried an idea to use a callback there so that after > > internal unlocking, the caller received the value (or whatever else > > needs to happen, such as lookup an offset in the tid bitmap). I've > > attached a draft for that that passes radix tree tests. It's a bit > > awkward, but I'm guessing this would more closely match future > > internal atomic locking. Let me know what you think of the concept, > > and then do whichever way you think is best. (using v53 as the basis) > > Thank you for verifying this idea! Interesting. While it's promising > in terms of future atomic locking, I'm concerned it might not be easy > to use if radix tree APIs supports only such callback style. Yeah, it's quite awkward. It could be helped by only exposing it for varlen types. For simply returning "present or not" (used a lot in the regression tests), we could skip the callback if the data is null. That is all also extra stuff. > I believe > the caller would like to pass one more data along with val_data. For That's trivial, however, if I understand you correctly. With "void *", a callback can receive anything, including a struct containing additional pointers to elsewhere. > example, considering tidstore that has num_tids internally, it wants > to pass both a pointer to BlocktableEntry and a pointer to TidStore > itself so that it increments the counter while holding a lock. Hmm, so a callback to RT_SET also. That's interesting! Anyway, I agree it needs to be simple, since the first use doesn't even have multiple writers. > BTW in radixtree.h pg_attribute_unused() is used for some functions, > but is it for debugging purposes? I don't see why it's used only for > some functions. It was there to silence warnings about unused functions. I only see one remaining, and it's already behind a debug symbol, so we might not need this attribute anymore.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Jan 18, 2024 at 1:30 PM John Naylor wrote: > > On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada wrote: > > It seems we cannot make this work nicely. IIUC VacDeadItems is > > allocated in DSM and TidStore is embedded there. However, > > dead_items->items.area is a local pointer to dsa_area. So we cannot > > include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we > > would need to pass dsa_area to each interface by callers. > > Thanks again for exploring this line of thinking! Okay, it seems even > if there's a way to make this work, it would be too invasive to > justify when compared with the advantage I was hoping for. > > > > If we can't make this work nicely, I'd be okay with keeping the tid > > > store control object. My biggest concern is unnecessary > > > double-locking. > > > > If we don't do any locking stuff in radix tree APIs and it's the > > user's responsibility at all, probably we don't need a lock for > > tidstore? That is, we expose lock functions as you mentioned and the > > user (like tidstore) acquires/releases the lock before/after accessing > > the radix tree and num_items. > > I'm not quite sure what the point of "num_items" is anymore, because > it was really tied to the array in VacDeadItems. dead_items->num_items > is essential to reading/writing the array correctly. If this number is > wrong, the array is corrupt. There is no such requirement for the > radix tree. We don't need to know the number of tids to add to it or > do a lookup, or anything. True. Sorry I wanted to say "num_tids" of TidStore. I'm still thinking we need to have the number of TIDs in a tidstore, especially in the tidstore's control object. > > There are a number of places where we assert "the running count of the > dead items" is the same as "the length of the dead items array", like > here: > > @@ -2214,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel) > BlockNumber threshold; > > Assert(vacrel->num_index_scans == 0); > - Assert(vacrel->lpdead_items == vacrel->dead_items->num_items); > + Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items)); > > As such, in HEAD I'm guessing it's arbitrary which one is used for > control flow. Correct me if I'm mistaken. If I am wrong for some part > of the code, it'd be good to understand when that invariant can't be > maintained. > > @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel) > * Do index vacuuming (call each index's ambulkdelete routine), then do > * related heap vacuuming > */ > - if (dead_items->num_items > 0) > + if (TidStoreNumTids(dead_items) > 0) > lazy_vacuum(vacrel); > > Like here. In HEAD, could this have used vacrel->dead_items? > > @@ -2479,14 +2473,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel) > * We set all LP_DEAD items from the first heap pass to LP_UNUSED during > * the second heap pass. No more, no less. > */ > - Assert(index > 0); > Assert(vacrel->num_index_scans > 1 || > -(index == vacrel->lpdead_items && > +(TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items && > vacuumed_pages == vacrel->lpdead_item_pages)); > > ereport(DEBUG2, > - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages", > - vacrel->relname, (long long) index, vacuumed_pages))); > + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers > in %u pages", > + vacrel->relname, TidStoreNumTids(vacrel->dead_items), > + vacuumed_pages))); > > We assert that vacrel->lpdead_items has the expected value, and then > the ereport repeats the function call (with a lock) to read the value > we just consulted to pass the assert. > > If we *really* want to compare counts, maybe we could invent a > debugging-only function that iterates over the tree and popcounts the > bitmaps. That seems too expensive for regular assert builds, though. IIUC lpdead_items is the total number of LP_DEAD items vacuumed during the whole lazy vacuum operation whereas num_items is the number of LP_DEAD items vacuumed within one index vacuum and heap vacuum cycle. That is, after heap vacuum, the latter counter is reset while the former counter is not. The latter counter is used in lazyvacuum.c as well as the ereport in vac_bulkdel_one_index(). > > On the subject of debugging builds, I think it no longer makes sense > to have the array for debug checking in tid store, even during > development. A few months ago, we had an encoding scheme that looked > simple on paper, but its code was fiendishly difficult to follow (at > least for me). That's gone. In addition to the debugging count above, > we could also put a copy of the key in the BlockTableEntry's header, > in debug builds. We don't yet need to care about the key size, since > we don't (yet) have runtime-embeddable values. Putting a copy of the key in BlocktableEntry's header is an interesting idea. But the current debug code in the tidstore also makes sure that the tidstore returns TIDs in the correct order during an iterate operation. I think it still
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Jan 18, 2024 at 8:31 AM Masahiko Sawada wrote: > It seems we cannot make this work nicely. IIUC VacDeadItems is > allocated in DSM and TidStore is embedded there. However, > dead_items->items.area is a local pointer to dsa_area. So we cannot > include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we > would need to pass dsa_area to each interface by callers. Thanks again for exploring this line of thinking! Okay, it seems even if there's a way to make this work, it would be too invasive to justify when compared with the advantage I was hoping for. > > If we can't make this work nicely, I'd be okay with keeping the tid > > store control object. My biggest concern is unnecessary > > double-locking. > > If we don't do any locking stuff in radix tree APIs and it's the > user's responsibility at all, probably we don't need a lock for > tidstore? That is, we expose lock functions as you mentioned and the > user (like tidstore) acquires/releases the lock before/after accessing > the radix tree and num_items. I'm not quite sure what the point of "num_items" is anymore, because it was really tied to the array in VacDeadItems. dead_items->num_items is essential to reading/writing the array correctly. If this number is wrong, the array is corrupt. There is no such requirement for the radix tree. We don't need to know the number of tids to add to it or do a lookup, or anything. There are a number of places where we assert "the running count of the dead items" is the same as "the length of the dead items array", like here: @@ -2214,7 +2205,7 @@ lazy_vacuum(LVRelState *vacrel) BlockNumber threshold; Assert(vacrel->num_index_scans == 0); - Assert(vacrel->lpdead_items == vacrel->dead_items->num_items); + Assert(vacrel->lpdead_items == TidStoreNumTids(vacrel->dead_items)); As such, in HEAD I'm guessing it's arbitrary which one is used for control flow. Correct me if I'm mistaken. If I am wrong for some part of the code, it'd be good to understand when that invariant can't be maintained. @@ -1258,7 +1265,7 @@ lazy_scan_heap(LVRelState *vacrel) * Do index vacuuming (call each index's ambulkdelete routine), then do * related heap vacuuming */ - if (dead_items->num_items > 0) + if (TidStoreNumTids(dead_items) > 0) lazy_vacuum(vacrel); Like here. In HEAD, could this have used vacrel->dead_items? @@ -2479,14 +2473,14 @@ lazy_vacuum_heap_rel(LVRelState *vacrel) * We set all LP_DEAD items from the first heap pass to LP_UNUSED during * the second heap pass. No more, no less. */ - Assert(index > 0); Assert(vacrel->num_index_scans > 1 || -(index == vacrel->lpdead_items && +(TidStoreNumTids(vacrel->dead_items) == vacrel->lpdead_items && vacuumed_pages == vacrel->lpdead_item_pages)); ereport(DEBUG2, - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages", - vacrel->relname, (long long) index, vacuumed_pages))); + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages", + vacrel->relname, TidStoreNumTids(vacrel->dead_items), + vacuumed_pages))); We assert that vacrel->lpdead_items has the expected value, and then the ereport repeats the function call (with a lock) to read the value we just consulted to pass the assert. If we *really* want to compare counts, maybe we could invent a debugging-only function that iterates over the tree and popcounts the bitmaps. That seems too expensive for regular assert builds, though. On the subject of debugging builds, I think it no longer makes sense to have the array for debug checking in tid store, even during development. A few months ago, we had an encoding scheme that looked simple on paper, but its code was fiendishly difficult to follow (at least for me). That's gone. In addition to the debugging count above, we could also put a copy of the key in the BlockTableEntry's header, in debug builds. We don't yet need to care about the key size, since we don't (yet) have runtime-embeddable values. > Currently (as of v52 patch) RT_FIND is > doing so, [meaning, there is no internal "automatic" locking here since after we switched to variable-length types, an outstanding TODO] Maybe it's okay to expose global locking for v17. I have one possible alternative: This week I tried an idea to use a callback there so that after internal unlocking, the caller received the value (or whatever else needs to happen, such as lookup an offset in the tid bitmap). I've attached a draft for that that passes radix tree tests. It's a bit awkward, but I'm guessing this would more closely match future internal atomic locking. Let me know what you think of the concept, and then do whichever way you think is best. (using v53 as the basis) I believe this is the only open question remaining. The rest is just polish and testing. > During trying this idea, I realized that there is a visibility problem > in the radix tree template If it's broken even without the embedding I'll look into this (I don't know if this
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Jan 17, 2024 at 11:37 AM John Naylor wrote: > > On Wed, Jan 17, 2024 at 8:39 AM Masahiko Sawada wrote: > > > > On Wed, Jan 17, 2024 at 9:20 AM John Naylor wrote: > > > > > > On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada > > > wrote: > > > > Just changing "items" to be the local tidstore struct could make the > > > > code tricky a bit, since max_bytes and num_items are on the shared > > > > memory while "items" is a local pointer to the shared tidstore. > > > > > > Thanks for trying it this way! I like the overall simplification but > > > this aspect is not great. > > > Hmm, I wonder if that's a side-effect of the "create" functions doing > > > their own allocations and returning a pointer. Would it be less tricky > > > if the structs were declared where we need them and passed to "init" > > > functions? > > > > Seems worth trying. The current RT_CREATE() API is also convenient as > > other data structure such as simplehash.h and dshash.c supports a > > similar > > I don't happen to know if these paths had to solve similar trickiness > with some values being local, and some shared. > > > > That may be a good idea for other reasons. It's awkward that the > > > create function is declared like this: > > > > > > #ifdef RT_SHMEM > > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes, > > > dsa_area *dsa, > > > int tranche_id); > > > #else > > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes); > > > #endif > > > > > > An init function wouldn't need these parameters: it could look at the > > > passed struct to know what to do. > > > > But the init function would initialize leaf_ctx etc,no? Initializing > > leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE. > > I was more referring to the parameters that were different above > depending on shared memory. My first thought was that the tricky part > is because of the allocation in local memory, but it's certainly > possible I've misunderstood the problem. > > > The same > > is true for dsa. I imagined that an init function would allocate a DSA > > memory for the control object. > > Yes: > > ... > // embedded in VacDeadItems > TidStore items; > }; > > // NULL DSA in local case, etc > dead_items->items.area = dead_items_dsa; > dead_items->items.tranche_id = FOO_ID; > > TidStoreInit(_items->items, vac_work_mem); > > That's how I imagined it would work (leaving out some details). I > haven't tried it, so not sure how much it helps. Maybe it has other > problems, but I'm hoping it's just a matter of programming. It seems we cannot make this work nicely. IIUC VacDeadItems is allocated in DSM and TidStore is embedded there. However, dead_items->items.area is a local pointer to dsa_area. So we cannot include dsa_area in neither TidStore nor RT_RADIX_TREE. Instead we would need to pass dsa_area to each interface by callers. > > If we can't make this work nicely, I'd be okay with keeping the tid > store control object. My biggest concern is unnecessary > double-locking. If we don't do any locking stuff in radix tree APIs and it's the user's responsibility at all, probably we don't need a lock for tidstore? That is, we expose lock functions as you mentioned and the user (like tidstore) acquires/releases the lock before/after accessing the radix tree and num_items. Currently (as of v52 patch) RT_FIND is doing so, but we would need to change RT_SET() and iteration functions as well. During trying this idea, I realized that there is a visibility problem in the radix tree template especially if we want to embed the radix tree in a struct. Considering a use case where we want to use a radix tree in an exposed struct, we would declare only interfaces in a .h file and define actual implementation in a .c file (FYI TupleHashTableData does a similar thing with simplehash.h). The .c file and .h file would be like: in .h file: #define RT_PREFIX local_rt #define RT_SCOPE extern #define RT_DECLARE #define RT_VALUE_TYPE BlocktableEntry #define RT_VARLEN_VALUE #include "lib/radixtree.h" typedef struct TidStore { : local_rt_radix_tree tree; /* embedded */ : } TidStore; in .c file: #define RT_PREFIX local_rt #define RT_SCOPE extern #define RT_DEFINE #define RT_VALUE_TYPE BlocktableEntry #define RT_VARLEN_VALUE #include "lib/radixtree.h" But it doesn't work as the compiler doesn't know the actual definition of local_rt_radix_tree. If the 'tree' is *local_rt_radix_tree, it works. The reason is that with RT_DECLARE but without RT_DEFINE, the radix tree template generates only forward declarations: #ifdef RT_DECLARE typedef struct RT_RADIX_TREE RT_RADIX_TREE; typedef struct RT_ITER RT_ITER; In order to make it work, we need to move the definitions required to expose RT_RADIX_TREE struct to RT_DECLARE part, which actually requires to move RT_NODE, RT_HANDLE, RT_NODE_PTR, RT_SIZE_CLASS_COUNT, and RT_RADIX_TREE_CONTROL etc. However RT_SIZE_CLASS_COUNT, used in RT_RADIX_TREE, could be bothersome. Since it refers to
Re: [PoC] Improve dead tuple storage for lazy vacuum
I wrote: > > Hmm, I wonder if that's a side-effect of the "create" functions doing > > their own allocations and returning a pointer. Would it be less tricky > > if the structs were declared where we need them and passed to "init" > > functions? If this is a possibility, I thought I'd first send the last (I hope) large-ish set of radix tree cleanups to avoid rebasing issues. I'm not including tidstore/vacuum here, because recent discussion has some up-in-the-air work. Should be self-explanatory, but some thing are worth calling out: 0012 and 0013: Some time ago I started passing insertpos as a parameter, but now see that is not ideal -- when growing from node16 to node48 we don't need it at all, so it's a wasted calculation. While reverting that, I found that this also allows passing constants in some cases. 0014 makes a cleaner separation between adding a child and growing a node, resulting in more compact-looking functions. 0019 is a bit unpolished, but I realized that it's pointless to assign a zero child when further up the call stack we overwrite it anyway with the actual value. With this, that assignment is skipped. This makes some comments and names strange, so needs a bit of polish, but wanted to get it out there anyway. v53-ART.tar.gz Description: application/gzip
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Jan 17, 2024 at 8:39 AM Masahiko Sawada wrote: > > On Wed, Jan 17, 2024 at 9:20 AM John Naylor wrote: > > > > On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada > > wrote: > > > Just changing "items" to be the local tidstore struct could make the > > > code tricky a bit, since max_bytes and num_items are on the shared > > > memory while "items" is a local pointer to the shared tidstore. > > > > Thanks for trying it this way! I like the overall simplification but > > this aspect is not great. > > Hmm, I wonder if that's a side-effect of the "create" functions doing > > their own allocations and returning a pointer. Would it be less tricky > > if the structs were declared where we need them and passed to "init" > > functions? > > Seems worth trying. The current RT_CREATE() API is also convenient as > other data structure such as simplehash.h and dshash.c supports a > similar I don't happen to know if these paths had to solve similar trickiness with some values being local, and some shared. > > That may be a good idea for other reasons. It's awkward that the > > create function is declared like this: > > > > #ifdef RT_SHMEM > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes, > > dsa_area *dsa, > > int tranche_id); > > #else > > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes); > > #endif > > > > An init function wouldn't need these parameters: it could look at the > > passed struct to know what to do. > > But the init function would initialize leaf_ctx etc,no? Initializing > leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE. I was more referring to the parameters that were different above depending on shared memory. My first thought was that the tricky part is because of the allocation in local memory, but it's certainly possible I've misunderstood the problem. > The same > is true for dsa. I imagined that an init function would allocate a DSA > memory for the control object. Yes: ... // embedded in VacDeadItems TidStore items; }; // NULL DSA in local case, etc dead_items->items.area = dead_items_dsa; dead_items->items.tranche_id = FOO_ID; TidStoreInit(_items->items, vac_work_mem); That's how I imagined it would work (leaving out some details). I haven't tried it, so not sure how much it helps. Maybe it has other problems, but I'm hoping it's just a matter of programming. If we can't make this work nicely, I'd be okay with keeping the tid store control object. My biggest concern is unnecessary double-locking.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Jan 17, 2024 at 9:20 AM John Naylor wrote: > > On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada wrote: > > Just changing "items" to be the local tidstore struct could make the > > code tricky a bit, since max_bytes and num_items are on the shared > > memory while "items" is a local pointer to the shared tidstore. > > Thanks for trying it this way! I like the overall simplification but > this aspect is not great. > Hmm, I wonder if that's a side-effect of the "create" functions doing > their own allocations and returning a pointer. Would it be less tricky > if the structs were declared where we need them and passed to "init" > functions? Seems worth trying. The current RT_CREATE() API is also convenient as other data structure such as simplehash.h and dshash.c supports a similar > > That may be a good idea for other reasons. It's awkward that the > create function is declared like this: > > #ifdef RT_SHMEM > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes, > dsa_area *dsa, > int tranche_id); > #else > RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes); > #endif > > An init function wouldn't need these parameters: it could look at the > passed struct to know what to do. But the init function would initialize leaf_ctx etc,no? Initializing leaf_ctx needs max_bytes that is not stored in RT_RADIX_TREE. The same is true for dsa. I imagined that an init function would allocate a DSA memory for the control object. So I imagine we will end up still requiring some of them. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Jan 16, 2024 at 1:18 PM Masahiko Sawada wrote: > Just changing "items" to be the local tidstore struct could make the > code tricky a bit, since max_bytes and num_items are on the shared > memory while "items" is a local pointer to the shared tidstore. Thanks for trying it this way! I like the overall simplification but this aspect is not great. Hmm, I wonder if that's a side-effect of the "create" functions doing their own allocations and returning a pointer. Would it be less tricky if the structs were declared where we need them and passed to "init" functions? That may be a good idea for other reasons. It's awkward that the create function is declared like this: #ifdef RT_SHMEM RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes, dsa_area *dsa, int tranche_id); #else RT_SCOPE RT_RADIX_TREE *RT_CREATE(MemoryContext ctx, Size max_bytes); #endif An init function wouldn't need these parameters: it could look at the passed struct to know what to do.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Sun, Jan 14, 2024 at 10:43 PM John Naylor wrote: > > On Fri, Jan 12, 2024 at 3:49 PM Masahiko Sawada wrote: > > > > On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada > > wrote: > > > So I agree to remove both max_bytes and num_items from the control > > > object.Also, as you mentioned, we can remove the tidstore control > > > object itself. TidStoreGetHandle() returns a radix tree handle, and we > > > can pass it to TidStoreAttach(). I'll try it. > > Thanks. It's worth looking closely here. > > > I realized that if we remove the whole tidstore control object > > including max_bytes, processes who attached the shared tidstore cannot > > use TidStoreIsFull() actually as it always returns true. > > I imagine that we'd replace that with a function (maybe an earlier > version had it?) to report the memory usage to the caller, which > should know where to find max_bytes. > > > Also they > > cannot use TidStoreReset() as well since it needs to pass max_bytes to > > RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it > > could be problematic for general use. > > HEAD has no problem finding the necessary values, and I don't think > it'd be difficult to maintain that ability. I'm not actually sure what > "general use" needs to have, and I'm not sure anyone can guess. > There's the future possibility of parallel heap-scanning, but I'm > guessing a *lot* more needs to happen for that to work, so I'm not > sure how much it buys us to immediately start putting those two fields > in a special abstraction. The only other concrete use case mentioned > in this thread that I remember is bitmap heap scan, and I believe that > would never need to reset, only free the whole thing when finished. > > I spent some more time studying parallel vacuum, and have some > thoughts. In HEAD, we have > > -/* > - * VacDeadItems stores TIDs whose index tuples are deleted by index > vacuuming. > - */ > -typedef struct VacDeadItems > -{ > - int max_items; /* # slots allocated in array */ > - int num_items; /* current # of entries */ > - > - /* Sorted array of TIDs to delete from indexes */ > - ItemPointerData items[FLEXIBLE_ARRAY_MEMBER]; > -} VacDeadItems; > > ...which has the tids, plus two fields that function _very similarly_ > to the two extra fields in the tidstore control object. It's a bit > strange to me that the patch doesn't have this struct anymore. > > I suspect if we keep it around (just change "items" to be the local > tidstore struct), the patch would have a bit less churn and look/work > more like the current code. I think it might be easier to read if the > v17 commits are suited to the current needs of vacuum, rather than try > to anticipate all uses. Richer abstractions can come later if needed. Just changing "items" to be the local tidstore struct could make the code tricky a bit, since max_bytes and num_items are on the shared memory while "items" is a local pointer to the shared tidstore. This is a reason why I abstract them behind TidStore. However, IIUC the current parallel vacuum can work with such VacDeadItems fields, fortunately. The leader process can use VacDeadItems allocated on DSM, and worker processes can use a local VacDeadItems of which max_bytes and num_items are copied from the shared one and "items" is a local pointer. Assuming parallel heap scan requires for both the leader and workers to update the shared VacDeadItems concurrently, we may need such richer abstractions. I've implemented this idea in the v52 patch set. Here is the summary of the updates: 0008: Remove the control object from tidstore. Also removed some unsupported functions such as TidStoreNumTids() 0009: Adjust lazy vacuum integration patch with the control object removal. I've not updated any locking code yet. Once we confirm this direction, I'll update the locking code too. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v52-ART.tar.gz Description: GNU Zip compressed data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Jan 12, 2024 at 3:49 PM Masahiko Sawada wrote: > > On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada wrote: > > So I agree to remove both max_bytes and num_items from the control > > object.Also, as you mentioned, we can remove the tidstore control > > object itself. TidStoreGetHandle() returns a radix tree handle, and we > > can pass it to TidStoreAttach(). I'll try it. Thanks. It's worth looking closely here. > I realized that if we remove the whole tidstore control object > including max_bytes, processes who attached the shared tidstore cannot > use TidStoreIsFull() actually as it always returns true. I imagine that we'd replace that with a function (maybe an earlier version had it?) to report the memory usage to the caller, which should know where to find max_bytes. > Also they > cannot use TidStoreReset() as well since it needs to pass max_bytes to > RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it > could be problematic for general use. HEAD has no problem finding the necessary values, and I don't think it'd be difficult to maintain that ability. I'm not actually sure what "general use" needs to have, and I'm not sure anyone can guess. There's the future possibility of parallel heap-scanning, but I'm guessing a *lot* more needs to happen for that to work, so I'm not sure how much it buys us to immediately start putting those two fields in a special abstraction. The only other concrete use case mentioned in this thread that I remember is bitmap heap scan, and I believe that would never need to reset, only free the whole thing when finished. I spent some more time studying parallel vacuum, and have some thoughts. In HEAD, we have -/* - * VacDeadItems stores TIDs whose index tuples are deleted by index vacuuming. - */ -typedef struct VacDeadItems -{ - int max_items; /* # slots allocated in array */ - int num_items; /* current # of entries */ - - /* Sorted array of TIDs to delete from indexes */ - ItemPointerData items[FLEXIBLE_ARRAY_MEMBER]; -} VacDeadItems; ...which has the tids, plus two fields that function _very similarly_ to the two extra fields in the tidstore control object. It's a bit strange to me that the patch doesn't have this struct anymore. I suspect if we keep it around (just change "items" to be the local tidstore struct), the patch would have a bit less churn and look/work more like the current code. I think it might be easier to read if the v17 commits are suited to the current needs of vacuum, rather than try to anticipate all uses. Richer abstractions can come later if needed. Another stanza: - /* Prepare the dead_items space */ - dead_items = (VacDeadItems *) shm_toc_allocate(pcxt->toc, -est_dead_items_len); - dead_items->max_items = max_items; - dead_items->num_items = 0; - MemSet(dead_items->items, 0, sizeof(ItemPointerData) * max_items); - shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_DEAD_ITEMS, dead_items); - pvs->dead_items = dead_items; With s/max_items/max_bytes/, I wonder if we can still use some of this, and parallel workers would have no problem getting the necessary info, as they do today. If not, I don't really understand why. I'm not very familiar with working with shared memory, and I know the tree itself needs some different setup, so it's quite possible I'm missing something. I find it difficult to kept straight these four things: - radix tree - radix tree control object - tidstore - tidstore control object Even with the code in front of me, it's hard to reason about how these concepts fit together. It'd be much more readable if this was simplified.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Jan 11, 2024 at 9:28 AM Masahiko Sawada wrote: > > On Mon, Jan 8, 2024 at 8:35 PM John Naylor wrote: > > > > On Wed, Jan 3, 2024 at 9:10 PM John Naylor wrote: > > > > > > On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada > > > wrote: > > > > > > > I agree that we expose RT_LOCK_* functions and have tidstore use them, > > > > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)" > > > > calls part. I think that even if we expose them, we will still need to > > > > do something like "if (TidStoreIsShared(ts)) > > > > shared_rt_lock_share(ts->tree.shared)", no? > > > > > > I'll come back to this topic separately. > > > > To answer your question, sure, but that "if (TidStoreIsShared(ts))" > > part would be pushed down into a function so that only one place has > > to care about it. > > > > However, I'm starting to question whether we even need that. Meaning, > > lock the tidstore separately. To "lock the tidstore" means to take a > > lock, _separate_ from the radix tree's internal lock, to control > > access to two fields in a separate "control object": > > > > +typedef struct TidStoreControl > > +{ > > + /* the number of tids in the store */ > > + int64 num_tids; > > + > > + /* the maximum bytes a TidStore can use */ > > + size_t max_bytes; > > > > I'm pretty sure max_bytes does not need to be in shared memory, and > > certainly not under a lock: Thinking of a hypothetical > > parallel-prune-phase scenario, one way would be for a leader process > > to pass out ranges of blocks to workers, and when the limit is > > exceeded, stop passing out blocks and wait for all the workers to > > finish. > > True. I agreed that it doesn't need to be under a lock anyway, as it's > read-only. > > > > > As for num_tids, vacuum previously put the similar count in > > > > @@ -176,7 +179,8 @@ struct ParallelVacuumState > > PVIndStats *indstats; > > > > /* Shared dead items space among parallel vacuum workers */ > > - VacDeadItems *dead_items; > > + TidStore *dead_items; > > > > VacDeadItems contained "num_items". What was the reason to have new > > infrastructure for that count? And it doesn't seem like access to it > > was controlled by a lock -- can you confirm? If we did get parallel > > pruning, maybe the count would belong inside PVShared? > > I thought that since the tidstore is a general-purpose data structure > the shared counter should be protected by a lock. One thing I'm > concerned about is that we might need to update both the radix tree > and the counter atomically in some cases. But that's true we don't > need it for lazy vacuum at least for now. Even given the parallel scan > phase, probably we won't need to have workers check the total number > of stored tuples during a parallel scan. > > > > > The number of tids is not that tightly bound to the tidstore's job. I > > believe tidbitmap.c (a possible future client) doesn't care about the > > global number of tids -- not only that, but AND/OR operations can > > change the number in a non-obvious way, so it would not be convenient > > to keep an accurate number anyway. But the lock would still be > > mandatory with this patch. > > Very good point. > > > > > If we can make vacuum work a bit closer to how it does now, it'd be a > > big step up in readability, I think. Namely, getting rid of all the > > locking logic inside tidstore.c and let the radix tree's locking do > > the right thing. We'd need to make that work correctly when receiving > > pointers to values upon lookup, and I already shared ideas for that. > > But I want to see if there is any obstacle in the way of removing the > > tidstore control object and it's separate lock. > > So I agree to remove both max_bytes and num_items from the control > object.Also, as you mentioned, we can remove the tidstore control > object itself. TidStoreGetHandle() returns a radix tree handle, and we > can pass it to TidStoreAttach(). I'll try it. > I realized that if we remove the whole tidstore control object including max_bytes, processes who attached the shared tidstore cannot use TidStoreIsFull() actually as it always returns true. Also they cannot use TidStoreReset() as well since it needs to pass max_bytes to RT_CREATE(). It might not be a problem in terms of lazy vacuum, but it could be problematic for general use. If we remove it, we probably need a safeguard to prevent those who attached the tidstore from calling these functions. Or we can keep the control object but remove the lock and num_tids. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Jan 8, 2024 at 8:35 PM John Naylor wrote: > > On Wed, Jan 3, 2024 at 9:10 PM John Naylor wrote: > > > > On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada > > wrote: > > > > > I agree that we expose RT_LOCK_* functions and have tidstore use them, > > > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)" > > > calls part. I think that even if we expose them, we will still need to > > > do something like "if (TidStoreIsShared(ts)) > > > shared_rt_lock_share(ts->tree.shared)", no? > > > > I'll come back to this topic separately. > > To answer your question, sure, but that "if (TidStoreIsShared(ts))" > part would be pushed down into a function so that only one place has > to care about it. > > However, I'm starting to question whether we even need that. Meaning, > lock the tidstore separately. To "lock the tidstore" means to take a > lock, _separate_ from the radix tree's internal lock, to control > access to two fields in a separate "control object": > > +typedef struct TidStoreControl > +{ > + /* the number of tids in the store */ > + int64 num_tids; > + > + /* the maximum bytes a TidStore can use */ > + size_t max_bytes; > > I'm pretty sure max_bytes does not need to be in shared memory, and > certainly not under a lock: Thinking of a hypothetical > parallel-prune-phase scenario, one way would be for a leader process > to pass out ranges of blocks to workers, and when the limit is > exceeded, stop passing out blocks and wait for all the workers to > finish. True. I agreed that it doesn't need to be under a lock anyway, as it's read-only. > > As for num_tids, vacuum previously put the similar count in > > @@ -176,7 +179,8 @@ struct ParallelVacuumState > PVIndStats *indstats; > > /* Shared dead items space among parallel vacuum workers */ > - VacDeadItems *dead_items; > + TidStore *dead_items; > > VacDeadItems contained "num_items". What was the reason to have new > infrastructure for that count? And it doesn't seem like access to it > was controlled by a lock -- can you confirm? If we did get parallel > pruning, maybe the count would belong inside PVShared? I thought that since the tidstore is a general-purpose data structure the shared counter should be protected by a lock. One thing I'm concerned about is that we might need to update both the radix tree and the counter atomically in some cases. But that's true we don't need it for lazy vacuum at least for now. Even given the parallel scan phase, probably we won't need to have workers check the total number of stored tuples during a parallel scan. > > The number of tids is not that tightly bound to the tidstore's job. I > believe tidbitmap.c (a possible future client) doesn't care about the > global number of tids -- not only that, but AND/OR operations can > change the number in a non-obvious way, so it would not be convenient > to keep an accurate number anyway. But the lock would still be > mandatory with this patch. Very good point. > > If we can make vacuum work a bit closer to how it does now, it'd be a > big step up in readability, I think. Namely, getting rid of all the > locking logic inside tidstore.c and let the radix tree's locking do > the right thing. We'd need to make that work correctly when receiving > pointers to values upon lookup, and I already shared ideas for that. > But I want to see if there is any obstacle in the way of removing the > tidstore control object and it's separate lock. So I agree to remove both max_bytes and num_items from the control object.Also, as you mentioned, we can remove the tidstore control object itself. TidStoreGetHandle() returns a radix tree handle, and we can pass it to TidStoreAttach(). I'll try it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Jan 10, 2024 at 9:05 AM Masahiko Sawada wrote: > > I've done in 0010 patch in v51 patch set. Whereas RT_NODE_4 and > RT_NODE_16 structs declaration needs RT_FANOUT_4_HI and > RT_FANOUT_16_HI respectively, RT_FANOUT_16_LO and RT_FANOUT_48 need > RT_NODE_16 and RT_NODE_48 structs declaration. So fanout declarations > are now spread before and after RT_NODE_XXX struct declaration. It's a > bit less readable, but I'm not sure of a better way. They were before and after the *_BASE types, so it's not really worse, I think. I did notice that RT_SLOT_IDX_LIMIT has been considered special for a very long time, before we even had size classes, so it's the same thing but even more far away. I have an idea to introduce *_MAX macros, allowing to turn RT_SLOT_IDX_LIMIT into RT_FANOUT_48_MAX, so that everything is in the same spot, and to make this area more consistent. I also noticed that I'd been assuming that RT_FANOUT_16_HI fits easily into a DSA size class, but that's only true on 64-bit, and in any case we don't want to assume it. I've attached an addendum .txt to demo this idea. diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h index bde6916184..ffb0b58826 100644 --- a/src/include/lib/radixtree.h +++ b/src/include/lib/radixtree.h @@ -287,19 +287,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE * tree); /* Tree level the radix tree uses */ #define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_SPAN) -/* - * Number of bits necessary for isset array in node48. - * Since bitmapword can be 64 bits, the only values that make sense - * here are 64 and 128. - * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit - * inside a single bitmapword on most platforms, so it's a good starting - * point. We can make it higher if we need to. - */ -#define RT_SLOT_IDX_LIMIT (RT_NODE_MAX_SLOTS / 4) - -/* Invalid index used in node48 */ -#define RT_INVALID_SLOT_IDX0xFF - /* Get a chunk from the key */ #define RT_GET_KEY_CHUNK(key, shift) ((uint8) (((key) >> (shift)) & RT_CHUNK_MASK)) @@ -426,20 +413,29 @@ typedef union RT_NODE_PTR } RT_NODE_PTR; /* - * fanout values for each size class. - * - * RT_FANOUT_4_HI and RT_FANOUT_16_HI are declared here as they are + * Symbols for maximum possible fanout are declared first as they are * required to declare each node kind. The declarations of other fanout * values are followed as they need the struct sizes of each node kind. - * - * TODO: consider 5 with subclass 1 or 2. */ /* max possible key chunks without struct padding */ -#define RT_FANOUT_4_HI (8 - sizeof(RT_NODE)) +#define RT_FANOUT_4_MAX (8 - sizeof(RT_NODE)) /* equal to two 128-bit SIMD registers, regardless of availability */ -#define RT_FANOUT_16_HI32 +#define RT_FANOUT_16_MAX 32 + +/* + * This also determines the number of bits necessary for the isset array, + * so we need to be mindful of the size of bitmapword. + * Since bitmapword can be 64 bits, the only values that make sense + * here are 64 and 128. + * WIP: The paper uses at most 64 for this node kind. "isset" happens to fit + * inside a single bitmapword on most platforms, so it's a good starting + * point. We can make it higher if we need to. + */ +#define RT_FANOUT_48_MAX (RT_NODE_MAX_SLOTS / 4) + +#define RT_FANOUT_256 RT_NODE_MAX_SLOTS /* * Node structs, one for each "kind" @@ -448,7 +444,7 @@ typedef struct RT_NODE_4 { RT_NODE base; - uint8 chunks[RT_FANOUT_4_HI]; + uint8 chunks[RT_FANOUT_4_MAX]; /* number of children depends on size class */ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER]; @@ -458,7 +454,7 @@ typedef struct RT_NODE_16 { RT_NODE base; - uint8 chunks[RT_FANOUT_16_HI]; + uint8 chunks[RT_FANOUT_16_MAX]; /* number of children depends on size class */ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER]; @@ -476,8 +472,11 @@ typedef struct RT_NODE_48 /* The index of slots for each fanout */ uint8 slot_idxs[RT_NODE_MAX_SLOTS]; +/* Invalid index */ +#define RT_INVALID_SLOT_IDX0xFF + /* bitmap to track which slots are in use */ - bitmapword isset[RT_BM_IDX(RT_SLOT_IDX_LIMIT)]; + bitmapword isset[RT_BM_IDX(RT_FANOUT_48_MAX)]; /* number of children depends on size class */ RT_PTR_ALLOC children[FLEXIBLE_ARRAY_MEMBER]; @@ -493,28 +492,29 @@ typedef struct RT_NODE_256 { RT_NODE base; - /* -* Zero is a valid value for embedded values, so we use a bitmap to track -* which slots are in use. -*/ - bitmapword isset[RT_BM_IDX(RT_NODE_MAX_SLOTS)]; + /* bitmap to track which slots are in use */ + bitmapword isset[RT_BM_IDX(RT_FANOUT_256)]; /* Slots for 256 children */ - RT_PTR_ALLOC children[RT_NODE_MAX_SLOTS]; + RT_PTR_ALLOC
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Jan 9, 2024 at 8:19 PM John Naylor wrote: > > On Tue, Jan 9, 2024 at 9:40 AM Masahiko Sawada wrote: > > In addition, I've made some changes and cleanups: > > These look good to me, although I have not tried dumping a node in a while. > > > 0011 - simplify the radix tree iteration code. I hope it makes the > > code clear and readable. Also I removed RT_UPDATE_ITER_STACK(). > > I'm very pleased with how much simpler it is now! > > > 0013 - In RT_SHMEM case, we use SIZEOF_VOID_P for > > RT_VALUE_IS_EMBEDDABLE check, but I think it's not correct. Because > > DSA has its own pointer size, SIZEOF_DSA_POINTER, it could be 4 bytes > > even if SIZEOF_VOID_P is 8 bytes, for example in a case where > > !defined(PG_HAVE_ATOMIC_U64_SUPPORT). Please refer to dsa.h for > > details. > > Thanks for the pointer. ;-) > > > BTW, now that the inner and leaf nodes use the same structure, do we > > still need RT_NODE_BASE_XXX types? Most places where we use > > RT_NODE_BASE_XXX types can be replaced with RT_NODE_XXX types. > > That's been in the back of my mind as well. Maybe the common header > should be the new "base" member? At least, something other than "n". Agreed. > > > Exceptions are RT_FANOUT_XX calculations: > > > > #if SIZEOF_VOID_P < 8 > > #define RT_FANOUT_16_LO ((96 - sizeof(RT_NODE_BASE_16)) / > > sizeof(RT_PTR_ALLOC)) > > #define RT_FANOUT_48((512 - sizeof(RT_NODE_BASE_48)) / > > sizeof(RT_PTR_ALLOC)) > > #else > > #define RT_FANOUT_16_LO ((160 - sizeof(RT_NODE_BASE_16)) / > > sizeof(RT_PTR_ALLOC)) > > #define RT_FANOUT_48((768 - sizeof(RT_NODE_BASE_48)) / > > sizeof(RT_PTR_ALLOC)) > > #endif /* SIZEOF_VOID_P < 8 */ > > > > But I think we can replace them with offsetof(RT_NODE_16, children) etc. > > That makes sense. Do you want to have a go at it, or shall I? I've done in 0010 patch in v51 patch set. Whereas RT_NODE_4 and RT_NODE_16 structs declaration needs RT_FANOUT_4_HI and RT_FANOUT_16_HI respectively, RT_FANOUT_16_LO and RT_FANOUT_48 need RT_NODE_16 and RT_NODE_48 structs declaration. So fanout declarations are now spread before and after RT_NODE_XXX struct declaration. It's a bit less readable, but I'm not sure of a better way. The previous updates are merged into the main radix tree patch and tidstore patch. Nothing changes in other patches from v50. > > I think after that, the only big cleanup needed is putting things in a > more readable order. I can do that at a later date, and other > opportunities for beautification are pretty minor and localized. Agreed. > > Rationalizing locking is the only thing left that requires a bit of thought. Right, I'll send a reply soon. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v51-ART.tar.gz Description: application/gzip
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Jan 9, 2024 at 9:40 AM Masahiko Sawada wrote: > In addition, I've made some changes and cleanups: These look good to me, although I have not tried dumping a node in a while. > 0011 - simplify the radix tree iteration code. I hope it makes the > code clear and readable. Also I removed RT_UPDATE_ITER_STACK(). I'm very pleased with how much simpler it is now! > 0013 - In RT_SHMEM case, we use SIZEOF_VOID_P for > RT_VALUE_IS_EMBEDDABLE check, but I think it's not correct. Because > DSA has its own pointer size, SIZEOF_DSA_POINTER, it could be 4 bytes > even if SIZEOF_VOID_P is 8 bytes, for example in a case where > !defined(PG_HAVE_ATOMIC_U64_SUPPORT). Please refer to dsa.h for > details. Thanks for the pointer. ;-) > BTW, now that the inner and leaf nodes use the same structure, do we > still need RT_NODE_BASE_XXX types? Most places where we use > RT_NODE_BASE_XXX types can be replaced with RT_NODE_XXX types. That's been in the back of my mind as well. Maybe the common header should be the new "base" member? At least, something other than "n". > Exceptions are RT_FANOUT_XX calculations: > > #if SIZEOF_VOID_P < 8 > #define RT_FANOUT_16_LO ((96 - sizeof(RT_NODE_BASE_16)) / > sizeof(RT_PTR_ALLOC)) > #define RT_FANOUT_48((512 - sizeof(RT_NODE_BASE_48)) / > sizeof(RT_PTR_ALLOC)) > #else > #define RT_FANOUT_16_LO ((160 - sizeof(RT_NODE_BASE_16)) / > sizeof(RT_PTR_ALLOC)) > #define RT_FANOUT_48((768 - sizeof(RT_NODE_BASE_48)) / > sizeof(RT_PTR_ALLOC)) > #endif /* SIZEOF_VOID_P < 8 */ > > But I think we can replace them with offsetof(RT_NODE_16, children) etc. That makes sense. Do you want to have a go at it, or shall I? I think after that, the only big cleanup needed is putting things in a more readable order. I can do that at a later date, and other opportunities for beautification are pretty minor and localized. Rationalizing locking is the only thing left that requires a bit of thought.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Jan 3, 2024 at 11:10 PM John Naylor wrote: > > On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada wrote: > > > I agree that we expose RT_LOCK_* functions and have tidstore use them, > > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)" > > calls part. I think that even if we expose them, we will still need to > > do something like "if (TidStoreIsShared(ts)) > > shared_rt_lock_share(ts->tree.shared)", no? > > I'll come back to this topic separately. > > > I've attached a new patch set. From v47 patch, I've merged your > > changes for radix tree, and split the vacuum integration patch into 3 > > patches: simply replaces VacDeadItems with TidsTore (0007 patch), and > > use a simple TID array for one-pass strategy (0008 patch), and replace > > has_lpdead_items with "num_offsets > 0" (0009 patch), while > > incorporating your review comments on the vacuum integration patch > > Nice! > > > (sorry for making it difficult to see the changes from v47 patch). > > It's actually pretty clear. I just have a couple comments before > sharing my latest cleanups: > > (diff'ing between v47 and v48): > > -- /* > -* In the shared case, TidStoreControl and radix_tree are backed by > the > -* same DSA area and rt_memory_usage() returns the value including > both. > -* So we don't need to add the size of TidStoreControl separately. > -*/ > if (TidStoreIsShared(ts)) > - return sizeof(TidStore) + > shared_rt_memory_usage(ts->tree.shared); > + rt_mem = shared_rt_memory_usage(ts->tree.shared); > + else > + rt_mem = local_rt_memory_usage(ts->tree.local); > > - return sizeof(TidStore) + sizeof(TidStore) + > local_rt_memory_usage(ts->tree.local); > + return sizeof(TidStore) + sizeof(TidStoreControl) + rt_mem; > > Upthread, I meant that I don't see the need to include the size of > these structs *at all*. They're tiny, and the blocks/segments will > almost certainly have some empty space counted in the total anyway. > The returned size is already overestimated, so this extra code is just > a distraction. Agreed. > > - if (result->num_offsets + bmw_popcount(w) > result->max_offset) > + if (result->num_offsets + (sizeof(bitmapword) * BITS_PER_BITMAPWORD) > >= result->max_offset) > > I believe this math is wrong. We care about "result->num_offsets + > BITS_PER_BITMAPWORD", right? > Also, it seems if the condition evaluates to equal, we still have > enough space, in which case ">" the max is the right condition. Oops, you're right. Fixed. > > - if (off < 1 || off > MAX_TUPLES_PER_PAGE) > + if (off < 1 || off > MaxOffsetNumber) > > This can now use OffsetNumberIsValid(). Fixed. > > > 0013 to 0015 patches are also updates from v47 patch. > > > I'm thinking that we should change the order of the patches so that > > tidstore patch requires the patch for changing DSA segment sizes. That > > way, we can remove the complex max memory calculation part that we no > > longer use from the tidstore patch. > > I don't think there is any reason to have those calculations at all at > this point. Every patch in every version should at least *work > correctly*, without kludging m_w_m and without constraining max > segment size. I'm fine with the latter remaining in its own thread, > and I hope we can consider it an enhancement that respects the admin's > configured limits more effectively, and not a pre-requisite for not > breaking. I *think* we're there now, but it's hard to tell since 0015 > was at the very end. As I said recently, if something still fails, I'd > like to know why. So for v49, I took the liberty of removing the DSA > max segment patches for now, and squashing v48-0015. Fair enough. > > In addition for v49, I have quite a few cleanups: > > 0001 - This hasn't been touched in a very long time, but I ran > pgindent and clarified a comment > 0002 - We no longer need to isolate the rightmost bit anywhere, so > removed that part and revised the commit message accordingly. Thanks. > > radix tree: > 0003 - v48 plus squashed v48-0013 > 0004 - Removed or adjusted WIP, FIXME, TODO items. Some were outdated, > and I fixed most of the rest. > 0005 - Remove the RT_PTR_LOCAL macro, since it's not really useful anymore. > 0006 - RT_FREE_LEAF only needs the allocated pointer, so pass that. A > bit simpler. > 0007 - Uses the same idea from a previous cleanup of RT_SET, for RT_DELETE. > 0008 - Removes a holdover from the multi-value leaves era. > 0009 - It occurred to me that we need to have unique names for memory > contexts for different instantiations of the template. This is one way > to do it, by using the configured RT_PREFIX in the context name. I > also took an extra step to make the size class fanout show up > correctly on different platforms, but that's probably overkill and > undesirable, and I'll probably use only the class name next time. > 0010/11 - Make the array functions less surprising and with more
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Jan 3, 2024 at 9:10 PM John Naylor wrote: > > On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada wrote: > > > I agree that we expose RT_LOCK_* functions and have tidstore use them, > > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)" > > calls part. I think that even if we expose them, we will still need to > > do something like "if (TidStoreIsShared(ts)) > > shared_rt_lock_share(ts->tree.shared)", no? > > I'll come back to this topic separately. To answer your question, sure, but that "if (TidStoreIsShared(ts))" part would be pushed down into a function so that only one place has to care about it. However, I'm starting to question whether we even need that. Meaning, lock the tidstore separately. To "lock the tidstore" means to take a lock, _separate_ from the radix tree's internal lock, to control access to two fields in a separate "control object": +typedef struct TidStoreControl +{ + /* the number of tids in the store */ + int64 num_tids; + + /* the maximum bytes a TidStore can use */ + size_t max_bytes; I'm pretty sure max_bytes does not need to be in shared memory, and certainly not under a lock: Thinking of a hypothetical parallel-prune-phase scenario, one way would be for a leader process to pass out ranges of blocks to workers, and when the limit is exceeded, stop passing out blocks and wait for all the workers to finish. As for num_tids, vacuum previously put the similar count in @@ -176,7 +179,8 @@ struct ParallelVacuumState PVIndStats *indstats; /* Shared dead items space among parallel vacuum workers */ - VacDeadItems *dead_items; + TidStore *dead_items; VacDeadItems contained "num_items". What was the reason to have new infrastructure for that count? And it doesn't seem like access to it was controlled by a lock -- can you confirm? If we did get parallel pruning, maybe the count would belong inside PVShared? The number of tids is not that tightly bound to the tidstore's job. I believe tidbitmap.c (a possible future client) doesn't care about the global number of tids -- not only that, but AND/OR operations can change the number in a non-obvious way, so it would not be convenient to keep an accurate number anyway. But the lock would still be mandatory with this patch. If we can make vacuum work a bit closer to how it does now, it'd be a big step up in readability, I think. Namely, getting rid of all the locking logic inside tidstore.c and let the radix tree's locking do the right thing. We'd need to make that work correctly when receiving pointers to values upon lookup, and I already shared ideas for that. But I want to see if there is any obstacle in the way of removing the tidstore control object and it's separate lock.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Jan 2, 2024 at 8:01 PM Masahiko Sawada wrote: > I agree that we expose RT_LOCK_* functions and have tidstore use them, > but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)" > calls part. I think that even if we expose them, we will still need to > do something like "if (TidStoreIsShared(ts)) > shared_rt_lock_share(ts->tree.shared)", no? I'll come back to this topic separately. > I've attached a new patch set. From v47 patch, I've merged your > changes for radix tree, and split the vacuum integration patch into 3 > patches: simply replaces VacDeadItems with TidsTore (0007 patch), and > use a simple TID array for one-pass strategy (0008 patch), and replace > has_lpdead_items with "num_offsets > 0" (0009 patch), while > incorporating your review comments on the vacuum integration patch Nice! > (sorry for making it difficult to see the changes from v47 patch). It's actually pretty clear. I just have a couple comments before sharing my latest cleanups: (diff'ing between v47 and v48): -- /* -* In the shared case, TidStoreControl and radix_tree are backed by the -* same DSA area and rt_memory_usage() returns the value including both. -* So we don't need to add the size of TidStoreControl separately. -*/ if (TidStoreIsShared(ts)) - return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared); + rt_mem = shared_rt_memory_usage(ts->tree.shared); + else + rt_mem = local_rt_memory_usage(ts->tree.local); - return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local); + return sizeof(TidStore) + sizeof(TidStoreControl) + rt_mem; Upthread, I meant that I don't see the need to include the size of these structs *at all*. They're tiny, and the blocks/segments will almost certainly have some empty space counted in the total anyway. The returned size is already overestimated, so this extra code is just a distraction. - if (result->num_offsets + bmw_popcount(w) > result->max_offset) + if (result->num_offsets + (sizeof(bitmapword) * BITS_PER_BITMAPWORD) >= result->max_offset) I believe this math is wrong. We care about "result->num_offsets + BITS_PER_BITMAPWORD", right? Also, it seems if the condition evaluates to equal, we still have enough space, in which case ">" the max is the right condition. - if (off < 1 || off > MAX_TUPLES_PER_PAGE) + if (off < 1 || off > MaxOffsetNumber) This can now use OffsetNumberIsValid(). > 0013 to 0015 patches are also updates from v47 patch. > I'm thinking that we should change the order of the patches so that > tidstore patch requires the patch for changing DSA segment sizes. That > way, we can remove the complex max memory calculation part that we no > longer use from the tidstore patch. I don't think there is any reason to have those calculations at all at this point. Every patch in every version should at least *work correctly*, without kludging m_w_m and without constraining max segment size. I'm fine with the latter remaining in its own thread, and I hope we can consider it an enhancement that respects the admin's configured limits more effectively, and not a pre-requisite for not breaking. I *think* we're there now, but it's hard to tell since 0015 was at the very end. As I said recently, if something still fails, I'd like to know why. So for v49, I took the liberty of removing the DSA max segment patches for now, and squashing v48-0015. In addition for v49, I have quite a few cleanups: 0001 - This hasn't been touched in a very long time, but I ran pgindent and clarified a comment 0002 - We no longer need to isolate the rightmost bit anywhere, so removed that part and revised the commit message accordingly. radix tree: 0003 - v48 plus squashed v48-0013 0004 - Removed or adjusted WIP, FIXME, TODO items. Some were outdated, and I fixed most of the rest. 0005 - Remove the RT_PTR_LOCAL macro, since it's not really useful anymore. 0006 - RT_FREE_LEAF only needs the allocated pointer, so pass that. A bit simpler. 0007 - Uses the same idea from a previous cleanup of RT_SET, for RT_DELETE. 0008 - Removes a holdover from the multi-value leaves era. 0009 - It occurred to me that we need to have unique names for memory contexts for different instantiations of the template. This is one way to do it, by using the configured RT_PREFIX in the context name. I also took an extra step to make the size class fanout show up correctly on different platforms, but that's probably overkill and undesirable, and I'll probably use only the class name next time. 0010/11 - Make the array functions less surprising and with more informative names. 0012 - Restore a useful technique from Andres's prototype. This part has been slow for a long time, so much that it showed up in a profile where this path wasn't even taken much. tid store / vacuum: 0013/14 - Same as v48 TID store, with review squashed 0015 - Rationalize comment and starting
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Dec 27, 2023 at 12:08 PM John Naylor wrote: > > On Tue, Dec 26, 2023 at 12:43 PM Masahiko Sawada > wrote: > > > > On Thu, Dec 21, 2023 at 4:41 PM John Naylor wrote: > > > > +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber > > > *offsets, > > > + int num_offsets) > > > +{ > > > + char buf[MaxBlocktableEntrySize]; > > > + BlocktableEntry *page = (BlocktableEntry *) buf; > > > > > > I'm not sure this is safe with alignment. Maybe rather than plain > > > "char", it needs to be a union with BlocktableEntry, or something. > > > > I tried it in the new patch set but could you explain why it could not > > be safe with alignment? > > I was thinking because "buf" is just an array of bytes. But, since the > next declaration is a cast to a pointer to the actual type, maybe we > can rely on the compiler to do the right thing. (It seems to on my > machine in any case) Okay, I kept it. > > > > About separation of responsibilities for locking: The only thing > > > currently where the tid store is not locked is tree iteration. That's > > > a strange exception. Also, we've recently made RT_FIND return a > > > pointer, so the caller must somehow hold a share lock, but I think we > > > haven't exposed callers the ability to do that, and we rely on the tid > > > store lock for that. We have a mix of tree locking and tid store > > > locking. We will need to consider carefully how to make this more > > > clear, maintainable, and understandable. > > > > Yes, tidstore should be locked during the iteration. > > > > One simple direction about locking is that the radix tree has the lock > > but no APIs hold/release it. It's the caller's responsibility. If a > > data structure using a radix tree for its storage has its own lock > > (like tidstore), it can use it instead of the radix tree's one. A > > It looks like the only reason tidstore has its own lock is because it > has no way to delegate locking to the tree's lock. Instead of working > around the limitations of the thing we've designed, let's make it work > for the one use case we have. I think we need to expose RT_LOCK_* > functions to the outside, and have tid store use them. That would > allow us to simplify all those "if (TidStoreIsShared(ts) > LWLockAcquire(..., ...)" calls, which are complex and often redundant. I agree that we expose RT_LOCK_* functions and have tidstore use them, but am not sure the if (TidStoreIsShared(ts) LWLockAcquire(..., ...)" calls part. I think that even if we expose them, we will still need to do something like "if (TidStoreIsShared(ts)) shared_rt_lock_share(ts->tree.shared)", no? > > At some point, we'll probably want to keep locking inside, at least to > smooth the way for fine-grained locking you mentioned. > > > > In a green field, it'd be fine to replace these with an expression of > > > "num_offsets", but it adds a bit of noise for reviewers and the git > > > log. Is it really necessary? > > > > I see your point. I think we can live with having both > > has_lpdead_items and num_offsets. But we will have to check if these > > values are consistent, which could be less maintainable. > > It would be clearer if that removal was split out into a separate patch. Agreed. > > > > I'm also not quite sure why "deadoffsets" and "lpdead_items" got > > > moved to the PruneState. The latter was renamed in a way that makes > > > more sense, but I don't see why the churn is necessary. > ... > > > I guess it was added here, 800 lines away? If so, why? > > > > The above changes are related. The idea is not to use tidstore in a > > one-pass strategy. If the table doesn't have any indexes, in > > lazy_scan_prune() we collect offset numbers of dead tuples on the page > > and vacuum the page using them. In this case, we don't need to use > > tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The > > LVPagePruneState is a convenient place to store collected offset > > numbers. > > Okay, that makes sense, but if it was ever explained, I don't > remember, and there is nothing in the commit message either. > > I'm not sure this can be split up easily, but if so it might help reviewing. Agreed. > > This change also leads to a weird-looking control flow: > > if (vacrel->nindexes == 0) > { > if (prunestate.num_offsets > 0) > { > ... > } > } > else if (prunestate.num_offsets > 0) > { > ... > } Fixed. I've attached a new patch set. From v47 patch, I've merged your changes for radix tree, and split the vacuum integration patch into 3 patches: simply replaces VacDeadItems with TidsTore (0007 patch), and use a simple TID array for one-pass strategy (0008 patch), and replace has_lpdead_items with "num_offsets > 0" (0009 patch), while incorporating your review comments on the vacuum integration patch (sorry for making it difficult to see the changes from v47 patch). 0013 to 0015 patches are also updates from v47 patch. I'm thinking that we should change the order of the patches so that tidstore patch
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Dec 26, 2023 at 12:43 PM Masahiko Sawada wrote: > > On Thu, Dec 21, 2023 at 4:41 PM John Naylor wrote: > > +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber > > *offsets, > > + int num_offsets) > > +{ > > + char buf[MaxBlocktableEntrySize]; > > + BlocktableEntry *page = (BlocktableEntry *) buf; > > > > I'm not sure this is safe with alignment. Maybe rather than plain > > "char", it needs to be a union with BlocktableEntry, or something. > > I tried it in the new patch set but could you explain why it could not > be safe with alignment? I was thinking because "buf" is just an array of bytes. But, since the next declaration is a cast to a pointer to the actual type, maybe we can rely on the compiler to do the right thing. (It seems to on my machine in any case) > > About separation of responsibilities for locking: The only thing > > currently where the tid store is not locked is tree iteration. That's > > a strange exception. Also, we've recently made RT_FIND return a > > pointer, so the caller must somehow hold a share lock, but I think we > > haven't exposed callers the ability to do that, and we rely on the tid > > store lock for that. We have a mix of tree locking and tid store > > locking. We will need to consider carefully how to make this more > > clear, maintainable, and understandable. > > Yes, tidstore should be locked during the iteration. > > One simple direction about locking is that the radix tree has the lock > but no APIs hold/release it. It's the caller's responsibility. If a > data structure using a radix tree for its storage has its own lock > (like tidstore), it can use it instead of the radix tree's one. A It looks like the only reason tidstore has its own lock is because it has no way to delegate locking to the tree's lock. Instead of working around the limitations of the thing we've designed, let's make it work for the one use case we have. I think we need to expose RT_LOCK_* functions to the outside, and have tid store use them. That would allow us to simplify all those "if (TidStoreIsShared(ts) LWLockAcquire(..., ...)" calls, which are complex and often redundant. At some point, we'll probably want to keep locking inside, at least to smooth the way for fine-grained locking you mentioned. > > In a green field, it'd be fine to replace these with an expression of > > "num_offsets", but it adds a bit of noise for reviewers and the git > > log. Is it really necessary? > > I see your point. I think we can live with having both > has_lpdead_items and num_offsets. But we will have to check if these > values are consistent, which could be less maintainable. It would be clearer if that removal was split out into a separate patch. > > I'm also not quite sure why "deadoffsets" and "lpdead_items" got > > moved to the PruneState. The latter was renamed in a way that makes > > more sense, but I don't see why the churn is necessary. ... > > I guess it was added here, 800 lines away? If so, why? > > The above changes are related. The idea is not to use tidstore in a > one-pass strategy. If the table doesn't have any indexes, in > lazy_scan_prune() we collect offset numbers of dead tuples on the page > and vacuum the page using them. In this case, we don't need to use > tidstore so we pass the offsets array to lazy_vacuum_heap_page(). The > LVPagePruneState is a convenient place to store collected offset > numbers. Okay, that makes sense, but if it was ever explained, I don't remember, and there is nothing in the commit message either. I'm not sure this can be split up easily, but if so it might help reviewing. This change also leads to a weird-looking control flow: if (vacrel->nindexes == 0) { if (prunestate.num_offsets > 0) { ... } } else if (prunestate.num_offsets > 0) { ... }
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Dec 21, 2023 at 4:41 PM John Naylor wrote: > > On Thu, Dec 21, 2023 at 8:33 AM Masahiko Sawada wrote: > > > > I found the following comment and wanted to discuss: > > > > // this might be better as "iterate over nodes", plus a callback to > > RT_DUMP_NODE, > > // which should really only concern itself with single nodes > > RT_SCOPE void > > RT_DUMP(RT_RADIX_TREE *tree) > > > > If it means we need to somehow use the iteration functions also for > > dumping the whole tree, it would probably need to refactor the > > iteration codes so that the RT_DUMP() can use them while dumping > > visited nodes. But we need to be careful of not adding overheads to > > the iteration performance. > > Yeah, some months ago I thought a callback interface would make some > things easier. I don't think we need that at the moment (possibly > never), so that comment can be just removed. As far as these debug > functions, I only found useful the stats and dumping a single node, > FWIW. > > I've attached v47, which is v46 plus some fixes for radix tree. > > 0004 - moves everything for "delete" to the end -- gradually other > things will be grouped together in a sensible order > > 0005 - trivial LGTM. > > 0006 - shrink nodes -- still needs testing, but nothing crashes yet. Cool. The coverage test results showed the shrink codes are also covered. > This shows some renaming might be good: Previously we had > RT_CHUNK_CHILDREN_ARRAY_COPY for growing nodes, but for shrinking I've > added RT_COPY_ARRAYS_AND_DELETE, since the deletion happens by simply > not copying the slot to be deleted. This means when growing it would > be more clear to call the former RT_COPY_ARRAYS_FOR_INSERT, since that > reserves a new slot for the caller in the new node, but the caller > must do the insert itself. Agreed. > Note that there are some practical > restrictions/best-practices on whether shrinking should happen after > deletion or vice versa. Hopefully it's clear, but let me know if the > description can be improved. Also, it doesn't yet shrink from size > class 32 to 16, but it could with a bit of work. Sounds reasonable. > > 0007 - trivial, but could use a better comment. I also need to make > sure stats reporting works (may also need some cleanup work). > > 0008 - fixes RT_FREE_RECURSE -- I believe you wondered some months ago > if DSA could just free all our allocated segments without throwing > away the DSA, and that's still a good question. LGTM. > > 0009 - fixes the assert in RT_ITER_SET_NODE_FROM (btw, I don't think > this name is better than RT_UPDATE_ITER_STACK, so maybe we should go > back to that). Will rename it. > The assert doesn't fire, so I guess it does what it's > supposed to? Yes. > For me, the iteration logic is still the most confusing > piece out of the whole radix tree. Maybe that could be helped with > some better variable names, but I wonder if it needs more invasive > work. True. Maybe more comments would also help. > > 0010 - some fixes for number of children accounting in node256 > > 0011 - Long overdue pgindent of radixtree.h, without trying to fix up > afterwards. Feel free to throw out and redo if this interferes with > ongoing work. > LGTM. I'm working on the below review comments and most of them are already incorporated on the local branch: > The rest are from your v46. The bench doesn't work for tid store > anymore, so I squashed "disable bench for CI" until we get back to > that. Some more review comments (note: patch numbers are for v47, but > I changed nothing from v46 in this area): > > 0013: > > + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, > + * and stored in the radix tree. > > Recently outdated. The variable length values seems to work, so let's > make everything match. > > +#define MAX_TUPLES_PER_PAGE MaxOffsetNumber > > Maybe we don't need this macro anymore? The name no longer fits, in any case. Removed. > > +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber > *offsets, > + int num_offsets) > +{ > + char buf[MaxBlocktableEntrySize]; > + BlocktableEntry *page = (BlocktableEntry *) buf; > > I'm not sure this is safe with alignment. Maybe rather than plain > "char", it needs to be a union with BlocktableEntry, or something. I tried it in the new patch set but could you explain why it could not be safe with alignment? > > +static inline BlocktableEntry * > +tidstore_iter_kv(TidStoreIter *iter, uint64 *key) > +{ > + if (TidStoreIsShared(iter->ts)) > + return shared_rt_iterate_next(iter->tree_iter.shared, key); > + > + return local_rt_iterate_next(iter->tree_iter.local, key); > +} > > In the old encoding scheme, this function did something important, but > now it's a useless wrapper with one caller. Removed. > > + /* > + * In the shared case, TidStoreControl and radix_tree are backed by the > + * same DSA area and rt_memory_usage() returns the value including both. > + * So we don't need to add the size of TidStoreControl
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Dec 21, 2023 at 6:27 PM Andres Freund wrote: > > Could either of you summarize what the design changes you've made in the last > months are and why you've done them? Unfortunately this thread is very long, > and the comments in the file just say "FIXME" in places that apparently are > affected by design changes. This makes it hard to catch up here. I'd be happy to try, since we are about due for a summary. I was also hoping to reach a coherent-enough state sometime in early January to request your feedback, so good timing. Not sure how much detail to go into, but here goes: Back in May [1], the method of value storage shifted towards "combined pointer-value slots", which was described and recommended in the paper. There were some other changes for simplicity and efficiency, but none as far-reaching as this. This is enabled by using the template architecture that we adopted long ago for different reasons. Fixed length values are either stored in the slot of the last-level node (if the value fits into the platform's pointer), or are a "single-value" leaf (otherwise). For tid store, we want to eventually support bitmap heap scans (in addition to vacuum), and in doing so make it independent of heap AM. That means value types similar to PageTableEntry tidbitmap.c, but with a variable number of bitmapwords. That required radix tree to support variable length values. That has been the main focus in the last several months, and it basically works now. To my mind, the biggest architectural issues in the patch today are: - Variable-length values means that pointers are passed around in places. This will require some shifting responsibility for locking to the caller, or longer-term maybe a callback interface. (This is new, the below are pre-existing issues.) - The tid store has its own "control object" (when shared memory is needed) with its own lock, in addition to the same for the associated radix tree. This leads to unnecessary double-locking. This area needs some attention. - Memory accounting is still unsettled. The current thinking is to cap max block/segment size, scaled to a fraction of m_w_m, but there are still open questions. There has been some recent effort toward finishing work started earlier, like shrinking nodes. There a couple places that can still use either simplification or optimization, but otherwise work fine. Most of the remaining fixmes/todos/wips are trivial; a few are actually outdated now that I look again, and will be removed shortly. The regression tests could use some tidying up. -John [1] https://www.postgresql.org/message-id/CAFBsxsFyWLxweHVDtKb7otOCR4XdQGYR4b%2B9svxpVFnJs08BmQ%40mail.gmail.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
Hi, On 2023-12-21 14:41:37 +0700, John Naylor wrote: > I've attached v47, which is v46 plus some fixes for radix tree. Could either of you summarize what the design changes you've made in the last months are and why you've done them? Unfortunately this thread is very long, and the comments in the file just say "FIXME" in places that apparently are affected by design changes. This makes it hard to catch up here. Greetings, Andres Freund
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Dec 21, 2023 at 8:33 AM Masahiko Sawada wrote: > > I found the following comment and wanted to discuss: > > // this might be better as "iterate over nodes", plus a callback to > RT_DUMP_NODE, > // which should really only concern itself with single nodes > RT_SCOPE void > RT_DUMP(RT_RADIX_TREE *tree) > > If it means we need to somehow use the iteration functions also for > dumping the whole tree, it would probably need to refactor the > iteration codes so that the RT_DUMP() can use them while dumping > visited nodes. But we need to be careful of not adding overheads to > the iteration performance. Yeah, some months ago I thought a callback interface would make some things easier. I don't think we need that at the moment (possibly never), so that comment can be just removed. As far as these debug functions, I only found useful the stats and dumping a single node, FWIW. I've attached v47, which is v46 plus some fixes for radix tree. 0004 - moves everything for "delete" to the end -- gradually other things will be grouped together in a sensible order 0005 - trivial 0006 - shrink nodes -- still needs testing, but nothing crashes yet. This shows some renaming might be good: Previously we had RT_CHUNK_CHILDREN_ARRAY_COPY for growing nodes, but for shrinking I've added RT_COPY_ARRAYS_AND_DELETE, since the deletion happens by simply not copying the slot to be deleted. This means when growing it would be more clear to call the former RT_COPY_ARRAYS_FOR_INSERT, since that reserves a new slot for the caller in the new node, but the caller must do the insert itself. Note that there are some practical restrictions/best-practices on whether shrinking should happen after deletion or vice versa. Hopefully it's clear, but let me know if the description can be improved. Also, it doesn't yet shrink from size class 32 to 16, but it could with a bit of work. 0007 - trivial, but could use a better comment. I also need to make sure stats reporting works (may also need some cleanup work). 0008 - fixes RT_FREE_RECURSE -- I believe you wondered some months ago if DSA could just free all our allocated segments without throwing away the DSA, and that's still a good question. 0009 - fixes the assert in RT_ITER_SET_NODE_FROM (btw, I don't think this name is better than RT_UPDATE_ITER_STACK, so maybe we should go back to that). The assert doesn't fire, so I guess it does what it's supposed to? For me, the iteration logic is still the most confusing piece out of the whole radix tree. Maybe that could be helped with some better variable names, but I wonder if it needs more invasive work. I confess I don't have better ideas for how it would work differently. 0010 - some fixes for number of children accounting in node256 0011 - Long overdue pgindent of radixtree.h, without trying to fix up afterwards. Feel free to throw out and redo if this interferes with ongoing work. The rest are from your v46. The bench doesn't work for tid store anymore, so I squashed "disable bench for CI" until we get back to that. Some more review comments (note: patch numbers are for v47, but I changed nothing from v46 in this area): 0013: + * Internally, a tid is encoded as a pair of 64-bit key and 64-bit value, + * and stored in the radix tree. Recently outdated. The variable length values seems to work, so let's make everything match. +#define MAX_TUPLES_PER_PAGE MaxOffsetNumber Maybe we don't need this macro anymore? The name no longer fits, in any case. +TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, + int num_offsets) +{ + char buf[MaxBlocktableEntrySize]; + BlocktableEntry *page = (BlocktableEntry *) buf; I'm not sure this is safe with alignment. Maybe rather than plain "char", it needs to be a union with BlocktableEntry, or something. +static inline BlocktableEntry * +tidstore_iter_kv(TidStoreIter *iter, uint64 *key) +{ + if (TidStoreIsShared(iter->ts)) + return shared_rt_iterate_next(iter->tree_iter.shared, key); + + return local_rt_iterate_next(iter->tree_iter.local, key); +} In the old encoding scheme, this function did something important, but now it's a useless wrapper with one caller. + /* + * In the shared case, TidStoreControl and radix_tree are backed by the + * same DSA area and rt_memory_usage() returns the value including both. + * So we don't need to add the size of TidStoreControl separately. + */ + if (TidStoreIsShared(ts)) + return sizeof(TidStore) + shared_rt_memory_usage(ts->tree.shared); + + return sizeof(TidStore) + sizeof(TidStore) + local_rt_memory_usage(ts->tree.local); I don't see the point in including these tiny structs, since we will always blow past the limit by a number of kilobytes (at least, often megabytes or more) at the time it happens. + iter->output.max_offset = 64; Maybe needs a comment that this is just some starting size and not anything particular. + iter->output.offsets = palloc(sizeof(OffsetNumber) * iter->output.max_offset); + /* Make sure there
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Dec 21, 2023 at 10:19 AM John Naylor wrote: > > On Wed, Dec 20, 2023 at 6:36 PM Masahiko Sawada wrote: > > > > I've updated the new patch set that incorporated comments I got so > > far. 0007, 0008, and 0012 patches are updates from the v45 patch set. > > In addition to the review comments, I made some changes in tidstore to > > make it independent from heap. Specifically, it uses MaxOffsetNumber > > instead of MaxHeapTuplesPerPage. Now we don't need to include > > htup_details.h. It enlarged MaxBlocktableEntrySize but it's still 272 > > bytes. > > That's a good idea. > > > BTW regarding the previous comment I got before: > > > > > - RT_PTR_ALLOC *slot; > > > + RT_PTR_ALLOC *slot = NULL; > > > > > > We have a macro for invalid pointer because of DSA. > > > > I think that since *slot is a pointer to a RT_PTR_ALLOC it's okay to set > > NULL. > > Ah right, it's the address of the slot. > > > I'm going to update RT_DUMP() and RT_DUMP_NODE() codes for the next step. > > That could probably use some discussion. A few months ago, I found the > debugging functions only worked when everything else worked. When > things weren't working, I had to rip one of these functions apart so > it only looked at one node. If something is broken, we can't count on > recursion or iteration working, because we won't get that far. I don't > remember how things are in the current patch. Agreed. I found the following comment and wanted to discuss: // this might be better as "iterate over nodes", plus a callback to RT_DUMP_NODE, // which should really only concern itself with single nodes RT_SCOPE void RT_DUMP(RT_RADIX_TREE *tree) If it means we need to somehow use the iteration functions also for dumping the whole tree, it would probably need to refactor the iteration codes so that the RT_DUMP() can use them while dumping visited nodes. But we need to be careful of not adding overheads to the iteration performance. > > I've finished the node shrinking and addressed some fixme/todo areas > -- can I share these and squash your v46 changes first? Cool! Yes, please do so. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Dec 20, 2023 at 6:36 PM Masahiko Sawada wrote: > > I've updated the new patch set that incorporated comments I got so > far. 0007, 0008, and 0012 patches are updates from the v45 patch set. > In addition to the review comments, I made some changes in tidstore to > make it independent from heap. Specifically, it uses MaxOffsetNumber > instead of MaxHeapTuplesPerPage. Now we don't need to include > htup_details.h. It enlarged MaxBlocktableEntrySize but it's still 272 > bytes. That's a good idea. > BTW regarding the previous comment I got before: > > > - RT_PTR_ALLOC *slot; > > + RT_PTR_ALLOC *slot = NULL; > > > > We have a macro for invalid pointer because of DSA. > > I think that since *slot is a pointer to a RT_PTR_ALLOC it's okay to set NULL. Ah right, it's the address of the slot. > I'm going to update RT_DUMP() and RT_DUMP_NODE() codes for the next step. That could probably use some discussion. A few months ago, I found the debugging functions only worked when everything else worked. When things weren't working, I had to rip one of these functions apart so it only looked at one node. If something is broken, we can't count on recursion or iteration working, because we won't get that far. I don't remember how things are in the current patch. I've finished the node shrinking and addressed some fixme/todo areas -- can I share these and squash your v46 changes first?
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Dec 19, 2023 at 4:37 PM John Naylor wrote: > > On Tue, Dec 19, 2023 at 12:37 PM Masahiko Sawada > wrote: > > > > On Mon, Dec 18, 2023 at 3:41 PM John Naylor wrote: > > > Let's do it in just one place. In TidStoreCreate(), do > > > > > > /* clamp max_bytes to at least the size of the empty tree with > > > allocated blocks, so it doesn't immediately appear full */ > > > ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage); > > > > > > Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all > > > that. > > > > But doesn't it mean that even if we create a shared tidstore with > > small memory, say 64kB, it actually uses 1MB? > > This sounds like an argument for controlling the minimum DSA segment > size. (I'm not really in favor of that, but open to others' opinion) > > I wasn't talking about that above -- I was saying we should have only > one place where we clamp max_bytes so that the tree doesn't > immediately appear full. Thank you for your clarification. Understood. I've updated the new patch set that incorporated comments I got so far. 0007, 0008, and 0012 patches are updates from the v45 patch set. In addition to the review comments, I made some changes in tidstore to make it independent from heap. Specifically, it uses MaxOffsetNumber instead of MaxHeapTuplesPerPage. Now we don't need to include htup_details.h. It enlarged MaxBlocktableEntrySize but it's still 272 bytes. BTW regarding the previous comment I got before: > - RT_PTR_ALLOC *slot; > + RT_PTR_ALLOC *slot = NULL; > > We have a macro for invalid pointer because of DSA. I think that since *slot is a pointer to a RT_PTR_ALLOC it's okay to set NULL. As for the initial and maximum DSA segment sizes, I've sent a summary on that thread: https://www.postgresql.org/message-id/CAD21AoCVMw6DSmgZY9h%2BxfzKtzJeqWiwxaUD2T-FztVcV-XibQ%40mail.gmail.com I'm going to update RT_DUMP() and RT_DUMP_NODE() codes for the next step. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v46-ART.tar.gz Description: application/gzip
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Dec 19, 2023 at 12:37 PM Masahiko Sawada wrote: > > On Mon, Dec 18, 2023 at 3:41 PM John Naylor wrote: > > Let's do it in just one place. In TidStoreCreate(), do > > > > /* clamp max_bytes to at least the size of the empty tree with > > allocated blocks, so it doesn't immediately appear full */ > > ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage); > > > > Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that. > > But doesn't it mean that even if we create a shared tidstore with > small memory, say 64kB, it actually uses 1MB? This sounds like an argument for controlling the minimum DSA segment size. (I'm not really in favor of that, but open to others' opinion) I wasn't talking about that above -- I was saying we should have only one place where we clamp max_bytes so that the tree doesn't immediately appear full.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Dec 18, 2023 at 3:41 PM John Naylor wrote: > > On Fri, Dec 15, 2023 at 3:15 PM Masahiko Sawada wrote: > > > > On Fri, Dec 15, 2023 at 10:30 AM John Naylor > > wrote: > > > > parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes, > > > - int nrequested_workers, int max_items, > > > - int elevel, BufferAccessStrategy bstrategy) > > > + int nrequested_workers, int vac_work_mem, > > > + int max_offset, int elevel, > > > + BufferAccessStrategy bstrategy) > > > > > > It seems very strange to me that this function has to pass the > > > max_offset. In general, it's been simpler to assume we have a constant > > > max_offset, but in this case that fact is not helping. Something to > > > think about for later. > > > > max_offset was previously used in old TID encoding in tidstore. Since > > tidstore has entries for each block, I think we no longer need it. > > It's needed now to properly size the allocation of TidStoreIter which > contains... > > +/* Result struct for TidStoreIterateNext */ > +typedef struct TidStoreIterResult > +{ > + BlockNumber blkno; > + int num_offsets; > + OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER]; > +} TidStoreIterResult; > > Maybe we can palloc the offset array to "almost always" big enough, > with logic to resize if needed? If not too hard, seems worth it to > avoid churn in the parameter list. Yes, I was thinking of that. > > > > v45-0010: > > > > > > Thinking about this some more, I'm not sure we need to do anything > > > different for the *starting* segment size. (Controlling *max* size > > > does seem important, however.) For the corner case of m_w_m = 1MB, > > > it's fine if vacuum quits pruning immediately after (in effect) it > > > finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If > > > the memory accounting starts >1MB because we're adding the trivial > > > size of some struct, let's just stop doing that. The segment > > > allocations are what we care about. > > > > IIUC it's for work_mem, whose the minimum value is 64kB. > > > > > > > > v45-0011: > > > > > > + /* > > > + * max_bytes is forced to be at least 64kB, the current minimum valid > > > + * value for the work_mem GUC. > > > + */ > > > + max_bytes = Max(64 * 1024L, max_bytes); > > > > > > Why? > > > > This is to avoid creating a radix tree within very small memory. The > > minimum work_mem value is a reasonable lower bound that PostgreSQL > > uses internally. It's actually copied from tuplesort.c. > > There is no explanation for why it should be done like tuplesort.c. Also... > > - tree->leaf_ctx = SlabContextCreate(ctx, > -"radix tree leaves", > -RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > -sizeof(RT_VALUE_TYPE)); > + tree->leaf_ctx = SlabContextCreate(ctx, > +"radix tree leaves", > +Min(RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > +work_mem), > +sizeof(RT_VALUE_TYPE)); > > At first, my eyes skipped over this apparent re-indent, but hidden > inside here is another (undocumented) attempt to clamp the size of > something. There are too many of these sprinkled in various places, > and they're already a maintenance hazard -- a different one was left > behind in v45-0011: > > @@ -201,6 +183,7 @@ TidStoreCreate(size_t max_bytes, int max_off, > dsa_area *area) > ts->control->max_bytes = max_bytes - (70 * 1024); > } > > Let's do it in just one place. In TidStoreCreate(), do > > /* clamp max_bytes to at least the size of the empty tree with > allocated blocks, so it doesn't immediately appear full */ > ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage); > > Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that. But doesn't it mean that even if we create a shared tidstore with small memory, say 64kB, it actually uses 1MB? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Dec 15, 2023 at 3:15 PM Masahiko Sawada wrote: > > On Fri, Dec 15, 2023 at 10:30 AM John Naylor wrote: > > parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes, > > - int nrequested_workers, int max_items, > > - int elevel, BufferAccessStrategy bstrategy) > > + int nrequested_workers, int vac_work_mem, > > + int max_offset, int elevel, > > + BufferAccessStrategy bstrategy) > > > > It seems very strange to me that this function has to pass the > > max_offset. In general, it's been simpler to assume we have a constant > > max_offset, but in this case that fact is not helping. Something to > > think about for later. > > max_offset was previously used in old TID encoding in tidstore. Since > tidstore has entries for each block, I think we no longer need it. It's needed now to properly size the allocation of TidStoreIter which contains... +/* Result struct for TidStoreIterateNext */ +typedef struct TidStoreIterResult +{ + BlockNumber blkno; + int num_offsets; + OffsetNumber offsets[FLEXIBLE_ARRAY_MEMBER]; +} TidStoreIterResult; Maybe we can palloc the offset array to "almost always" big enough, with logic to resize if needed? If not too hard, seems worth it to avoid churn in the parameter list. > > v45-0010: > > > > Thinking about this some more, I'm not sure we need to do anything > > different for the *starting* segment size. (Controlling *max* size > > does seem important, however.) For the corner case of m_w_m = 1MB, > > it's fine if vacuum quits pruning immediately after (in effect) it > > finds the DSA has gone to 2MB. It's not worth bothering with, IMO. If > > the memory accounting starts >1MB because we're adding the trivial > > size of some struct, let's just stop doing that. The segment > > allocations are what we care about. > > IIUC it's for work_mem, whose the minimum value is 64kB. > > > > > v45-0011: > > > > + /* > > + * max_bytes is forced to be at least 64kB, the current minimum valid > > + * value for the work_mem GUC. > > + */ > > + max_bytes = Max(64 * 1024L, max_bytes); > > > > Why? > > This is to avoid creating a radix tree within very small memory. The > minimum work_mem value is a reasonable lower bound that PostgreSQL > uses internally. It's actually copied from tuplesort.c. There is no explanation for why it should be done like tuplesort.c. Also... - tree->leaf_ctx = SlabContextCreate(ctx, -"radix tree leaves", -RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), -sizeof(RT_VALUE_TYPE)); + tree->leaf_ctx = SlabContextCreate(ctx, +"radix tree leaves", +Min(RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), +work_mem), +sizeof(RT_VALUE_TYPE)); At first, my eyes skipped over this apparent re-indent, but hidden inside here is another (undocumented) attempt to clamp the size of something. There are too many of these sprinkled in various places, and they're already a maintenance hazard -- a different one was left behind in v45-0011: @@ -201,6 +183,7 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area) ts->control->max_bytes = max_bytes - (70 * 1024); } Let's do it in just one place. In TidStoreCreate(), do /* clamp max_bytes to at least the size of the empty tree with allocated blocks, so it doesn't immediately appear full */ ts->control->max_bytes = Max(max_bytes, {rt, shared_rt}_memory_usage); Then we can get rid of all the worry about 1MB/2MB, 64kB, 70kB -- all that. I may not recall everything while writing this, but it seems the only other thing we should be clamping is the max aset block size (solved) / max DSM segment size (in progress).
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Dec 15, 2023 at 10:30 AM John Naylor wrote: > > On Thu, Dec 14, 2023 at 7:22 AM Masahiko Sawada wrote: > > In v45, 0001 - 0006 are from earlier versions but I've merged previous > > updates. So the radix tree now has RT_SET() and RT_FIND() but not > > RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous > > versions that incorporated the above comments. 0009 patch integrates > > tidstore with lazy vacuum. > > Excellent! I repeated a quick run of the small "test 1" with very low m_w_m > from > > https://www.postgresql.org/message-id/CAFBsxsHrvTPUK%3DC1%3DxweJjGujja4Xjfgva3C8jnW3Shz6RBnFg%40mail.gmail.com > > ...and got similar results, so we still have good space-efficiency on this > test: > > master: > INFO: finished vacuuming "john.public.test": index scans: 9 > system usage: CPU: user: 56.83 s, system: 9.36 s, elapsed: 119.62 s > > v45: > INFO: finished vacuuming "john.public.test": index scans: 1 > system usage: CPU: user: 6.82 s, system: 2.05 s, elapsed: 10.89 s Thank you for testing it again. That's a very good result. > For my next steps, I will finish the node-shrinking behavior and save > for a later patchset. Not needed for tid store, but needs to happen > because of assumptions in the code. Also, some time ago, I think I > commented out RT_FREE_RECURSE to get something working, so I'll fix > it, and look at other fixmes and todos. Great! > > > Note that DSA segment problem is not > > resolved yet in this patch. > > I remember you started a separate thread about this, but I don't think > it got any attention. Maybe reply with a "TLDR;" and share a patch to > allow controlling max segment size. Yeah, I recalled that thread. Will send a reply. > > Some more comments: > > v45-0003: > > Since RT_ITERATE_NEXT_PTR works for tid store, do we even need > RT_ITERATE_NEXT anymore? The former should handle fixed-length values > just fine? If so, we should rename it to match the latter. Agreed to rename it. > > + * The caller is responsible for locking/unlocking the tree in shared mode. > > This is not new to v45, but this will come up again below. This needs > more explanation: Since we're returning a pointer (to support > variable-length values), the caller needs to maintain control until > it's finished with the value. Will fix. > > v45-0005: > > + * Regarding the concurrency support, we use a single LWLock for the > TidStore. > + * The TidStore is exclusively locked when inserting encoded tids to the > + * radix tree or when resetting itself. When searching on the TidStore or > + * doing the iteration, it is not locked but the underlying radix tree is > + * locked in shared mode. > > This is just stating facts without giving any reasons. Readers are > going to wonder why it's inconsistent. The "why" is much more > important than the "what". Even with that, this comment is also far > from the relevant parts, and so will get out of date. Maybe we can > just make sure each relevant function is explained individually. Right, I'll fix it. > > v45-0007: > > -RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx); > +RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, Size work_mem); > > Tid store calls this max_bytes -- can we use that name here, too? > "work_mem" is highly specific. While I agree that "work_mem" is highly specific, I avoided using "max_bytes" in radix tree because "max_bytes" sounds to me there is a memory limitation but the radix tree doesn't have it actually. It might be sufficient to mention it in the comment, though. > > - RT_PTR_ALLOC *slot; > + RT_PTR_ALLOC *slot = NULL; > > We have a macro for invalid pointer because of DSA. Will fix. > > v45-0008: > > - if (off < 1 || off > MAX_TUPLES_PER_PAGE) > + if (unlikely(off < 1 || off > MAX_TUPLES_PER_PAGE)) > elog(ERROR, "tuple offset out of range: %u", off); > > This is a superfluous distraction, since the error path is located way > off in the cold segment of the binary. Okay, will remove it. > > v45-0009: > > (just a few small things for now) > > - * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the > - * vacrel->dead_items array. > + * lazy_vacuum_heap_page() -- free page's LP_DEAD items. > > I think we can keep as "listed in the TID store". > > - * Allocate dead_items (either using palloc, or in dynamic shared memory). > - * Sets dead_items in vacrel for caller. > + * Allocate a (local or shared) TidStore for storing dead TIDs. Sets > dead_items > + * in vacrel for caller. > > I think we want to keep "in dynamic shared memory". It's still true. > I'm not sure anything needs to change here, actually. Agreed with above comments. Will fix them. > > parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes, > - int nrequested_workers, int max_items, > - int elevel, BufferAccessStrategy bstrategy) > + int nrequested_workers, int vac_work_mem, > + int max_offset, int elevel, > + BufferAccessStrategy bstrategy) > > It seems very strange to me that this function
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Dec 14, 2023 at 7:22 AM Masahiko Sawada wrote: > In v45, 0001 - 0006 are from earlier versions but I've merged previous > updates. So the radix tree now has RT_SET() and RT_FIND() but not > RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous > versions that incorporated the above comments. 0009 patch integrates > tidstore with lazy vacuum. Excellent! I repeated a quick run of the small "test 1" with very low m_w_m from https://www.postgresql.org/message-id/CAFBsxsHrvTPUK%3DC1%3DxweJjGujja4Xjfgva3C8jnW3Shz6RBnFg%40mail.gmail.com ...and got similar results, so we still have good space-efficiency on this test: master: INFO: finished vacuuming "john.public.test": index scans: 9 system usage: CPU: user: 56.83 s, system: 9.36 s, elapsed: 119.62 s v45: INFO: finished vacuuming "john.public.test": index scans: 1 system usage: CPU: user: 6.82 s, system: 2.05 s, elapsed: 10.89 s More sparse TID distributions won't be as favorable, but we have ideas to improve that in the future. For my next steps, I will finish the node-shrinking behavior and save for a later patchset. Not needed for tid store, but needs to happen because of assumptions in the code. Also, some time ago, I think I commented out RT_FREE_RECURSE to get something working, so I'll fix it, and look at other fixmes and todos. > Note that DSA segment problem is not > resolved yet in this patch. I remember you started a separate thread about this, but I don't think it got any attention. Maybe reply with a "TLDR;" and share a patch to allow controlling max segment size. Some more comments: v45-0003: Since RT_ITERATE_NEXT_PTR works for tid store, do we even need RT_ITERATE_NEXT anymore? The former should handle fixed-length values just fine? If so, we should rename it to match the latter. + * The caller is responsible for locking/unlocking the tree in shared mode. This is not new to v45, but this will come up again below. This needs more explanation: Since we're returning a pointer (to support variable-length values), the caller needs to maintain control until it's finished with the value. v45-0005: + * Regarding the concurrency support, we use a single LWLock for the TidStore. + * The TidStore is exclusively locked when inserting encoded tids to the + * radix tree or when resetting itself. When searching on the TidStore or + * doing the iteration, it is not locked but the underlying radix tree is + * locked in shared mode. This is just stating facts without giving any reasons. Readers are going to wonder why it's inconsistent. The "why" is much more important than the "what". Even with that, this comment is also far from the relevant parts, and so will get out of date. Maybe we can just make sure each relevant function is explained individually. v45-0007: -RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx); +RT_SCOPE RT_RADIX_TREE * RT_CREATE(MemoryContext ctx, Size work_mem); Tid store calls this max_bytes -- can we use that name here, too? "work_mem" is highly specific. - RT_PTR_ALLOC *slot; + RT_PTR_ALLOC *slot = NULL; We have a macro for invalid pointer because of DSA. v45-0008: - if (off < 1 || off > MAX_TUPLES_PER_PAGE) + if (unlikely(off < 1 || off > MAX_TUPLES_PER_PAGE)) elog(ERROR, "tuple offset out of range: %u", off); This is a superfluous distraction, since the error path is located way off in the cold segment of the binary. v45-0009: (just a few small things for now) - * lazy_vacuum_heap_page() -- free page's LP_DEAD items listed in the - * vacrel->dead_items array. + * lazy_vacuum_heap_page() -- free page's LP_DEAD items. I think we can keep as "listed in the TID store". - * Allocate dead_items (either using palloc, or in dynamic shared memory). - * Sets dead_items in vacrel for caller. + * Allocate a (local or shared) TidStore for storing dead TIDs. Sets dead_items + * in vacrel for caller. I think we want to keep "in dynamic shared memory". It's still true. I'm not sure anything needs to change here, actually. parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes, - int nrequested_workers, int max_items, - int elevel, BufferAccessStrategy bstrategy) + int nrequested_workers, int vac_work_mem, + int max_offset, int elevel, + BufferAccessStrategy bstrategy) It seems very strange to me that this function has to pass the max_offset. In general, it's been simpler to assume we have a constant max_offset, but in this case that fact is not helping. Something to think about for later. - (errmsg("scanned index \"%s\" to remove %d row versions", + (errmsg("scanned index \"%s\" to remove " UINT64_FORMAT " row versions", This should be signed int64. v45-0010: Thinking about this some more, I'm not sure we need to do anything different for the *starting* segment size. (Controlling *max* size does seem important, however.) For the corner case of m_w_m = 1MB, it's fine if vacuum quits pruning immediately after (in effect) it finds the DSA has gone to 2MB. It's not worth
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Dec 12, 2023 at 11:53 AM John Naylor wrote: > > On Mon, Dec 11, 2023 at 1:18 PM Masahiko Sawada wrote: > > > I've attached the updated patch set. From the previous patch set, I've > > merged patches 0007 to 0010. The other changes such as adding RT_GET() > > still are unmerged for now, for discussion. Probably we can make them > > as follow-up patches as we discussed. 0011 to 0015 patches are new > > changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and > > support variable-length values. > > This looks like the right direction, and I'm pleased it's not much > additional code on top of my last patch. > > v44-0014: > > +#ifdef RT_VARLEN_VALUE > + /* XXX: need to choose block sizes? */ > + tree->leaf_ctx = AllocSetContextCreate(ctx, > +"radix tree leaves", > +ALLOCSET_DEFAULT_SIZES); > +#else > + tree->leaf_ctx = SlabContextCreate(ctx, > +"radix tree leaves", > +RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), > +sizeof(RT_VALUE_TYPE)); > +#endif /* RT_VARLEN_VALUE */ > > Choosing block size: Similar to what we've discussed previously around > DSA segments, we might model this on CreateWorkExprContext() in > src/backend/executor/execUtils.c. Maybe tid store can pass maint_w_m / > autovac_w_m (later work_mem for bitmap scan). RT_CREATE could set the > max block size to 1/16 of that, or less. > > Also, it occurred to me that compile-time embeddable values don't need > a leaf context. I'm not sure how many places assume that there is > always a leaf context. If not many, it may be worth not creating one > here, just to be tidy. > > + size_t copysize; > > - memcpy(leaf.local, value_p, sizeof(RT_VALUE_TYPE)); > + copysize = sizeof(RT_VALUE_TYPE); > +#endif > + > + memcpy(leaf.local, value_p, copysize); > > I'm not sure this indirection adds clarity. I guess the intent was to > keep from saying "memcpy" twice, but now the code has to say "copysize > = foo" twice. > > For varlen case, we need to watch out for slowness because of memcpy. > Let's put that off for later testing, though. We may someday want to > avoid a memcpy call for the varlen case, so let's keep it flexible > here. > > v44-0015: > > +#define SizeOfBlocktableEntry (offsetof( > > Unused. > > + char buf[MaxBlocktableEntrySize] = {0}; > > Zeroing this buffer is probably going to be expensive. Also see this > pre-existing comment: > /* WIP: slow, since it writes to memory for every bit */ > page->words[wordnum] |= ((bitmapword) 1 << bitnum); > > For this function (which will be vacuum-only, so we can assume > ordering), in the loop we can: > * declare the local bitmapword variable to be zero > * set the bits on it > * write it out to the right location when done. > > Let's fix both of these at once. > > + if (TidStoreIsShared(ts)) > + shared_rt_set(ts->tree.shared, blkno, (void *) page, page_len); > + else > + local_rt_set(ts->tree.local, blkno, (void *) page, page_len); > > Is there a reason for "void *"? The declared parameter is > "RT_VALUE_TYPE *value_p" in 0014. > Also, since this function is for vacuum (and other uses will need a > new function), let's assert the returned bool is false. > > Does iteration still work? If so, it's not too early to re-wire this > up with vacuum and see how it behaves. > > Lastly, my compiler has a warning that CI doesn't have: > > In file included from ../src/test/modules/test_radixtree/test_radixtree.c:121: > ../src/include/lib/radixtree.h: In function ‘rt_find.isra’: > ../src/include/lib/radixtree.h:2142:24: warning: ‘slot’ may be used > uninitialized [-Wmaybe-uninitialized] > 2142 | return (RT_VALUE_TYPE*) slot; > |^ > ../src/include/lib/radixtree.h:2112:23: note: ‘slot’ was declared here > 2112 | RT_PTR_ALLOC *slot; > | ^~~~ Thank you for the comments! I agreed with all of them and incorporated them into the attached latest patch set, v45. In v45, 0001 - 0006 are from earlier versions but I've merged previous updates. So the radix tree now has RT_SET() and RT_FIND() but not RT_GET() and RT_SEARCH(). 0007 and 0008 are the updates from previous versions that incorporated the above comments. 0009 patch integrates tidstore with lazy vacuum. Note that DSA segment problem is not resolved yet in this patch. 0010 and 0011 makes DSA initial/max segment size configurable and make parallel vacuum specify both in proportion to maintenance_work_mem. 0012 is a development-purpose patch to make it easy to investigate bugs in tidstore. I'd like to keep it in the patch set at least during the development. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v45-ART.tar.gz Description: GNU Zip compressed data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Dec 11, 2023 at 1:18 PM Masahiko Sawada wrote: > I've attached the updated patch set. From the previous patch set, I've > merged patches 0007 to 0010. The other changes such as adding RT_GET() > still are unmerged for now, for discussion. Probably we can make them > as follow-up patches as we discussed. 0011 to 0015 patches are new > changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and > support variable-length values. This looks like the right direction, and I'm pleased it's not much additional code on top of my last patch. v44-0014: +#ifdef RT_VARLEN_VALUE + /* XXX: need to choose block sizes? */ + tree->leaf_ctx = AllocSetContextCreate(ctx, +"radix tree leaves", +ALLOCSET_DEFAULT_SIZES); +#else + tree->leaf_ctx = SlabContextCreate(ctx, +"radix tree leaves", +RT_SLAB_BLOCK_SIZE(sizeof(RT_VALUE_TYPE)), +sizeof(RT_VALUE_TYPE)); +#endif /* RT_VARLEN_VALUE */ Choosing block size: Similar to what we've discussed previously around DSA segments, we might model this on CreateWorkExprContext() in src/backend/executor/execUtils.c. Maybe tid store can pass maint_w_m / autovac_w_m (later work_mem for bitmap scan). RT_CREATE could set the max block size to 1/16 of that, or less. Also, it occurred to me that compile-time embeddable values don't need a leaf context. I'm not sure how many places assume that there is always a leaf context. If not many, it may be worth not creating one here, just to be tidy. + size_t copysize; - memcpy(leaf.local, value_p, sizeof(RT_VALUE_TYPE)); + copysize = sizeof(RT_VALUE_TYPE); +#endif + + memcpy(leaf.local, value_p, copysize); I'm not sure this indirection adds clarity. I guess the intent was to keep from saying "memcpy" twice, but now the code has to say "copysize = foo" twice. For varlen case, we need to watch out for slowness because of memcpy. Let's put that off for later testing, though. We may someday want to avoid a memcpy call for the varlen case, so let's keep it flexible here. v44-0015: +#define SizeOfBlocktableEntry (offsetof( Unused. + char buf[MaxBlocktableEntrySize] = {0}; Zeroing this buffer is probably going to be expensive. Also see this pre-existing comment: /* WIP: slow, since it writes to memory for every bit */ page->words[wordnum] |= ((bitmapword) 1 << bitnum); For this function (which will be vacuum-only, so we can assume ordering), in the loop we can: * declare the local bitmapword variable to be zero * set the bits on it * write it out to the right location when done. Let's fix both of these at once. + if (TidStoreIsShared(ts)) + shared_rt_set(ts->tree.shared, blkno, (void *) page, page_len); + else + local_rt_set(ts->tree.local, blkno, (void *) page, page_len); Is there a reason for "void *"? The declared parameter is "RT_VALUE_TYPE *value_p" in 0014. Also, since this function is for vacuum (and other uses will need a new function), let's assert the returned bool is false. Does iteration still work? If so, it's not too early to re-wire this up with vacuum and see how it behaves. Lastly, my compiler has a warning that CI doesn't have: In file included from ../src/test/modules/test_radixtree/test_radixtree.c:121: ../src/include/lib/radixtree.h: In function ‘rt_find.isra’: ../src/include/lib/radixtree.h:2142:24: warning: ‘slot’ may be used uninitialized [-Wmaybe-uninitialized] 2142 | return (RT_VALUE_TYPE*) slot; |^ ../src/include/lib/radixtree.h:2112:23: note: ‘slot’ was declared here 2112 | RT_PTR_ALLOC *slot; | ^~~~
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Dec 8, 2023 at 9:44 PM Masahiko Sawada wrote: > > On Fri, Dec 8, 2023 at 7:46 PM John Naylor wrote: > > > > On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada > > wrote: > > > > > > BTW Given that the actual value size can be calculated only by the > > > caller, how does the tree know if the value is embedded or not? It's > > > probably related to how to store combined pointer/value slots. > > > > Right, this is future work. At first, variable-length types will have > > to be single-value leaves. In fact, the idea for storing up to 3 > > offsets in the bitmap header could be done this way -- it would just > > be a (small) single-value leaf. > > Agreed. > > > > > (Reminder: Currently, fixed-length values are compile-time embeddable > > if the platform pointer size is big enough.) > > > > > If leaf > > > nodes have a bitmap array that indicates the corresponding slot is an > > > embedded value or a pointer to a value, it would be easy. > > > > That's the most general way to do it. We could do it much more easily > > with a pointer tag, although for the above idea it may require some > > endian-aware coding. Both were mentioned in the paper, I recall. > > True. Probably we can use the combined pointer/value slots approach > only if the tree is able to use the pointer tagging. That is, if the > caller allows the tree to use one bit of the value. > > I'm going to update the patch based on the recent discussion (RT_SET() > and variable-length values) etc., and post the patch set early next > week. I've attached the updated patch set. From the previous patch set, I've merged patches 0007 to 0010. The other changes such as adding RT_GET() still are unmerged for now, for discussion. Probably we can make them as follow-up patches as we discussed. 0011 to 0015 patches are new changes for v44 patch set, which removes RT_SEARCH() and RT_SET() and support variable-length values. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v44-ART.tar.gz Description: GNU Zip compressed data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Dec 8, 2023 at 7:46 PM John Naylor wrote: > > On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada wrote: > > > > BTW Given that the actual value size can be calculated only by the > > caller, how does the tree know if the value is embedded or not? It's > > probably related to how to store combined pointer/value slots. > > Right, this is future work. At first, variable-length types will have > to be single-value leaves. In fact, the idea for storing up to 3 > offsets in the bitmap header could be done this way -- it would just > be a (small) single-value leaf. Agreed. > > (Reminder: Currently, fixed-length values are compile-time embeddable > if the platform pointer size is big enough.) > > > If leaf > > nodes have a bitmap array that indicates the corresponding slot is an > > embedded value or a pointer to a value, it would be easy. > > That's the most general way to do it. We could do it much more easily > with a pointer tag, although for the above idea it may require some > endian-aware coding. Both were mentioned in the paper, I recall. True. Probably we can use the combined pointer/value slots approach only if the tree is able to use the pointer tagging. That is, if the caller allows the tree to use one bit of the value. I'm going to update the patch based on the recent discussion (RT_SET() and variable-length values) etc., and post the patch set early next week. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Dec 8, 2023 at 3:06 PM Masahiko Sawada wrote: > > BTW Given that the actual value size can be calculated only by the > caller, how does the tree know if the value is embedded or not? It's > probably related to how to store combined pointer/value slots. Right, this is future work. At first, variable-length types will have to be single-value leaves. In fact, the idea for storing up to 3 offsets in the bitmap header could be done this way -- it would just be a (small) single-value leaf. (Reminder: Currently, fixed-length values are compile-time embeddable if the platform pointer size is big enough.) > If leaf > nodes have a bitmap array that indicates the corresponding slot is an > embedded value or a pointer to a value, it would be easy. That's the most general way to do it. We could do it much more easily with a pointer tag, although for the above idea it may require some endian-aware coding. Both were mentioned in the paper, I recall. > But since > the bitmap array is needed only in the leaf nodes, internal nodes and > leaf nodes will no longer be identical structure, which is not a bad > thing to me, though. Absolutely no way we are going back to double everything: double types, double functions, double memory contexts. Plus, that bitmap in inner nodes could indicate a pointer to a leaf that got there by "lazy expansion".
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Dec 8, 2023 at 3:45 PM Masahiko Sawada wrote: > > On Fri, Dec 8, 2023 at 1:37 PM John Naylor wrote: > > > > On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada > > wrote: > > > > > It's still unclear to me why the value doesn't need to contain the size. > > > > > > If I understand you correctly, in RT_GET(), the tree allocs a new > > > memory and updates the slot where the value is embedded with the > > > pointer to the allocated memory, and returns the pointer to the > > > caller. Since the returned value, newly allocated memory, is still > > > empty, the callner needs to copy the contents of the old value to the > > > new value and do whatever else it needs to. > > > > > > If the value is already a single-leave value and RT_GET() is called > > > with a larger size, the slot is always replaced with the newly > > > allocated area and the caller needs to copy the contents? If the tree > > > does realloc the value with a new size, how does the tree know the new > > > value is larger than the existing value? It seems like the caller > > > needs to provide a function to calculate the size of the value based > > > on the length. > > > > Right. My brief description mentioned one thing without details: The > > caller would need to control whether to re-alloc. RT_GET would pass > > the size. If nothing is found, the tree would allocate. If there is a > > value already, just return it. That means both the address of the > > slot, and the local pointer to the value (with embedded, would be the > > same address). BTW Given that the actual value size can be calculated only by the caller, how does the tree know if the value is embedded or not? It's probably related to how to store combined pointer/value slots. If leaf nodes have a bitmap array that indicates the corresponding slot is an embedded value or a pointer to a value, it would be easy. But since the bitmap array is needed only in the leaf nodes, internal nodes and leaf nodes will no longer be identical structure, which is not a bad thing to me, though. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Dec 8, 2023 at 1:37 PM John Naylor wrote: > > On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada wrote: > > > It's still unclear to me why the value doesn't need to contain the size. > > > > If I understand you correctly, in RT_GET(), the tree allocs a new > > memory and updates the slot where the value is embedded with the > > pointer to the allocated memory, and returns the pointer to the > > caller. Since the returned value, newly allocated memory, is still > > empty, the callner needs to copy the contents of the old value to the > > new value and do whatever else it needs to. > > > > If the value is already a single-leave value and RT_GET() is called > > with a larger size, the slot is always replaced with the newly > > allocated area and the caller needs to copy the contents? If the tree > > does realloc the value with a new size, how does the tree know the new > > value is larger than the existing value? It seems like the caller > > needs to provide a function to calculate the size of the value based > > on the length. > > Right. My brief description mentioned one thing without details: The > caller would need to control whether to re-alloc. RT_GET would pass > the size. If nothing is found, the tree would allocate. If there is a > value already, just return it. That means both the address of the > slot, and the local pointer to the value (with embedded, would be the > same address). The caller checks if the array is long enough. If not, > call a new function that takes the new size, the address of the slot, > and the pointer to the old value. The tree would re-alloc, put the > alloc pointer in the slot and return the new local pointer. But as we > agreed, that is all follow-up work. Thank you for the detailed explanation. That makes sense to me. We will address it as a follow-up work. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Dec 8, 2023 at 8:57 AM Masahiko Sawada wrote: > It's still unclear to me why the value doesn't need to contain the size. > > If I understand you correctly, in RT_GET(), the tree allocs a new > memory and updates the slot where the value is embedded with the > pointer to the allocated memory, and returns the pointer to the > caller. Since the returned value, newly allocated memory, is still > empty, the callner needs to copy the contents of the old value to the > new value and do whatever else it needs to. > > If the value is already a single-leave value and RT_GET() is called > with a larger size, the slot is always replaced with the newly > allocated area and the caller needs to copy the contents? If the tree > does realloc the value with a new size, how does the tree know the new > value is larger than the existing value? It seems like the caller > needs to provide a function to calculate the size of the value based > on the length. Right. My brief description mentioned one thing without details: The caller would need to control whether to re-alloc. RT_GET would pass the size. If nothing is found, the tree would allocate. If there is a value already, just return it. That means both the address of the slot, and the local pointer to the value (with embedded, would be the same address). The caller checks if the array is long enough. If not, call a new function that takes the new size, the address of the slot, and the pointer to the old value. The tree would re-alloc, put the alloc pointer in the slot and return the new local pointer. But as we agreed, that is all follow-up work.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Dec 7, 2023 at 12:27 PM John Naylor wrote: > > On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada wrote: > > > > On Sat, Oct 28, 2023 at 5:56 PM John Naylor wrote: > > > bool > > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p); > > or for variable-length value support, > > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz); > > > > If an entry already exists, update its value to 'value_p' and return > > true. Otherwise set the value and return false. > > > RT_VALUE_TYPE > > RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); > > > > If the entry already exists, replace the value with a new empty value > > with sz bytes and set "found" to true. Otherwise, insert an empty > > value, return its pointer, and set "found" to false. > > > > We probably will find a better name but I use RT_INSERT() for > > discussion. RT_INSERT() returns an empty slot regardless of existing > > values. It can be used to insert a new value or to replace the value > > with a larger value. > > Looking at TidStoreSetBlockOffsets again (in particular how it works > with RT_GET), and thinking about issues we've discussed, I think > RT_SET is sufficient for vacuum. Here's how it could work: > > TidStoreSetBlockOffsets could have a stack variable that's "almost > always" large enough. When not, it can allocate in its own context. It > sets the necessary bits there. Then, it passes the pointer to RT_SET > with the number of bytes to copy. That seems very simple. Right. > > At some future time, we can add a new function with the complex > business about getting the current value to modify it, with the > re-alloc'ing that it might require. > > In other words, from both an API perspective and a performance > perspective, it makes sense for tid store to have a simple "set" > interface for vacuum that can be optimized for its characteristics > (insert only, ordered offsets). And also a more complex one for bitmap > scan (setting/unsetting bits of existing values, in any order). They > can share the same iteration interface, key types, and value types. > > What do you think, Masahiko? Good point. RT_SET() would be faster than RT_GET() and updating the value because RT_SET() would not need to take care of the existing value (its size, embedded or not, realloc etc). I think that we can separate the radix tree patch into two parts: the main implementation with RT_SET(), and more complex APIs such as RT_GET() etc. That way, it would probably make it easy to complete the radix tree and tidstore first. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Dec 6, 2023 at 3:39 PM John Naylor wrote: > > On Wed, Dec 6, 2023 at 4:34 AM Masahiko Sawada wrote: > > > > On Mon, Dec 4, 2023 at 5:21 PM John Naylor wrote: > > > > > Given variable-length value support, RT_GET() would have to do > > > > repalloc() if the existing value size is not big enough for the new > > > > value, but it cannot as the radix tree doesn't know the size of each > > > > stored value. > > > > > > I think we have two choices: > > > > > > - the value stores the "length". The caller would need to specify a > > > function to compute size from the "length" member. Note this assumes > > > there is an array. I think both aspects are not great. > > > - the value stores the "size". Callers that store an array (as > > > PageTableEntry's do) would compute length when they need to. This > > > sounds easier. > > > > As for the second idea, do we always need to require the value to have > > the "size" (e.g. int32) in the first field of its struct? If so, the > > caller will be able to use only 4 bytes in embedded value cases (or > > won't be able to use at all if the pointer size is 4 bytes). > > We could have an RT_SIZE_TYPE for varlen value types. That's easy. > There is another way, though: (This is a digression into embedded > values, but it does illuminate some issues even aside from that) > > My thinking a while ago was that an embedded value had no explicit > length/size, but could be "expanded" into a conventional value for the > caller. For bitmaps, the smallest full value would have length 1 and > whatever size (For tid store maybe 16 bytes). This would happen > automatically via a template function. > > Now I think that could be too complicated (especially for page table > entries, which have more bookkeeping than vacuum needs) and slow. > Imagine this as an embedded value: > > typedef struct BlocktableEntry > { > uint16 size; > > /* later: uint8 flags; for bitmap scan */ > > /* 64 bit: 3 elements , 32-bit: 1 element */ > OffsetNumber offsets[( sizeof(Pointer) - sizeof(int16) ) / > sizeof(OffsetNumber)]; > > /* end of embeddable value */ > > bitmapword words[FLEXIBLE_ARRAY_MEMBER]; > } BlocktableEntry; > > Here we can use a slot to store up to 3 offsets, no matter how big > they are. That's great because a bitmap could be mostly wasted space. Interesting idea. > But now the caller can't know up front how many bytes it needs until > it retrieves the value and sees what's already there. If there are > already three values, the caller needs to tell the tree "alloc this > much, update this slot you just gave me with the alloc (maybe DSA) > pointer, and return the local pointer". Then copy the 3 offsets into > set bits, and set whatever else it needs to. With normal values, same > thing, but with realloc. > > This is a bit complex, but I see an advantage The tree doesn't need to > care so much about the size, so the value doesn't need to contain the > size. For our case, we can use length (number of bitmapwords) without > the disadvantages I mentioned above, with length zero (or maybe -1) > meaning "no bitmapword array, the offsets are all in this small > array". It's still unclear to me why the value doesn't need to contain the size. If I understand you correctly, in RT_GET(), the tree allocs a new memory and updates the slot where the value is embedded with the pointer to the allocated memory, and returns the pointer to the caller. Since the returned value, newly allocated memory, is still empty, the callner needs to copy the contents of the old value to the new value and do whatever else it needs to. If the value is already a single-leave value and RT_GET() is called with a larger size, the slot is always replaced with the newly allocated area and the caller needs to copy the contents? If the tree does realloc the value with a new size, how does the tree know the new value is larger than the existing value? It seems like the caller needs to provide a function to calculate the size of the value based on the length. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada wrote: > > On Sat, Oct 28, 2023 at 5:56 PM John Naylor wrote: > bool > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p); > or for variable-length value support, > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz); > > If an entry already exists, update its value to 'value_p' and return > true. Otherwise set the value and return false. > RT_VALUE_TYPE > RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); > > If the entry already exists, replace the value with a new empty value > with sz bytes and set "found" to true. Otherwise, insert an empty > value, return its pointer, and set "found" to false. > > We probably will find a better name but I use RT_INSERT() for > discussion. RT_INSERT() returns an empty slot regardless of existing > values. It can be used to insert a new value or to replace the value > with a larger value. Looking at TidStoreSetBlockOffsets again (in particular how it works with RT_GET), and thinking about issues we've discussed, I think RT_SET is sufficient for vacuum. Here's how it could work: TidStoreSetBlockOffsets could have a stack variable that's "almost always" large enough. When not, it can allocate in its own context. It sets the necessary bits there. Then, it passes the pointer to RT_SET with the number of bytes to copy. That seems very simple. At some future time, we can add a new function with the complex business about getting the current value to modify it, with the re-alloc'ing that it might require. In other words, from both an API perspective and a performance perspective, it makes sense for tid store to have a simple "set" interface for vacuum that can be optimized for its characteristics (insert only, ordered offsets). And also a more complex one for bitmap scan (setting/unsetting bits of existing values, in any order). They can share the same iteration interface, key types, and value types. What do you think, Masahiko?
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Dec 6, 2023 at 4:34 AM Masahiko Sawada wrote: > > On Mon, Dec 4, 2023 at 5:21 PM John Naylor wrote: > > > Given variable-length value support, RT_GET() would have to do > > > repalloc() if the existing value size is not big enough for the new > > > value, but it cannot as the radix tree doesn't know the size of each > > > stored value. > > > > I think we have two choices: > > > > - the value stores the "length". The caller would need to specify a > > function to compute size from the "length" member. Note this assumes > > there is an array. I think both aspects are not great. > > - the value stores the "size". Callers that store an array (as > > PageTableEntry's do) would compute length when they need to. This > > sounds easier. > > As for the second idea, do we always need to require the value to have > the "size" (e.g. int32) in the first field of its struct? If so, the > caller will be able to use only 4 bytes in embedded value cases (or > won't be able to use at all if the pointer size is 4 bytes). We could have an RT_SIZE_TYPE for varlen value types. That's easy. There is another way, though: (This is a digression into embedded values, but it does illuminate some issues even aside from that) My thinking a while ago was that an embedded value had no explicit length/size, but could be "expanded" into a conventional value for the caller. For bitmaps, the smallest full value would have length 1 and whatever size (For tid store maybe 16 bytes). This would happen automatically via a template function. Now I think that could be too complicated (especially for page table entries, which have more bookkeeping than vacuum needs) and slow. Imagine this as an embedded value: typedef struct BlocktableEntry { uint16 size; /* later: uint8 flags; for bitmap scan */ /* 64 bit: 3 elements , 32-bit: 1 element */ OffsetNumber offsets[( sizeof(Pointer) - sizeof(int16) ) / sizeof(OffsetNumber)]; /* end of embeddable value */ bitmapword words[FLEXIBLE_ARRAY_MEMBER]; } BlocktableEntry; Here we can use a slot to store up to 3 offsets, no matter how big they are. That's great because a bitmap could be mostly wasted space. But now the caller can't know up front how many bytes it needs until it retrieves the value and sees what's already there. If there are already three values, the caller needs to tell the tree "alloc this much, update this slot you just gave me with the alloc (maybe DSA) pointer, and return the local pointer". Then copy the 3 offsets into set bits, and set whatever else it needs to. With normal values, same thing, but with realloc. This is a bit complex, but I see an advantage The tree doesn't need to care so much about the size, so the value doesn't need to contain the size. For our case, we can use length (number of bitmapwords) without the disadvantages I mentioned above, with length zero (or maybe -1) meaning "no bitmapword array, the offsets are all in this small array". > > > Another idea is that the radix tree returns the pointer > > > to the slot and the caller updates the value accordingly. > > > > I did exactly this in v43 TidStore if I understood you correctly. If I > > misunderstood you, can you clarify? > > I meant to expose RT_GET_SLOT_RECURSIVE() so that the caller updates > the value as they want. Did my sketch above get closer to that? Side note: I don't think we can expose that directly (e.g. need to check for create or extend upwards), but some functionality can be a thin wrapper around it. > > However, for vacuum, we have all values that we need up front. That > > gives me an idea: Something like this insert API could be optimized > > for "insert-only": If we only free values when we free the whole tree > > at the end, that's a clear use case for David Rowley's proposed "bump > > context", which would save 8 bytes per allocation and be a bit faster. > > [1] (RT_GET for varlen values would use an aset context, to allow > > repalloc, and nodes would continue to use slab). > > Interesting idea and worth trying it. Do we need to protect the whole > tree as insert-only for safety? It's problematic if the user uses > mixed RT_INSERT() and RT_GET(). You're right, but I'm not sure what the policy should be.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Dec 4, 2023 at 5:21 PM John Naylor wrote: > > On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada wrote: > > > Since the variable-length values support is a big deal and would be > > related to API design I'd like to discuss the API design first. > > Thanks for the fine summary of the issues here. > > [Swapping this back in my head] > > > RT_VALUE_TYPE > > RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found); > > or for variable-length value support, > > RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); > > > > If an entry already exists, return its pointer and set "found" to > > true. Otherwize, insert an empty value with sz bytes, return its > > pointer, and set "found" to false. > > > --- > > bool > > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p); > > or for variable-length value support, > > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz); > > > > If an entry already exists, update its value to 'value_p' and return > > true. Otherwise set the value and return false. > > I'd have to double-check, but I think RT_SET is vestigial and I'm not > sure it has any advantage over RT_GET as I've sketched it out. I'm > pretty sure it's only there now because changing the radix tree > regression tests is much harder than changing TID store. Agreed. > > > Given variable-length value support, RT_GET() would have to do > > repalloc() if the existing value size is not big enough for the new > > value, but it cannot as the radix tree doesn't know the size of each > > stored value. > > I think we have two choices: > > - the value stores the "length". The caller would need to specify a > function to compute size from the "length" member. Note this assumes > there is an array. I think both aspects are not great. > - the value stores the "size". Callers that store an array (as > PageTableEntry's do) would compute length when they need to. This > sounds easier. As for the second idea, do we always need to require the value to have the "size" (e.g. int32) in the first field of its struct? If so, the caller will be able to use only 4 bytes in embedded value cases (or won't be able to use at all if the pointer size is 4 bytes). > > > Another idea is that the radix tree returns the pointer > > to the slot and the caller updates the value accordingly. > > I did exactly this in v43 TidStore if I understood you correctly. If I > misunderstood you, can you clarify? I meant to expose RT_GET_SLOT_RECURSIVE() so that the caller updates the value as they want. > > > But it means > > that the caller has to update the slot properly while considering the > > value size (embedded vs. single-leave value), which seems not a good > > idea. > > For this optimization, callers will have to know about pointer-sized > values and treat them differently, but they don't need to know the > details about how where they are stored. > > While we want to keep embedded values in the back of our minds, I > really think the details should be postponed to a follow-up commit. Agreed. > > > To deal with this problem, I think we can somewhat change RT_GET() API > > as follow: > > > > RT_VALUE_TYPE > > RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); > > > > If the entry already exists, replace the value with a new empty value > > with sz bytes and set "found" to true. Otherwise, insert an empty > > value, return its pointer, and set "found" to false. > > > > We probably will find a better name but I use RT_INSERT() for > > discussion. RT_INSERT() returns an empty slot regardless of existing > > values. It can be used to insert a new value or to replace the value > > with a larger value. > > For the case we are discussing, bitmaps, updating an existing value is > a bit tricky. We need the existing value to properly update it with > set or unset bits. This can't work in general without a lot of work > for the caller. True. > > However, for vacuum, we have all values that we need up front. That > gives me an idea: Something like this insert API could be optimized > for "insert-only": If we only free values when we free the whole tree > at the end, that's a clear use case for David Rowley's proposed "bump > context", which would save 8 bytes per allocation and be a bit faster. > [1] (RT_GET for varlen values would use an aset context, to allow > repalloc, and nodes would continue to use slab). Interesting idea and worth trying it. Do we need to protect the whole tree as insert-only for safety? It's problematic if the user uses mixed RT_INSERT() and RT_GET(). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Nov 27, 2023 at 1:45 PM Masahiko Sawada wrote: > Since the variable-length values support is a big deal and would be > related to API design I'd like to discuss the API design first. Thanks for the fine summary of the issues here. [Swapping this back in my head] > RT_VALUE_TYPE > RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found); > or for variable-length value support, > RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); > > If an entry already exists, return its pointer and set "found" to > true. Otherwize, insert an empty value with sz bytes, return its > pointer, and set "found" to false. > --- > bool > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p); > or for variable-length value support, > RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz); > > If an entry already exists, update its value to 'value_p' and return > true. Otherwise set the value and return false. I'd have to double-check, but I think RT_SET is vestigial and I'm not sure it has any advantage over RT_GET as I've sketched it out. I'm pretty sure it's only there now because changing the radix tree regression tests is much harder than changing TID store. > Given variable-length value support, RT_GET() would have to do > repalloc() if the existing value size is not big enough for the new > value, but it cannot as the radix tree doesn't know the size of each > stored value. I think we have two choices: - the value stores the "length". The caller would need to specify a function to compute size from the "length" member. Note this assumes there is an array. I think both aspects are not great. - the value stores the "size". Callers that store an array (as PageTableEntry's do) would compute length when they need to. This sounds easier. > Another idea is that the radix tree returns the pointer > to the slot and the caller updates the value accordingly. I did exactly this in v43 TidStore if I understood you correctly. If I misunderstood you, can you clarify? > But it means > that the caller has to update the slot properly while considering the > value size (embedded vs. single-leave value), which seems not a good > idea. For this optimization, callers will have to know about pointer-sized values and treat them differently, but they don't need to know the details about how where they are stored. While we want to keep embedded values in the back of our minds, I really think the details should be postponed to a follow-up commit. > To deal with this problem, I think we can somewhat change RT_GET() API > as follow: > > RT_VALUE_TYPE > RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); > > If the entry already exists, replace the value with a new empty value > with sz bytes and set "found" to true. Otherwise, insert an empty > value, return its pointer, and set "found" to false. > > We probably will find a better name but I use RT_INSERT() for > discussion. RT_INSERT() returns an empty slot regardless of existing > values. It can be used to insert a new value or to replace the value > with a larger value. For the case we are discussing, bitmaps, updating an existing value is a bit tricky. We need the existing value to properly update it with set or unset bits. This can't work in general without a lot of work for the caller. However, for vacuum, we have all values that we need up front. That gives me an idea: Something like this insert API could be optimized for "insert-only": If we only free values when we free the whole tree at the end, that's a clear use case for David Rowley's proposed "bump context", which would save 8 bytes per allocation and be a bit faster. [1] (RT_GET for varlen values would use an aset context, to allow repalloc, and nodes would continue to use slab). [1] https://www.postgresql.org/message-id/flat/CAApHDvqGSpCU95TmM=Bp=6xjl_nlys4zdzopfnywbk97xrd...@mail.gmail.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Sat, Oct 28, 2023 at 5:56 PM John Naylor wrote: > > I wrote: > > > Seems fine at a glance, thanks. I will build on this to implement > > variable-length values. I have already finished one prerequisite which is: > > public APIs passing pointers to values. > > Since my publishing schedule has not kept up, I'm just going to share > something similar to what I mentioned earlier, just to get things > moving again. Thanks for sharing the updates. I've returned to work today and will resume working on this feature. > > 0001-0009 are from earlier versions, except for 0007 which makes a > bunch of superficial naming updates, similar to those done in a recent > other version. Somewhere along the way I fixed long-standing git > whitespace warnings, but I don't remember if that's new here. In any > case, let's try to preserve that. > > 0010 is some minor refactoring to reduce duplication > > 0011-0014 add public functions that give the caller more control over > the input and responsibility for locking. They are not named well, but > I plan these to be temporary: They are currently used for the tidstore > only, since that has much simpler tests than the standard radix tree > tests. One thing to note: since the tidstore has always done it's own > locking within a larger structure, these patches don't bother to do > locking at the radix tree level. Locking twice seems...not great. > These patches are the main prerequisite for variable-length values. > Once that is working well, we can switch the standard tests to the new > APIs. Since the variable-length values support is a big deal and would be related to API design I'd like to discuss the API design first. Currently, we have the following APIs: --- RT_VALUE_TYPE RT_GET(RT_RADIX_TREE *tree, uint64 key, bool *found); or for variable-length value support, RT_GET(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); If an entry already exists, return its pointer and set "found" to true. Otherwize, insert an empty value with sz bytes, return its pointer, and set "found" to false. --- RT_VALUE_TYPE RT_FIND(RT_RADIX_TREE *tree, uint64 key); If an entry exists, return the pointer to the value, otherwise return NULL. (I omitted RT_SEARCH() as it's essentially the same as RT_FIND() and will probably get removed.) --- bool RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p); or for variable-length value support, RT_SET(RT_RADIX_TREE *tree, uint64 key, RT_VALUE_TYPE *value_p, size_t sz); If an entry already exists, update its value to 'value_p' and return true. Otherwise set the value and return false. Given variable-length value support, RT_GET() would have to do repalloc() if the existing value size is not big enough for the new value, but it cannot as the radix tree doesn't know the size of each stored value. Another idea is that the radix tree returns the pointer to the slot and the caller updates the value accordingly. But it means that the caller has to update the slot properly while considering the value size (embedded vs. single-leave value), which seems not a good idea. To deal with this problem, I think we can somewhat change RT_GET() API as follow: RT_VALUE_TYPE RT_INSERT(RT_RADIX_TREE *tree, uint64 key, size_t sz, bool *found); If the entry already exists, replace the value with a new empty value with sz bytes and set "found" to true. Otherwise, insert an empty value, return its pointer, and set "found" to false. We probably will find a better name but I use RT_INSERT() for discussion. RT_INSERT() returns an empty slot regardless of existing values. It can be used to insert a new value or to replace the value with a larger value. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
I wrote: > Seems fine at a glance, thanks. I will build on this to implement > variable-length values. I have already finished one prerequisite which is: > public APIs passing pointers to values. Since my publishing schedule has not kept up, I'm just going to share something similar to what I mentioned earlier, just to get things moving again. 0001-0009 are from earlier versions, except for 0007 which makes a bunch of superficial naming updates, similar to those done in a recent other version. Somewhere along the way I fixed long-standing git whitespace warnings, but I don't remember if that's new here. In any case, let's try to preserve that. 0010 is some minor refactoring to reduce duplication 0011-0014 add public functions that give the caller more control over the input and responsibility for locking. They are not named well, but I plan these to be temporary: They are currently used for the tidstore only, since that has much simpler tests than the standard radix tree tests. One thing to note: since the tidstore has always done it's own locking within a larger structure, these patches don't bother to do locking at the radix tree level. Locking twice seems...not great. These patches are the main prerequisite for variable-length values. Once that is working well, we can switch the standard tests to the new APIs. Next steps include (some of these were briefly discussed off-list with Sawada-san): - template parameter for varlen values - some callers to pass length in bytes - block entries to have num_elems for # of bitmap words - a way for updates to re-alloc values when needed - aset allocation for values when appropriate v43-ART.tar.gz Description: application/gzip
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Sat, Sep 16, 2023 at 9:03 AM Andres Freund wrote: > > Hi, > > On 2023-08-28 23:43:22 +0900, Masahiko Sawada wrote: > > I've attached v42 patch set. I improved tidstore regression test codes > > in addition of imcorporating the above comments. > > Why did you need to disable the benchmark module for CI? I didn't want to unnecessarily make cfbot unhappy since the benchmark module is not going to get committed to the core and sometimes not up-to-date. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
Hi, On 2023-08-28 23:43:22 +0900, Masahiko Sawada wrote: > I've attached v42 patch set. I improved tidstore regression test codes > in addition of imcorporating the above comments. Why did you need to disable the benchmark module for CI? Greetings, Andres Freund
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Sep 6, 2023 at 3:23 PM John Naylor wrote: > > > On Mon, Aug 28, 2023 at 9:44 PM Masahiko Sawada wrote: > > > > I've attached v42 patch set. I improved tidstore regression test codes > > in addition of imcorporating the above comments. > > Seems fine at a glance, thanks. I will build on this to implement > variable-length values. Thanks. > I have already finished one prerequisite which is: public APIs passing > pointers to values. Great! Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Aug 28, 2023 at 9:44 PM Masahiko Sawada wrote: > > I've attached v42 patch set. I improved tidstore regression test codes > in addition of imcorporating the above comments. Seems fine at a glance, thanks. I will build on this to implement variable-length values. I have already finished one prerequisite which is: public APIs passing pointers to values. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Aug 28, 2023 at 4:20 PM John Naylor wrote: > > On Sun, Aug 27, 2023 at 7:53 PM Masahiko Sawada wrote: > > > > I've updated the regression tests for tidstore so that it uses SQL > > functions to add blocks/offsets and dump its contents. The new test > > covers the same test coverages but it's executed using SQL functions > > instead of executing all tests in one SQL function. > > This is much nicer and more flexible, thanks! A few questions/comments: > > tidstore_dump_tids() returns a string -- is it difficult to turn this into a > SRF, or is it just a bit more work? It's not difficult. I've changed it in v42 patch. > > The lookup test seems fine for now. The output would look nicer with an > "order by tid". Agreed. > > I think we could have the SQL function tidstore_create() take a boolean for > shared memory. That would allow ad-hoc testing without a recompile, if I'm > not mistaken. Agreed. > > +SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[]) > + FROM blocks, offsets > + GROUP BY blk; > + tidstore_set_block_offsets > + > + > + > + > + > + > +(5 rows) > > Calling a void function multiple times leads to vertical whitespace, which > looks a bit strange and may look better with some output, even if irrelevant: > > -SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[]) > +SELECT row_number() over(order by blk), tidstore_set_block_offsets(blk, > array_agg(offsets.off)::int2[]) > > row_number | tidstore_set_block_offsets > + > 1 | > 2 | > 3 | > 4 | > 5 | > (5 rows) Yes, it looks better. I've attached v42 patch set. I improved tidstore regression test codes in addition of imcorporating the above comments. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v42-ART.tar.gz Description: GNU Zip compressed data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Sun, Aug 27, 2023 at 7:53 PM Masahiko Sawada wrote: > > I've updated the regression tests for tidstore so that it uses SQL > functions to add blocks/offsets and dump its contents. The new test > covers the same test coverages but it's executed using SQL functions > instead of executing all tests in one SQL function. This is much nicer and more flexible, thanks! A few questions/comments: tidstore_dump_tids() returns a string -- is it difficult to turn this into a SRF, or is it just a bit more work? The lookup test seems fine for now. The output would look nicer with an "order by tid". I think we could have the SQL function tidstore_create() take a boolean for shared memory. That would allow ad-hoc testing without a recompile, if I'm not mistaken. +SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[]) + FROM blocks, offsets + GROUP BY blk; + tidstore_set_block_offsets + + + + + + +(5 rows) Calling a void function multiple times leads to vertical whitespace, which looks a bit strange and may look better with some output, even if irrelevant: -SELECT tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[]) +SELECT row_number() over(order by blk), tidstore_set_block_offsets(blk, array_agg(offsets.off)::int2[]) row_number | tidstore_set_block_offsets + 1 | 2 | 3 | 4 | 5 | (5 rows) -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Aug 16, 2023 at 8:04 PM John Naylor wrote: > > > On Tue, Aug 15, 2023 at 6:53 PM John Naylor > wrote: > > > > On Tue, Aug 15, 2023 at 9:34 AM Masahiko Sawada > > wrote: > > > > > BTW cfbot reported that some regression tests failed due to OOM. I've > > > attached the patch to fix it. > > > > Seems worth doing now rather than later, so added this and squashed most of > > the rest together. > > This segfaults because of a mistake fixing a rebase conflict, so v40 attached. > Thank you for updating the patch set. On Tue, Aug 15, 2023 at 11:33 AM Masahiko Sawada wrote: > On Mon, Aug 14, 2023 at 8:05 PM John Naylor > wrote: > > Looking at the tidstore tests again after some months, I'm not particularly > > pleased with the amount of code required for how little it seems to be > > testing, nor the output when something fails. (I wonder how hard it would > > be to have SQL functions that add blocks/offsets to the tid store, and emit > > tuples of tids found in the store.) > > It would not be hard to have such SQL functions. I'll try it. I've updated the regression tests for tidstore so that it uses SQL functions to add blocks/offsets and dump its contents. The new test covers the same test coverages but it's executed using SQL functions instead of executing all tests in one SQL function. 0008 patch fixes a bug in tidstore which I found during this work. We didn't recreate the radix tree in the same memory context when TidStoreReset(). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v41-ART.tar.gz Description: GNU Zip compressed data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Aug 15, 2023 at 6:53 PM John Naylor wrote: > > On Tue, Aug 15, 2023 at 9:34 AM Masahiko Sawada wrote: > > > BTW cfbot reported that some regression tests failed due to OOM. I've > > attached the patch to fix it. > > Seems worth doing now rather than later, so added this and squashed most of the rest together. This segfaults because of a mistake fixing a rebase conflict, so v40 attached. -- John Naylor EDB: http://www.enterprisedb.com v40-ART.tar.gz Description: application/gzip
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Aug 15, 2023 at 9:34 AM Masahiko Sawada wrote: > BTW cfbot reported that some regression tests failed due to OOM. I've > attached the patch to fix it. Seems worth doing now rather than later, so added this and squashed most of the rest together. I wonder if that test uses too much memory in general. Maybe using the full uint64 is too much. > On Mon, Aug 14, 2023 at 8:05 PM John Naylor > wrote: > > This is possibly faster than v38-0010, but looking like not worth the complexity, assuming the other way avoids the bug going forward. > > I prefer 0010 but is it worth testing with other compilers such as clang? Okay, keeping 0010 with a comment, and leaving out 0011 for now. Clang is aggressive about unrolling loops, so may be worth looking globally at some point. > > v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch > > Within noise level of v38-0011, but it's small and simple, so I like it, at least for small arrays. > > Agreed. Keeping 0012 and not 0013. -- John Naylor EDB: http://www.enterprisedb.com v39-ART.tar.gz Description: application/gzip
Re: [PoC] Improve dead tuple storage for lazy vacuum
Hi, On Mon, Aug 14, 2023 at 8:05 PM John Naylor wrote: > > On Thu, Jul 13, 2023 at 3:09 PM Masahiko Sawada wrote: > > > > 0007, 0008, 0010, and 0011 are straightforward and agree to merge them. Thank you for updating the patch! > > [Part 1 - clear the deck of earlier performance work etc] > > Thanks for taking a look! I've merged 0007 and 0008. The others need a > performance test to justify them -- an eyeball check is not enough. I've now > made the time to do that. > > sparse loads > > v38 0001-0006 (still using node3 for this test only): > > select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from > bench_load_random_int(100 * 1000 * (1+x-x))) a; > avg > - > 27.1000 > > select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from > bench_load_random_int(500 * 1000 * (1+x-x))) a; > avg > -- > 165.6333 > > v38-0007-Optimize-RT_EXTEND_DOWN.patch > > select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from > bench_load_random_int(100 * 1000 * (1+x-x))) a; > avg > - > 25.0900 > > select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from > bench_load_random_int(500 * 1000 * (1+x-x))) a; > avg > -- > 157.3667 > > That seems worth doing. > > v38-0008-Use-4-children-for-node-4-also-attempt-portable-.patch > > This combines two things because I messed up a rebase: Use fanout of 4, and > try some macros for shmem sizes, both 32- and 64-bit. Looking at this much, I > no longer have a goal to have a separate set of size-classes for non-SIMD > platforms, because that would cause global maintenance problems -- it's > probably better to reduce worst-case search time where necessary. That would > be much more localized. > > > I have some questions on 0009 patch: > > > According to the comment, this optimization is for only gcc? > > No, not at all. That tells me the comment is misleading. > > > I think this change reduces > > readability and maintainability. > > Well, that much is obvious. What is not obvious is how much it gains us over > the alternatives. I do have a simpler idea, though... > > load mostly node4 > > select * from bench_search_random_nodes(250*1000, '0xFF'); > n4 = 42626, n16 = 21492, n32 = 0, n64 = 0, n256 = 257 > mem_allocated | load_ms | search_ms > ---+-+--- >7352384 | 25 | 0 > > v38-0009-TEMP-take-out-search-time-from-bench.patch > > This is just to allow LATERAL queries for better measurements. > > select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from > bench_search_random_nodes(250*1000 * (1+x-x), '0xFF')) a; > > avg > - > 24.8333 0007, 0008, and 0009 look good to me. > > v38-0010-Try-a-simpler-way-to-avoid-memmove.patch > > This slightly rewrites the standard loop so that gcc doesn't turn it into a > memmove(). Unlike the patch you didn't like, this *is* gcc-specific. (needs a > comment, which I forgot) > > avg > - > 21.9600 > > So, that's not a trivial difference. I wasn't a big fan of Andres' __asm("") > workaround, but that may be just my ignorance about it. We need something > like either of the two. > > v38-0011-Optimize-add_child_4-take-2.patch > avg > - > 21.3500 > > This is possibly faster than v38-0010, but looking like not worth the > complexity, assuming the other way avoids the bug going forward. I prefer 0010 but is it worth testing with other compilers such as clang? > > > According to the bugzilla ticket > > referred to in the comment, it's realized as a bug in the community, > > so once the gcc bug fixes, we might no longer need this trick, no? > > No comment in two years... > > v38-0013-Use-constant-for-initial-copy-of-chunks-and-chil.patch > > This is the same as v37-0011. I wasn't quite satisfied with it since it still > has two memcpy() calls, but it actually seems to regress: > > avg > - > 22.0900 > > v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch > > This patch uses a single loop for the copy. > > avg > - > 21.0300 > > Within noise level of v38-0011, but it's small and simple, so I like it, at > least for small arrays. Agreed. > > v38-0014-node48-Remove-need-for-RIGHTMOST_ONE-in-radix-tr.patch > v38-0015-node48-Remove-dead-code-by-using-loop-local-var.patch > > Just small cleanups. > > v38-0016-Use-memcpy-for-children-when-growing-into-node48.patch > > Makes sense, but untested. Agreed. BTW cfbot reported that some regression tests failed due to OOM. I've attached the patch to fix it. > > === > [Part 2] > > Per off-list discussion with Masahiko, it makes sense to take some of the
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Jul 13, 2023 at 3:09 PM Masahiko Sawada wrote: > > 0007, 0008, 0010, and 0011 are straightforward and agree to merge them. [Part 1 - clear the deck of earlier performance work etc] Thanks for taking a look! I've merged 0007 and 0008. The others need a performance test to justify them -- an eyeball check is not enough. I've now made the time to do that. sparse loads v38 0001-0006 (still using node3 for this test only): select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 * (1+x-x))) a; avg - 27.1000 select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a; avg -- 165.6333 v38-0007-Optimize-RT_EXTEND_DOWN.patch select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_load_random_int(100 * 1000 * (1+x-x))) a; avg - 25.0900 select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from bench_load_random_int(500 * 1000 * (1+x-x))) a; avg -- 157.3667 That seems worth doing. v38-0008-Use-4-children-for-node-4-also-attempt-portable-.patch This combines two things because I messed up a rebase: Use fanout of 4, and try some macros for shmem sizes, both 32- and 64-bit. Looking at this much, I no longer have a goal to have a separate set of size-classes for non-SIMD platforms, because that would cause global maintenance problems -- it's probably better to reduce worst-case search time where necessary. That would be much more localized. > I have some questions on 0009 patch: > According to the comment, this optimization is for only gcc? No, not at all. That tells me the comment is misleading. > I think this change reduces > readability and maintainability. Well, that much is obvious. What is not obvious is how much it gains us over the alternatives. I do have a simpler idea, though... load mostly node4 select * from bench_search_random_nodes(250*1000, '0xFF'); n4 = 42626, n16 = 21492, n32 = 0, n64 = 0, n256 = 257 mem_allocated | load_ms | search_ms ---+-+--- 7352384 | 25 | 0 v38-0009-TEMP-take-out-search-time-from-bench.patch This is just to allow LATERAL queries for better measurements. select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from bench_search_random_nodes(250*1000 * (1+x-x), '0xFF')) a; avg - 24.8333 v38-0010-Try-a-simpler-way-to-avoid-memmove.patch This slightly rewrites the standard loop so that gcc doesn't turn it into a memmove(). Unlike the patch you didn't like, this *is* gcc-specific. (needs a comment, which I forgot) avg - 21.9600 So, that's not a trivial difference. I wasn't a big fan of Andres' __asm("") workaround, but that may be just my ignorance about it. We need something like either of the two. v38-0011-Optimize-add_child_4-take-2.patch avg - 21.3500 This is possibly faster than v38-0010, but looking like not worth the complexity, assuming the other way avoids the bug going forward. > According to the bugzilla ticket > referred to in the comment, it's realized as a bug in the community, > so once the gcc bug fixes, we might no longer need this trick, no? No comment in two years... v38-0013-Use-constant-for-initial-copy-of-chunks-and-chil.patch This is the same as v37-0011. I wasn't quite satisfied with it since it still has two memcpy() calls, but it actually seems to regress: avg - 22.0900 v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch This patch uses a single loop for the copy. avg - 21.0300 Within noise level of v38-0011, but it's small and simple, so I like it, at least for small arrays. v38-0014-node48-Remove-need-for-RIGHTMOST_ONE-in-radix-tr.patch v38-0015-node48-Remove-dead-code-by-using-loop-local-var.patch Just small cleanups. v38-0016-Use-memcpy-for-children-when-growing-into-node48.patch Makes sense, but untested. === [Part 2] Per off-list discussion with Masahiko, it makes sense to take some of the ideas I've used locally on tidbitmap, and start incorporating them into earlier vacuum work to get that out the door faster. With that in mind... v38-0017-Make-tidstore-more-similar-to-tidbitmap.patch This uses a simplified PagetableEntry (unimaginatively called BlocktableEntry just to avoid confusion), to be replaced with the real thing at a later date. This is still fixed size, to be replaced with a varlen type. Looking at the tidstore tests again after some months, I'm not particularly pleased with the amount of code required for how little it seems to be testing, nor the output
Re: [PoC] Improve dead tuple storage for lazy vacuum
Hi, On Thu, Jul 13, 2023 at 5:08 PM Masahiko Sawada wrote: > > On Sat, Jul 8, 2023 at 11:54 AM John Naylor > wrote: > > > > > > On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada > > wrote: > > > > > > On Wed, Jul 5, 2023 at 8:21 PM John Naylor > > > wrote: > > > > Well, it's going to be a bit of a mess until I can demonstrate it > > > > working (and working well) with bitmap heap scan. Fixing that now is > > > > just going to create conflicts. I do have a couple small older patches > > > > laying around that were quick experiments -- I think at least some of > > > > them should give a performance boost in loading speed, but haven't had > > > > time to test. Would you like to take a look? > > > > > > Yes, I can experiment with these patches in the meantime. > > > > Okay, here it is in v36. 0001-6 are same as v35. > > > > 0007 removes a wasted extra computation newly introduced by refactoring > > growing nodes. 0008 just makes 0011 nicer. Not worth testing by themselves, > > but better to be tidy. > > 0009 is an experiment to get rid of slow memmoves in node4, addressing a > > long-standing inefficiency. It looks a bit tricky, but I think it's > > actually straightforward after drawing out the cases with pen and paper. It > > works if the fanout is either 4 or 5, so we have some wiggle room. This may > > give a noticeable boost if the input is reversed or random. > > 0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with > > sparse trees. > > 0011 reduces function calls when growing the smaller nodes. Not sure about > > this one -- possibly worth it for node4 only? > > > > If these help, it'll show up more easily in smaller inputs. Large inputs > > tend to be more dominated by RAM latency. cfbot reported some failures[1], and the v36 patch cannot be applied cleanly to the current HEAD. I've attached updated patches to make cfbot happy. Regards, [1] http://cfbot.cputube.org/highlights/all.html#3687 -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v37-ART.tar.gz Description: GNU Zip compressed data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Sat, Jul 8, 2023 at 11:54 AM John Naylor wrote: > > > On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada wrote: > > > > On Wed, Jul 5, 2023 at 8:21 PM John Naylor > > wrote: > > > Well, it's going to be a bit of a mess until I can demonstrate it working > > > (and working well) with bitmap heap scan. Fixing that now is just going > > > to create conflicts. I do have a couple small older patches laying around > > > that were quick experiments -- I think at least some of them should give > > > a performance boost in loading speed, but haven't had time to test. Would > > > you like to take a look? > > > > Yes, I can experiment with these patches in the meantime. > > Okay, here it is in v36. 0001-6 are same as v35. > > 0007 removes a wasted extra computation newly introduced by refactoring > growing nodes. 0008 just makes 0011 nicer. Not worth testing by themselves, > but better to be tidy. > 0009 is an experiment to get rid of slow memmoves in node4, addressing a > long-standing inefficiency. It looks a bit tricky, but I think it's actually > straightforward after drawing out the cases with pen and paper. It works if > the fanout is either 4 or 5, so we have some wiggle room. This may give a > noticeable boost if the input is reversed or random. > 0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with > sparse trees. > 0011 reduces function calls when growing the smaller nodes. Not sure about > this one -- possibly worth it for node4 only? > > If these help, it'll show up more easily in smaller inputs. Large inputs tend > to be more dominated by RAM latency. Thanks for sharing the patches! 0007, 0008, 0010, and 0011 are straightforward and agree to merge them. I have some questions on 0009 patch: + /* shift chunks and children + + Unfortunately, gcc has gotten too aggressive in turning simple loops + into slow memmove's, so we have to be a bit more clever. + See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101481 + + We take advantage of the fact that a good + compiler can turn a memmove of a small constant power-of-two + number of bytes into a single load/store. + */ According to the comment, this optimization is for only gcc? and there is no negative impact when building with other compilers such as clang by this change? I'm not sure that it's a good approach to hand-optimize the code much to generate better instructions on gcc. I think this change reduces readability and maintainability. According to the bugzilla ticket referred to in the comment, it's realized as a bug in the community, so once the gcc bug fixes, we might no longer need this trick, no? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Jul 7, 2023 at 2:19 PM Masahiko Sawada wrote: > > On Wed, Jul 5, 2023 at 8:21 PM John Naylor wrote: > > Well, it's going to be a bit of a mess until I can demonstrate it working (and working well) with bitmap heap scan. Fixing that now is just going to create conflicts. I do have a couple small older patches laying around that were quick experiments -- I think at least some of them should give a performance boost in loading speed, but haven't had time to test. Would you like to take a look? > > Yes, I can experiment with these patches in the meantime. Okay, here it is in v36. 0001-6 are same as v35. 0007 removes a wasted extra computation newly introduced by refactoring growing nodes. 0008 just makes 0011 nicer. Not worth testing by themselves, but better to be tidy. 0009 is an experiment to get rid of slow memmoves in node4, addressing a long-standing inefficiency. It looks a bit tricky, but I think it's actually straightforward after drawing out the cases with pen and paper. It works if the fanout is either 4 or 5, so we have some wiggle room. This may give a noticeable boost if the input is reversed or random. 0010 allows RT_EXTEND_DOWN to reduce function calls, so should help with sparse trees. 0011 reduces function calls when growing the smaller nodes. Not sure about this one -- possibly worth it for node4 only? If these help, it'll show up more easily in smaller inputs. Large inputs tend to be more dominated by RAM latency. -- John Naylor EDB: http://www.enterprisedb.com v36-ART.tar.gz Description: application/gzip
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Jul 5, 2023 at 8:21 PM John Naylor wrote: > > On Tue, Jul 4, 2023 at 12:49 PM Masahiko Sawada wrote: > > > > As you mentioned, the 1-byte value is embedded into 8 byte so 7 bytes > > are unused, but we use less memory since we use less slab contexts and > > save fragmentations. > > Thanks for testing. This tree is sparse enough that most of the space is > taken up by small inner nodes, and not by leaves. So, it's encouraging to see > a small space savings even here. > > > I've also tested some large value cases (e.g. the value is 80-bytes) > > and got a similar result. > > Interesting. With a separate allocation per value the overhead would be 8 > bytes, or 10% here. It's plausible that savings elsewhere can hide that, > globally. > > > Regarding the codes, there are many todo and fixme comments so it > > seems to me that your recent work is still in-progress. What is the > > current status? Can I start reviewing the code or should I wait for a > > while until your recent work completes? > > Well, it's going to be a bit of a mess until I can demonstrate it working > (and working well) with bitmap heap scan. Fixing that now is just going to > create conflicts. I do have a couple small older patches laying around that > were quick experiments -- I think at least some of them should give a > performance boost in loading speed, but haven't had time to test. Would you > like to take a look? Yes, I can experiment with these patches in the meantime. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com