Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, May 1, 2024 at 4:29 PM John Naylor wrote: > > On Thu, Apr 25, 2024 at 8:36 AM Masahiko Sawada wrote: > > > > On Mon, Apr 15, 2024 at 6:12 PM John Naylor wrote: > > > > - RT_KEY_GET_SHIFT is not covered for key=0: > > > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803 > > > > > > That should be fairly simple to add to the tests. > > > > There are two paths to call RT_KEY_GET_SHIFT(): > > > > 1. RT_SET() -> RT_KEY_GET_SHIFT() > > 2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT() > > > > In both cases, it's called when key > tree->ctl->max_val. Since the > > minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called > > when key=0. > > Ah, right, so it is dead code. Nothing to worry about, but it does > point the way to some simplifications, which I've put together in the > attached. Thank you for the patch. It looks good to me. + /* compute the smallest shift that will allowing storing the key */ + start_shift = pg_leftmost_one_pos64(key) / RT_SPAN * RT_SPAN; The comment is moved from RT_KEY_GET_SHIFT() but I think s/will allowing storing/will allow storing/. > > > > - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered: > > > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644 > > > > > > That should be easy to add. > > > > Agreed. The patch is attached. > > LGTM > > > > - TidStoreCreate* has some memory clamps that are not covered: > > > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179 > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234 > > > > > > Maybe we could experiment with using 1MB for shared, and something > > > smaller for local. > > > > I've confirmed that the local and shared tidstore with small max sizes > > such as 4kB and 1MB worked. Currently the max size is hard-coded in > > test_tidstore.c but if we use work_mem as the max size, we can pass > > different max sizes for local and shared in the test script. > > Seems okay, do you want to try that and see how it looks? I've attached a simple patch for this. In test_tidstore.sql, we used to create two local tidstore and one shared tidstore. I thought of specifying small work_mem values for these three cases but it would remove the normal test cases. So I created separate tidstore for this test. Also, the new test is just to check if tidstore can be created with such a small size, but it might be a good idea to add some TIDs to check if it really works fine. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com use_work_mem_as_max_bytes.patch Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Apr 25, 2024 at 8:36 AM Masahiko Sawada wrote: > > On Mon, Apr 15, 2024 at 6:12 PM John Naylor wrote: > > - RT_KEY_GET_SHIFT is not covered for key=0: > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803 > > > > That should be fairly simple to add to the tests. > > There are two paths to call RT_KEY_GET_SHIFT(): > > 1. RT_SET() -> RT_KEY_GET_SHIFT() > 2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT() > > In both cases, it's called when key > tree->ctl->max_val. Since the > minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called > when key=0. Ah, right, so it is dead code. Nothing to worry about, but it does point the way to some simplifications, which I've put together in the attached. > > - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered: > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644 > > > > That should be easy to add. > > Agreed. The patch is attached. LGTM > > - TidStoreCreate* has some memory clamps that are not covered: > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179 > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234 > > > > Maybe we could experiment with using 1MB for shared, and something > > smaller for local. > > I've confirmed that the local and shared tidstore with small max sizes > such as 4kB and 1MB worked. Currently the max size is hard-coded in > test_tidstore.c but if we use work_mem as the max size, we can pass > different max sizes for local and shared in the test script. Seems okay, do you want to try that and see how it looks? diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h index 2896a6efc5..fdac103763 100644 --- a/src/include/lib/radixtree.h +++ b/src/include/lib/radixtree.h @@ -217,7 +217,6 @@ #define RT_NODE_48_GET_CHILD RT_MAKE_NAME(node_48_get_child) #define RT_NODE_256_IS_CHUNK_USED RT_MAKE_NAME(node_256_is_chunk_used) #define RT_NODE_256_GET_CHILD RT_MAKE_NAME(node_256_get_child) -#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift) #define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val) #define RT_NODE_SEARCH RT_MAKE_NAME(node_search) #define RT_NODE_DELETE RT_MAKE_NAME(node_delete) @@ -320,9 +319,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE * tree); /* Mask for extracting a chunk from a key */ #define RT_CHUNK_MASK ((1 << RT_SPAN) - 1) -/* Maximum shift needed to extract a chunk from a key */ -#define RT_MAX_SHIFT RT_KEY_GET_SHIFT(UINT64_MAX) - /* Maximum level a tree can reach for a key */ #define RT_MAX_LEVEL ((sizeof(uint64) * BITS_PER_BYTE) / RT_SPAN) @@ -797,28 +793,15 @@ RT_NODE_256_GET_CHILD(RT_NODE_256 * node, uint8 chunk) return &node->children[chunk]; } -/* - * Return the smallest shift that will allowing storing the given key. - */ -static inline int -RT_KEY_GET_SHIFT(uint64 key) -{ - if (key == 0) - return 0; - else - return (pg_leftmost_one_pos64(key) / RT_SPAN) * RT_SPAN; -} - /* * Return the max value that can be stored in the tree with the given shift. */ static uint64 RT_SHIFT_GET_MAX_VAL(int shift) { - if (shift == RT_MAX_SHIFT) - return UINT64_MAX; - else - return (UINT64CONST(1) << (shift + RT_SPAN)) - 1; + int max_shift = (sizeof(uint64) - 1) * BITS_PER_BYTE; + + return UINT64_MAX >> (max_shift - shift); } /* @@ -1574,9 +1557,8 @@ RT_NODE_INSERT(RT_RADIX_TREE * tree, RT_PTR_ALLOC * parent_slot, RT_CHILD_PTR no * and move the old node below it. */ static pg_noinline void -RT_EXTEND_UP(RT_RADIX_TREE * tree, uint64 key) +RT_EXTEND_UP(RT_RADIX_TREE * tree, uint64 key, int target_shift) { - int target_shift = RT_KEY_GET_SHIFT(key); int shift = tree->ctl->start_shift; Assert(shift < target_shift); @@ -1713,11 +1695,15 @@ RT_SET(RT_RADIX_TREE * tree, uint64 key, RT_VALUE_TYPE * value_p) /* Extend the tree if necessary */ if (unlikely(key > tree->ctl->max_val)) { + int start_shift; + + /* compute the smallest shift that will allowing storing the key */ + start_shift = pg_leftmost_one_pos64(key) / RT_SPAN * RT_SPAN; + if (tree->ctl->num_keys == 0) { RT_CHILD_PTR node; RT_NODE_4 *n4; - int start_shift = RT_KEY_GET_SHIFT(key); /* * With an empty root node, we don't extend the tree upwards, @@ -1738,7 +1724,7 @@ RT_SET(RT_RADIX_TREE * tree, uint64 key, RT_VALUE_TYPE * value_p) goto have_slot; } else - RT_EXTEND_UP(tree, key); + RT_EXTEND_UP(tree, key, start_shift); } slot = RT_GET_SLOT_RECURSIVE(tree, &tree->ctl->root, @@ -2937,7 +2923,6 @@ RT_DUMP_NODE(RT_NODE * node) #undef RT_SPAN #undef RT_NODE_MAX_SLOTS #undef RT_CHUNK_MASK -#undef RT_MAX_SHIFT #undef RT_MAX_LEVEL #undef RT_GET_KEY_CHUNK #undef RT_BM_IDX @@ -3032,7 +3017,6 @@ RT_DUMP_NODE(RT_NODE * node) #undef RT_NODE_48_GET_CHILD #undef RT_NODE_256_IS_CHUNK_USED #undef RT_NODE_256_GET_CHILD -
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Apr 25, 2024 at 1:38 PM Masahiko Sawada wrote: > > On Thu, Apr 25, 2024 at 12:17 PM John Naylor wrote: > > > > On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada > > wrote: > > > > > > > I saw a SIGSEGV there when using tidstore to write a fix for something > > > > else. > > > > Patch attached. > > > > > > Great find, thank you for the patch! > > > > +1 > > > > (This occurred to me a few days ago, but I was far from my computer.) > > > > With the purge function that Noah proposed, I believe we can also get > > rid of the comment at the top of the .sql test file warning of a > > maintenance hazard: > > ..."To avoid adding duplicates, > > -- each call to do_set_block_offsets() should use different block > > -- numbers." > > Good point. Removed. > > > > > > of do_gset_block_offset() and check_set_block_offsets(). If these are > > > annoying, we can remove the cases of array[1] and array[1,2]. > > > > Let's keep those -- 32-bit platforms should also exercise this path. > > Agreed. > > I've attached a new patch. I'll push it tonight, if there is no further > comment. > Pushed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Apr 25, 2024 at 12:17 PM John Naylor wrote: > > On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada wrote: > > > > > I saw a SIGSEGV there when using tidstore to write a fix for something > > > else. > > > Patch attached. > > > > Great find, thank you for the patch! > > +1 > > (This occurred to me a few days ago, but I was far from my computer.) > > With the purge function that Noah proposed, I believe we can also get > rid of the comment at the top of the .sql test file warning of a > maintenance hazard: > ..."To avoid adding duplicates, > -- each call to do_set_block_offsets() should use different block > -- numbers." Good point. Removed. > > > of do_gset_block_offset() and check_set_block_offsets(). If these are > > annoying, we can remove the cases of array[1] and array[1,2]. > > Let's keep those -- 32-bit platforms should also exercise this path. Agreed. I've attached a new patch. I'll push it tonight, if there is no further comment. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v2-0001-radixtree-Fix-SIGSEGV-at-update-of-embeddable-val.patch Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada wrote: > > > I saw a SIGSEGV there when using tidstore to write a fix for something else. > > Patch attached. > > Great find, thank you for the patch! +1 (This occurred to me a few days ago, but I was far from my computer.) With the purge function that Noah proposed, I believe we can also get rid of the comment at the top of the .sql test file warning of a maintenance hazard: ..."To avoid adding duplicates, -- each call to do_set_block_offsets() should use different block -- numbers." I found that it doesn't add any measurable time to run the test. > The fix looks good to me. I think we can improve regression tests for > better coverage. In TidStore on a 64-bit machine, we can store 3 > offsets in the header and these values are embedded to the leaf page. > With more than 3 offsets, the value size becomes more than 16 bytes > and a single value leaf. Therefore, if we can add the test with the > array[1,2,3,4,100], we can cover the case of replacing a single-value > leaf with a different size new single-value leaf. Now we add 9 pairs Good idea. > of do_gset_block_offset() and check_set_block_offsets(). If these are > annoying, we can remove the cases of array[1] and array[1,2]. Let's keep those -- 32-bit platforms should also exercise this path.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Apr 25, 2024 at 6:03 AM Noah Misch wrote: > > On Mon, Apr 15, 2024 at 04:12:38PM +0700, John Naylor wrote: > > - Some paths for single-value leaves are not covered: > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904 > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954 > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606 > > > > However, these paths do get regression test coverage on 32-bit > > machines. 64-bit builds only have leaves in the TID store, which > > doesn't (currently) delete entries, and doesn't instantiate the tree > > with the debug option. > > > > - In RT_SET "if (found)" is not covered: > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768 > > > > That's because we don't yet have code that replaces an existing value > > with a value of a different length. > > I saw a SIGSEGV there when using tidstore to write a fix for something else. > Patch attached. Great find, thank you for the patch! The fix looks good to me. I think we can improve regression tests for better coverage. In TidStore on a 64-bit machine, we can store 3 offsets in the header and these values are embedded to the leaf page. With more than 3 offsets, the value size becomes more than 16 bytes and a single value leaf. Therefore, if we can add the test with the array[1,2,3,4,100], we can cover the case of replacing a single-value leaf with a different size new single-value leaf. Now we add 9 pairs of do_gset_block_offset() and check_set_block_offsets(). If these are annoying, we can remove the cases of array[1] and array[1,2]. I've attached a new patch. In addition to the new test case I mentioned, I've added some new comments and removed an unnecessary added line in test_tidstore.sql. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com 0001-radixtree-Fix-SIGSEGV-at-update-of-embeddable-value-.patch Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Apr 15, 2024 at 6:12 PM John Naylor wrote: > > I took a look at the coverage report from [1] and it seems pretty > good, but there are a couple more tests we could do. Thank you for checking! > > - RT_KEY_GET_SHIFT is not covered for key=0: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803 > > That should be fairly simple to add to the tests. There are two paths to call RT_KEY_GET_SHIFT(): 1. RT_SET() -> RT_KEY_GET_SHIFT() 2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT() In both cases, it's called when key > tree->ctl->max_val. Since the minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called when key=0. > > - Some paths for single-value leaves are not covered: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904 > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954 > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606 > > However, these paths do get regression test coverage on 32-bit > machines. 64-bit builds only have leaves in the TID store, which > doesn't (currently) delete entries, and doesn't instantiate the tree > with the debug option. Right. > > - In RT_SET "if (found)" is not covered: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768 > > That's because we don't yet have code that replaces an existing value > with a value of a different length. Noah reported an issue around that. We should incorporate the patch and cover this code path. > > - RT_FREE_RECURSE isn't well covered: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768 > > The TID store test is pretty simple as far as distribution of block > keys, and focuses more on the offset bitmaps. We could try to cover > all branches here, but it would make the test less readable, and it's > kind of the wrong place to do that anyway. test_radixtree.c does have > a commented-out option to use shared memory, but that's for local > testing and won't be reflected in the coverage report. Maybe it's > enough. Agreed. > > - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644 > > That should be easy to add. Agreed. The patch is attached. > > - RT_DUMP_NODE is not covered, and never called by default anyway: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2804 > > It seems we could just leave it alone since it's debug-only, but it's > also a lot of lines. One idea is to use elog with DEBUG5 instead of > commenting out the call sites, but that would cause a lot of noise. I think we can leave it alone. > > - TidStoreCreate* has some memory clamps that are not covered: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179 > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234 > > Maybe we could experiment with using 1MB for shared, and something > smaller for local. I've confirmed that the local and shared tidstore with small max sizes such as 4kB and 1MB worked. Currently the max size is hard-coded in test_tidstore.c but if we use work_mem as the max size, we can pass different max sizes for local and shared in the test script. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com improve_code_coverage_radixtree.patch Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Apr 15, 2024 at 04:12:38PM +0700, John Naylor wrote: > - Some paths for single-value leaves are not covered: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904 > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954 > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606 > > However, these paths do get regression test coverage on 32-bit > machines. 64-bit builds only have leaves in the TID store, which > doesn't (currently) delete entries, and doesn't instantiate the tree > with the debug option. > > - In RT_SET "if (found)" is not covered: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768 > > That's because we don't yet have code that replaces an existing value > with a value of a different length. I saw a SIGSEGV there when using tidstore to write a fix for something else. Patch attached. Author: Noah Misch Commit: Noah Misch radixtree: Fix SIGSEGV at update of embeddable value to non-embeddable. Also, fix a memory leak when updating from non-embeddable to embeddable. Both were unreachable without adding C code. Reviewed by FIXME. Discussion: https://postgr.es/m/FIXME diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h index dc4c00d..aa8f44c 100644 --- a/src/include/lib/radixtree.h +++ b/src/include/lib/radixtree.h @@ -1749,6 +1749,9 @@ have_slot: if (RT_VALUE_IS_EMBEDDABLE(value_p)) { + if (found && !RT_CHILDPTR_IS_VALUE(*slot)) + RT_FREE_LEAF(tree, *slot); + /* store value directly in child pointer slot */ memcpy(slot, value_p, value_sz); @@ -1765,7 +1768,7 @@ have_slot: { RT_CHILD_PTR leaf; - if (found) + if (found && !RT_CHILDPTR_IS_VALUE(*slot)) { Assert(RT_PTR_ALLOC_IS_VALID(*slot)); leaf.alloc = *slot; diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out index 06c610e..d116927 100644 --- a/src/test/modules/test_tidstore/expected/test_tidstore.out +++ b/src/test/modules/test_tidstore/expected/test_tidstore.out @@ -79,6 +79,96 @@ SELECT test_destroy(); (1 row) +-- Test replacements crossing RT_CHILDPTR_IS_VALUE in both directions +SELECT test_create(false); + test_create +- + +(1 row) + +SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT check_set_block_offsets(); + do_set_block_offsets +-- +1 +(1 row) + + check_set_block_offsets +- + +(1 row) + +SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT check_set_block_offsets(); + do_set_block_offsets +-- +1 +(1 row) + + check_set_block_offsets +- + +(1 row) + +SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT check_set_block_offsets(); + do_set_block_offsets +-- +1 +(1 row) + + check_set_block_offsets +- + +(1 row) + +SELECT do_set_block_offsets(1, array[1,2,3,4]::int2[]); SELECT check_set_block_offsets(); + do_set_block_offsets +-- +1 +(1 row) + + check_set_block_offsets +- + +(1 row) + +SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT check_set_block_offsets(); + do_set_block_offsets +-- +1 +(1 row) + + check_set_block_offsets +- + +(1 row) + +SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT check_set_block_offsets(); + do_set_block_offsets +-- +1 +(1 row) + + check_set_block_offsets +- + +(1 row) + +SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT check_set_block_offsets(); + do_set_block_offsets +-- +1 +(1 row) + + check_set_block_offsets +- + +(1 row) + +SELECT test_destroy(); + test_destroy +-- + +(1 row) + -- Use shared memory this time. We can't do that in test_radixtree.sql, -- because unused static functions would raise warnings there. SELECT test_create(true); diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql b/src/test/modules/test_tidstore/sql/test_tidstore.sql index bb31877..704d869 100644 --- a/src/test/modules/test_tidstore/sql/test_tidstore.sql +++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql @@ -14,6 +14,7 @@ CREATE TEMP TABLE hideblocks(blockno bigint); -- We use a higher number to test tidstore. \set maxoffset 512 + SELECT test_create(false); -- Test on empty tidstore. @@ -50,6 +51,20 @@ SELECT t
Re: [PoC] Improve dead tuple storage for lazy vacuum
I took a look at the coverage report from [1] and it seems pretty good, but there are a couple more tests we could do. - RT_KEY_GET_SHIFT is not covered for key=0: https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803 That should be fairly simple to add to the tests. - Some paths for single-value leaves are not covered: https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904 https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954 https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606 However, these paths do get regression test coverage on 32-bit machines. 64-bit builds only have leaves in the TID store, which doesn't (currently) delete entries, and doesn't instantiate the tree with the debug option. - In RT_SET "if (found)" is not covered: https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768 That's because we don't yet have code that replaces an existing value with a value of a different length. - RT_FREE_RECURSE isn't well covered: https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768 The TID store test is pretty simple as far as distribution of block keys, and focuses more on the offset bitmaps. We could try to cover all branches here, but it would make the test less readable, and it's kind of the wrong place to do that anyway. test_radixtree.c does have a commented-out option to use shared memory, but that's for local testing and won't be reflected in the coverage report. Maybe it's enough. - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered: https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644 That should be easy to add. - RT_DUMP_NODE is not covered, and never called by default anyway: https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2804 It seems we could just leave it alone since it's debug-only, but it's also a lot of lines. One idea is to use elog with DEBUG5 instead of commenting out the call sites, but that would cause a lot of noise. - TidStoreCreate* has some memory clamps that are not covered: https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179 https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234 Maybe we could experiment with using 1MB for shared, and something smaller for local. [1] https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Apr 8, 2024 at 7:26 PM John Naylor wrote: > > I pushed both of these and see that mylodon complains that anonymous > unions are a C11 feature. I'm not actually sure that the union with > uintptr_t is actually needed, though, since that's not accessed as > such here. The simplest thing seems to get rid if the union and name > the inner struct "header", as in the attached. I pushed this with some comment adjustments.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Apr 8, 2024 at 7:42 PM Pavel Borisov wrote: > >> I pushed both of these and see that mylodon complains that anonymous >> unions are a C11 feature. I'm not actually sure that the union with >> uintptr_t is actually needed, though, since that's not accessed as >> such here. The simplest thing seems to get rid if the union and name >> the inner struct "header", as in the attached. > > > Provided uintptr_t is not accessed it might be good to get rid of it. > > Maybe this patch also need correction in this: > +#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) - sizeof(int8)) > / sizeof(OffsetNumber)) For full context the diff was -#define NUM_FULL_OFFSETS ((sizeof(bitmapword) - sizeof(uint16)) / sizeof(OffsetNumber)) +#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) - sizeof(int8)) / sizeof(OffsetNumber)) I wanted the former, from f35bd9bf35 , to be independently useful (in case the commit in question had some unresolvable issue), and its intent is to fill struct padding when the array of bitmapword happens to have length zero. Changing to uintptr_t for the size calculation reflects the intent to fit in a (local) pointer, regardless of the size of a bitmapword. (If a DSA pointer happens to be a different size for some odd platform, it should still work, BTW.) My thinking with the union was, for big-endian, to force the 'flags' member to where it can be set, but thinking again, it should still work if by happenstance the header was smaller than the child pointer: A different bit would get tagged, but I believe that's irrelevant. The 'flags' member makes sure a byte is reserved for the tag, but it may not be where the tag is actually located, if that makes sense.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, 8 Apr 2024 at 16:27, John Naylor wrote: > On Sun, Apr 7, 2024 at 9:08 AM John Naylor > wrote: > > > > I've attached a mostly-polished update on runtime embeddable values, > > storing up to 3 offsets in the child pointer (1 on 32-bit platforms). > > As discussed, this includes a macro to cap max possible offset that > > can be stored in the bitmap, which I believe only reduces the valid > > offset range for 32kB pages on 32-bit platforms. Even there, it allows > > for more line pointers than can possibly be useful. It also splits > > into two parts for readability. It would be committed in two pieces as > > well, since they are independently useful. > > I pushed both of these and see that mylodon complains that anonymous > unions are a C11 feature. I'm not actually sure that the union with > uintptr_t is actually needed, though, since that's not accessed as > such here. The simplest thing seems to get rid if the union and name > the inner struct "header", as in the attached. > Provided uintptr_t is not accessed it might be good to get rid of it. Maybe this patch also need correction in this: +#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) - sizeof(int8)) / sizeof(OffsetNumber)) Regards, Pavel
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Sun, Apr 7, 2024 at 9:08 AM John Naylor wrote: > > I've attached a mostly-polished update on runtime embeddable values, > storing up to 3 offsets in the child pointer (1 on 32-bit platforms). > As discussed, this includes a macro to cap max possible offset that > can be stored in the bitmap, which I believe only reduces the valid > offset range for 32kB pages on 32-bit platforms. Even there, it allows > for more line pointers than can possibly be useful. It also splits > into two parts for readability. It would be committed in two pieces as > well, since they are independently useful. I pushed both of these and see that mylodon complains that anonymous unions are a C11 feature. I'm not actually sure that the union with uintptr_t is actually needed, though, since that's not accessed as such here. The simplest thing seems to get rid if the union and name the inner struct "header", as in the attached. diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c index cddbaf013b..78730797d6 100644 --- a/src/backend/access/common/tidstore.c +++ b/src/backend/access/common/tidstore.c @@ -43,35 +43,30 @@ */ typedef struct BlocktableEntry { - union + struct { - struct - { #ifndef WORDS_BIGENDIAN - /* - * We need to position this member so that the backing radix tree - * can use the lowest bit for a pointer tag. In particular, it - * must be placed within 'header' so that it corresponds to the - * lowest byte in 'ptr'. We position 'nwords' along with it to - * avoid struct padding. - */ - uint8 flags; - - int8 nwords; + /* + * We need to position this member so that the backing radix tree can + * use the lowest bit for a pointer tag. In particular, it must be + * placed within 'header' so that it corresponds to the lowest byte in + * 'ptr'. We position 'nwords' along with it to avoid struct padding. + */ + uint8 flags; + + int8 nwords; #endif - /* - * We can store a small number of offsets here to avoid wasting - * space with a sparse bitmap. - */ - OffsetNumber full_offsets[NUM_FULL_OFFSETS]; + /* + * We can store a small number of offsets here to avoid wasting space + * with a sparse bitmap. + */ + OffsetNumber full_offsets[NUM_FULL_OFFSETS]; #ifdef WORDS_BIGENDIAN - int8 nwords; - uint8 flags; + int8 nwords; + uint8 flags; #endif - }; - uintptr_t ptr; } header; bitmapword words[FLEXIBLE_ARRAY_MEMBER];
Re: [PoC] Improve dead tuple storage for lazy vacuum
Hi, John! On Mon, 8 Apr 2024 at 03:13, John Naylor wrote: > On Mon, Apr 8, 2024 at 2:07 AM Andres Freund wrote: > > > > Looking at the code, the failure isn't suprising anymore: > > chardata[MaxBlocktableEntrySize]; > > BlocktableEntry *page = (BlocktableEntry *) data; > > > > 'char' doesn't enforce any alignment, but you're storing a > BlocktableEntry in > > a char[]. You can't just do that. Look at how we do that for > > e.g. PGAlignedblock. > > > > > > With the attached minimal fix, the tests pass again. > > Thanks, will push this shortly! > Buildfarm animal mylodon looks unhappy with this: FAILED: src/backend/postgres_lib.a.p/access_common_tidstore.c.o ccache clang-14 -Isrc/backend/postgres_lib.a.p -Isrc/include -I../pgsql/src/include -I/usr/include/libxml2 -I/usr/include/security -fdiagnostics-color=never -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -O2 -g -fno-strict-aliasing -fwrapv -D_GNU_SOURCE -Wmissing-prototypes -Wpointer-arith -Werror=vla -Werror=unguarded-availability-new -Wendif-labels -Wmissing-format-attribute -Wcast-function-type -Wformat-security -Wdeclaration-after-statement -Wno-unused-command-line-argument -Wno-compound-token-split-by-macro -O1 -ggdb -g3 -fno-omit-frame-pointer -Wall -Wextra -Wno-unused-parameter -Wno-sign-compare -Wno-missing-field-initializers -Wno-array-bounds -std=c99 -Wc11-extensions -Werror=c11-extensions -fPIC -isystem /usr/include/mit-krb5 -pthread -DBUILDING_DLL -MD -MQ src/backend/postgres_lib.a.p/access_common_tidstore.c.o -MF src/backend/postgres_lib.a.p/access_common_tidstore.c.o.d -o src/backend/postgres_lib.a.p/access_common_tidstore.c.o -c ../pgsql/src/backend/access/common/tidstore.c ../pgsql/src/backend/access/common/tidstore.c:48:3: error: anonymous structs are a C11 extension [-Werror,-Wc11-extensions] struct ^ 1 error generated. Regards, Pavel Borisov Supabase
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Apr 8, 2024 at 2:07 AM Andres Freund wrote: > > Looking at the code, the failure isn't suprising anymore: > chardata[MaxBlocktableEntrySize]; > BlocktableEntry *page = (BlocktableEntry *) data; > > 'char' doesn't enforce any alignment, but you're storing a BlocktableEntry in > a char[]. You can't just do that. Look at how we do that for > e.g. PGAlignedblock. > > > With the attached minimal fix, the tests pass again. Thanks, will push this shortly!
Re: [PoC] Improve dead tuple storage for lazy vacuum
Hi, On 2024-04-01 11:53:28 +0900, Masahiko Sawada wrote: > On Fri, Mar 29, 2024 at 4:21 PM John Naylor wrote: > > I've marked it Ready for Committer. > > Thank you! I've attached the patch that I'm going to push tomorrow. Locally I ran a 32bit build with ubsan enabled (by accident actually), which complains: performing post-bootstrap initialization ... --- stderr --- ../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341:24: runtime error: member access within misaligned address 0xffb6258e for type 'struct BlocktableEntry', which requires 4 byte alignment 0xffb6258e: note: pointer points here 00 00 02 00 01 40 dc e9 83 0b 80 48 70 ee 00 00 00 00 00 00 00 01 17 00 00 00 f8 d4 a6 ee e8 25 ^ #0 0x814097e in TidStoreSetBlockOffsets ../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341 #1 0x826560a in dead_items_add ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:2889 #2 0x825f8da in lazy_scan_prune ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:1502 #3 0x825da71 in lazy_scan_heap ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:977 #4 0x825ad8f in heap_vacuum_rel ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:499 #5 0x8697e97 in table_relation_vacuum ../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1725 #6 0x869fca6 in vacuum_rel ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:2206 #7 0x869a0fd in vacuum ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:622 #8 0x869986b in ExecVacuum ../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:449 #9 0x8e5f832 in standard_ProcessUtility ../../../../../home/andres/src/postgresql/src/backend/tcop/utility.c:859 #10 0x8e5e5f6 in ProcessUtility ../../../../../home/andres/src/postgresql/src/backend/tcop/utility.c:523 #11 0x8e5b71a in PortalRunUtility ../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:1158 #12 0x8e5be80 in PortalRunMulti ../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:1315 #13 0x8e59f9b in PortalRun ../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:791 #14 0x8e4d5f3 in exec_simple_query ../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:1274 #15 0x8e55159 in PostgresMain ../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4680 #16 0x8e54445 in PostgresSingleUserMain ../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4136 #17 0x88bb55e in main ../../../../../home/andres/src/postgresql/src/backend/main/main.c:194 #18 0xf76f47c4 (/lib/i386-linux-gnu/libc.so.6+0x237c4) (BuildId: fe79efe6681a919714a4e119da2baac3a4953fbf) #19 0xf76f4887 in __libc_start_main (/lib/i386-linux-gnu/libc.so.6+0x23887) (BuildId: fe79efe6681a919714a4e119da2baac3a4953fbf) #20 0x80d40f7 in _start (/srv/dev/build/postgres/m-dev-assert-32/tmp_install/srv/dev/install/postgres/m-dev-assert-32/bin/postgres+0x80d40f7) SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341:24 in Aborted (core dumped) child process exited with exit code 134 initdb: data directory "/srv/dev/build/postgres/m-dev-assert-32/tmp_install/initdb-template" not removed at user's request At first I was confused why CI didn't find this. Turns out that, for me, this is only triggered without compiler optimizations, and I had used -O0 while CI uses some optimizations. Backtrace: #9 0x0814097f in TidStoreSetBlockOffsets (ts=0xb8dfde4, blkno=15, offsets=0xffb6275c, num_offsets=11) at ../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341 #10 0x0826560b in dead_items_add (vacrel=0xb8df6d4, blkno=15, offsets=0xffb6275c, num_offsets=11) at ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:2889 #11 0x0825f8db in lazy_scan_prune (vacrel=0xb8df6d4, buf=24, blkno=15, page=0xeeb6c000 "", vmbuffer=729, all_visible_according_to_vm=false, has_lpdead_items=0xffb62a1f) at ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:1502 #12 0x0825da72 in lazy_scan_heap (vacrel=0xb8df6d4) at ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:977 #13 0x0825ad90 in heap_vacuum_rel (rel=0xb872810, params=0xffb62e90, bstrategy=0xb99d5e0) at ../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:499 #14 0x08697e98 in table_relation_vacuum (rel=0xb872810, params=0xffb62e90, bstrategy=0xb99d5e0) at ../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1725 #15 0x0869fca7 in vacuum_rel (relid=1249, relation=0x0, params=0xff
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Sun, Apr 7, 2024 at 9:08 AM John Naylor wrote: > I've attached a mostly-polished update on runtime embeddable values, > storing up to 3 offsets in the child pointer (1 on 32-bit platforms). And...since there's a new bump context patch, I wanted to anticipate squeezing an update on top of that, if that gets committed. 0004/5 are the v6 bump context, and 0006 uses it for vacuum. The rest are to show it works -- the expected.out changes make possible problems in CI easier to see. The allocation size is 16 bytes, so this difference is entirely due to lack of chunk header: aset: 6619136 bump: 5047296 (Note: assert builds still have the chunk header for sanity checking, so this was done in a more optimized build) From e2333bad47974a22120a4de3fad804a21757f631 Mon Sep 17 00:00:00 2001 From: David Rowley Date: Tue, 20 Feb 2024 21:49:57 +1300 Subject: [PATCH v84 5/8] Introduce a bump memory allocator --- src/backend/nodes/gen_node_support.pl | 2 +- src/backend/utils/mmgr/Makefile | 1 + src/backend/utils/mmgr/bump.c | 818 ++ src/backend/utils/mmgr/mcxt.c | 15 +- src/backend/utils/mmgr/meson.build| 1 + src/include/nodes/memnodes.h | 3 +- src/include/utils/memutils.h | 7 + src/include/utils/memutils_internal.h | 18 +- src/tools/pgindent/typedefs.list | 2 + 9 files changed, 862 insertions(+), 5 deletions(-) create mode 100644 src/backend/utils/mmgr/bump.c diff --git a/src/backend/nodes/gen_node_support.pl b/src/backend/nodes/gen_node_support.pl index d4244facbb..81df3bdf95 100644 --- a/src/backend/nodes/gen_node_support.pl +++ b/src/backend/nodes/gen_node_support.pl @@ -149,7 +149,7 @@ my @abstract_types = qw(Node); # they otherwise don't participate in node support. my @extra_tags = qw( IntList OidList XidList - AllocSetContext GenerationContext SlabContext + AllocSetContext GenerationContext SlabContext BumpContext TIDBitmap WindowObjectData ); diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile index dae3432c98..01a1fb8527 100644 --- a/src/backend/utils/mmgr/Makefile +++ b/src/backend/utils/mmgr/Makefile @@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global OBJS = \ alignedalloc.o \ aset.o \ + bump.o \ dsa.o \ freepage.o \ generation.o \ diff --git a/src/backend/utils/mmgr/bump.c b/src/backend/utils/mmgr/bump.c new file mode 100644 index 00..f98a203a0c --- /dev/null +++ b/src/backend/utils/mmgr/bump.c @@ -0,0 +1,818 @@ +/*- + * + * bump.c + * Bump allocator definitions. + * + * Bump is a MemoryContext implementation designed for memory usages which + * require allocating a large number of chunks, none of which ever need to be + * pfree'd or realloc'd. Chunks allocated by this context have no chunk header + * and operations which ordinarily require looking at the chunk header cannot + * be performed. For example, pfree, realloc, GetMemoryChunkSpace and + * GetMemoryChunkContext are all not possible with bump allocated chunks. The + * only way to release memory allocated by this context type is to reset or + * delete the context. + * + * Portions Copyright (c) 2023, PostgreSQL Global Development Group + * + * IDENTIFICATION + * src/backend/utils/mmgr/bump.c + * + * + * Bump is best suited to cases which require a large number of short-lived + * chunks where performance matters. Because bump allocated chunks don't + * have a chunk header, it can fit more chunks on each block. This means we + * can do more with less memory and fewer cache lines. The reason it's best + * suited for short-lived usages of memory is that ideally, pointers to bump + * allocated chunks won't be visible to a large amount of code. The more + * code that operates on memory allocated by this allocator, the more chances + * that some code will try to perform a pfree or one of the other operations + * which are made impossible due to the lack of chunk header. In order to + * to detect accidental usage of the various disallowed operations, we do add + * a MemoryChunk chunk header in MEMORY_CONTEXT_CHECKING builds and have the + * various disallowed functions raise an ERROR. + * + * Allocations are MAXALIGNed. + * + *- + */ + +#include "postgres.h" + +#include "lib/ilist.h" +#include "port/pg_bitutils.h" +#include "utils/memdebug.h" +#include "utils/memutils.h" +#include "utils/memutils_memorychunk.h" +#include "utils/memutils_internal.h" + +#define Bump_BLOCKHDRSZ MAXALIGN(sizeof(BumpBlock)) + +/* No chunk header unless built with MEMORY_CONTEXT_CHECKING */ +#ifdef MEMORY_CONTEXT_CHECKING +#define Bump_CHUNKHDRSZ sizeof(MemoryChunk) +#else +#define Bump_CHUNKHDRSZ 0 +#endif + +#define Bump_CHUNK_FRACTION 8 + +/* The keeper block is allocated in the same allocation as the set */ +#define KeeperBlock(set) ((BumpBlock *) ((char
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Apr 1, 2024 at 9:54 AM Masahiko Sawada wrote: > > Thank you! I've attached the patch that I'm going to push tomorrow. Excellent! I've attached a mostly-polished update on runtime embeddable values, storing up to 3 offsets in the child pointer (1 on 32-bit platforms). As discussed, this includes a macro to cap max possible offset that can be stored in the bitmap, which I believe only reduces the valid offset range for 32kB pages on 32-bit platforms. Even there, it allows for more line pointers than can possibly be useful. It also splits into two parts for readability. It would be committed in two pieces as well, since they are independently useful. From 24bd672deb4a6fa14abfc8583b500d1cbc332032 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Fri, 5 Apr 2024 17:04:12 +0700 Subject: [PATCH v83 1/3] store offsets in the header --- src/backend/access/common/tidstore.c | 52 +++ .../test_tidstore/expected/test_tidstore.out | 14 + .../test_tidstore/sql/test_tidstore.sql | 5 ++ 3 files changed, 71 insertions(+) diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c index e1a7e82469..4eb5d46951 100644 --- a/src/backend/access/common/tidstore.c +++ b/src/backend/access/common/tidstore.c @@ -34,6 +34,9 @@ /* number of active words for a page: */ #define WORDS_PER_PAGE(n) ((n) / BITS_PER_BITMAPWORD + 1) +/* number of offsets we can store in the header of a BlocktableEntry */ +#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint16)) / sizeof(OffsetNumber)) + /* * This is named similarly to PagetableEntry in tidbitmap.c * because the two have a similar function. @@ -41,6 +44,10 @@ typedef struct BlocktableEntry { uint16 nwords; + + /* We can store a small number of offsets here to avoid wasting space with a sparse bitmap. */ + OffsetNumber full_offsets[NUM_FULL_OFFSETS]; + bitmapword words[FLEXIBLE_ARRAY_MEMBER]; } BlocktableEntry; #define MaxBlocktableEntrySize \ @@ -316,6 +323,25 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, for (int i = 1; i < num_offsets; i++) Assert(offsets[i] > offsets[i - 1]); + memset(page, 0, offsetof(BlocktableEntry, words)); + + if (num_offsets <= NUM_FULL_OFFSETS) + { + for (int i = 0; i < num_offsets; i++) + { + OffsetNumber off = offsets[i]; + + /* safety check to ensure we don't overrun bit array bounds */ + if (!OffsetNumberIsValid(off)) +elog(ERROR, "tuple offset out of range: %u", off); + + page->full_offsets[i] = off; + } + + page->nwords = 0; + } + else + { for (wordnum = 0, next_word_threshold = BITS_PER_BITMAPWORD; wordnum <= WORDNUM(offsets[num_offsets - 1]); wordnum++, next_word_threshold += BITS_PER_BITMAPWORD) @@ -343,6 +369,7 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, page->nwords = wordnum; Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1])); +} if (TidStoreIsShared(ts)) shared_ts_set(ts->tree.shared, blkno, page); @@ -369,6 +396,18 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid) if (page == NULL) return false; + if (page->nwords == 0) + { + /* we have offsets in the header */ + for (int i = 0; i < NUM_FULL_OFFSETS; i++) + { + if (page->full_offsets[i] == off) +return true; + } + return false; + } + else + { wordnum = WORDNUM(off); bitnum = BITNUM(off); @@ -378,6 +417,7 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid) return (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0; } +} /* * Prepare to iterate through a TidStore. @@ -496,6 +536,17 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno, result->num_offsets = 0; result->blkno = blkno; + if (page->nwords == 0) + { + /* we have offsets in the header */ + for (int i = 0; i < NUM_FULL_OFFSETS; i++) + { + if (page->full_offsets[i] != InvalidOffsetNumber) +result->offsets[result->num_offsets++] = page->full_offsets[i]; + } + } + else + { for (wordnum = 0; wordnum < page->nwords; wordnum++) { bitmapword w = page->words[wordnum]; @@ -518,3 +569,4 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno, } } } +} diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out index 0ae2f970da..c40779859b 100644 --- a/src/test/modules/test_tidstore/expected/test_tidstore.out +++ b/src/test/modules/test_tidstore/expected/test_tidstore.out @@ -36,6 +36,20 @@ SELECT do_set_block_offsets(blk, array_agg(off)::int2[]) (VALUES (0), (1), (:maxblkno / 2), (:maxblkno - 1), (:maxblkno)) AS blocks(blk), (VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)) AS offsets(off) GROUP BY blk; +-- Test offsets embedded in the bitmap header. +SELECT do_set_block_offsets(501, array[greatest((random() * :maxoffset)::int, 1)]::int2[]); + do_set_block_offsets +-- + 501 +(1 row) + +S
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 29, 2024 at 4:21 PM John Naylor wrote: > > On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada > wrote: > > I think the patch is in good shape. Do you have other comments or > > suggestions, John? > > --- a/doc/src/sgml/config.sgml > +++ b/doc/src/sgml/config.sgml > @@ -1918,11 +1918,6 @@ include_dir 'conf.d' > too high. It may be useful to control for this by separately > setting . > > - > -Note that for the collection of dead tuple identifiers, > -VACUUM is only able to utilize up to a maximum of > -1GB of memory. > - > > > > This is mentioned twice for two different GUCs -- need to remove the > other one, too. Good catch, removed. > Other than that, I just have minor nits: > > - * The major space usage for vacuuming is storage for the array of dead TIDs > + * The major space usage for vacuuming is TID store, a storage for dead TIDs > > I think I've helped edit this sentence before, but I still don't quite > like it. I'm thinking now "is storage for the dead tuple IDs". > > - * set upper bounds on the number of TIDs we can keep track of at once. > + * set upper bounds on the maximum memory that can be used for keeping track > + * of dead TIDs at once. > > I think "maximum" is redundant with "upper bounds". Fixed. > > I also feel the commit message needs more "meat" -- we need to clearly > narrate the features and benefits. I've attached how I would write it, > but feel free to use what you like to match your taste. Well, that's much better than mine. > > I've marked it Ready for Committer. Thank you! I've attached the patch that I'm going to push tomorrow. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v82-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada wrote: > I think the patch is in good shape. Do you have other comments or > suggestions, John? --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1918,11 +1918,6 @@ include_dir 'conf.d' too high. It may be useful to control for this by separately setting . - -Note that for the collection of dead tuple identifiers, -VACUUM is only able to utilize up to a maximum of -1GB of memory. - This is mentioned twice for two different GUCs -- need to remove the other one, too. Other than that, I just have minor nits: - * The major space usage for vacuuming is storage for the array of dead TIDs + * The major space usage for vacuuming is TID store, a storage for dead TIDs I think I've helped edit this sentence before, but I still don't quite like it. I'm thinking now "is storage for the dead tuple IDs". - * set upper bounds on the number of TIDs we can keep track of at once. + * set upper bounds on the maximum memory that can be used for keeping track + * of dead TIDs at once. I think "maximum" is redundant with "upper bounds". I also feel the commit message needs more "meat" -- we need to clearly narrate the features and benefits. I've attached how I would write it, but feel free to use what you like to match your taste. I've marked it Ready for Committer. Use TID Store for storage of dead tuple IDs during lazy vacuum Previously, we used a simple array for storing dead tuple IDs during lazy vacuum, which had a number of problems: * The array used a single allocation and so was limited to 1GB. * The allocation was pessimistically sized according to table size. * Lookup with binary search was slow because of poor CPU cache and branch prediction behavior. This commit replaces that array with the TID store from commit XX. Since the backing radix tree makes small allocations as needed, the 1GB limit is now gone. Further, the total memory used is now often smaller by an order of magnitude or more, depending on the distribution of blocks and offsets. These two features should make multiple rounds of heap scanning and index cleanup an extremely rare event. TID lookup during index cleanup is also several times faster, even more so when index order is correlated with heap tuple order. Since there is no longer a predictable relationship between the number of dead tuples vacuumed and the space taken up by their TIDs, the number of tuples no longer provides any meaningful insights for users, nor is the maximum number predictable. For that reason this commit also changes to byte-based progress reporting, with the relevant columns of pg_stat_progress_vacuum renamed accordingly to max_dead_tuple_bytes and dead_tuple_bytes. For parallel vacuum, both the TID store and supplemental information specific to vacuum are shared among the parallel vacuum workers. As with the previous array, we don't take any locks on TidStore during parallel vacuum since writes are still only done by the leader process. XXX: bump catalog version Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 28, 2024 at 6:15 PM John Naylor wrote: > > On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada > wrote: > > > > Pushed the refactoring patch. > > > > I've attached the rebased vacuum improvement patch for cfbot. I > > mentioned in the commit message that this patch eliminates the 1GB > > limitation. > > > > I think the patch is in good shape. Do you have other comments or > > suggestions, John? > > I'll do another pass tomorrow, but first I wanted to get in another > slightly-challenging in-situ test. Thanks! > > About the "tuples missed" -- I didn't expect contention during this > test. I believe that's completely unrelated behavior, but wanted to > mention it anyway, since I found it confusing. I don't investigate it enough but bgwriter might be related to the contention. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada wrote: > > Pushed the refactoring patch. > > I've attached the rebased vacuum improvement patch for cfbot. I > mentioned in the commit message that this patch eliminates the 1GB > limitation. > > I think the patch is in good shape. Do you have other comments or > suggestions, John? I'll do another pass tomorrow, but first I wanted to get in another slightly-challenging in-situ test. On my humble laptop, I can still fit a table large enough to cause PG16 to choke on multiple rounds of index cleanup: drop table if exists test; create unlogged table test (a int, b uuid) with (autovacuum_enabled=false); insert into test (a,b) select i, gen_random_uuid() from generate_series(1,1000*1000*1000) i; create index on test (a); create index on test (b); delete from test; vacuum (verbose, truncate off, parallel 2) test; INFO: vacuuming "john.public.test" INFO: launched 1 parallel vacuum worker for index vacuuming (planned: 1) INFO: finished vacuuming "john.public.test": index scans: 1 pages: 0 removed, 6369427 remain, 6369427 scanned (100.00% of total) tuples: 97174 removed, 2826 remain, 0 are dead but not yet removable tuples missed: 2826 dead from 18 pages not removed due to cleanup lock contention removable cutoff: 771, which was 0 XIDs old when operation ended new relfrozenxid: 767, which is 4 XIDs ahead of previous value frozen: 0 pages from table (0.00% of total) had 0 tuples frozen index scan needed: 6369409 pages from table (100.00% of total) had 97174 dead item identifiers removed index "test_a_idx": pages: 2741898 in total, 2741825 newly deleted, 2741825 currently deleted, 0 reusable index "test_b_idx": pages: 3850387 in total, 3842056 newly deleted, 3842056 currently deleted, 0 reusable avg read rate: 159.740 MB/s, avg write rate: 161.726 MB/s buffer usage: 26367981 hits, 14958634 misses, 15144601 dirtied WAL usage: 3 records, 1 full page images, 2050 bytes system usage: CPU: user: 151.89 s, system: 193.54 s, elapsed: 731.59 s Watching pg_stat_progress_vacuum, dead_tuple_bytes got up to 398458880. About the "tuples missed" -- I didn't expect contention during this test. I believe that's completely unrelated behavior, but wanted to mention it anyway, since I found it confusing.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 27, 2024 at 5:43 PM Masahiko Sawada wrote: > > On Wed, Mar 27, 2024 at 9:25 AM John Naylor wrote: > > > > On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada > > wrote: > > > > > > On Mon, Mar 25, 2024 at 3:25 PM John Naylor > > > wrote: > > > > > > > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada > > > > wrote: > > > > > > - * remaining LP_DEAD line pointers on the page in the dead_items > > > > - * array. These dead items include those pruned by lazy_scan_prune() > > > > - * as well we line pointers previously marked LP_DEAD. > > > > + * remaining LP_DEAD line pointers on the page in the dead_items. > > > > + * These dead items include those pruned by lazy_scan_prune() as well > > > > + * we line pointers previously marked LP_DEAD. > > > > > > > > Here maybe "into dead_items". > > > > - * remaining LP_DEAD line pointers on the page in the dead_items. > > + * remaining LP_DEAD line pointers on the page into the dead_items. > > > > Let me explain. It used to be "in the dead_items array." It is not an > > array anymore, so it was changed to "in the dead_items". dead_items is > > a variable name, and names don't take "the". "into dead_items" seems > > most natural to me, but there are other possible phrasings. > > Thanks for the explanation. I was distracted. Fixed in the latest patch. > > > > > > > > > Did you try it with 1MB m_w_m? > > > > > > > > > > I've incorporated the above comments and test results look good to me. > > > > > > > > Could you be more specific about what the test was? > > > > Does it work with 1MB m_w_m? > > > > > > If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB. > > > > > > FYI other test cases I tested were: > > > > > > * m_w_m = 2199023254528 (maximum value) > > > initial: 1MB > > > max: 128GB > > > > > > * m_w_m = 64MB (default) > > > initial: 1MB > > > max: 8MB > > > > If the test was a vacuum, how big a table was needed to hit 128GB? > > I just checked how TIdStoreCreateLocal() calculated the initial and > max segment sizes while changing m_w_m, so didn't check how big > segments are actually allocated in the maximum value test case. > > > > > > > The existing comment slipped past my radar, but max_bytes is not a > > > > limit, it's a hint. Come to think of it, it never was a limit in the > > > > normal sense, but in earlier patches it was the criteria for reporting > > > > "I'm full" when asked. > > > > > > Updated the comment. > > > > + * max_bytes is not a limit; it's used to choose the memory block sizes of > > + * a memory context for TID storage in order for the total memory > > consumption > > + * not to be overshot a lot. The caller can use the max_bytes as the > > criteria > > + * for reporting whether it's full or not. > > > > This is good information. I suggest this edit: > > > > "max_bytes" is not an internally-enforced limit; it is used only as a > > hint to cap the memory block size of the memory context for TID > > storage. This reduces space wastage due to over-allocation. If the > > caller wants to monitor memory usage, it must compare its limit with > > the value reported by TidStoreMemoryUsage(). > > > > Other comments: > > Thanks for the suggestion! > > > > > v79-0002 looks good to me. > > > > v79-0003: > > > > "With this commit, when creating a shared TidStore, a dedicated DSA > > area is created for TID storage instead of using the provided DSA > > area." > > > > This is very subtle, but "the provided..." implies there still is one. > > -> "a provided..." > > > > + * Similar to TidStoreCreateLocal() but create a shared TidStore on a > > + * DSA area. The TID storage will live in the DSA area, and a memory > > + * context rt_context will have only meta data of the radix tree. > > > > -> "the memory context" > > Fixed in the latest patch. > > > > > I think you can go ahead and commit 0002 and 0003/4. > > I've pushed the 0002 (dsa init and max segment size) patch, and will > push the attached 0001 patch next. Pushed the refactoring patch. I've attached the rebased vacuum improvement patch for cfbot. I mentioned in the commit message that this patch eliminates the 1GB limitation. I think the patch is in good shape. Do you have other comments or suggestions, John? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v81-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 27, 2024 at 9:25 AM John Naylor wrote: > > On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada wrote: > > > > On Mon, Mar 25, 2024 at 3:25 PM John Naylor wrote: > > > > > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada > > > wrote: > > > > - * remaining LP_DEAD line pointers on the page in the dead_items > > > - * array. These dead items include those pruned by lazy_scan_prune() > > > - * as well we line pointers previously marked LP_DEAD. > > > + * remaining LP_DEAD line pointers on the page in the dead_items. > > > + * These dead items include those pruned by lazy_scan_prune() as well > > > + * we line pointers previously marked LP_DEAD. > > > > > > Here maybe "into dead_items". > > - * remaining LP_DEAD line pointers on the page in the dead_items. > + * remaining LP_DEAD line pointers on the page into the dead_items. > > Let me explain. It used to be "in the dead_items array." It is not an > array anymore, so it was changed to "in the dead_items". dead_items is > a variable name, and names don't take "the". "into dead_items" seems > most natural to me, but there are other possible phrasings. Thanks for the explanation. I was distracted. Fixed in the latest patch. > > > > > > Did you try it with 1MB m_w_m? > > > > > > > > I've incorporated the above comments and test results look good to me. > > > > > > Could you be more specific about what the test was? > > > Does it work with 1MB m_w_m? > > > > If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB. > > > > FYI other test cases I tested were: > > > > * m_w_m = 2199023254528 (maximum value) > > initial: 1MB > > max: 128GB > > > > * m_w_m = 64MB (default) > > initial: 1MB > > max: 8MB > > If the test was a vacuum, how big a table was needed to hit 128GB? I just checked how TIdStoreCreateLocal() calculated the initial and max segment sizes while changing m_w_m, so didn't check how big segments are actually allocated in the maximum value test case. > > > > The existing comment slipped past my radar, but max_bytes is not a > > > limit, it's a hint. Come to think of it, it never was a limit in the > > > normal sense, but in earlier patches it was the criteria for reporting > > > "I'm full" when asked. > > > > Updated the comment. > > + * max_bytes is not a limit; it's used to choose the memory block sizes of > + * a memory context for TID storage in order for the total memory consumption > + * not to be overshot a lot. The caller can use the max_bytes as the criteria > + * for reporting whether it's full or not. > > This is good information. I suggest this edit: > > "max_bytes" is not an internally-enforced limit; it is used only as a > hint to cap the memory block size of the memory context for TID > storage. This reduces space wastage due to over-allocation. If the > caller wants to monitor memory usage, it must compare its limit with > the value reported by TidStoreMemoryUsage(). > > Other comments: Thanks for the suggestion! > > v79-0002 looks good to me. > > v79-0003: > > "With this commit, when creating a shared TidStore, a dedicated DSA > area is created for TID storage instead of using the provided DSA > area." > > This is very subtle, but "the provided..." implies there still is one. > -> "a provided..." > > + * Similar to TidStoreCreateLocal() but create a shared TidStore on a > + * DSA area. The TID storage will live in the DSA area, and a memory > + * context rt_context will have only meta data of the radix tree. > > -> "the memory context" Fixed in the latest patch. > > I think you can go ahead and commit 0002 and 0003/4. I've pushed the 0002 (dsa init and max segment size) patch, and will push the attached 0001 patch next. > > v79-0005: > > - bypass = (vacrel->lpdead_item_pages < threshold && > - vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L)); > + bypass = (vacrel->lpdead_item_pages < threshold) && > + TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L); > > The parentheses look strange, and the first line shouldn't change > without a good reason. Fixed. > > - /* Set dead_items space */ > - dead_items = (VacDeadItems *) shm_toc_lookup(toc, > - PARALLEL_VACUUM_KEY_DEAD_ITEMS, > - false); > + /* Set dead items */ > + dead_items = TidStoreAttach(shared->dead_items_dsa_handle, > + shared->dead_items_handle); > > I feel ambivalent about this comment change. The original is not very > descriptive to begin with. If we need to change at all, maybe "find > dead_items in shared memory"? Agreed. > > v79-0005: As I said earlier, Dilip Kumar reviewed an earlier version. > > v79-0006: > > vac_work_mem should also go back to being an int. Fixed. I've attached the latest patches. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v80-0001-Rethink-create-and-attach-APIs-of-shared-TidStor.patch Description: Binary data v80-0002-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada wrote: > > On Mon, Mar 25, 2024 at 3:25 PM John Naylor wrote: > > > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada > > wrote: > > - * remaining LP_DEAD line pointers on the page in the dead_items > > - * array. These dead items include those pruned by lazy_scan_prune() > > - * as well we line pointers previously marked LP_DEAD. > > + * remaining LP_DEAD line pointers on the page in the dead_items. > > + * These dead items include those pruned by lazy_scan_prune() as well > > + * we line pointers previously marked LP_DEAD. > > > > Here maybe "into dead_items". - * remaining LP_DEAD line pointers on the page in the dead_items. + * remaining LP_DEAD line pointers on the page into the dead_items. Let me explain. It used to be "in the dead_items array." It is not an array anymore, so it was changed to "in the dead_items". dead_items is a variable name, and names don't take "the". "into dead_items" seems most natural to me, but there are other possible phrasings. > > > > Did you try it with 1MB m_w_m? > > > > > > I've incorporated the above comments and test results look good to me. > > > > Could you be more specific about what the test was? > > Does it work with 1MB m_w_m? > > If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB. > > FYI other test cases I tested were: > > * m_w_m = 2199023254528 (maximum value) > initial: 1MB > max: 128GB > > * m_w_m = 64MB (default) > initial: 1MB > max: 8MB If the test was a vacuum, how big a table was needed to hit 128GB? > > The existing comment slipped past my radar, but max_bytes is not a > > limit, it's a hint. Come to think of it, it never was a limit in the > > normal sense, but in earlier patches it was the criteria for reporting > > "I'm full" when asked. > > Updated the comment. + * max_bytes is not a limit; it's used to choose the memory block sizes of + * a memory context for TID storage in order for the total memory consumption + * not to be overshot a lot. The caller can use the max_bytes as the criteria + * for reporting whether it's full or not. This is good information. I suggest this edit: "max_bytes" is not an internally-enforced limit; it is used only as a hint to cap the memory block size of the memory context for TID storage. This reduces space wastage due to over-allocation. If the caller wants to monitor memory usage, it must compare its limit with the value reported by TidStoreMemoryUsage(). Other comments: v79-0002 looks good to me. v79-0003: "With this commit, when creating a shared TidStore, a dedicated DSA area is created for TID storage instead of using the provided DSA area." This is very subtle, but "the provided..." implies there still is one. -> "a provided..." + * Similar to TidStoreCreateLocal() but create a shared TidStore on a + * DSA area. The TID storage will live in the DSA area, and a memory + * context rt_context will have only meta data of the radix tree. -> "the memory context" I think you can go ahead and commit 0002 and 0003/4. v79-0005: - bypass = (vacrel->lpdead_item_pages < threshold && - vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L)); + bypass = (vacrel->lpdead_item_pages < threshold) && + TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L); The parentheses look strange, and the first line shouldn't change without a good reason. - /* Set dead_items space */ - dead_items = (VacDeadItems *) shm_toc_lookup(toc, - PARALLEL_VACUUM_KEY_DEAD_ITEMS, - false); + /* Set dead items */ + dead_items = TidStoreAttach(shared->dead_items_dsa_handle, + shared->dead_items_handle); I feel ambivalent about this comment change. The original is not very descriptive to begin with. If we need to change at all, maybe "find dead_items in shared memory"? v79-0005: As I said earlier, Dilip Kumar reviewed an earlier version. v79-0006: vac_work_mem should also go back to being an int.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 25, 2024 at 3:25 PM John Naylor wrote: > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada > wrote: > > > > On Thu, Mar 21, 2024 at 7:48 PM John Naylor wrote: > > > > v77-0001 > > > > > > - dead_items = (VacDeadItems *) > > > palloc(vac_max_items_to_alloc_size(max_items)); > > > - dead_items->max_items = max_items; > > > - dead_items->num_items = 0; > > > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0); > > > + > > > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo)); > > > + dead_items_info->max_bytes = vac_work_mem * 1024L; > > > > > > This is confusing enough that it looks like a bug: > > > > > > [inside TidStoreCreate()] > > > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */ > > > while (16 * maxBlockSize > max_bytes * 1024L) > > > maxBlockSize >>= 1; > > > > > > This was copied from CreateWorkExprContext, which operates directly on > > > work_mem -- if the parameter is actually bytes, we can't "* 1024" > > > here. If we're passing something measured in kilobytes, the parameter > > > is badly named. Let's use convert once and use bytes everywhere. > > > > True. The attached 0001 patch fixes it. > > v78-0001 and 02 are fine, but for 0003 there is a consequence that I > didn't see mentioned: I think that the fix done in 0001 patch can be merged into 0003 patch. > vac_work_mem now refers to bytes, where before > it referred to kilobytes. It seems pretty confusing to use a different > convention from elsewhere, especially if it has the same name but > different meaning across versions. Worse, this change is buried inside > a moving-stuff-around diff, making it hard to see. Maybe "convert only > once" is still possible, but I was actually thinking of > > + dead_items_info->max_bytes = vac_work_mem * 1024L; > + vacrel->dead_items = TidStoreCreate(dead_items_info->max_bytes, NULL, 0); > > That way it's pretty obvious that it's correct. That may require a bit > of duplication and moving around for shmem, but there is some of that > already. Agreed. > > More on 0003: > > - * The major space usage for vacuuming is storage for the array of dead TIDs > + * The major space usage for vacuuming is TidStore, a storage for dead TIDs > > + * autovacuum_work_mem) memory space to keep track of dead TIDs. If the > + * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to > vacuum > > I wonder if the comments here should refer to it using a more natural > spelling, like "TID store". > > - * items in the dead_items array for later vacuuming, count live and > + * items in the dead_items for later vacuuming, count live and > > Maybe "the dead_items area", or "the dead_items store" or "in dead_items"? > > - * remaining LP_DEAD line pointers on the page in the dead_items > - * array. These dead items include those pruned by lazy_scan_prune() > - * as well we line pointers previously marked LP_DEAD. > + * remaining LP_DEAD line pointers on the page in the dead_items. > + * These dead items include those pruned by lazy_scan_prune() as well > + * we line pointers previously marked LP_DEAD. > > Here maybe "into dead_items". > > Also, "we line pointers" seems to be a pre-existing typo. > > - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages", > - vacrel->relname, (long long) index, vacuumed_pages))); > + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers > in %u pages", > + vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages))); > > This is a translated message, so let's keep the message the same. > > /* > * Allocate dead_items (either using palloc, or in dynamic shared memory). > * Sets dead_items in vacrel for caller. > * > * Also handles parallel initialization as part of allocating dead_items in > * DSM when required. > */ > static void > dead_items_alloc(LVRelState *vacrel, int nworkers) > > This comment didn't change at all. It's not wrong, but let's consider > updating the specifics. Fixed above comments. > v78-0005: > > "Although commit XXX > allowed specifying the initial and maximum DSA segment sizes, callers > still needed to clamp their own limits, which was not consistent and > user-friendly." > > Perhaps s/still needed/would have needed/ ..., since we're preventing > that necessity. > > > > Did you try it with 1MB m_w_m? > > > > I've incorporated the above comments and test results look good to me. > > Could you be more specific about what the test was? > Does it work with 1MB m_w_m? If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB. FYI other test cases I tested were: * m_w_m = 2199023254528 (maximum value) initial: 1MB max: 128GB * m_w_m = 64MB (default) initial: 1MB max: 8MB > > + /* > + * Choose the initial and maximum DSA segment sizes to be no longer > + * than 1/16 and 1/8 of max_bytes, respectively. If the initial > + * segment size is low, we end up having many segments, which risks > + * exceeding the total number of segments the
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada wrote: > > On Thu, Mar 21, 2024 at 7:48 PM John Naylor wrote: > > v77-0001 > > > > - dead_items = (VacDeadItems *) > > palloc(vac_max_items_to_alloc_size(max_items)); > > - dead_items->max_items = max_items; > > - dead_items->num_items = 0; > > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0); > > + > > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo)); > > + dead_items_info->max_bytes = vac_work_mem * 1024L; > > > > This is confusing enough that it looks like a bug: > > > > [inside TidStoreCreate()] > > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */ > > while (16 * maxBlockSize > max_bytes * 1024L) > > maxBlockSize >>= 1; > > > > This was copied from CreateWorkExprContext, which operates directly on > > work_mem -- if the parameter is actually bytes, we can't "* 1024" > > here. If we're passing something measured in kilobytes, the parameter > > is badly named. Let's use convert once and use bytes everywhere. > > True. The attached 0001 patch fixes it. v78-0001 and 02 are fine, but for 0003 there is a consequence that I didn't see mentioned: vac_work_mem now refers to bytes, where before it referred to kilobytes. It seems pretty confusing to use a different convention from elsewhere, especially if it has the same name but different meaning across versions. Worse, this change is buried inside a moving-stuff-around diff, making it hard to see. Maybe "convert only once" is still possible, but I was actually thinking of + dead_items_info->max_bytes = vac_work_mem * 1024L; + vacrel->dead_items = TidStoreCreate(dead_items_info->max_bytes, NULL, 0); That way it's pretty obvious that it's correct. That may require a bit of duplication and moving around for shmem, but there is some of that already. More on 0003: - * The major space usage for vacuuming is storage for the array of dead TIDs + * The major space usage for vacuuming is TidStore, a storage for dead TIDs + * autovacuum_work_mem) memory space to keep track of dead TIDs. If the + * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum I wonder if the comments here should refer to it using a more natural spelling, like "TID store". - * items in the dead_items array for later vacuuming, count live and + * items in the dead_items for later vacuuming, count live and Maybe "the dead_items area", or "the dead_items store" or "in dead_items"? - * remaining LP_DEAD line pointers on the page in the dead_items - * array. These dead items include those pruned by lazy_scan_prune() - * as well we line pointers previously marked LP_DEAD. + * remaining LP_DEAD line pointers on the page in the dead_items. + * These dead items include those pruned by lazy_scan_prune() as well + * we line pointers previously marked LP_DEAD. Here maybe "into dead_items". Also, "we line pointers" seems to be a pre-existing typo. - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages", - vacrel->relname, (long long) index, vacuumed_pages))); + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers in %u pages", + vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages))); This is a translated message, so let's keep the message the same. /* * Allocate dead_items (either using palloc, or in dynamic shared memory). * Sets dead_items in vacrel for caller. * * Also handles parallel initialization as part of allocating dead_items in * DSM when required. */ static void dead_items_alloc(LVRelState *vacrel, int nworkers) This comment didn't change at all. It's not wrong, but let's consider updating the specifics. v78-0004: > > +#define dsa_create(tranch_id) \ > > + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE) > > > > Since these macros are now referring to defaults, maybe their name > > should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE > > (*_MAX_*) > > It makes sense to rename DSA_INITIAL_SEGMENT_SIZE , but I think that > the DSA_MAX_SEGMENT_SIZE is the theoretical maximum size, the current > name also makes sense to me. Right, that makes sense. v78-0005: "Although commit XXX allowed specifying the initial and maximum DSA segment sizes, callers still needed to clamp their own limits, which was not consistent and user-friendly." Perhaps s/still needed/would have needed/ ..., since we're preventing that necessity. > > Did you try it with 1MB m_w_m? > > I've incorporated the above comments and test results look good to me. Could you be more specific about what the test was? Does it work with 1MB m_w_m? + /* + * Choose the initial and maximum DSA segment sizes to be no longer + * than 1/16 and 1/8 of max_bytes, respectively. If the initial + * segment size is low, we end up having many segments, which risks + * exceeding the total number of segments the platform can have. The second sentence is technically correct, but I'm not sure how it relates to the cod
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 25, 2024 at 10:13 AM Tom Lane wrote: > > Masahiko Sawada writes: > > On Mon, Mar 25, 2024 at 1:53 AM Tom Lane wrote: > >> I think the point here is that if you start with an arbitrary > >> non-negative shift value, the preceding loop may in fact decrement it > >> down to something less than zero before exiting, in which case we > >> would indeed have trouble. I suspect that the code is making > >> undocumented assumptions about the possible initial values of shift. > >> Maybe some Asserts would be good? Also, if we're effectively assuming > >> that shift must be exactly zero here, why not let the compiler > >> hard-code that? > > > Sounds like a good solution. I've attached the patch for that. > > Personally I'd put the Assert immediately after the loop, because > it's not related to the "Reserve slot for the value" comment. > Seems reasonable otherwise. > Thanks. Pushed the fix after moving the Assert. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
John Naylor writes: > Hmm, before 30e144287 this code only ran in a test module, is it > possible Coverity would not find it there? That could indeed explain why Coverity didn't see it. I'm not sure how our community run is set up, but it may not build the test modules. regards, tom lane
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 25, 2024 at 8:02 AM Masahiko Sawada wrote: > > On Mon, Mar 25, 2024 at 1:53 AM Tom Lane wrote: > > > > I'm not sure why it took a couple weeks for Coverity to notice > > ee1b30f12, but it saw it today, and it's not happy: > > Hmm, I've also done Coverity Scan in development but I wasn't able to > see this one for some reason... Hmm, before 30e144287 this code only ran in a test module, is it possible Coverity would not find it there?
Re: [PoC] Improve dead tuple storage for lazy vacuum
Masahiko Sawada writes: > On Mon, Mar 25, 2024 at 1:53 AM Tom Lane wrote: >> I think the point here is that if you start with an arbitrary >> non-negative shift value, the preceding loop may in fact decrement it >> down to something less than zero before exiting, in which case we >> would indeed have trouble. I suspect that the code is making >> undocumented assumptions about the possible initial values of shift. >> Maybe some Asserts would be good? Also, if we're effectively assuming >> that shift must be exactly zero here, why not let the compiler >> hard-code that? > Sounds like a good solution. I've attached the patch for that. Personally I'd put the Assert immediately after the loop, because it's not related to the "Reserve slot for the value" comment. Seems reasonable otherwise. regards, tom lane
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 25, 2024 at 1:53 AM Tom Lane wrote: > > John Naylor writes: > > Done. I pushed this with a few last-minute cosmetic adjustments. This > > has been a very long time coming, but we're finally in the home > > stretch! Thank you for the report. > > I'm not sure why it took a couple weeks for Coverity to notice > ee1b30f12, but it saw it today, and it's not happy: Hmm, I've also done Coverity Scan in development but I wasn't able to see this one for some reason... > > /srv/coverity/git/pgsql-git/postgresql/src/include/lib/radixtree.h: 1621 in > local_ts_extend_down() > 1615node = child; > 1616shift -= RT_SPAN; > 1617} > 1618 > 1619/* Reserve slot for the value. */ > 1620n4 = (RT_NODE_4 *) node.local; > >>> CID 1594658: Integer handling issues (BAD_SHIFT) > >>> In expression "key >> shift", shifting by a negative amount has > >>> undefined behavior. The shift amount, "shift", is as little as -7. > 1621n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift); > 1622n4->base.count = 1; > 1623 > 1624return &n4->children[0]; > 1625 } > 1626 > > I think the point here is that if you start with an arbitrary > non-negative shift value, the preceding loop may in fact decrement it > down to something less than zero before exiting, in which case we > would indeed have trouble. I suspect that the code is making > undocumented assumptions about the possible initial values of shift. > Maybe some Asserts would be good? Also, if we're effectively assuming > that shift must be exactly zero here, why not let the compiler > hard-code that? > > - n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift); > + n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0); Sounds like a good solution. I've attached the patch for that. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com fix_radixtree.patch Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
John Naylor writes: > Done. I pushed this with a few last-minute cosmetic adjustments. This > has been a very long time coming, but we're finally in the home > stretch! I'm not sure why it took a couple weeks for Coverity to notice ee1b30f12, but it saw it today, and it's not happy: /srv/coverity/git/pgsql-git/postgresql/src/include/lib/radixtree.h: 1621 in local_ts_extend_down() 1615node = child; 1616shift -= RT_SPAN; 1617} 1618 1619/* Reserve slot for the value. */ 1620n4 = (RT_NODE_4 *) node.local; >>> CID 1594658: Integer handling issues (BAD_SHIFT) >>> In expression "key >> shift", shifting by a negative amount has >>> undefined behavior. The shift amount, "shift", is as little as -7. 1621n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift); 1622n4->base.count = 1; 1623 1624return &n4->children[0]; 1625 } 1626 I think the point here is that if you start with an arbitrary non-negative shift value, the preceding loop may in fact decrement it down to something less than zero before exiting, in which case we would indeed have trouble. I suspect that the code is making undocumented assumptions about the possible initial values of shift. Maybe some Asserts would be good? Also, if we're effectively assuming that shift must be exactly zero here, why not let the compiler hard-code that? - n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift); + n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0); regards, tom lane
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 21, 2024 at 7:48 PM John Naylor wrote: > > On Thu, Mar 21, 2024 at 4:03 PM Masahiko Sawada wrote: > > > > I've looked into this idea further. Overall, it looks clean and I > > don't see any problem so far in terms of integration with lazy vacuum. > > I've attached three patches for discussion and tests. > > Seems okay in the big picture, it's the details we need to be careful of. > > v77-0001 > > - dead_items = (VacDeadItems *) > palloc(vac_max_items_to_alloc_size(max_items)); > - dead_items->max_items = max_items; > - dead_items->num_items = 0; > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0); > + > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo)); > + dead_items_info->max_bytes = vac_work_mem * 1024L; > > This is confusing enough that it looks like a bug: > > [inside TidStoreCreate()] > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */ > while (16 * maxBlockSize > max_bytes * 1024L) > maxBlockSize >>= 1; > > This was copied from CreateWorkExprContext, which operates directly on > work_mem -- if the parameter is actually bytes, we can't "* 1024" > here. If we're passing something measured in kilobytes, the parameter > is badly named. Let's use convert once and use bytes everywhere. True. The attached 0001 patch fixes it. > > v77-0002: > > +#define dsa_create(tranch_id) \ > + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE) > > Since these macros are now referring to defaults, maybe their name > should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE > (*_MAX_*) It makes sense to rename DSA_INITIAL_SEGMENT_SIZE , but I think that the DSA_MAX_SEGMENT_SIZE is the theoretical maximum size, the current name also makes sense to me. > > +/* The minimum size of a DSM segment. */ > +#define DSA_MIN_SEGMENT_SIZE ((size_t) 1024) > > That's a *lot* smaller than it is now. Maybe 256kB? We just want 1MB > m_w_m to work correctly. Fixed. > > v77-0003: > > +/* Public APIs to create local or shared TidStore */ > + > +TidStore * > +TidStoreCreateLocal(size_t max_bytes) > +{ > + return tidstore_create_internal(max_bytes, false, 0); > +} > + > +TidStore * > +TidStoreCreateShared(size_t max_bytes, int tranche_id) > +{ > + return tidstore_create_internal(max_bytes, true, tranche_id); > +} > > I don't think these operations have enough in common to justify > sharing even an internal implementation. Choosing aset block size is > done for both memory types, but it's pointless to do it for shared > memory, because the local context is then only used for small > metadata. > > + /* > + * Choose the DSA initial and max segment sizes to be no longer than > + * 1/16 and 1/8 of max_bytes, respectively. > + */ > > I'm guessing the 1/8 here because the number of segments is limited? I > know these numbers are somewhat arbitrary, but readers will wonder why > one has 1/8 and the other has 1/16. > > + if (dsa_init_size < DSA_MIN_SEGMENT_SIZE) > + dsa_init_size = DSA_MIN_SEGMENT_SIZE; > + if (dsa_max_size < DSA_MAX_SEGMENT_SIZE) > + dsa_max_size = DSA_MAX_SEGMENT_SIZE; > > The second clamp seems against the whole point of this patch -- it > seems they should all be clamped bigger than the DSA_MIN_SEGMENT_SIZE? > Did you try it with 1MB m_w_m? I've incorporated the above comments and test results look good to me. I've attached the several patches: - 0002 is a minor fix for tidstore I found. - 0005 changes the create APIs of tidstore. - 0006 update the vacuum improvement patch to use the new TidStoreCreateLocal/Shared() APIs. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v78-0005-Rethink-create-and-attach-APIs-of-shared-TidStor.patch Description: Binary data v78-0004-Allow-specifying-initial-and-maximum-segment-siz.patch Description: Binary data v78-0003-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch Description: Binary data v78-0006-Adjust-the-vacuum-improvement-patch-to-new-TidSt.patch Description: Binary data v78-0002-Fix-an-inconsistent-function-prototype-with-the-.patch Description: Binary data v78-0001-Fix-a-calculation-in-TidStoreCreate.patch Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 21, 2024 at 4:03 PM Masahiko Sawada wrote: > > I've looked into this idea further. Overall, it looks clean and I > don't see any problem so far in terms of integration with lazy vacuum. > I've attached three patches for discussion and tests. Seems okay in the big picture, it's the details we need to be careful of. v77-0001 - dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items)); - dead_items->max_items = max_items; - dead_items->num_items = 0; + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0); + + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo)); + dead_items_info->max_bytes = vac_work_mem * 1024L; This is confusing enough that it looks like a bug: [inside TidStoreCreate()] /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */ while (16 * maxBlockSize > max_bytes * 1024L) maxBlockSize >>= 1; This was copied from CreateWorkExprContext, which operates directly on work_mem -- if the parameter is actually bytes, we can't "* 1024" here. If we're passing something measured in kilobytes, the parameter is badly named. Let's use convert once and use bytes everywhere. Note: This was not another pass over the whole vacuum patch, just looking an the issue at hand. Also for later: Dilip Kumar reviewed an earlier version. v77-0002: +#define dsa_create(tranch_id) \ + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE) Since these macros are now referring to defaults, maybe their name should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE (*_MAX_*) +/* The minimum size of a DSM segment. */ +#define DSA_MIN_SEGMENT_SIZE ((size_t) 1024) That's a *lot* smaller than it is now. Maybe 256kB? We just want 1MB m_w_m to work correctly. v77-0003: +/* Public APIs to create local or shared TidStore */ + +TidStore * +TidStoreCreateLocal(size_t max_bytes) +{ + return tidstore_create_internal(max_bytes, false, 0); +} + +TidStore * +TidStoreCreateShared(size_t max_bytes, int tranche_id) +{ + return tidstore_create_internal(max_bytes, true, tranche_id); +} I don't think these operations have enough in common to justify sharing even an internal implementation. Choosing aset block size is done for both memory types, but it's pointless to do it for shared memory, because the local context is then only used for small metadata. + /* + * Choose the DSA initial and max segment sizes to be no longer than + * 1/16 and 1/8 of max_bytes, respectively. + */ I'm guessing the 1/8 here because the number of segments is limited? I know these numbers are somewhat arbitrary, but readers will wonder why one has 1/8 and the other has 1/16. + if (dsa_init_size < DSA_MIN_SEGMENT_SIZE) + dsa_init_size = DSA_MIN_SEGMENT_SIZE; + if (dsa_max_size < DSA_MAX_SEGMENT_SIZE) + dsa_max_size = DSA_MAX_SEGMENT_SIZE; The second clamp seems against the whole point of this patch -- it seems they should all be clamped bigger than the DSA_MIN_SEGMENT_SIZE? Did you try it with 1MB m_w_m?
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 21, 2024 at 4:35 PM John Naylor wrote: > > On Thu, Mar 21, 2024 at 1:11 PM Masahiko Sawada wrote: > > > Or we can have a new function for dsa.c to set the initial and max > > segment size (or either one) to the existing DSA area so that > > TidStoreCreate() can specify them at creation. > > I didn't like this very much, because it's splitting an operation > across an API boundary. The caller already has all the information it > needs when it creates the DSA. Straw man proposal: it could do the > same for local memory, then they'd be more similar. But if we made > local contexts the responsibility of the caller, that would cause > duplication between creating and resetting. Fair point. > > > In shared TidStore > > cases, since all memory required by shared radix tree is allocated in > > the passed-in DSA area and the memory usage is the total segment size > > allocated in the DSA area > > ...plus apparently some overhead, I just found out today, but that's > beside the point. > > On Thu, Mar 21, 2024 at 2:02 PM Masahiko Sawada wrote: > > > > Yet another idea is that TidStore creates its own DSA area in > > TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area > > (using dsa_create()) and pass it to RT_CREATE(). Also, we need a new > > API to get the DSA area. The caller (e.g. parallel vacuum) gets the > > dsa_handle of the DSA and stores it in the shared memory (e.g. in > > PVShared). TidStoreAttach() will take two arguments: dsa_handle for > > the DSA area and dsa_pointer for the shared radix tree. This idea > > still requires controlling min/max segment sizes since dsa_create() > > uses the 1MB as the initial segment size. But the TidStoreCreate() > > would be more user friendly. > > This seems like an overall simplification, aside from future size > configuration, so +1 to continue looking into this. If we go this > route, I'd like to avoid a boolean parameter and cleanly separate > TidStoreCreateLocal() and TidStoreCreateShared(). Every operation > after that can introspect, but it's a bit awkward to force these cases > into the same function. It always was a little bit, but this change > makes it more so. I've looked into this idea further. Overall, it looks clean and I don't see any problem so far in terms of integration with lazy vacuum. I've attached three patches for discussion and tests. - 0001 patch makes lazy vacuum use of tidstore. - 0002 patch makes DSA init/max segment size configurable (borrowed from another thread). - 0003 patch makes TidStore create its own DSA area with init/max DSA segment adjustment (PoC patch). One thing unclear to me is that this idea will be usable even when we want to use the tidstore for parallel bitmap scan. Currently, we create a shared tidbitmap on a DSA area in ParallelExecutorInfo. This DSA area is used not only for tidbitmap but also for parallel hash etc. If the tidstore created its own DSA area, parallel bitmap scan would have to use the tidstore's DSA in addition to the DSA area in ParallelExecutorInfo. I'm not sure if there are some differences between these usages in terms of resource manager etc. It seems no problem but I might be missing something. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v77-0003-PoC-Make-shared-TidStore-create-its-own-DSA-area.patch Description: Binary data v77-0002-Make-DSA-initial-and-maximum-segment-size-config.patch Description: Binary data v77-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 21, 2024 at 1:11 PM Masahiko Sawada wrote: > Or we can have a new function for dsa.c to set the initial and max > segment size (or either one) to the existing DSA area so that > TidStoreCreate() can specify them at creation. I didn't like this very much, because it's splitting an operation across an API boundary. The caller already has all the information it needs when it creates the DSA. Straw man proposal: it could do the same for local memory, then they'd be more similar. But if we made local contexts the responsibility of the caller, that would cause duplication between creating and resetting. > In shared TidStore > cases, since all memory required by shared radix tree is allocated in > the passed-in DSA area and the memory usage is the total segment size > allocated in the DSA area ...plus apparently some overhead, I just found out today, but that's beside the point. On Thu, Mar 21, 2024 at 2:02 PM Masahiko Sawada wrote: > > Yet another idea is that TidStore creates its own DSA area in > TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area > (using dsa_create()) and pass it to RT_CREATE(). Also, we need a new > API to get the DSA area. The caller (e.g. parallel vacuum) gets the > dsa_handle of the DSA and stores it in the shared memory (e.g. in > PVShared). TidStoreAttach() will take two arguments: dsa_handle for > the DSA area and dsa_pointer for the shared radix tree. This idea > still requires controlling min/max segment sizes since dsa_create() > uses the 1MB as the initial segment size. But the TidStoreCreate() > would be more user friendly. This seems like an overall simplification, aside from future size configuration, so +1 to continue looking into this. If we go this route, I'd like to avoid a boolean parameter and cleanly separate TidStoreCreateLocal() and TidStoreCreateShared(). Every operation after that can introspect, but it's a bit awkward to force these cases into the same function. It always was a little bit, but this change makes it more so.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 21, 2024 at 3:10 PM Masahiko Sawada wrote: > > On Thu, Mar 21, 2024 at 12:40 PM John Naylor wrote: > > > > On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada > > wrote: > > > > > > On Wed, Mar 20, 2024 at 11:19 PM John Naylor > > > wrote: > > > > Are they (the blocks to be precise) really out of order? The VALUES > > > > statement is ordered, but after inserting it does not output that way. > > > > I wondered if this is platform independent, but CI and our dev > > > > machines haven't failed this test, and I haven't looked into what > > > > determines the order. It's easy enough to hide the blocks if we ever > > > > need to, as we do elsewhere... > > > > > > It seems not necessary as such a test is already covered by > > > test_radixtree. I've changed the query to hide the output blocks. > > > > Okay. > > > > > The buildfarm has been all-green so far. > > > > Great! > > > > > I've attached the latest vacuum improvement patch. > > > > > > I just remembered that the tidstore cannot still be used for parallel > > > vacuum with minimum maintenance_work_mem. Even when the shared > > > tidstore is empty, its memory usage reports 1056768 bytes, a bit above > > > 1MB (1048576 bytes). We need something discussed on another thread[1] > > > in order to make it work. > > > > For exactly this reason, we used to have a clamp on max_bytes when it > > was internal to tidstore, so that it never reported full when first > > created, so I guess that got thrown away when we got rid of the > > control object in shared memory. Forcing callers to clamp their own > > limits seems pretty unfriendly, though. > > Or we can have a new function for dsa.c to set the initial and max > segment size (or either one) to the existing DSA area so that > TidStoreCreate() can specify them at creation. In shared TidStore > cases, since all memory required by shared radix tree is allocated in > the passed-in DSA area and the memory usage is the total segment size > allocated in the DSA area, the user will have to prepare a DSA area > only for the shared tidstore. So we might be able to expect that the > DSA passed-in to TidStoreCreate() is empty and its segment sizes can > be adjustable. Yet another idea is that TidStore creates its own DSA area in TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area (using dsa_create()) and pass it to RT_CREATE(). Also, we need a new API to get the DSA area. The caller (e.g. parallel vacuum) gets the dsa_handle of the DSA and stores it in the shared memory (e.g. in PVShared). TidStoreAttach() will take two arguments: dsa_handle for the DSA area and dsa_pointer for the shared radix tree. This idea still requires controlling min/max segment sizes since dsa_create() uses the 1MB as the initial segment size. But the TidStoreCreate() would be more user friendly. I've attached a PoC patch for discussion. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com tidstore_creates_dsa.patch.nocfbot Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 21, 2024 at 12:40 PM John Naylor wrote: > > On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada wrote: > > > > On Wed, Mar 20, 2024 at 11:19 PM John Naylor > > wrote: > > > Are they (the blocks to be precise) really out of order? The VALUES > > > statement is ordered, but after inserting it does not output that way. > > > I wondered if this is platform independent, but CI and our dev > > > machines haven't failed this test, and I haven't looked into what > > > determines the order. It's easy enough to hide the blocks if we ever > > > need to, as we do elsewhere... > > > > It seems not necessary as such a test is already covered by > > test_radixtree. I've changed the query to hide the output blocks. > > Okay. > > > The buildfarm has been all-green so far. > > Great! > > > I've attached the latest vacuum improvement patch. > > > > I just remembered that the tidstore cannot still be used for parallel > > vacuum with minimum maintenance_work_mem. Even when the shared > > tidstore is empty, its memory usage reports 1056768 bytes, a bit above > > 1MB (1048576 bytes). We need something discussed on another thread[1] > > in order to make it work. > > For exactly this reason, we used to have a clamp on max_bytes when it > was internal to tidstore, so that it never reported full when first > created, so I guess that got thrown away when we got rid of the > control object in shared memory. Forcing callers to clamp their own > limits seems pretty unfriendly, though. Or we can have a new function for dsa.c to set the initial and max segment size (or either one) to the existing DSA area so that TidStoreCreate() can specify them at creation. In shared TidStore cases, since all memory required by shared radix tree is allocated in the passed-in DSA area and the memory usage is the total segment size allocated in the DSA area, the user will have to prepare a DSA area only for the shared tidstore. So we might be able to expect that the DSA passed-in to TidStoreCreate() is empty and its segment sizes can be adjustable. > > The proposals in that thread are pretty simple. If those don't move > forward soon, a hackish workaround would be to round down the number > we get from dsa_get_total_size to the nearest megabyte. Then > controlling min/max segment size would be a nice-to-have for PG17, not > a prerequisite. Interesting idea. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada wrote: > > On Wed, Mar 20, 2024 at 11:19 PM John Naylor wrote: > > Are they (the blocks to be precise) really out of order? The VALUES > > statement is ordered, but after inserting it does not output that way. > > I wondered if this is platform independent, but CI and our dev > > machines haven't failed this test, and I haven't looked into what > > determines the order. It's easy enough to hide the blocks if we ever > > need to, as we do elsewhere... > > It seems not necessary as such a test is already covered by > test_radixtree. I've changed the query to hide the output blocks. Okay. > The buildfarm has been all-green so far. Great! > I've attached the latest vacuum improvement patch. > > I just remembered that the tidstore cannot still be used for parallel > vacuum with minimum maintenance_work_mem. Even when the shared > tidstore is empty, its memory usage reports 1056768 bytes, a bit above > 1MB (1048576 bytes). We need something discussed on another thread[1] > in order to make it work. For exactly this reason, we used to have a clamp on max_bytes when it was internal to tidstore, so that it never reported full when first created, so I guess that got thrown away when we got rid of the control object in shared memory. Forcing callers to clamp their own limits seems pretty unfriendly, though. The proposals in that thread are pretty simple. If those don't move forward soon, a hackish workaround would be to round down the number we get from dsa_get_total_size to the nearest megabyte. Then controlling min/max segment size would be a nice-to-have for PG17, not a prerequisite.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 20, 2024 at 11:19 PM John Naylor wrote: > > On Wed, Mar 20, 2024 at 8:30 PM Masahiko Sawada wrote: > > I forgot to report the results. Yes, I did some tests where I inserted > > many TIDs to make the tidstore use several GB memory. I did two cases: > > > > 1. insert 100M blocks of TIDs with an offset of 100. > > 2. insert 10M blocks of TIDs with an offset of 2048. > > > > The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup > > and iteration results were expected. > > Thanks for confirming! > > > While reviewing the codes again, the following two things caught my eyes: > > > > in check_set_block_offset() function, we don't take a lock on the > > tidstore while checking all possible TIDs. I'll add > > TidStoreLockShare() and TidStoreUnlock() as follows: > > > > + TidStoreLockShare(tidstore); > > if (TidStoreIsMember(tidstore, &tid)) > > ItemPointerSet(&items.lookup_tids[num_lookup_tids++], > > blkno, offset); > > + TidStoreUnlock(tidstore); > > In one sense, all locking in the test module is useless since there is > only a single process. On the other hand, it seems good to at least > run what we have written to run it trivially, and serve as an example > of usage. We should probably be consistent, and document at the top > that the locks are pro-forma only. Agreed. > > > Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take > > a lock on the shared tidstore since dsa_get_total_size() (called by > > RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it > > in the comment as follows: > > > > -/* Return the memory usage of TidStore */ > > +/* > > + * Return the memory usage of TidStore. > > + * > > + * In shared TidStore cases, since shared_ts_memory_usage() does > > appropriate > > + * locking, the caller doesn't need to take a lock. > > + */ > > > > What do you think? > > That duplicates the underlying comment on the radix tree function that > this calls, so I'm inclined to leave it out. At this level it's > probably best to document when a caller _does_ need to take an action. Okay, I didn't change it. > > One thing I forgot to ask about earlier: > > +-- Add tids in out of order. > > Are they (the blocks to be precise) really out of order? The VALUES > statement is ordered, but after inserting it does not output that way. > I wondered if this is platform independent, but CI and our dev > machines haven't failed this test, and I haven't looked into what > determines the order. It's easy enough to hide the blocks if we ever > need to, as we do elsewhere... It seems not necessary as such a test is already covered by test_radixtree. I've changed the query to hide the output blocks. I've pushed the tidstore patch after incorporating the above changes. In addition to that, I've added the following changes before the push: - Added src/test/modules/test_tidstore/.gitignore file. - Removed unnecessary #include from tidstore.c. The buildfarm has been all-green so far. I've attached the latest vacuum improvement patch. I just remembered that the tidstore cannot still be used for parallel vacuum with minimum maintenance_work_mem. Even when the shared tidstore is empty, its memory usage reports 1056768 bytes, a bit above 1MB (1048576 bytes). We need something discussed on another thread[1] in order to make it work. Regards, [1] https://www.postgresql.org/message-id/CAD21AoCVMw6DSmgZY9h%2BxfzKtzJeqWiwxaUD2T-FztVcV-XibQ%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v76-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 20, 2024 at 8:30 PM Masahiko Sawada wrote: > I forgot to report the results. Yes, I did some tests where I inserted > many TIDs to make the tidstore use several GB memory. I did two cases: > > 1. insert 100M blocks of TIDs with an offset of 100. > 2. insert 10M blocks of TIDs with an offset of 2048. > > The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup > and iteration results were expected. Thanks for confirming! > While reviewing the codes again, the following two things caught my eyes: > > in check_set_block_offset() function, we don't take a lock on the > tidstore while checking all possible TIDs. I'll add > TidStoreLockShare() and TidStoreUnlock() as follows: > > + TidStoreLockShare(tidstore); > if (TidStoreIsMember(tidstore, &tid)) > ItemPointerSet(&items.lookup_tids[num_lookup_tids++], > blkno, offset); > + TidStoreUnlock(tidstore); In one sense, all locking in the test module is useless since there is only a single process. On the other hand, it seems good to at least run what we have written to run it trivially, and serve as an example of usage. We should probably be consistent, and document at the top that the locks are pro-forma only. It's both a blessing and a curse that vacuum only has a single writer. It makes development less of a hassle, but also means that tidstore locking is done for API-completeness reasons, not (yet) as a practical necessity. Even tidbitmap.c's hash table currently has a single writer, and while using tidstore for that is still an engineering challenge for other reasons, it wouldn't exercise locking meaningfully, either, at least at first. > Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take > a lock on the shared tidstore since dsa_get_total_size() (called by > RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it > in the comment as follows: > > -/* Return the memory usage of TidStore */ > +/* > + * Return the memory usage of TidStore. > + * > + * In shared TidStore cases, since shared_ts_memory_usage() does appropriate > + * locking, the caller doesn't need to take a lock. > + */ > > What do you think? That duplicates the underlying comment on the radix tree function that this calls, so I'm inclined to leave it out. At this level it's probably best to document when a caller _does_ need to take an action. One thing I forgot to ask about earlier: +-- Add tids in out of order. Are they (the blocks to be precise) really out of order? The VALUES statement is ordered, but after inserting it does not output that way. I wondered if this is platform independent, but CI and our dev machines haven't failed this test, and I haven't looked into what determines the order. It's easy enough to hide the blocks if we ever need to, as we do elsewhere...
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 20, 2024 at 3:48 PM John Naylor wrote: > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada > wrote: > > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor wrote: > > > Locally (not CI), we should try big inputs to make sure we can > > > actually go up to many GB -- it's easier and faster this way than > > > having vacuum give us a large data set. > > > > I'll do these tests. > > I just remembered this -- did any of this kind of testing happen? I > can do it as well. I forgot to report the results. Yes, I did some tests where I inserted many TIDs to make the tidstore use several GB memory. I did two cases: 1. insert 100M blocks of TIDs with an offset of 100. 2. insert 10M blocks of TIDs with an offset of 2048. The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup and iteration results were expected. > > > Thank you. I've incorporated all the comments above. I've attached the > > latest patches, and am going to push them (one by one) after > > self-review again. > > One more cosmetic thing in 0001 that caught my eye: > > diff --git a/src/backend/access/common/Makefile > b/src/backend/access/common/Makefile > index b9aff0ccfd..67b8cc6108 100644 > --- a/src/backend/access/common/Makefile > +++ b/src/backend/access/common/Makefile > @@ -27,6 +27,7 @@ OBJS = \ > syncscan.o \ > toast_compression.o \ > toast_internals.o \ > + tidstore.o \ > tupconvert.o \ > tupdesc.o > > diff --git a/src/backend/access/common/meson.build > b/src/backend/access/common/meson.build > index 725041a4ce..a02397855e 100644 > --- a/src/backend/access/common/meson.build > +++ b/src/backend/access/common/meson.build > @@ -15,6 +15,7 @@ backend_sources += files( >'syncscan.c', >'toast_compression.c', >'toast_internals.c', > + 'tidstore.c', >'tupconvert.c', >'tupdesc.c', > ) > > These aren't in alphabetical order. Good catch. I'll fix them before the push. While reviewing the codes again, the following two things caught my eyes: in check_set_block_offset() function, we don't take a lock on the tidstore while checking all possible TIDs. I'll add TidStoreLockShare() and TidStoreUnlock() as follows: + TidStoreLockShare(tidstore); if (TidStoreIsMember(tidstore, &tid)) ItemPointerSet(&items.lookup_tids[num_lookup_tids++], blkno, offset); + TidStoreUnlock(tidstore); --- Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take a lock on the shared tidstore since dsa_get_total_size() (called by RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it in the comment as follows: -/* Return the memory usage of TidStore */ +/* + * Return the memory usage of TidStore. + * + * In shared TidStore cases, since shared_ts_memory_usage() does appropriate + * locking, the caller doesn't need to take a lock. + */ What do you think? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada wrote: > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor wrote: > > Locally (not CI), we should try big inputs to make sure we can > > actually go up to many GB -- it's easier and faster this way than > > having vacuum give us a large data set. > > I'll do these tests. I just remembered this -- did any of this kind of testing happen? I can do it as well. > Thank you. I've incorporated all the comments above. I've attached the > latest patches, and am going to push them (one by one) after > self-review again. One more cosmetic thing in 0001 that caught my eye: diff --git a/src/backend/access/common/Makefile b/src/backend/access/common/Makefile index b9aff0ccfd..67b8cc6108 100644 --- a/src/backend/access/common/Makefile +++ b/src/backend/access/common/Makefile @@ -27,6 +27,7 @@ OBJS = \ syncscan.o \ toast_compression.o \ toast_internals.o \ + tidstore.o \ tupconvert.o \ tupdesc.o diff --git a/src/backend/access/common/meson.build b/src/backend/access/common/meson.build index 725041a4ce..a02397855e 100644 --- a/src/backend/access/common/meson.build +++ b/src/backend/access/common/meson.build @@ -15,6 +15,7 @@ backend_sources += files( 'syncscan.c', 'toast_compression.c', 'toast_internals.c', + 'tidstore.c', 'tupconvert.c', 'tupdesc.c', ) These aren't in alphabetical order.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Mar 19, 2024 at 6:40 PM John Naylor wrote: > > On Tue, Mar 19, 2024 at 10:24 AM Masahiko Sawada > wrote: > > > > On Tue, Mar 19, 2024 at 8:35 AM John Naylor wrote: > > > > > > On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada > > > wrote: > > > > > > > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor > > > > wrote: > > > > It might also be worth reducing the number of blocks in the random > > > test -- multiple runs will have different offsets anyway. > > > > Yes. If we reduce the number of blocks from 1000 to 100, the > > regression test took on my environment: > > > > 1000 blocks : 516 ms > > 100 blocks : 228 ms > > Sounds good. > > > Removed some unnecessary variables in 0002 patch. > > Looks good. > > > So the MaxBlocktableEntrySize calculation would be as follows? > > > > #define MaxBlocktableEntrySize \ > > offsetof(BlocktableEntry, words) + \ > > (sizeof(bitmapword) * \ > > WORDS_PER_PAGE(Min(MaxOffsetNumber, \ > >BITS_PER_BITMAPWORD * PG_INT8_MAX - 1 > > > > I've made this change in the 0003 patch. > > This is okay, but one side effect is that we have both an assert and > an elog, for different limits. I think we'll need a separate #define > to help. But for now, I don't want to hold up tidstore further with > this because I believe almost everything else in v74 is in pretty good > shape. I'll save this for later as a part of the optimization I > proposed. > > Remaining things I noticed: > > +#define RT_PREFIX local_rt > +#define RT_PREFIX shared_rt > > Prefixes for simplehash, for example, don't have "sh" -- maybe > "local/shared_ts" > > + /* MemoryContext where the radix tree uses */ > > s/where/that/ > > +/* > + * Lock support functions. > + * > + * We can use the radix tree's lock for shared TidStore as the data we > + * need to protect is only the shared radix tree. > + */ > +void > +TidStoreLockExclusive(TidStore *ts) > > Talking about multiple things, so maybe a blank line after the comment. > > With those, I think you can go ahead and squash all the tidstore > patches except for 0003 and commit it. > > > While reviewing the vacuum patch, I realized that we always pass > > LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related > > to the tidstore is therefore always the same. I think it would be > > better to make the caller of TidStoreCreate() specify the tranch_id > > and pass it to RT_CREATE(). That way, the caller can specify their own > > wait event for tidstore. The 0008 patch tried this idea. dshash.c does > > the same idea. > > Sounds reasonable. I'll just note that src/include/storage/lwlock.h > still has an entry for LWTRANCHE_SHARED_TIDSTORE. Thank you. I've incorporated all the comments above. I've attached the latest patches, and am going to push them (one by one) after self-review again. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v75-0001-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch Description: Binary data v75-0002-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Mar 19, 2024 at 10:24 AM Masahiko Sawada wrote: > > On Tue, Mar 19, 2024 at 8:35 AM John Naylor wrote: > > > > On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada > > wrote: > > > > > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor > > > wrote: > > It might also be worth reducing the number of blocks in the random > > test -- multiple runs will have different offsets anyway. > > Yes. If we reduce the number of blocks from 1000 to 100, the > regression test took on my environment: > > 1000 blocks : 516 ms > 100 blocks : 228 ms Sounds good. > Removed some unnecessary variables in 0002 patch. Looks good. > So the MaxBlocktableEntrySize calculation would be as follows? > > #define MaxBlocktableEntrySize \ > offsetof(BlocktableEntry, words) + \ > (sizeof(bitmapword) * \ > WORDS_PER_PAGE(Min(MaxOffsetNumber, \ >BITS_PER_BITMAPWORD * PG_INT8_MAX - 1 > > I've made this change in the 0003 patch. This is okay, but one side effect is that we have both an assert and an elog, for different limits. I think we'll need a separate #define to help. But for now, I don't want to hold up tidstore further with this because I believe almost everything else in v74 is in pretty good shape. I'll save this for later as a part of the optimization I proposed. Remaining things I noticed: +#define RT_PREFIX local_rt +#define RT_PREFIX shared_rt Prefixes for simplehash, for example, don't have "sh" -- maybe "local/shared_ts" + /* MemoryContext where the radix tree uses */ s/where/that/ +/* + * Lock support functions. + * + * We can use the radix tree's lock for shared TidStore as the data we + * need to protect is only the shared radix tree. + */ +void +TidStoreLockExclusive(TidStore *ts) Talking about multiple things, so maybe a blank line after the comment. With those, I think you can go ahead and squash all the tidstore patches except for 0003 and commit it. > While reviewing the vacuum patch, I realized that we always pass > LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related > to the tidstore is therefore always the same. I think it would be > better to make the caller of TidStoreCreate() specify the tranch_id > and pass it to RT_CREATE(). That way, the caller can specify their own > wait event for tidstore. The 0008 patch tried this idea. dshash.c does > the same idea. Sounds reasonable. I'll just note that src/include/storage/lwlock.h still has an entry for LWTRANCHE_SHARED_TIDSTORE.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Mar 19, 2024 at 8:35 AM John Naylor wrote: > > On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada > wrote: > > > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor > > wrote: > > > > Random offsets is what I was thinking of (if made distinct and > > > ordered), but even there the code is fairy trivial, so I don't have a > > > strong feeling about it. > > > > Agreed. > > Looks good. > > A related thing I should mention is that the tests which look up all > possible offsets are really expensive with the number of blocks we're > using now (assert build): > > v70 0.33s > v72 1.15s > v73 1.32 > > To trim that back, I think we should give up on using shared memory > for the is-full test: We can cause aset to malloc a new block with a > lot fewer entries. In the attached, this brings it back down to 0.43s. Looks good. Agreed with this change. > It might also be worth reducing the number of blocks in the random > test -- multiple runs will have different offsets anyway. Yes. If we reduce the number of blocks from 1000 to 100, the regression test took on my environment: 1000 blocks : 516 ms 100 blocks : 228 ms > > > > I think we can stop including the debug-tid-store patch for CI now. > > > That would allow getting rid of some unnecessary variables. > > > > Agreed. > > Okay, all that remains here is to get rid of those variables (might be > just one). Removed some unnecessary variables in 0002 patch. > > > > + * Scan the TidStore and return a pointer to TidStoreIterResult that has > > > TIDs > > > + * in one block. We return the block numbers in ascending order and the > > > offset > > > + * numbers in each result is also sorted in ascending order. > > > + */ > > > +TidStoreIterResult * > > > +TidStoreIterateNext(TidStoreIter *iter) > > > > > > The wording is a bit awkward. > > > > Fixed. > > - * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs > - * in one block. We return the block numbers in ascending order and the > offset > - * numbers in each result is also sorted in ascending order. > + * Scan the TidStore and return the TIDs of the next block. The returned > block > + * numbers is sorted in ascending order, and the offset numbers in each > result > + * is also sorted in ascending order. > > Better, but it's still not very clear. Maybe "The offsets in each > iteration result are ordered, as are the block numbers over all > iterations." Thanks, fixed. > > > > +/* Extract TIDs from the given key-value pair */ > > > +static void > > > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, > > > BlocktableEntry *page) > > > > > > This is a leftover from the old encoding scheme. This should really > > > take a "BlockNumber blockno" not a "key", and the only call site > > > should probably cast the uint64 to BlockNumber. > > > > Fixed. > > This part looks good. I didn't notice earlier, but this comment has a > similar issue > > @@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter) > return NULL; > > /* Collect TIDs extracted from the key-value pair */ > - tidstore_iter_extract_tids(iter, key, page); > + tidstore_iter_extract_tids(iter, (BlockNumber) key, page); > > ..."extracted" was once a separate operation. I think just removing > that one word is enough to update it. Fixed. > > Some other review on code comments: > > v73-0001: > > + /* Enlarge the TID array if necessary */ > > It's "arrays" now. > > v73-0005: > > +-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has > +-- different randon 100 offset numbers each other. > > The numbers are obvious from the query. Maybe just mention that the > offsets are randomized and must be unique and ordered. > > + * The caller is responsible for release any locks. > > "releasing" Fixed. > > > > +typedef struct BlocktableEntry > > > +{ > > > + uint16 nwords; > > > + bitmapword words[FLEXIBLE_ARRAY_MEMBER]; > > > +} BlocktableEntry; > > > > > > In my WIP for runtime-embeddable offsets, nwords needs to be one byte. > > I should be more clear here: nwords fitting into one byte allows 3 > embedded offsets (1 on 32-bit platforms, which is good for testing at > least). With uint16 nwords that reduces to 2 (none on 32-bit > platforms). Further, after the current patch series is fully > committed, I plan to split the embedded-offset patch into two parts: > The first would store the offsets in the header, but would still need > a (smaller) allocation. The second would embed them in the child > pointer. Only the second patch will care about the size of nwords > because it needs to reserve a byte for the pointer tag. Thank you for the clarification. > > > > That doesn't have any real-world affect on the largest offset > > > encountered, and only in 32-bit builds with 32kB block size would the > > > theoretical max change at all. To be precise, we could use in the > > > MaxBlocktableEntrySize calculation: > > > > > > Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1); > > > > I don't get thi
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada wrote: > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor wrote: > > Random offsets is what I was thinking of (if made distinct and > > ordered), but even there the code is fairy trivial, so I don't have a > > strong feeling about it. > > Agreed. Looks good. A related thing I should mention is that the tests which look up all possible offsets are really expensive with the number of blocks we're using now (assert build): v70 0.33s v72 1.15s v73 1.32 To trim that back, I think we should give up on using shared memory for the is-full test: We can cause aset to malloc a new block with a lot fewer entries. In the attached, this brings it back down to 0.43s. It might also be worth reducing the number of blocks in the random test -- multiple runs will have different offsets anyway. > > I think we can stop including the debug-tid-store patch for CI now. > > That would allow getting rid of some unnecessary variables. > > Agreed. Okay, all that remains here is to get rid of those variables (might be just one). > > + * Scan the TidStore and return a pointer to TidStoreIterResult that has > > TIDs > > + * in one block. We return the block numbers in ascending order and the > > offset > > + * numbers in each result is also sorted in ascending order. > > + */ > > +TidStoreIterResult * > > +TidStoreIterateNext(TidStoreIter *iter) > > > > The wording is a bit awkward. > > Fixed. - * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs - * in one block. We return the block numbers in ascending order and the offset - * numbers in each result is also sorted in ascending order. + * Scan the TidStore and return the TIDs of the next block. The returned block + * numbers is sorted in ascending order, and the offset numbers in each result + * is also sorted in ascending order. Better, but it's still not very clear. Maybe "The offsets in each iteration result are ordered, as are the block numbers over all iterations." > > +/* Extract TIDs from the given key-value pair */ > > +static void > > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, > > BlocktableEntry *page) > > > > This is a leftover from the old encoding scheme. This should really > > take a "BlockNumber blockno" not a "key", and the only call site > > should probably cast the uint64 to BlockNumber. > > Fixed. This part looks good. I didn't notice earlier, but this comment has a similar issue @@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter) return NULL; /* Collect TIDs extracted from the key-value pair */ - tidstore_iter_extract_tids(iter, key, page); + tidstore_iter_extract_tids(iter, (BlockNumber) key, page); ..."extracted" was once a separate operation. I think just removing that one word is enough to update it. Some other review on code comments: v73-0001: + /* Enlarge the TID array if necessary */ It's "arrays" now. v73-0005: +-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has +-- different randon 100 offset numbers each other. The numbers are obvious from the query. Maybe just mention that the offsets are randomized and must be unique and ordered. + * The caller is responsible for release any locks. "releasing" > > +typedef struct BlocktableEntry > > +{ > > + uint16 nwords; > > + bitmapword words[FLEXIBLE_ARRAY_MEMBER]; > > +} BlocktableEntry; > > > > In my WIP for runtime-embeddable offsets, nwords needs to be one byte. I should be more clear here: nwords fitting into one byte allows 3 embedded offsets (1 on 32-bit platforms, which is good for testing at least). With uint16 nwords that reduces to 2 (none on 32-bit platforms). Further, after the current patch series is fully committed, I plan to split the embedded-offset patch into two parts: The first would store the offsets in the header, but would still need a (smaller) allocation. The second would embed them in the child pointer. Only the second patch will care about the size of nwords because it needs to reserve a byte for the pointer tag. > > That doesn't have any real-world affect on the largest offset > > encountered, and only in 32-bit builds with 32kB block size would the > > theoretical max change at all. To be precise, we could use in the > > MaxBlocktableEntrySize calculation: > > > > Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1); > > I don't get this expression. Making the nwords one byte works well? > With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256 > bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit > OS, respectively. One byte nwrods variable seems not to be sufficient I believe there is confusion between bitmap words and bytes: 2048 / 64 = 32 words = 256 bytes It used to be max tuples per (heap) page, but we wanted a simple way to make this independent of heap. I believe we won't need to ever store the actual MaxOffsetNumber, although we technically still could with a one-byte type and 32kB pages, at least on
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Sun, Mar 17, 2024 at 11:46 AM John Naylor wrote: > > On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada wrote: > > > > On Fri, Mar 15, 2024 at 4:36 PM John Naylor wrote: > > > > > > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada > > > wrote: > > > > > Given TidStoreSetBlockOffsets() is designed to always set (i.e. > > > > overwrite) the value, I think we should not expect that found is > > > > always false. > > > > > > I find that a puzzling statement, since 1) it was designed for > > > insert-only workloads, not actual overwrite IIRC and 2) the tests will > > > now fail if the same block is set twice, since we just switched the > > > tests to use a remnant of vacuum's old array. Having said that, I > > > don't object to removing artificial barriers to using it for purposes > > > not yet imagined, as long as test_tidstore.sql warns against that. > > > > I think that if it supports only insert-only workload and expects the > > same block is set only once, it should raise an error rather than an > > assertion. It's odd to me that the function fails only with an > > assertion build assertions even though it actually works fine even in > > that case. > > After thinking some more, I think you're right -- it's too > heavy-handed to throw an error/assert and a public function shouldn't > make assumptions about the caller. It's probably just a matter of > documenting the function (and it's lack of generality), and the tests > (which are based on the thing we're replacing). Removed 'found' in 0003 patch. > > > As for test_tidstore you're right that the test code doesn't handle > > the case where setting the same block twice. I think that there is no > > problem in the fixed-TIDs tests, but we would need something for > > random-TIDs tests so that we don't set the same block twice. I guess > > it could be trivial since we can use SQL queries to generate TIDs. I'm > > not sure how the random-TIDs tests would be like, but I think we can > > use SELECT DISTINCT to eliminate the duplicates of block numbers to > > use. > > Also, I don't think we need random blocks, since the radix tree tests > excercise that heavily already. > > Random offsets is what I was thinking of (if made distinct and > ordered), but even there the code is fairy trivial, so I don't have a > strong feeling about it. Agreed. > > > > Given the above two things, I think this function's comment needs > > > stronger language about its limitations. Perhaps even mention that > > > it's intended for, and optimized for, vacuum. You and I have long > > > known that tidstore would need a separate, more complex, function to > > > add or remove individual tids from existing entries, but it might be > > > good to have that documented. > > > > Agreed. > > How about this: > > /* > - * Set the given TIDs on the blkno to TidStore. > + * Create or replace an entry for the given block and array of offsets > * > - * NB: the offset numbers in offsets must be sorted in ascending order. > + * NB: This function is designed and optimized for vacuum's heap scanning > + * phase, so has some limitations: > + * - The offset numbers in "offsets" must be sorted in ascending order. > + * - If the block number already exists, the entry will be replaced -- > + * there is no way to add or remove offsets from an entry. > */ > void > TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber > *offsets, Looks good. > > I think we can stop including the debug-tid-store patch for CI now. > That would allow getting rid of some unnecessary variables. Agreed. > > + * Prepare to iterate through a TidStore. Since the radix tree is locked > during > + * the iteration, TidStoreEndIterate() needs to be called when finished. > > + * Concurrent updates during the iteration will be blocked when inserting a > + * key-value to the radix tree. > > This is outdated. Locking is optional. The remaining real reason now > is that TidStoreEndIterate needs to free memory. We probably need to > say something about locking, too, but not this. Fixed. > > + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs > + * in one block. We return the block numbers in ascending order and the > offset > + * numbers in each result is also sorted in ascending order. > + */ > +TidStoreIterResult * > +TidStoreIterateNext(TidStoreIter *iter) > > The wording is a bit awkward. Fixed. > > +/* > + * Finish an iteration over TidStore. This needs to be called after finishing > + * or when existing an iteration. > + */ > > s/existing/exiting/ ? > > It seems to say we need to finish after finishing. Maybe more precise wording. Fixed. > > +/* Extract TIDs from the given key-value pair */ > +static void > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, > BlocktableEntry *page) > > This is a leftover from the old encoding scheme. This should really > take a "BlockNumber blockno" not a "key", and the only call site > should probably cast the uint64 to BlockNumbe
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada wrote: > > On Fri, Mar 15, 2024 at 4:36 PM John Naylor wrote: > > > > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada > > wrote: > > > Given TidStoreSetBlockOffsets() is designed to always set (i.e. > > > overwrite) the value, I think we should not expect that found is > > > always false. > > > > I find that a puzzling statement, since 1) it was designed for > > insert-only workloads, not actual overwrite IIRC and 2) the tests will > > now fail if the same block is set twice, since we just switched the > > tests to use a remnant of vacuum's old array. Having said that, I > > don't object to removing artificial barriers to using it for purposes > > not yet imagined, as long as test_tidstore.sql warns against that. > > I think that if it supports only insert-only workload and expects the > same block is set only once, it should raise an error rather than an > assertion. It's odd to me that the function fails only with an > assertion build assertions even though it actually works fine even in > that case. After thinking some more, I think you're right -- it's too heavy-handed to throw an error/assert and a public function shouldn't make assumptions about the caller. It's probably just a matter of documenting the function (and it's lack of generality), and the tests (which are based on the thing we're replacing). > As for test_tidstore you're right that the test code doesn't handle > the case where setting the same block twice. I think that there is no > problem in the fixed-TIDs tests, but we would need something for > random-TIDs tests so that we don't set the same block twice. I guess > it could be trivial since we can use SQL queries to generate TIDs. I'm > not sure how the random-TIDs tests would be like, but I think we can > use SELECT DISTINCT to eliminate the duplicates of block numbers to > use. Also, I don't think we need random blocks, since the radix tree tests excercise that heavily already. Random offsets is what I was thinking of (if made distinct and ordered), but even there the code is fairy trivial, so I don't have a strong feeling about it. > > Given the above two things, I think this function's comment needs > > stronger language about its limitations. Perhaps even mention that > > it's intended for, and optimized for, vacuum. You and I have long > > known that tidstore would need a separate, more complex, function to > > add or remove individual tids from existing entries, but it might be > > good to have that documented. > > Agreed. How about this: /* - * Set the given TIDs on the blkno to TidStore. + * Create or replace an entry for the given block and array of offsets * - * NB: the offset numbers in offsets must be sorted in ascending order. + * NB: This function is designed and optimized for vacuum's heap scanning + * phase, so has some limitations: + * - The offset numbers in "offsets" must be sorted in ascending order. + * - If the block number already exists, the entry will be replaced -- + * there is no way to add or remove offsets from an entry. */ void TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, I think we can stop including the debug-tid-store patch for CI now. That would allow getting rid of some unnecessary variables. More comments: + * Prepare to iterate through a TidStore. Since the radix tree is locked during + * the iteration, TidStoreEndIterate() needs to be called when finished. + * Concurrent updates during the iteration will be blocked when inserting a + * key-value to the radix tree. This is outdated. Locking is optional. The remaining real reason now is that TidStoreEndIterate needs to free memory. We probably need to say something about locking, too, but not this. + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs + * in one block. We return the block numbers in ascending order and the offset + * numbers in each result is also sorted in ascending order. + */ +TidStoreIterResult * +TidStoreIterateNext(TidStoreIter *iter) The wording is a bit awkward. +/* + * Finish an iteration over TidStore. This needs to be called after finishing + * or when existing an iteration. + */ s/existing/exiting/ ? It seems to say we need to finish after finishing. Maybe more precise wording. +/* Extract TIDs from the given key-value pair */ +static void +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, BlocktableEntry *page) This is a leftover from the old encoding scheme. This should really take a "BlockNumber blockno" not a "key", and the only call site should probably cast the uint64 to BlockNumber. + * tidstore.h + * Tid storage. + * + * + * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group Update year. +typedef struct BlocktableEntry +{ + uint16 nwords; + bitmapword words[FLEXIBLE_ARRAY_MEMBER]; +} BlocktableEntry; In my WIP for runtime-embeddable offsets, nwords needs to be one byte. That doesn't have any
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 15, 2024 at 4:36 PM John Naylor wrote: > > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada wrote: > > > > On Thu, Mar 14, 2024 at 6:55 PM John Naylor wrote: > > > > > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada > > > wrote: > > > > > > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor > > > > wrote: > > > > > Okay, here's an another idea: Change test_lookup_tids() to be more > > > > > general and put the validation down into C as well. First we save the > > > > > blocks from do_set_block_offsets() into a table, then with all those > > > > > blocks lookup a sufficiently-large range of possible offsets and save > > > > > found values in another array. So the static items structure would > > > > > have 3 arrays: inserts, successful lookups, and iteration (currently > > > > > the iteration output is private to check_set_block_offsets(). Then > > > > > sort as needed and check they are all the same. > > > > > > > > That's a promising idea. We can use the same mechanism for randomized > > > > tests too. If you're going to work on this, I'll do other tests on my > > > > environment in the meantime. > > > > > > Some progress on this in v72 -- I tried first without using SQL to > > > save the blocks, just using the unique blocks from the verification > > > array. It seems to work fine. > > > > Thanks! > > Seems I forgot the attachment last time...there's more stuff now > anyway, based on discussion. Thank you for updating the patches! The idea of using three TID arrays for the lookup test and iteration test looks good to me. I think we can add random-TIDs tests on top of it. > > > > - Since there are now three arrays we should reduce max bytes to > > > something smaller. > > > > Agreed. > > I went further than this, see below. > > > > - Further on that, I'm not sure if the "is full" test is telling us > > > much. It seems we could make max bytes a static variable and set it to > > > the size of the empty store. I'm guessing it wouldn't take much to add > > > enough tids so that the contexts need to allocate some blocks, and > > > then it would appear full and we can test that. I've made it so all > > > arrays repalloc when needed, just in case. > > > > How about using work_mem as max_bytes instead of having it as a static > > variable? In test_tidstore.sql we set work_mem before creating the > > tidstore. It would make the tidstore more controllable by SQL queries. > > My complaint is that the "is full" test is trivial, and also strange > in that max_bytes is used for two unrelated things: > > - the initial size of the verification arrays, which was always larger > than necessary, and now there are three of them > - the hint to TidStoreCreate to calculate its max block size / the > threshold for being "full" > > To make the "is_full" test slightly less trivial, my idea is to save > the empty store size and later add enough tids so that it has to > allocate new blocks/DSA segments, which is not that many, and then it > will appear full. I've done this and also separated the purpose of > various sizes in v72-0009/10. I see your point and the changes look good to me. > Using actual work_mem seems a bit more difficult to make this work. Agreed. > > > > --- > > + if (TidStoreIsShared(ts)) > > + found = shared_rt_set(ts->tree.shared, blkno, page); > > + else > > + found = local_rt_set(ts->tree.local, blkno, page); > > + > > + Assert(!found); > > > > Given TidStoreSetBlockOffsets() is designed to always set (i.e. > > overwrite) the value, I think we should not expect that found is > > always false. > > I find that a puzzling statement, since 1) it was designed for > insert-only workloads, not actual overwrite IIRC and 2) the tests will > now fail if the same block is set twice, since we just switched the > tests to use a remnant of vacuum's old array. Having said that, I > don't object to removing artificial barriers to using it for purposes > not yet imagined, as long as test_tidstore.sql warns against that. I think that if it supports only insert-only workload and expects the same block is set only once, it should raise an error rather than an assertion. It's odd to me that the function fails only with an assertion build assertions even though it actually works fine even in that case. As for test_tidstore you're right that the test code doesn't handle the case where setting the same block twice. I think that there is no problem in the fixed-TIDs tests, but we would need something for random-TIDs tests so that we don't set the same block twice. I guess it could be trivial since we can use SQL queries to generate TIDs. I'm not sure how the random-TIDs tests would be like, but I think we can use SELECT DISTINCT to eliminate the duplicates of block numbers to use. > > Given the above two things, I think this function's comment needs > stronger language about its limitations. Perhaps even mention that > it's intended for, and optimized for, vacuum. You and I have long > kno
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada wrote: > > On Thu, Mar 14, 2024 at 6:55 PM John Naylor wrote: > > > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada > > wrote: > > > > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor > > > wrote: > > > > Okay, here's an another idea: Change test_lookup_tids() to be more > > > > general and put the validation down into C as well. First we save the > > > > blocks from do_set_block_offsets() into a table, then with all those > > > > blocks lookup a sufficiently-large range of possible offsets and save > > > > found values in another array. So the static items structure would > > > > have 3 arrays: inserts, successful lookups, and iteration (currently > > > > the iteration output is private to check_set_block_offsets(). Then > > > > sort as needed and check they are all the same. > > > > > > That's a promising idea. We can use the same mechanism for randomized > > > tests too. If you're going to work on this, I'll do other tests on my > > > environment in the meantime. > > > > Some progress on this in v72 -- I tried first without using SQL to > > save the blocks, just using the unique blocks from the verification > > array. It seems to work fine. > > Thanks! Seems I forgot the attachment last time...there's more stuff now anyway, based on discussion. > > - Since there are now three arrays we should reduce max bytes to > > something smaller. > > Agreed. I went further than this, see below. > > - Further on that, I'm not sure if the "is full" test is telling us > > much. It seems we could make max bytes a static variable and set it to > > the size of the empty store. I'm guessing it wouldn't take much to add > > enough tids so that the contexts need to allocate some blocks, and > > then it would appear full and we can test that. I've made it so all > > arrays repalloc when needed, just in case. > > How about using work_mem as max_bytes instead of having it as a static > variable? In test_tidstore.sql we set work_mem before creating the > tidstore. It would make the tidstore more controllable by SQL queries. My complaint is that the "is full" test is trivial, and also strange in that max_bytes is used for two unrelated things: - the initial size of the verification arrays, which was always larger than necessary, and now there are three of them - the hint to TidStoreCreate to calculate its max block size / the threshold for being "full" To make the "is_full" test slightly less trivial, my idea is to save the empty store size and later add enough tids so that it has to allocate new blocks/DSA segments, which is not that many, and then it will appear full. I've done this and also separated the purpose of various sizes in v72-0009/10. Using actual work_mem seems a bit more difficult to make this work. > > - I'm not sure it's useful to keep test_lookup_tids() around. Since we > > now have a separate lookup test, the only thing it can tell us is that > > lookups fail on an empty store. I arranged it so that > > check_set_block_offsets() works on an empty store. Although that's > > even more trivial, it's just reusing what we already need. > > Agreed. Removed in v72-0007 On Fri, Mar 15, 2024 at 9:49 AM Masahiko Sawada wrote: > > I have two questions on tidstore.c: > > +/* > + * Set the given TIDs on the blkno to TidStore. > + * > + * NB: the offset numbers in offsets must be sorted in ascending order. > + */ > > Do we need some assertions to check if the given offset numbers are > sorted expectedly? Done in v72-0008 > --- > + if (TidStoreIsShared(ts)) > + found = shared_rt_set(ts->tree.shared, blkno, page); > + else > + found = local_rt_set(ts->tree.local, blkno, page); > + > + Assert(!found); > > Given TidStoreSetBlockOffsets() is designed to always set (i.e. > overwrite) the value, I think we should not expect that found is > always false. I find that a puzzling statement, since 1) it was designed for insert-only workloads, not actual overwrite IIRC and 2) the tests will now fail if the same block is set twice, since we just switched the tests to use a remnant of vacuum's old array. Having said that, I don't object to removing artificial barriers to using it for purposes not yet imagined, as long as test_tidstore.sql warns against that. Given the above two things, I think this function's comment needs stronger language about its limitations. Perhaps even mention that it's intended for, and optimized for, vacuum. You and I have long known that tidstore would need a separate, more complex, function to add or remove individual tids from existing entries, but it might be good to have that documented. Other things: v72-0011: Test that zero offset raises an error. v72-0013: I had wanted to microbenchmark this, but since we are running short of time I decided to skip that, so I want to revert some code to make it again more similar to the equivalent in tidbitmap.c. In the absence of evidence, it seems better to do it this way. v72
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 14, 2024 at 9:03 PM Masahiko Sawada wrote: > > On Thu, Mar 14, 2024 at 6:55 PM John Naylor wrote: > > > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada > > wrote: > > > > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor > > > wrote: > > > > Okay, here's an another idea: Change test_lookup_tids() to be more > > > > general and put the validation down into C as well. First we save the > > > > blocks from do_set_block_offsets() into a table, then with all those > > > > blocks lookup a sufficiently-large range of possible offsets and save > > > > found values in another array. So the static items structure would > > > > have 3 arrays: inserts, successful lookups, and iteration (currently > > > > the iteration output is private to check_set_block_offsets(). Then > > > > sort as needed and check they are all the same. > > > > > > That's a promising idea. We can use the same mechanism for randomized > > > tests too. If you're going to work on this, I'll do other tests on my > > > environment in the meantime. > > > > Some progress on this in v72 -- I tried first without using SQL to > > save the blocks, just using the unique blocks from the verification > > array. It seems to work fine. > > Thanks! > > > > > - Since there are now three arrays we should reduce max bytes to > > something smaller. > > Agreed. > > > - Further on that, I'm not sure if the "is full" test is telling us > > much. It seems we could make max bytes a static variable and set it to > > the size of the empty store. I'm guessing it wouldn't take much to add > > enough tids so that the contexts need to allocate some blocks, and > > then it would appear full and we can test that. I've made it so all > > arrays repalloc when needed, just in case. > > How about using work_mem as max_bytes instead of having it as a static > variable? In test_tidstore.sql we set work_mem before creating the > tidstore. It would make the tidstore more controllable by SQL queries. > > > - Why are we switching to TopMemoryContext? It's not explained -- the > > comment only tells what the code is doing (which is obvious), but not > > why. > > This is because the tidstore needs to live across the transaction > boundary. We can use TopMemoryContext or CacheMemoryContext. > > > - I'm not sure it's useful to keep test_lookup_tids() around. Since we > > now have a separate lookup test, the only thing it can tell us is that > > lookups fail on an empty store. I arranged it so that > > check_set_block_offsets() works on an empty store. Although that's > > even more trivial, it's just reusing what we already need. > > Agreed. > I have two questions on tidstore.c: +/* + * Set the given TIDs on the blkno to TidStore. + * + * NB: the offset numbers in offsets must be sorted in ascending order. + */ Do we need some assertions to check if the given offset numbers are sorted expectedly? --- + if (TidStoreIsShared(ts)) + found = shared_rt_set(ts->tree.shared, blkno, page); + else + found = local_rt_set(ts->tree.local, blkno, page); + + Assert(!found); Given TidStoreSetBlockOffsets() is designed to always set (i.e. overwrite) the value, I think we should not expect that found is always false. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 14, 2024 at 6:55 PM John Naylor wrote: > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada > wrote: > > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor wrote: > > > Okay, here's an another idea: Change test_lookup_tids() to be more > > > general and put the validation down into C as well. First we save the > > > blocks from do_set_block_offsets() into a table, then with all those > > > blocks lookup a sufficiently-large range of possible offsets and save > > > found values in another array. So the static items structure would > > > have 3 arrays: inserts, successful lookups, and iteration (currently > > > the iteration output is private to check_set_block_offsets(). Then > > > sort as needed and check they are all the same. > > > > That's a promising idea. We can use the same mechanism for randomized > > tests too. If you're going to work on this, I'll do other tests on my > > environment in the meantime. > > Some progress on this in v72 -- I tried first without using SQL to > save the blocks, just using the unique blocks from the verification > array. It seems to work fine. Thanks! > > - Since there are now three arrays we should reduce max bytes to > something smaller. Agreed. > - Further on that, I'm not sure if the "is full" test is telling us > much. It seems we could make max bytes a static variable and set it to > the size of the empty store. I'm guessing it wouldn't take much to add > enough tids so that the contexts need to allocate some blocks, and > then it would appear full and we can test that. I've made it so all > arrays repalloc when needed, just in case. How about using work_mem as max_bytes instead of having it as a static variable? In test_tidstore.sql we set work_mem before creating the tidstore. It would make the tidstore more controllable by SQL queries. > - Why are we switching to TopMemoryContext? It's not explained -- the > comment only tells what the code is doing (which is obvious), but not > why. This is because the tidstore needs to live across the transaction boundary. We can use TopMemoryContext or CacheMemoryContext. > - I'm not sure it's useful to keep test_lookup_tids() around. Since we > now have a separate lookup test, the only thing it can tell us is that > lookups fail on an empty store. I arranged it so that > check_set_block_offsets() works on an empty store. Although that's > even more trivial, it's just reusing what we already need. Agreed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada wrote: > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor wrote: > > Okay, here's an another idea: Change test_lookup_tids() to be more > > general and put the validation down into C as well. First we save the > > blocks from do_set_block_offsets() into a table, then with all those > > blocks lookup a sufficiently-large range of possible offsets and save > > found values in another array. So the static items structure would > > have 3 arrays: inserts, successful lookups, and iteration (currently > > the iteration output is private to check_set_block_offsets(). Then > > sort as needed and check they are all the same. > > That's a promising idea. We can use the same mechanism for randomized > tests too. If you're going to work on this, I'll do other tests on my > environment in the meantime. Some progress on this in v72 -- I tried first without using SQL to save the blocks, just using the unique blocks from the verification array. It seems to work fine. Some open questions on the test module: - Since there are now three arrays we should reduce max bytes to something smaller. - Further on that, I'm not sure if the "is full" test is telling us much. It seems we could make max bytes a static variable and set it to the size of the empty store. I'm guessing it wouldn't take much to add enough tids so that the contexts need to allocate some blocks, and then it would appear full and we can test that. I've made it so all arrays repalloc when needed, just in case. - Why are we switching to TopMemoryContext? It's not explained -- the comment only tells what the code is doing (which is obvious), but not why. - I'm not sure it's useful to keep test_lookup_tids() around. Since we now have a separate lookup test, the only thing it can tell us is that lookups fail on an empty store. I arranged it so that check_set_block_offsets() works on an empty store. Although that's even more trivial, it's just reusing what we already need.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 14, 2024 at 1:29 PM John Naylor wrote: > > On Thu, Mar 14, 2024 at 8:53 AM Masahiko Sawada wrote: > > > > On Thu, Mar 14, 2024 at 9:59 AM John Naylor wrote: > > > > BTW do we still want to test the tidstore by using a combination of > > > > SQL functions? We might no longer need to input TIDs via a SQL > > > > function. > > > > > > I'm not sure. I stopped short of doing that to get feedback on this > > > much. One advantage with SQL functions is we can use generate_series > > > to easily input lists of blocks with different numbers and strides, > > > and array literals for offsets are a bit easier. What do you think? > > > > While I'm not a fan of the following part, I agree that it makes sense > > to use SQL functions for test data generation: > > > > -- Constant values used in the tests. > > \set maxblkno 4294967295 > > -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is > > 291. > > -- We use a higher number to test tidstore. > > \set maxoffset 512 > > I'm not really a fan of these either, and could be removed a some > point if we've done everything else nicely. > > > It would also be easier for developers to test the tidstore with their > > own data set. So I agreed with the current approach; use SQL functions > > for data generation and do the actual tests inside C functions. > > Okay, here's an another idea: Change test_lookup_tids() to be more > general and put the validation down into C as well. First we save the > blocks from do_set_block_offsets() into a table, then with all those > blocks lookup a sufficiently-large range of possible offsets and save > found values in another array. So the static items structure would > have 3 arrays: inserts, successful lookups, and iteration (currently > the iteration output is private to check_set_block_offsets(). Then > sort as needed and check they are all the same. That's a promising idea. We can use the same mechanism for randomized tests too. If you're going to work on this, I'll do other tests on my environment in the meantime. > > Further thought: We may not really need to test block numbers that > vigorously, since the radix tree tests should cover keys/values pretty > well. Agreed. Probably boundary block numbers: 0, 1, MaxBlockNumber - 1, and MaxBlockNumber, would be sufficient. > The difference here is using bitmaps of tids and that should be > well covered. Right. We would need to test offset numbers vigorously instead. > > Locally (not CI), we should try big inputs to make sure we can > actually go up to many GB -- it's easier and faster this way than > having vacuum give us a large data set. I'll do these tests. > > > Is it > > convenient for developers if we have functions like generate_tids() > > and generate_random_tids() to generate TIDs so that they can pass them > > to do_set_block_offsets()? > > I guess I don't see the advantage of adding a layer of indirection at > this point, but it could be useful at a later time. Agreed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 14, 2024 at 8:53 AM Masahiko Sawada wrote: > > On Thu, Mar 14, 2024 at 9:59 AM John Naylor wrote: > > > BTW do we still want to test the tidstore by using a combination of > > > SQL functions? We might no longer need to input TIDs via a SQL > > > function. > > > > I'm not sure. I stopped short of doing that to get feedback on this > > much. One advantage with SQL functions is we can use generate_series > > to easily input lists of blocks with different numbers and strides, > > and array literals for offsets are a bit easier. What do you think? > > While I'm not a fan of the following part, I agree that it makes sense > to use SQL functions for test data generation: > > -- Constant values used in the tests. > \set maxblkno 4294967295 > -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is > 291. > -- We use a higher number to test tidstore. > \set maxoffset 512 I'm not really a fan of these either, and could be removed a some point if we've done everything else nicely. > It would also be easier for developers to test the tidstore with their > own data set. So I agreed with the current approach; use SQL functions > for data generation and do the actual tests inside C functions. Okay, here's an another idea: Change test_lookup_tids() to be more general and put the validation down into C as well. First we save the blocks from do_set_block_offsets() into a table, then with all those blocks lookup a sufficiently-large range of possible offsets and save found values in another array. So the static items structure would have 3 arrays: inserts, successful lookups, and iteration (currently the iteration output is private to check_set_block_offsets(). Then sort as needed and check they are all the same. Further thought: We may not really need to test block numbers that vigorously, since the radix tree tests should cover keys/values pretty well. The difference here is using bitmaps of tids and that should be well covered. Locally (not CI), we should try big inputs to make sure we can actually go up to many GB -- it's easier and faster this way than having vacuum give us a large data set. > Is it > convenient for developers if we have functions like generate_tids() > and generate_random_tids() to generate TIDs so that they can pass them > to do_set_block_offsets()? I guess I don't see the advantage of adding a layer of indirection at this point, but it could be useful at a later time.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 14, 2024 at 9:59 AM John Naylor wrote: > > On Wed, Mar 13, 2024 at 9:29 PM Masahiko Sawada wrote: > > > > On Wed, Mar 13, 2024 at 8:05 PM John Naylor wrote: > > > > > > On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada > > > wrote: > > > > > > > As I mentioned above, if we implement the test cases in C, we can use > > > > the debug-build array in the test code. And we won't use it in AND/OR > > > > operations tests in the future. > > > > > > That's a really interesting idea, so I went ahead and tried that for > > > v71. This seems like a good basis for testing larger, randomized > > > inputs, once we decide how best to hide that from the expected output. > > > The tests use SQL functions do_set_block_offsets() and > > > check_set_block_offsets(). The latter does two checks against a tid > > > array, and replaces test_dump_tids(). > > > > Great! I think that's a very good starter. > > > > The lookup_test() (and test_lookup_tids()) do also test that the > > IsMember() function returns false as expected if the TID doesn't exist > > in it, and probably we can do these tests in a C function too. > > > > BTW do we still want to test the tidstore by using a combination of > > SQL functions? We might no longer need to input TIDs via a SQL > > function. > > I'm not sure. I stopped short of doing that to get feedback on this > much. One advantage with SQL functions is we can use generate_series > to easily input lists of blocks with different numbers and strides, > and array literals for offsets are a bit easier. What do you think? While I'm not a fan of the following part, I agree that it makes sense to use SQL functions for test data generation: -- Constant values used in the tests. \set maxblkno 4294967295 -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291. -- We use a higher number to test tidstore. \set maxoffset 512 It would also be easier for developers to test the tidstore with their own data set. So I agreed with the current approach; use SQL functions for data generation and do the actual tests inside C functions. Is it convenient for developers if we have functions like generate_tids() and generate_random_tids() to generate TIDs so that they can pass them to do_set_block_offsets()? Then they call check_set_block_offsets() and others for actual data lookup and iteration tests. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 13, 2024 at 9:29 PM Masahiko Sawada wrote: > > On Wed, Mar 13, 2024 at 8:05 PM John Naylor wrote: > > > > On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada > > wrote: > > > > > As I mentioned above, if we implement the test cases in C, we can use > > > the debug-build array in the test code. And we won't use it in AND/OR > > > operations tests in the future. > > > > That's a really interesting idea, so I went ahead and tried that for > > v71. This seems like a good basis for testing larger, randomized > > inputs, once we decide how best to hide that from the expected output. > > The tests use SQL functions do_set_block_offsets() and > > check_set_block_offsets(). The latter does two checks against a tid > > array, and replaces test_dump_tids(). > > Great! I think that's a very good starter. > > The lookup_test() (and test_lookup_tids()) do also test that the > IsMember() function returns false as expected if the TID doesn't exist > in it, and probably we can do these tests in a C function too. > > BTW do we still want to test the tidstore by using a combination of > SQL functions? We might no longer need to input TIDs via a SQL > function. I'm not sure. I stopped short of doing that to get feedback on this much. One advantage with SQL functions is we can use generate_series to easily input lists of blocks with different numbers and strides, and array literals for offsets are a bit easier. What do you think?
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 13, 2024 at 8:05 PM John Naylor wrote: > > On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada wrote: > > > As I mentioned above, if we implement the test cases in C, we can use > > the debug-build array in the test code. And we won't use it in AND/OR > > operations tests in the future. > > That's a really interesting idea, so I went ahead and tried that for > v71. This seems like a good basis for testing larger, randomized > inputs, once we decide how best to hide that from the expected output. > The tests use SQL functions do_set_block_offsets() and > check_set_block_offsets(). The latter does two checks against a tid > array, and replaces test_dump_tids(). Great! I think that's a very good starter. The lookup_test() (and test_lookup_tids()) do also test that the IsMember() function returns false as expected if the TID doesn't exist in it, and probably we can do these tests in a C function too. BTW do we still want to test the tidstore by using a combination of SQL functions? We might no longer need to input TIDs via a SQL function. > Funnily enough, the debug array > itself gave false failures when using a similar array in the test > harness, because it didn't know all the places where the array should > have been sorted -- it only worked by chance before because of what > order things were done. Good catch, thanks. > I squashed everything from v70 and also took the liberty of switching > on shared memory for tid store tests. The only reason we didn't do > this with the radix tree tests is that the static attach/detach > functions would raise warnings since they are not used. Agreed to test the tidstore on shared memory. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada wrote: > As I mentioned above, if we implement the test cases in C, we can use > the debug-build array in the test code. And we won't use it in AND/OR > operations tests in the future. That's a really interesting idea, so I went ahead and tried that for v71. This seems like a good basis for testing larger, randomized inputs, once we decide how best to hide that from the expected output. The tests use SQL functions do_set_block_offsets() and check_set_block_offsets(). The latter does two checks against a tid array, and replaces test_dump_tids(). Funnily enough, the debug array itself gave false failures when using a similar array in the test harness, because it didn't know all the places where the array should have been sorted -- it only worked by chance before because of what order things were done. I squashed everything from v70 and also took the liberty of switching on shared memory for tid store tests. The only reason we didn't do this with the radix tree tests is that the static attach/detach functions would raise warnings since they are not used. From 9b6b517b9f9977999034d16b4dc33e91094ae7ba Mon Sep 17 00:00:00 2001 From: Masahiko Sawada Date: Tue, 12 Dec 2023 22:36:24 +0900 Subject: [PATCH v71 2/6] DEV: Debug TidStore. --- src/backend/access/common/tidstore.c | 203 ++- 1 file changed, 201 insertions(+), 2 deletions(-) diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c index b725b62d4c..33753d8ed2 100644 --- a/src/backend/access/common/tidstore.c +++ b/src/backend/access/common/tidstore.c @@ -29,6 +29,11 @@ #include "utils/dsa.h" #include "utils/memutils.h" +/* Enable TidStore debugging if USE_ASSERT_CHECKING */ +#ifdef USE_ASSERT_CHECKING +#define TIDSTORE_DEBUG +#include "catalog/index.h" +#endif #define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) #define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) @@ -88,6 +93,13 @@ struct TidStore /* DSA area for TidStore if using shared memory */ dsa_area *area; + +#ifdef TIDSTORE_DEBUG + ItemPointerData *tids; + int64 max_tids; + int64 num_tids; + bool tids_unordered; +#endif }; #define TidStoreIsShared(ts) ((ts)->area != NULL) @@ -105,11 +117,25 @@ struct TidStoreIter /* output for the caller */ TidStoreIterResult output; + +#ifdef TIDSTORE_DEBUG + /* iterator index for the ts->tids array */ + int64 tids_idx; +#endif }; static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, BlocktableEntry *page); +/* debug functions available only when TIDSTORE_DEBUG */ +#ifdef TIDSTORE_DEBUG +static void ts_debug_set_block_offsets(TidStore *ts, BlockNumber blkno, + OffsetNumber *offsets, int num_offsets); +static void ts_debug_iter_check_tids(TidStoreIter *iter); +static bool ts_debug_is_member(TidStore *ts, ItemPointer tid); +static int itemptr_cmp(const void *left, const void *right); +#endif + /* * Create a TidStore. The TidStore will live in the memory context that is * CurrentMemoryContext at the time of this call. The TID storage, backed @@ -154,6 +180,17 @@ TidStoreCreate(size_t max_bytes, dsa_area *area) else ts->tree.local = local_rt_create(ts->rt_context); +#ifdef TIDSTORE_DEBUG + { + int64 max_tids = max_bytes / sizeof(ItemPointerData); + + ts->tids = palloc(sizeof(ItemPointerData) * max_tids); + ts->max_tids = max_tids; + ts->num_tids = 0; + ts->tids_unordered = false; + } +#endif + return ts; } @@ -191,6 +228,7 @@ TidStoreDetach(TidStore *ts) Assert(TidStoreIsShared(ts)); shared_rt_detach(ts->tree.shared); + pfree(ts); } @@ -241,6 +279,11 @@ TidStoreDestroy(TidStore *ts) MemoryContextDelete(ts->rt_context); +#ifdef TIDSTORE_DEBUG + if (!TidStoreIsShared(ts)) + pfree(ts->tids); +#endif + pfree(ts); } @@ -297,6 +340,14 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, found = local_rt_set(ts->tree.local, blkno, page); Assert(!found); + +#ifdef TIDSTORE_DEBUG + if (!TidStoreIsShared(ts)) + { + /* Insert TIDs into the TID array too */ + ts_debug_set_block_offsets(ts, blkno, offsets, num_offsets); + } +#endif } /* Return true if the given TID is present in the TidStore */ @@ -310,6 +361,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid) OffsetNumber off = ItemPointerGetOffsetNumber(tid); bool ret; +#ifdef TIDSTORE_DEBUG + bool ret_debug = false; + + if (!TidStoreIsShared(ts)) + ret_debug = ts_debug_is_member(ts, tid); +#endif + if (TidStoreIsShared(ts)) page = shared_rt_find(ts->tree.shared, blk); else @@ -317,17 +375,29 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid) /* no entry for the blk */ if (page == NULL) - return false; + { + ret = false; + goto done; + } wordnum = WORDNUM(off); bitnum = BITNUM(off); /* no bitmap for the off */ if (wordnum >= page->nwords) - return false; + { + ret = false; + goto done; + } ret = (page->words[wordnum] & ((bi
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Mar 12, 2024 at 7:34 PM John Naylor wrote: > > On Mon, Mar 11, 2024 at 3:13 PM Masahiko Sawada wrote: > > > > On Mon, Mar 11, 2024 at 12:20 PM John Naylor > > wrote: > > > > > > On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada > > > wrote: > > > > + ts->context = CurrentMemoryContext; > > > > > > As far as I can tell, this member is never accessed again -- am I > > > missing something? > > > > You're right. It was used to re-create the tidstore in the same > > context again while resetting it, but we no longer support the reset > > API. Considering it again, would it be better to allocate the iterator > > struct in the same context as we store the tidstore struct? > > That makes sense. > > > > + /* DSA for tidstore will be detached at the end of session */ > > > > > > No other test module pins the mapping, but that doesn't necessarily > > > mean it's wrong. Is there some advantage over explicitly detaching? > > > > One small benefit of not explicitly detaching dsa_area in > > tidstore_destroy() would be simplicity; IIUC if we want to do that, we > > need to remember the dsa_area using (for example) a static variable, > > and free it if it's non-NULL. I've implemented this idea in the > > attached patch. > > Okay, I don't have a strong preference at this point. I'd keep the update on that. > > > > +-- Add tids in random order. > > > > > > I don't see any randomization here. I do remember adding row_number to > > > remove whitespace in the output, but I don't remember a random order. > > > On that subject, the row_number was an easy trick to avoid extra > > > whitespace, but maybe we should just teach the setting function to > > > return blocknumber rather than null? > > > > Good idea, fixed. > > + test_set_block_offsets > + > + 2147483647 > + 0 > + 4294967294 > + 1 > + 4294967295 > > Hmm, was the earlier comment about randomness referring to this? I'm > not sure what other regression tests do in these cases, or how > relibale this is. If this is a problem we could simply insert this > result into a temp table so it's not output. I didn't address the comment about randomness. I think that we will have both random TIDs tests and fixed TIDs tests in test_tidstore as we discussed, and probably we can do both tests with similar steps; insert TIDs into both a temp table and tidstore and check if the tidstore returned the results as expected by comparing the results to the temp table. Probably we can have a common pl/pgsql function that checks that and raises a WARNING or an ERROR. Given that this is very similar to what we did in test_radixtree, why do we really want to implement it using a pl/pgsql function? When we discussed it before, I found the current way makes sense. But given that we're adding more tests and will add more tests in the future, doing the tests in C will be more maintainable and faster. Also, I think we can do the debug-build array stuff in the test_tidstore code instead. > > > > +Datum > > > +tidstore_create(PG_FUNCTION_ARGS) > > > +{ > > > ... > > > + tidstore = TidStoreCreate(max_bytes, dsa); > > > > > > +Datum > > > +tidstore_set_block_offsets(PG_FUNCTION_ARGS) > > > +{ > > > > > > + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs); > > > > > > These names are too similar. Maybe the test module should do > > > s/tidstore_/test_/ or similar. > > > > Agreed. > > Mostly okay, although a couple look a bit generic now. I'll leave it > up to you if you want to tweak things. > > > > In general, the .sql file is still very hard-coded. Functions are > > > created that contain a VALUES statement. Maybe it's okay for now, but > > > wanted to mention it. Ideally, we'd have some randomized tests, > > > without having to display it. That could be in addition to (not > > > replacing) the small tests we have that display input. (see below) > > > > > > > Agreed to add randomized tests in addition to the existing tests. > > I'll try something tomorrow. > > > Sounds a good idea. In fact, if there are some bugs in tidstore, it's > > likely that even initdb would fail in practice. However, it's a very > > good idea that we can test the tidstore anyway with such a check > > without a debug-build array. > > > > Or as another idea, I wonder if we could keep the debug-build array in > > some form. For example, we use the array with the particular build > > flag and set a BF animal for that. That way, we can test the tidstore > > in more real cases. > > I think the purpose of a debug flag is to help developers catch > mistakes. I don't think it's quite useful enough for that. For one, it > has the same 1GB limitation as vacuum's current array. For another, > it'd be a terrible way to debug moving tidbitmap.c from its hash table > to use TID store -- AND/OR operations and lossy pages are pretty much > undoable with a copy of vacuum's array. Valid points. As I mentioned above, if we im
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 11, 2024 at 3:13 PM Masahiko Sawada wrote: > > On Mon, Mar 11, 2024 at 12:20 PM John Naylor wrote: > > > > On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada > > wrote: > > + ts->context = CurrentMemoryContext; > > > > As far as I can tell, this member is never accessed again -- am I > > missing something? > > You're right. It was used to re-create the tidstore in the same > context again while resetting it, but we no longer support the reset > API. Considering it again, would it be better to allocate the iterator > struct in the same context as we store the tidstore struct? That makes sense. > > + /* DSA for tidstore will be detached at the end of session */ > > > > No other test module pins the mapping, but that doesn't necessarily > > mean it's wrong. Is there some advantage over explicitly detaching? > > One small benefit of not explicitly detaching dsa_area in > tidstore_destroy() would be simplicity; IIUC if we want to do that, we > need to remember the dsa_area using (for example) a static variable, > and free it if it's non-NULL. I've implemented this idea in the > attached patch. Okay, I don't have a strong preference at this point. > > +-- Add tids in random order. > > > > I don't see any randomization here. I do remember adding row_number to > > remove whitespace in the output, but I don't remember a random order. > > On that subject, the row_number was an easy trick to avoid extra > > whitespace, but maybe we should just teach the setting function to > > return blocknumber rather than null? > > Good idea, fixed. + test_set_block_offsets + + 2147483647 + 0 + 4294967294 + 1 + 4294967295 Hmm, was the earlier comment about randomness referring to this? I'm not sure what other regression tests do in these cases, or how relibale this is. If this is a problem we could simply insert this result into a temp table so it's not output. > > +Datum > > +tidstore_create(PG_FUNCTION_ARGS) > > +{ > > ... > > + tidstore = TidStoreCreate(max_bytes, dsa); > > > > +Datum > > +tidstore_set_block_offsets(PG_FUNCTION_ARGS) > > +{ > > > > + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs); > > > > These names are too similar. Maybe the test module should do > > s/tidstore_/test_/ or similar. > > Agreed. Mostly okay, although a couple look a bit generic now. I'll leave it up to you if you want to tweak things. > > In general, the .sql file is still very hard-coded. Functions are > > created that contain a VALUES statement. Maybe it's okay for now, but > > wanted to mention it. Ideally, we'd have some randomized tests, > > without having to display it. That could be in addition to (not > > replacing) the small tests we have that display input. (see below) > > > > Agreed to add randomized tests in addition to the existing tests. I'll try something tomorrow. > Sounds a good idea. In fact, if there are some bugs in tidstore, it's > likely that even initdb would fail in practice. However, it's a very > good idea that we can test the tidstore anyway with such a check > without a debug-build array. > > Or as another idea, I wonder if we could keep the debug-build array in > some form. For example, we use the array with the particular build > flag and set a BF animal for that. That way, we can test the tidstore > in more real cases. I think the purpose of a debug flag is to help developers catch mistakes. I don't think it's quite useful enough for that. For one, it has the same 1GB limitation as vacuum's current array. For another, it'd be a terrible way to debug moving tidbitmap.c from its hash table to use TID store -- AND/OR operations and lossy pages are pretty much undoable with a copy of vacuum's array. Last year, when I insisted on trying a long term realistic load that compares the result with the array, the encoding scheme was much harder to understand in code. I think it's now easier, and there are better tests. > In the latest (v69) patch: > > - squashed v68-0005 and v68-0006 patches. > - removed most of the changes in v68-0007 patch. > - addressed above review comments in v69-0002 patch. > - v69-0003, 0004, and 0005 are miscellaneous updates. > > As for renaming TidStore to TIDStore, I dropped the patch for now > since it seems we're using "Tid" in some function names and variable > names. If we want to update it, we can do that later. I think we're not consistent across the codebase, and it's fine to drop that patch. v70-0008: @@ -489,7 +489,7 @@ parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs) /* * Free the current tidstore and return allocated DSA segments to the * operating system. Then we recreate the tidstore with the same max_bytes - * limitation. + * limitation we just used. Nowadays, max_bytes is now more like a hint for tidstore, and not a limitation, right? Vacuum has the limitation. Maybe instead of "with", we should say "passing the same limitatio
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 11, 2024 at 5:13 PM Masahiko Sawada wrote: > > In the latest (v69) patch: > > - squashed v68-0005 and v68-0006 patches. > - removed most of the changes in v68-0007 patch. > - addressed above review comments in v69-0002 patch. > - v69-0003, 0004, and 0005 are miscellaneous updates. Since the v69 conflicts with the current HEAD, I've rebased them. In addition, v70-0008 is the new patch, which cleans up the vacuum integration patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v70-ART.tar.gz Description: GNU Zip compressed data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Mar 11, 2024 at 12:20 PM John Naylor wrote: > > On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada wrote: > > > > I've attached the remaining patches for CI. I've made some minor > > changes in separate patches and drafted the commit message for > > tidstore patch. > > > > While reviewing the tidstore code, I thought that it would be more > > appropriate to place tidstore.c under src/backend/lib instead of > > src/backend/common/access since the tidstore is no longer implemented > > only for heap or other access methods, and it might also be used by > > executor nodes in the future. What do you think? > > That's a heck of a good question. I don't think src/backend/lib is > right -- it seems that's for general-purpose data structures. > Something like backend/utils is also too general. > src/backend/access/common has things for tuple descriptors, toast, > sessions, and I don't think tidstore is out of place here. I'm not > sure there's a better place, but I could be convinced otherwise. Yeah, I agreed that src/backend/lib seems not to be the place for tidstore. Let's keep it in src/backend/access/common. If others think differently, we can move it later. > > v68-0001: > > I'm not sure if commit messages are much a subject of review, and it's > up to the committer, but I'll share a couple comments just as > something to think about, not something I would ask you to change: I > think it's a bit distracting that the commit message talks about the > justification to use it for vacuum. Let's save that for the commit > with actual vacuum changes. Also, I suspect saying there are a "wide > range" of uses is over-selling it a bit, and that paragraph is a bit > awkward aside from that. Thank you for the comment, and I agreed. I've updated the commit message. > > + /* Collect TIDs extracted from the key-value pair */ > + result->num_offsets = 0; > + > > This comment has nothing at all to do with this line. If the comment > is for several lines following, some of which are separated by blank > lines, there should be a blank line after the comment. Also, why isn't > tidstore_iter_extract_tids() responsible for setting that to zero? Agreed, fixed. I also updated this part so we set result->blkno in tidstore_iter_extract_tids() too, which seems more readable. > > + ts->context = CurrentMemoryContext; > > As far as I can tell, this member is never accessed again -- am I > missing something? You're right. It was used to re-create the tidstore in the same context again while resetting it, but we no longer support the reset API. Considering it again, would it be better to allocate the iterator struct in the same context as we store the tidstore struct? > > + /* DSA for tidstore will be detached at the end of session */ > > No other test module pins the mapping, but that doesn't necessarily > mean it's wrong. Is there some advantage over explicitly detaching? One small benefit of not explicitly detaching dsa_area in tidstore_destroy() would be simplicity; IIUC if we want to do that, we need to remember the dsa_area using (for example) a static variable, and free it if it's non-NULL. I've implemented this idea in the attached patch. > > +-- Add tids in random order. > > I don't see any randomization here. I do remember adding row_number to > remove whitespace in the output, but I don't remember a random order. > On that subject, the row_number was an easy trick to avoid extra > whitespace, but maybe we should just teach the setting function to > return blocknumber rather than null? Good idea, fixed. > > +Datum > +tidstore_create(PG_FUNCTION_ARGS) > +{ > ... > + tidstore = TidStoreCreate(max_bytes, dsa); > > +Datum > +tidstore_set_block_offsets(PG_FUNCTION_ARGS) > +{ > > + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs); > > These names are too similar. Maybe the test module should do > s/tidstore_/test_/ or similar. Agreed. > > +/* Sanity check if we've called tidstore_create() */ > +static void > +check_tidstore_available(void) > +{ > + if (tidstore == NULL) > + elog(ERROR, "tidstore is not initialized"); > +} > > I don't find this very helpful. If a developer wiped out the create > call, wouldn't the test crash and burn pretty obviously? Removed. > > In general, the .sql file is still very hard-coded. Functions are > created that contain a VALUES statement. Maybe it's okay for now, but > wanted to mention it. Ideally, we'd have some randomized tests, > without having to display it. That could be in addition to (not > replacing) the small tests we have that display input. (see below) > Agreed to add randomized tests in addition to the existing tests. > > v68-0002: > > @@ -329,6 +381,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid) > > ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0; > > +#ifdef TIDSTORE_DEBUG > + if (!TidStoreIsShared(ts)) > + { > + bool ret_debug = ts_debug_is_member(ts, tid);; > + Assert(ret == ret_debug); > + } > +#endif > > This only checking the
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada wrote: > > On Thu, Feb 15, 2024 at 8:26 PM John Naylor wrote: > > v61-0007: Runtime-embeddable tids -- Optional for v17, but should > > reduce memory regressions, so should be considered. Up to 3 tids can > > be stored in the last level child pointer. It's not polished, but I'll > > only proceed with that if we think we need this. "flags" iis called > > that because it could hold tidbitmap.c booleans (recheck, lossy) in > > the future, in addition to reserving space for the pointer tag. Note: > > I hacked the tests to only have 2 offsets per block to demo, but of > > course both paths should be tested. > > Interesting. I've run the same benchmark tests we did[1][2] (the > median of 3 runs): [found a big speed-up where we don't expect one] I tried to reproduce this (similar patch, but rebased on top of a bug you recently fixed (possibly related?) -- attached, and also shows one way to address some lack of coverage in the debug build, for as long as we test that with CI). Fortunately I cannot see a difference, so I believe it's not affecting the case in this test all, as expected: v68: INFO: finished vacuuming "john.public.test": index scans: 1 pages: 0 removed, 442478 remain, 88478 scanned (20.00% of total) tuples: 19995999 removed, 80003979 remain, 0 are dead but not yet removable removable cutoff: 770, which was 0 XIDs old when operation ended frozen: 0 pages from table (0.00% of total) had 0 tuples frozen index scan needed: 88478 pages from table (20.00% of total) had 19995999 dead item identifiers removed index "test_x_idx": pages: 274194 in total, 54822 newly deleted, 54822 currently deleted, 0 reusable avg read rate: 620.356 MB/s, avg write rate: 124.105 MB/s buffer usage: 758236 hits, 274196 misses, 54854 dirtied WAL usage: 2 records, 0 full page images, 425 bytes system usage: CPU: user: 3.74 s, system: 0.68 s, elapsed: 4.45 s system usage: CPU: user: 3.02 s, system: 0.42 s, elapsed: 3.47 s system usage: CPU: user: 3.09 s, system: 0.38 s, elapsed: 3.49 s system usage: CPU: user: 3.00 s, system: 0.43 s, elapsed: 3.45 s v68 + emb values (that cannot be used because > 3 tids per block): INFO: finished vacuuming "john.public.test": index scans: 1 pages: 0 removed, 442478 remain, 88478 scanned (20.00% of total) tuples: 19995999 removed, 80003979 remain, 0 are dead but not yet removable removable cutoff: 775, which was 0 XIDs old when operation ended frozen: 0 pages from table (0.00% of total) had 0 tuples frozen index scan needed: 88478 pages from table (20.00% of total) had 19995999 dead item identifiers removed index "test_x_idx": pages: 274194 in total, 54822 newly deleted, 54822 currently deleted, 0 reusable avg read rate: 570.808 MB/s, avg write rate: 114.192 MB/s buffer usage: 758236 hits, 274196 misses, 54854 dirtied WAL usage: 2 records, 0 full page images, 425 bytes system usage: CPU: user: 3.11 s, system: 0.62 s, elapsed: 3.75 s system usage: CPU: user: 3.04 s, system: 0.41 s, elapsed: 3.46 s system usage: CPU: user: 3.05 s, system: 0.41 s, elapsed: 3.47 s system usage: CPU: user: 3.04 s, system: 0.43 s, elapsed: 3.49 s I'll continue polishing the runtime-embeddable values patch as time permits, for later consideration. WIP-3-embeddable-TIDs.patch.nocfbot Description: Binary data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada wrote: > > I've attached the remaining patches for CI. I've made some minor > changes in separate patches and drafted the commit message for > tidstore patch. > > While reviewing the tidstore code, I thought that it would be more > appropriate to place tidstore.c under src/backend/lib instead of > src/backend/common/access since the tidstore is no longer implemented > only for heap or other access methods, and it might also be used by > executor nodes in the future. What do you think? That's a heck of a good question. I don't think src/backend/lib is right -- it seems that's for general-purpose data structures. Something like backend/utils is also too general. src/backend/access/common has things for tuple descriptors, toast, sessions, and I don't think tidstore is out of place here. I'm not sure there's a better place, but I could be convinced otherwise. v68-0001: I'm not sure if commit messages are much a subject of review, and it's up to the committer, but I'll share a couple comments just as something to think about, not something I would ask you to change: I think it's a bit distracting that the commit message talks about the justification to use it for vacuum. Let's save that for the commit with actual vacuum changes. Also, I suspect saying there are a "wide range" of uses is over-selling it a bit, and that paragraph is a bit awkward aside from that. + /* Collect TIDs extracted from the key-value pair */ + result->num_offsets = 0; + This comment has nothing at all to do with this line. If the comment is for several lines following, some of which are separated by blank lines, there should be a blank line after the comment. Also, why isn't tidstore_iter_extract_tids() responsible for setting that to zero? + ts->context = CurrentMemoryContext; As far as I can tell, this member is never accessed again -- am I missing something? + /* DSA for tidstore will be detached at the end of session */ No other test module pins the mapping, but that doesn't necessarily mean it's wrong. Is there some advantage over explicitly detaching? +-- Add tids in random order. I don't see any randomization here. I do remember adding row_number to remove whitespace in the output, but I don't remember a random order. On that subject, the row_number was an easy trick to avoid extra whitespace, but maybe we should just teach the setting function to return blocknumber rather than null? +Datum +tidstore_create(PG_FUNCTION_ARGS) +{ ... + tidstore = TidStoreCreate(max_bytes, dsa); +Datum +tidstore_set_block_offsets(PG_FUNCTION_ARGS) +{ + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs); These names are too similar. Maybe the test module should do s/tidstore_/test_/ or similar. +/* Sanity check if we've called tidstore_create() */ +static void +check_tidstore_available(void) +{ + if (tidstore == NULL) + elog(ERROR, "tidstore is not initialized"); +} I don't find this very helpful. If a developer wiped out the create call, wouldn't the test crash and burn pretty obviously? In general, the .sql file is still very hard-coded. Functions are created that contain a VALUES statement. Maybe it's okay for now, but wanted to mention it. Ideally, we'd have some randomized tests, without having to display it. That could be in addition to (not replacing) the small tests we have that display input. (see below) v68-0002: @@ -329,6 +381,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid) ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0; +#ifdef TIDSTORE_DEBUG + if (!TidStoreIsShared(ts)) + { + bool ret_debug = ts_debug_is_member(ts, tid);; + Assert(ret == ret_debug); + } +#endif This only checking the case where we haven't returned already. In particular... + /* no entry for the blk */ + if (page == NULL) + return false; + + wordnum = WORDNUM(off); + bitnum = BITNUM(off); + + /* no bitmap for the off */ + if (wordnum >= page->nwords) + return false; ...these results are not checked. More broadly, it seems like the test module should be able to test everything that the debug-build array would complain about. Including ordered iteration. This may require first saving our test input to a table. We could create a cursor on a query that fetches the ordered input from the table and verifies that the tid store iterate produces the same ordered set, maybe with pl/pgSQL. Or something like that. Seems like not a whole lot of work. I can try later in the week, if you like. v68-0005/6 look ready to squash v68-0008 - I'm not a fan of captilizing short comment fragments. I use the style of either: short lower-case phrases, or full sentences including capitalization, correct grammar and period. I see these two styles all over the code base, as appropriate. + /* Remain attached until end of backend */ We'll probably want this comment, if in fact we want this behavior. + /* + * Note that funcctx->call_cntr is incremented in SRF_RETURN_NEXT + * before return. + */
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 8, 2024 at 9:53 AM John Naylor wrote: > > On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada wrote: > > Yesterday I've confirmed the something like the below fixes the > > problem happened in Windows CI: > > > > --- a/src/test/modules/test_radixtree/meson.build > > +++ b/src/test/modules/test_radixtree/meson.build > > @@ -12,6 +12,7 @@ endif > > > > test_radixtree = shared_module('test_radixtree', > >test_radixtree_sources, > > + link_with: host_system == 'windows' ? pgport_srv : [], > >kwargs: pg_test_mod_args, > > ) > > test_install_libs += test_radixtree > > pgport_srv is for backend, shared libraries should be using pgport_shlib > In any case, I'm trying it in CI branch with pgport_shlib now. That seems to work, so I'll push that just to get things green again.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada wrote: > Yesterday I've confirmed the something like the below fixes the > problem happened in Windows CI: > > --- a/src/test/modules/test_radixtree/meson.build > +++ b/src/test/modules/test_radixtree/meson.build > @@ -12,6 +12,7 @@ endif > > test_radixtree = shared_module('test_radixtree', >test_radixtree_sources, > + link_with: host_system == 'windows' ? pgport_srv : [], >kwargs: pg_test_mod_args, > ) > test_install_libs += test_radixtree pgport_srv is for backend, shared libraries should be using pgport_shlib Further, the top level meson.build has: # all shared libraries not part of the backend should depend on this frontend_shlib_code = declare_dependency( include_directories: [postgres_inc], link_with: [common_shlib, pgport_shlib], sources: generated_headers, dependencies: [shlib_code, os_deps, libintl], ) ...but the only things that declare needing frontend_shlib_code are in src/interfaces/. In any case, I'm trying it in CI branch with pgport_shlib now.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada wrote: > > Yesterday I've confirmed the something like the below fixes the > problem happened in Windows CI: Glad you shared before I went and did it. > --- a/src/test/modules/test_radixtree/meson.build > +++ b/src/test/modules/test_radixtree/meson.build > @@ -12,6 +12,7 @@ endif > > test_radixtree = shared_module('test_radixtree', >test_radixtree_sources, > + link_with: host_system == 'windows' ? pgport_srv : [], I don't see any similar coding elsewhere, so that leaves me wondering if we're missing something. On the other hand, maybe no test modules use files in src/port ... >kwargs: pg_test_mod_args, > ) > test_install_libs += test_radixtree > > But I'm not sure it's the right fix especially because I guess it > could raise "AddressSanitizer: odr-violation" error on Windows. Well, it's now at zero definitions that it can see, so I imagine it's possible that adding the above would not cause more than one. In any case, we might not know since as far as I can tell the MSVC animals don't have address sanitizer. I'll look around some more, and if I don't get any revelations, I guess we should go with the above.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Mar 8, 2024 at 10:04 AM John Naylor wrote: > > On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada wrote: > > > > It looks like it requires a link with pgport_srv but I'm not sure. It > > seems that the recent commit 1f1d73a8b breaks CI, Windows - Server > > 2019, VS 2019 - Meson & ninja, too. > > Unfortunately, none of the Windows animals happened to run both after > the initial commit and before removing the (seemingly useless on our > daily platfoms) link. I'll confirm on my own CI branch in a few > minutes. Yesterday I've confirmed the something like the below fixes the problem happened in Windows CI: --- a/src/test/modules/test_radixtree/meson.build +++ b/src/test/modules/test_radixtree/meson.build @@ -12,6 +12,7 @@ endif test_radixtree = shared_module('test_radixtree', test_radixtree_sources, + link_with: host_system == 'windows' ? pgport_srv : [], kwargs: pg_test_mod_args, ) test_install_libs += test_radixtree But I'm not sure it's the right fix especially because I guess it could raise "AddressSanitizer: odr-violation" error on Windows. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada wrote: > > It looks like it requires a link with pgport_srv but I'm not sure. It > seems that the recent commit 1f1d73a8b breaks CI, Windows - Server > 2019, VS 2019 - Meson & ninja, too. Unfortunately, none of the Windows animals happened to run both after the initial commit and before removing the (seemingly useless on our daily platfoms) link. I'll confirm on my own CI branch in a few minutes.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 8:06 PM John Naylor wrote: > > On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada wrote: > > > > On Thu, Mar 7, 2024 at 6:37 PM John Naylor wrote: > > > > $ git grep 'link_with: pgport_srv' > > > src/test/modules/test_radixtree/meson.build: link_with: pgport_srv, > > > > > > No other test module uses this directive, and indeed, removing this > > > still builds fine for me. Thoughts? > > > > Yeah, it could be the culprit. The test_radixtree/meson.build is the > > sole extension that explicitly specifies a link with pgport_srv. I > > think we can get rid of it as I've also confirmed the build still fine > > even without it. > > olingo and grassquit have turned green, so that must have been it. fairywren is complaining another build failure: [1931/2156] "gcc" -o src/test/modules/test_radixtree/test_radixtree.dll src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj "-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group" "-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a" "-Wl,--stack,4194304" "-Wl,--allow-multiple-definition" "-Wl,--disable-auto-import" "-fvisibility=hidden" "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a" "-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a" "C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a" "C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm" "-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32" "-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group" FAILED: src/test/modules/test_radixtree/test_radixtree.dll "gcc" -o src/test/modules/test_radixtree/test_radixtree.dll src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj "-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group" "-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a" "-Wl,--stack,4194304" "-Wl,--allow-multiple-definition" "-Wl,--disable-auto-import" "-fvisibility=hidden" "C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a" "-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a" "C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a" "C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm" "-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32" "-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group" C:/tools/nmsys64/ucrt64/bin/../lib/gcc/x86_64-w64-mingw32/12.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe: src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj:test_radixtree:(.rdata$.refptr.pg_popcount64[.refptr.pg_popcount64]+0x0): undefined reference to `pg_popcount64' It looks like it requires a link with pgport_srv but I'm not sure. It seems that the recent commit 1f1d73a8b breaks CI, Windows - Server 2019, VS 2019 - Meson & ninja, too. Regards, [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-03-07%2012%3A53%3A20 -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 8:06 PM John Naylor wrote: > > On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada wrote: > > > > On Thu, Mar 7, 2024 at 6:37 PM John Naylor wrote: > > > > $ git grep 'link_with: pgport_srv' > > > src/test/modules/test_radixtree/meson.build: link_with: pgport_srv, > > > > > > No other test module uses this directive, and indeed, removing this > > > still builds fine for me. Thoughts? > > > > Yeah, it could be the culprit. The test_radixtree/meson.build is the > > sole extension that explicitly specifies a link with pgport_srv. I > > think we can get rid of it as I've also confirmed the build still fine > > even without it. > > olingo and grassquit have turned green, so that must have been it. Cool! I've attached the remaining patches for CI. I've made some minor changes in separate patches and drafted the commit message for tidstore patch. While reviewing the tidstore code, I thought that it would be more appropriate to place tidstore.c under src/backend/lib instead of src/backend/common/access since the tidstore is no longer implemented only for heap or other access methods, and it might also be used by executor nodes in the future. What do you think? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v68-ART.tar.gz Description: GNU Zip compressed data
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada wrote: > > On Thu, Mar 7, 2024 at 6:37 PM John Naylor wrote: > > $ git grep 'link_with: pgport_srv' > > src/test/modules/test_radixtree/meson.build: link_with: pgport_srv, > > > > No other test module uses this directive, and indeed, removing this > > still builds fine for me. Thoughts? > > Yeah, it could be the culprit. The test_radixtree/meson.build is the > sole extension that explicitly specifies a link with pgport_srv. I > think we can get rid of it as I've also confirmed the build still fine > even without it. olingo and grassquit have turned green, so that must have been it.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 6:37 PM John Naylor wrote: > > On Thu, Mar 7, 2024 at 1:19 PM John Naylor wrote: > > > > In addition, olingo and grassquit are showing different kinds of > > "AddressSanitizer: odr-violation" errors, which I'm not sure what to > > make of -- example: > > This might be relevant: > > $ git grep 'link_with: pgport_srv' > src/test/modules/test_radixtree/meson.build: link_with: pgport_srv, > > No other test module uses this directive, and indeed, removing this > still builds fine for me. Thoughts? Yeah, it could be the culprit. The test_radixtree/meson.build is the sole extension that explicitly specifies a link with pgport_srv. I think we can get rid of it as I've also confirmed the build still fine even without it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 1:19 PM John Naylor wrote: > > In addition, olingo and grassquit are showing different kinds of > "AddressSanitizer: odr-violation" errors, which I'm not sure what to > make of -- example: This might be relevant: $ git grep 'link_with: pgport_srv' src/test/modules/test_radixtree/meson.build: link_with: pgport_srv, No other test module uses this directive, and indeed, removing this still builds fine for me. Thoughts?
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 1:49 PM Masahiko Sawada wrote: > odr-violation seems to refer to One Definition Rule (ODR). According > to Wikipedia[1]: > > The One Definition Rule (ODR) is an important rule of the C++ > programming language that prescribes that classes/structs and > non-inline functions cannot have more than one definition in the > entire program and template and types cannot have more than one > definition by translation unit. It is defined in the ISO C++ Standard > (ISO/IEC 14882) 2003, at section 3.2. Some other programming languages > have similar but differently defined rules towards the same objective. > > I don't fully understand this concept yet but are these two different > build failures related? I thought it may have something to do with the prerequisite commit that moved some symbols from bitmapset.c to .h: /* Select appropriate bit-twiddling functions for bitmap word size */ #if BITS_PER_BITMAPWORD == 32 #define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w) #define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w) #define bmw_popcount(w) pg_popcount32(w) #elif BITS_PER_BITMAPWORD == 64 #define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w) #define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w) #define bmw_popcount(w) pg_popcount64(w) #else #error "invalid BITS_PER_BITMAPWORD" #endif ...but olingo's error seems strange to me, because it is complaining of pg_leftmost_one_pos, which refers to the lookup table in pg_bitutils.c -- I thought all buildfarm members used the bitscan instructions. grassquit is complaining of pg_popcount64, which is a global function, also in pg_bitutils.c. Not sure what to make of this, since we're just pointing symbols at things which should have a single definition...
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 4:21 PM John Naylor wrote: > > On Thu, Mar 7, 2024 at 2:14 PM Masahiko Sawada wrote: > > > > On Thu, Mar 7, 2024 at 4:01 PM John Naylor wrote: > > > > > > On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada > > > wrote: > > > > > > > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor > > > > wrote: > > > > > > > > > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor > > > > > wrote: > > > > > > > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11 > > > > > > feature [-Werror,-Wtypedef-redefinition]" > > > > > > > > > > > > I'll look in the other templates to see if what they do. > > > > > > > > > > Their "declare" sections have full typedefs. I found it works to leave > > > > > out the typedef for the "define" section, but I first want to > > > > > reproduce the build failure. > > > > > > > > Right. I've reproduced this build failure on my machine by specifying > > > > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the > > > > below change seems to fix the problem: > > > > > > Confirmed, will push shortly. > > > > mamba complained different build errors[1]: > > > > 2740 | fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys); > > | ~~^ ~~~ > > || | > > |long int int64 {aka long long > > int} > > | %lld > > ../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld' > > expects argument of type 'long int', but argument 4 has type 'int64' > > {aka 'long long int'} [-Werror=format=] > > 2752 | fprintf(stderr, ", n%d = %ld", size_class.fanout, > > tree->ctl->num_nodes[i]); > > |~~^ > > ~~~ > > | | > > | > > | long int > > int64 {aka long long int} > > |%lld > > ../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld' > > expects argument of type 'long int', but argument 3 has type 'int64' > > {aka 'long long int'} [-Werror=format=] > > 2755 | fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves); > > | ~~^ ~ > > ||| > > |long int int64 {aka long long > > int} > > | %lld > > > > Regards, > > > > [1] > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18 > > Yeah, the attached fixes it for me. Thanks, LGTM. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 2:14 PM Masahiko Sawada wrote: > > On Thu, Mar 7, 2024 at 4:01 PM John Naylor wrote: > > > > On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada > > wrote: > > > > > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor > > > wrote: > > > > > > > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor > > > > wrote: > > > > > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11 > > > > > feature [-Werror,-Wtypedef-redefinition]" > > > > > > > > > > I'll look in the other templates to see if what they do. > > > > > > > > Their "declare" sections have full typedefs. I found it works to leave > > > > out the typedef for the "define" section, but I first want to > > > > reproduce the build failure. > > > > > > Right. I've reproduced this build failure on my machine by specifying > > > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the > > > below change seems to fix the problem: > > > > Confirmed, will push shortly. > > mamba complained different build errors[1]: > > 2740 | fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys); > | ~~^ ~~~ > || | > |long int int64 {aka long long > int} > | %lld > ../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld' > expects argument of type 'long int', but argument 4 has type 'int64' > {aka 'long long int'} [-Werror=format=] > 2752 | fprintf(stderr, ", n%d = %ld", size_class.fanout, > tree->ctl->num_nodes[i]); > |~~^ > ~~~ > | | > | > | long int > int64 {aka long long int} > |%lld > ../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld' > expects argument of type 'long int', but argument 3 has type 'int64' > {aka 'long long int'} [-Werror=format=] > 2755 | fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves); > | ~~^ ~ > ||| > |long int int64 {aka long long int} > | %lld > > Regards, > > [1] > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18 Yeah, the attached fixes it for me. diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h index 93e6a7d809..b8ad51c14d 100644 --- a/src/include/lib/radixtree.h +++ b/src/include/lib/radixtree.h @@ -2737,7 +2737,7 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE * tree) { fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val); - fprintf(stderr, "num_keys = %ld\n", tree->ctl->num_keys); + fprintf(stderr, "num_keys = %lld\n", (long long) tree->ctl->num_keys); #ifdef RT_SHMEM fprintf(stderr, "handle = " DSA_POINTER_FORMAT "\n", tree->ctl->handle); @@ -2749,10 +2749,10 @@ RT_STATS(RT_RADIX_TREE * tree) { RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i]; - fprintf(stderr, ", n%d = %ld", size_class.fanout, tree->ctl->num_nodes[i]); + fprintf(stderr, ", n%d = %lld", size_class.fanout, (long long) tree->ctl->num_nodes[i]); } - fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves); + fprintf(stderr, ", leaves = %lld", (long long) tree->ctl->num_leaves); fprintf(stderr, "\n"); }
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 4:01 PM John Naylor wrote: > > On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada wrote: > > > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor wrote: > > > > > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor > > > wrote: > > > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11 > > > > feature [-Werror,-Wtypedef-redefinition]" > > > > > > > > I'll look in the other templates to see if what they do. > > > > > > Their "declare" sections have full typedefs. I found it works to leave > > > out the typedef for the "define" section, but I first want to > > > reproduce the build failure. > > > > Right. I've reproduced this build failure on my machine by specifying > > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the > > below change seems to fix the problem: > > Confirmed, will push shortly. mamba complained different build errors[1]: 2740 | fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys); | ~~^ ~~~ || | |long int int64 {aka long long int} | %lld ../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld' expects argument of type 'long int', but argument 4 has type 'int64' {aka 'long long int'} [-Werror=format=] 2752 | fprintf(stderr, ", n%d = %ld", size_class.fanout, tree->ctl->num_nodes[i]); |~~^ ~~~ | | | | long int int64 {aka long long int} |%lld ../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld' expects argument of type 'long int', but argument 3 has type 'int64' {aka 'long long int'} [-Werror=format=] 2755 | fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves); | ~~^ ~ ||| |long int int64 {aka long long int} | %lld Regards, [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18 -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada wrote: > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor wrote: > > > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor wrote: > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11 > > > feature [-Werror,-Wtypedef-redefinition]" > > > > > > I'll look in the other templates to see if what they do. > > > > Their "declare" sections have full typedefs. I found it works to leave > > out the typedef for the "define" section, but I first want to > > reproduce the build failure. > > Right. I've reproduced this build failure on my machine by specifying > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the > below change seems to fix the problem: Confirmed, will push shortly.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 3:27 PM Masahiko Sawada wrote: > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor wrote: > > > > > > In addition, olingo and grassquit are showing different kinds of > > "AddressSanitizer: odr-violation" errors, which I'm not sure what to > > make of -- example: odr-violation seems to refer to One Definition Rule (ODR). According to Wikipedia[1]: The One Definition Rule (ODR) is an important rule of the C++ programming language that prescribes that classes/structs and non-inline functions cannot have more than one definition in the entire program and template and types cannot have more than one definition by translation unit. It is defined in the ISO C++ Standard (ISO/IEC 14882) 2003, at section 3.2. Some other programming languages have similar but differently defined rules towards the same objective. I don't fully understand this concept yet but are these two different build failures related? Regards, [1] https://en.wikipedia.org/wiki/One_Definition_Rule -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 3:20 PM John Naylor wrote: > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor wrote: > > > > On Thu, Mar 7, 2024 at 12:55 PM John Naylor wrote: > > > > > > On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada > > > wrote: > > > > > > > > > + /* Find the control object in shared memory */ > > > > > + control = handle; > > > > > > > > I think it's mostly because of readability; it makes clear that the > > > > handle should be castable to dsa_pointer and it's a control object. I > > > > borrowed it from dshash_attach(). > > > > > > I find that a bit strange, but I went ahead and kept it. > > > > > > > > > > > > On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada > > > wrote: > > > > > > > The 0001 patch looks good to me. I have some minor comments: > > > > > > > +PGFILEDESC = "test_radixtree - test code for > > > > src/backend/lib/radixtree.c" > > > > + > > > > > > > > "src/backend/lib/radixtree.c" should be updated to > > > > "src/include/lib/radixtree.h". > > > > > > Done. > > > > > > > --- /dev/null > > > > +++ b/src/test/modules/test_radixtree/README > > > > @@ -0,0 +1,7 @@ > > > > +test_integerset contains unit tests for testing the integer set > > > > implementation > > > > +in src/backend/lib/integerset.c. > > > > + > > > > +The tests verify the correctness of the implementation, but they can > > > > also be > > > > +used as a micro-benchmark. If you set the 'intset_test_stats' flag in > > > > +test_integerset.c, the tests will print extra information about > > > > execution time > > > > +and memory usage. > > > > > > > > This file is not updated for test_radixtree. I think we can remove it > > > > as the test cases in test_radixtree are clear. > > > > > > Done. I pushed this with a few last-minute cosmetic adjustments. This > > > has been a very long time coming, but we're finally in the home > > > stretch! > > > > > > Already, I see sifaka doesn't like this, and I'm looking now... > > > > It's complaining that these forward declarations... > > > > /* generate forward declarations necessary to use the radix tree */ > > #ifdef RT_DECLARE > > > > typedef struct RT_RADIX_TREE RT_RADIX_TREE; > > typedef struct RT_ITER RT_ITER; > > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11 > > feature [-Werror,-Wtypedef-redefinition]" > > > > I'll look in the other templates to see if what they do. > > Their "declare" sections have full typedefs. I found it works to leave > out the typedef for the "define" section, but I first want to > reproduce the build failure. Right. I've reproduced this build failure on my machine by specifying flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the below change seems to fix the problem: --- a/src/include/lib/radixtree.h +++ b/src/include/lib/radixtree.h @@ -676,7 +676,7 @@ typedef struct RT_RADIX_TREE_CONTROL } RT_RADIX_TREE_CONTROL; /* Entry point for allocating and accessing the tree */ -typedef struct RT_RADIX_TREE +struct RT_RADIX_TREE { MemoryContext context; @@ -691,7 +691,7 @@ typedef struct RT_RADIX_TREE /* leaf_context is used only for single-value leaves */ MemoryContextData *leaf_context; #endif -} RT_RADIX_TREE; +}; /* * Iteration support. @@ -714,7 +714,7 @@ typedef struct RT_NODE_ITER } RT_NODE_ITER; /* state for iterating over the whole radix tree */ -typedef struct RT_ITER +struct RT_ITER { RT_RADIX_TREE *tree; @@ -728,7 +728,7 @@ typedef struct RT_ITER /* The key constructed during iteration */ uint64 key; -} RT_ITER; +}; /* verification (available only in assert-enabled builds) */ > > In addition, olingo and grassquit are showing different kinds of > "AddressSanitizer: odr-violation" errors, which I'm not sure what to > make of -- example: > > ==1862767==ERROR: AddressSanitizer: odr-violation (0x7fc257476b60): > [1] size=256 'pg_leftmost_one_pos' > /home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34 > [2] size=256 'pg_leftmost_one_pos' > /home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34 > These globals were registered at these points: > [1]: > #0 0x563564b97bf6 in __asan_register_globals > (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6) > (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) > #1 0x563564b98d1d in __asan_register_elf_globals > (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d) > (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) > #2 0x7fc265c3fe3d in call_init elf/dl-init.c:74:3 > #3 0x7fc265c3fe3d in call_init elf/dl-init.c:26:1 > > [2]: > #0 0x563564b97bf6 in __asan_register_globals > (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6) > (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) > #1 0x563564b98d1d in __asan_register_elf_globals
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 12:59 PM John Naylor wrote: > > On Thu, Mar 7, 2024 at 12:55 PM John Naylor wrote: > > > > On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada > > wrote: > > > > > > > + /* Find the control object in shared memory */ > > > > + control = handle; > > > > > > I think it's mostly because of readability; it makes clear that the > > > handle should be castable to dsa_pointer and it's a control object. I > > > borrowed it from dshash_attach(). > > > > I find that a bit strange, but I went ahead and kept it. > > > > > > > > On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada > > wrote: > > > > > The 0001 patch looks good to me. I have some minor comments: > > > > > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c" > > > + > > > > > > "src/backend/lib/radixtree.c" should be updated to > > > "src/include/lib/radixtree.h". > > > > Done. > > > > > --- /dev/null > > > +++ b/src/test/modules/test_radixtree/README > > > @@ -0,0 +1,7 @@ > > > +test_integerset contains unit tests for testing the integer set > > > implementation > > > +in src/backend/lib/integerset.c. > > > + > > > +The tests verify the correctness of the implementation, but they can > > > also be > > > +used as a micro-benchmark. If you set the 'intset_test_stats' flag in > > > +test_integerset.c, the tests will print extra information about > > > execution time > > > +and memory usage. > > > > > > This file is not updated for test_radixtree. I think we can remove it > > > as the test cases in test_radixtree are clear. > > > > Done. I pushed this with a few last-minute cosmetic adjustments. This > > has been a very long time coming, but we're finally in the home > > stretch! > > > > Already, I see sifaka doesn't like this, and I'm looking now... > > It's complaining that these forward declarations... > > /* generate forward declarations necessary to use the radix tree */ > #ifdef RT_DECLARE > > typedef struct RT_RADIX_TREE RT_RADIX_TREE; > typedef struct RT_ITER RT_ITER; > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11 > feature [-Werror,-Wtypedef-redefinition]" > > I'll look in the other templates to see if what they do. Their "declare" sections have full typedefs. I found it works to leave out the typedef for the "define" section, but I first want to reproduce the build failure. In addition, olingo and grassquit are showing different kinds of "AddressSanitizer: odr-violation" errors, which I'm not sure what to make of -- example: ==1862767==ERROR: AddressSanitizer: odr-violation (0x7fc257476b60): [1] size=256 'pg_leftmost_one_pos' /home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34 [2] size=256 'pg_leftmost_one_pos' /home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34 These globals were registered at these points: [1]: #0 0x563564b97bf6 in __asan_register_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #1 0x563564b98d1d in __asan_register_elf_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #2 0x7fc265c3fe3d in call_init elf/dl-init.c:74:3 #3 0x7fc265c3fe3d in call_init elf/dl-init.c:26:1 [2]: #0 0x563564b97bf6 in __asan_register_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #1 0x563564b98d1d in __asan_register_elf_globals (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d) (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8) #2 0x7fc2649847f5 in call_init csu/../csu/libc-start.c:145:3 #3 0x7fc2649847f5 in __libc_start_main csu/../csu/libc-start.c:347:5
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Mar 7, 2024 at 12:55 PM John Naylor wrote: > > On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada wrote: > > > > > + /* Find the control object in shared memory */ > > > + control = handle; > > > > I think it's mostly because of readability; it makes clear that the > > handle should be castable to dsa_pointer and it's a control object. I > > borrowed it from dshash_attach(). > > I find that a bit strange, but I went ahead and kept it. > > > > On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada wrote: > > > The 0001 patch looks good to me. I have some minor comments: > > > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c" > > + > > > > "src/backend/lib/radixtree.c" should be updated to > > "src/include/lib/radixtree.h". > > Done. > > > --- /dev/null > > +++ b/src/test/modules/test_radixtree/README > > @@ -0,0 +1,7 @@ > > +test_integerset contains unit tests for testing the integer set > > implementation > > +in src/backend/lib/integerset.c. > > + > > +The tests verify the correctness of the implementation, but they can also > > be > > +used as a micro-benchmark. If you set the 'intset_test_stats' flag in > > +test_integerset.c, the tests will print extra information about execution > > time > > +and memory usage. > > > > This file is not updated for test_radixtree. I think we can remove it > > as the test cases in test_radixtree are clear. > > Done. I pushed this with a few last-minute cosmetic adjustments. This > has been a very long time coming, but we're finally in the home > stretch! > > Already, I see sifaka doesn't like this, and I'm looking now... It's complaining that these forward declarations... /* generate forward declarations necessary to use the radix tree */ #ifdef RT_DECLARE typedef struct RT_RADIX_TREE RT_RADIX_TREE; typedef struct RT_ITER RT_ITER; ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11 feature [-Werror,-Wtypedef-redefinition]" I'll look in the other templates to see if what they do.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada wrote: > > > + /* Find the control object in shared memory */ > > + control = handle; > > I think it's mostly because of readability; it makes clear that the > handle should be castable to dsa_pointer and it's a control object. I > borrowed it from dshash_attach(). I find that a bit strange, but I went ahead and kept it. On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada wrote: > The 0001 patch looks good to me. I have some minor comments: > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c" > + > > "src/backend/lib/radixtree.c" should be updated to > "src/include/lib/radixtree.h". Done. > --- /dev/null > +++ b/src/test/modules/test_radixtree/README > @@ -0,0 +1,7 @@ > +test_integerset contains unit tests for testing the integer set > implementation > +in src/backend/lib/integerset.c. > + > +The tests verify the correctness of the implementation, but they can also be > +used as a micro-benchmark. If you set the 'intset_test_stats' flag in > +test_integerset.c, the tests will print extra information about execution > time > +and memory usage. > > This file is not updated for test_radixtree. I think we can remove it > as the test cases in test_radixtree are clear. Done. I pushed this with a few last-minute cosmetic adjustments. This has been a very long time coming, but we're finally in the home stretch! Already, I see sifaka doesn't like this, and I'm looking now...
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 6, 2024 at 8:20 PM John Naylor wrote: > > On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada wrote: > > > > > I'd like to push 0001 and 0002 shortly, and then do another sweep over > > > 0003, with remaining feedback, and get that in so we get some > > > buildfarm testing before the remaining polishing work on > > > tidstore/vacuum. > > > > Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to > > polish tidstore and vacuum patches and update commit messages. > > I don't think v66 got a CI run because of vacuumlazy.c bitrot, so I'm > attaching v67 which fixes that and has some small cosmetic adjustments > to the template. Thank you for updating the patch. > One functional change for debugging build is that > RT_STATS now prints out the number of leaves. I'll squash and push > 0001 tomorrow morning unless there are further comments. The 0001 patch looks good to me. I have some minor comments: --- /dev/null +++ b/src/test/modules/test_radixtree/Makefile @@ -0,0 +1,23 @@ +# src/test/modules/test_radixtree/Makefile + +MODULE_big = test_radixtree +OBJS = \ + $(WIN32RES) \ + test_radixtree.o +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c" + "src/backend/lib/radixtree.c" should be updated to "src/include/lib/radixtree.h". --- --- /dev/null +++ b/src/test/modules/test_radixtree/README @@ -0,0 +1,7 @@ +test_integerset contains unit tests for testing the integer set implementation +in src/backend/lib/integerset.c. + +The tests verify the correctness of the implementation, but they can also be +used as a micro-benchmark. If you set the 'intset_test_stats' flag in +test_integerset.c, the tests will print extra information about execution time +and memory usage. This file is not updated for test_radixtree. I think we can remove it as the test cases in test_radixtree are clear. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 6, 2024 at 8:25 PM John Naylor wrote: > > Actually, I forgot -- I had one more question: Masahiko, is there a > reason for this extra local variable, which uses the base type, rather > than the typedef'd parameter? > > +RT_SCOPE RT_RADIX_TREE * > +RT_ATTACH(dsa_area *dsa, RT_HANDLE handle) > +{ > + RT_RADIX_TREE *tree; > + dsa_pointer control; > + > + tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE)); > + > + /* Find the control object in shared memory */ > + control = handle; I think it's mostly because of readability; it makes clear that the handle should be castable to dsa_pointer and it's a control object. I borrowed it from dshash_attach(). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
Actually, I forgot -- I had one more question: Masahiko, is there a reason for this extra local variable, which uses the base type, rather than the typedef'd parameter? +RT_SCOPE RT_RADIX_TREE * +RT_ATTACH(dsa_area *dsa, RT_HANDLE handle) +{ + RT_RADIX_TREE *tree; + dsa_pointer control; + + tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE)); + + /* Find the control object in shared memory */ + control = handle;
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada wrote: > > > I'd like to push 0001 and 0002 shortly, and then do another sweep over > > 0003, with remaining feedback, and get that in so we get some > > buildfarm testing before the remaining polishing work on > > tidstore/vacuum. > > Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to > polish tidstore and vacuum patches and update commit messages. I don't think v66 got a CI run because of vacuumlazy.c bitrot, so I'm attaching v67 which fixes that and has some small cosmetic adjustments to the template. One functional change for debugging build is that RT_STATS now prints out the number of leaves. I'll squash and push 0001 tomorrow morning unless there are further comments. v67-ART.tar.gz Description: application/gzip
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 6, 2024 at 3:40 PM Masahiko Sawada wrote: > > On Wed, Mar 6, 2024 at 5:33 PM John Naylor wrote: > > I've looked around and it seems clang is more lax on conversions. > > Since it works fine for clang, I think we just need a cast here for > > gcc. I've attached a blind attempt at a fix -- I'll apply shortly > > unless someone happens to test and find it doesn't work. > > I've reproduced the same error on my raspberry pi, and confirmed the > patch fixes the error. > > My previous idea was wrong. With my proposal, the regression test for > radix tree failed on my raspberry pi. On the other hand, with your > patch the tests passed. Pushed, and at least parula's green now, thanks for testing! And thanks, Andres, for the ping!
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 6, 2024 at 5:33 PM John Naylor wrote: > > On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada wrote: > > > > ../../src/include/port/simd.h:326:71: error: incompatible type for > > argument 1 of \342\200\230vshrq_n_s8\342\200\231 > > uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, > > 7)); > >^ > > > > Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead. > > I've looked around and it seems clang is more lax on conversions. > Since it works fine for clang, I think we just need a cast here for > gcc. I've attached a blind attempt at a fix -- I'll apply shortly > unless someone happens to test and find it doesn't work. I've reproduced the same error on my raspberry pi, and confirmed the patch fixes the error. My previous idea was wrong. With my proposal, the regression test for radix tree failed on my raspberry pi. On the other hand, with your patch the tests passed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada wrote: > > ../../src/include/port/simd.h:326:71: error: incompatible type for > argument 1 of \342\200\230vshrq_n_s8\342\200\231 > uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7)); >^ > > Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead. I've looked around and it seems clang is more lax on conversions. Since it works fine for clang, I think we just need a cast here for gcc. I've attached a blind attempt at a fix -- I'll apply shortly unless someone happens to test and find it doesn't work. diff --git a/src/include/port/simd.h b/src/include/port/simd.h index 326b4faff5..597496f2fb 100644 --- a/src/include/port/simd.h +++ b/src/include/port/simd.h @@ -323,7 +323,7 @@ vector8_highbit_mask(const Vector8 v) 1 << 4, 1 << 5, 1 << 6, 1 << 7, }; - uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7)); + uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8((int8x16_t) v, 7)); uint8x16_t maskedhi = vextq_u8(masked, masked, 8); return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 6, 2024 at 3:06 PM John Naylor wrote: > > (Hmm, I thought we had run this code on Arm already...) CI MacOS uses Clang on aarch64, which has been working fine. The failing animals are on gcc 7.3...
Re: [PoC] Improve dead tuple storage for lazy vacuum
Hi, On March 6, 2024 9:06:50 AM GMT+01:00, John Naylor wrote: >On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada wrote: >> >> On Wed, Mar 6, 2024 at 4:41 PM Andres Freund wrote: > >> > A few ARM buildfarm animals are complaining: >> > >> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02 >> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03 >> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18 >> > >> >> The error message we got is: >> >> ../../src/include/port/simd.h:326:71: error: incompatible type for >> argument 1 of \342\200\230vshrq_n_s8\342\200\231 >> uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, >> 7)); >>^ >> >> Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead. > >That sounds plausible, and I'll look further. > >(Hmm, I thought we had run this code on Arm already...) Perhaps we should switch one of the CI jobs to ARM... Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada wrote: > > On Wed, Mar 6, 2024 at 4:41 PM Andres Freund wrote: > > A few ARM buildfarm animals are complaining: > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18 > > > > The error message we got is: > > ../../src/include/port/simd.h:326:71: error: incompatible type for > argument 1 of \342\200\230vshrq_n_s8\342\200\231 > uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7)); >^ > > Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead. That sounds plausible, and I'll look further. (Hmm, I thought we had run this code on Arm already...)
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 6, 2024 at 4:41 PM Andres Freund wrote: > > Hi, > > On 2024-03-05 16:41:30 +0700, John Naylor wrote: > > I'd like to push 0001 and 0002 shortly, and then do another sweep over > > 0003, with remaining feedback, and get that in so we get some > > buildfarm testing before the remaining polishing work on > > tidstore/vacuum. > > A few ARM buildfarm animals are complaining: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18 > The error message we got is: ../../src/include/port/simd.h:326:71: error: incompatible type for argument 1 of \342\200\230vshrq_n_s8\342\200\231 uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7)); ^ Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead. Regard, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
Hi, On 2024-03-05 16:41:30 +0700, John Naylor wrote: > I'd like to push 0001 and 0002 shortly, and then do another sweep over > 0003, with remaining feedback, and get that in so we get some > buildfarm testing before the remaining polishing work on > tidstore/vacuum. A few ARM buildfarm animals are complaining: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18 Greetings, Andres Freund
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Mar 6, 2024 at 12:59 PM John Naylor wrote: > > On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada wrote: > > > > On Tue, Mar 5, 2024 at 6:41 PM John Naylor wrote: > > > > Done in v66-0009. I'd be curious to hear any feedback. I like the > > > aspect that the random numbers come from a different seed every time > > > the test runs. > > > > The new tests look good. Here are some comments: > > > > --- > > + expected = keys[i]; > > + iterval = rt_iterate_next(iter, &iterkey); > > > > - ndeleted++; > > + EXPECT_TRUE(iterval != NULL); > > + EXPECT_EQ_U64(iterkey, expected); > > + EXPECT_EQ_U64(*iterval, expected); > > > > Can we verify that the iteration returns keys in ascending order? > > We get the "expected" value from the keys we saved in the now-sorted > array, so we do already. Unless I misunderstand you. Ah, you're right. Please ignore this comment. > > > --- > > radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext, > > "test_radix_tree", > > ALLOCSET_DEFAULT_SIZES); > > > > We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I > > think it's better to use either one for consistency. > > Will change to "small", since 32-bit platforms will use slab for leaves. Agreed. > > I'll look at the memory usage and estimate what 32-bit platforms will > use, and maybe adjust the number of keys. A few megabytes is fine, but > not many megabytes. Thanks, sounds good. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com