subject:"\[PoC\] Improve dead tuple storage for lazy vacuum"

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-05-07 Thread Masahiko Sawada

On Wed, May 1, 2024 at 4:29 PM John Naylor  wrote:
>
> On Thu, Apr 25, 2024 at 8:36 AM Masahiko Sawada  wrote:
> >
> > On Mon, Apr 15, 2024 at 6:12 PM John Naylor  wrote:
>
> > > - RT_KEY_GET_SHIFT is not covered for key=0:
> > >
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803
> > >
> > > That should be fairly simple to add to the tests.
> >
> > There are two paths to call RT_KEY_GET_SHIFT():
> >
> > 1. RT_SET() -> RT_KEY_GET_SHIFT()
> > 2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT()
> >
> > In both cases, it's called when key > tree->ctl->max_val. Since the
> > minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called
> > when key=0.
>
> Ah, right, so it is dead code. Nothing to worry about, but it does
> point the way to some simplifications, which I've put together in the
> attached.

Thank you for the patch. It looks good to me.

+   /* compute the smallest shift that will allowing storing the key */
+   start_shift = pg_leftmost_one_pos64(key) / RT_SPAN * RT_SPAN;

The comment is moved from RT_KEY_GET_SHIFT() but I think s/will
allowing storing/will allow storing/.

>
> > > - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:
> > >
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644
> > >
> > > That should be easy to add.
> >
> > Agreed. The patch is attached.
>
> LGTM
>
> > > - TidStoreCreate* has some memory clamps that are not covered:
> > >
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
> > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234
> > >
> > > Maybe we could experiment with using 1MB for shared, and something
> > > smaller for local.
> >
> > I've confirmed that the local and shared tidstore with small max sizes
> > such as 4kB and 1MB worked. Currently the max size is hard-coded in
> > test_tidstore.c but if we use work_mem as the max size, we can pass
> > different max sizes for local and shared in the test script.
>
> Seems okay, do you want to try that and see how it looks?

I've attached a simple patch for this. In test_tidstore.sql, we used
to create two local tidstore and one shared tidstore. I thought of
specifying small work_mem values for these three cases but it would
remove the normal test cases. So I created separate tidstore for this
test. Also, the new test is just to check if tidstore can be created
with such a small size, but it might be a good idea to add some TIDs
to check if it really works fine.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


use_work_mem_as_max_bytes.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-05-01 Thread John Naylor

On Thu, Apr 25, 2024 at 8:36 AM Masahiko Sawada  wrote:
>
> On Mon, Apr 15, 2024 at 6:12 PM John Naylor  wrote:

> > - RT_KEY_GET_SHIFT is not covered for key=0:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803
> >
> > That should be fairly simple to add to the tests.
>
> There are two paths to call RT_KEY_GET_SHIFT():
>
> 1. RT_SET() -> RT_KEY_GET_SHIFT()
> 2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT()
>
> In both cases, it's called when key > tree->ctl->max_val. Since the
> minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called
> when key=0.

Ah, right, so it is dead code. Nothing to worry about, but it does
point the way to some simplifications, which I've put together in the
attached.

> > - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644
> >
> > That should be easy to add.
>
> Agreed. The patch is attached.

LGTM

> > - TidStoreCreate* has some memory clamps that are not covered:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234
> >
> > Maybe we could experiment with using 1MB for shared, and something
> > smaller for local.
>
> I've confirmed that the local and shared tidstore with small max sizes
> such as 4kB and 1MB worked. Currently the max size is hard-coded in
> test_tidstore.c but if we use work_mem as the max size, we can pass
> different max sizes for local and shared in the test script.

Seems okay, do you want to try that and see how it looks?
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 2896a6efc5..fdac103763 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -217,7 +217,6 @@
 #define RT_NODE_48_GET_CHILD RT_MAKE_NAME(node_48_get_child)
 #define RT_NODE_256_IS_CHUNK_USED RT_MAKE_NAME(node_256_is_chunk_used)
 #define RT_NODE_256_GET_CHILD RT_MAKE_NAME(node_256_get_child)
-#define RT_KEY_GET_SHIFT RT_MAKE_NAME(key_get_shift)
 #define RT_SHIFT_GET_MAX_VAL RT_MAKE_NAME(shift_get_max_val)
 #define RT_NODE_SEARCH RT_MAKE_NAME(node_search)
 #define RT_NODE_DELETE RT_MAKE_NAME(node_delete)
@@ -320,9 +319,6 @@ RT_SCOPE void RT_STATS(RT_RADIX_TREE * tree);
 /* Mask for extracting a chunk from a key */
 #define RT_CHUNK_MASK ((1 << RT_SPAN) - 1)
 
-/* Maximum shift needed to extract a chunk from a key */
-#define RT_MAX_SHIFT	RT_KEY_GET_SHIFT(UINT64_MAX)
-
 /* Maximum level a tree can reach for a key */
 #define RT_MAX_LEVEL	((sizeof(uint64) * BITS_PER_BYTE) / RT_SPAN)
 
@@ -797,28 +793,15 @@ RT_NODE_256_GET_CHILD(RT_NODE_256 * node, uint8 chunk)
 	return &node->children[chunk];
 }
 
-/*
- * Return the smallest shift that will allowing storing the given key.
- */
-static inline int
-RT_KEY_GET_SHIFT(uint64 key)
-{
-	if (key == 0)
-		return 0;
-	else
-		return (pg_leftmost_one_pos64(key) / RT_SPAN) * RT_SPAN;
-}
-
 /*
  * Return the max value that can be stored in the tree with the given shift.
  */
 static uint64
 RT_SHIFT_GET_MAX_VAL(int shift)
 {
-	if (shift == RT_MAX_SHIFT)
-		return UINT64_MAX;
-	else
-		return (UINT64CONST(1) << (shift + RT_SPAN)) - 1;
+	int max_shift = (sizeof(uint64) - 1) * BITS_PER_BYTE;
+
+	return UINT64_MAX >> (max_shift - shift);
 }
 
 /*
@@ -1574,9 +1557,8 @@ RT_NODE_INSERT(RT_RADIX_TREE * tree, RT_PTR_ALLOC * parent_slot, RT_CHILD_PTR no
  * and move the old node below it.
  */
 static pg_noinline void
-RT_EXTEND_UP(RT_RADIX_TREE * tree, uint64 key)
+RT_EXTEND_UP(RT_RADIX_TREE * tree, uint64 key, int target_shift)
 {
-	int			target_shift = RT_KEY_GET_SHIFT(key);
 	int			shift = tree->ctl->start_shift;
 
 	Assert(shift < target_shift);
@@ -1713,11 +1695,15 @@ RT_SET(RT_RADIX_TREE * tree, uint64 key, RT_VALUE_TYPE * value_p)
 	/* Extend the tree if necessary */
 	if (unlikely(key > tree->ctl->max_val))
 	{
+		int			start_shift;
+
+		/* compute the smallest shift that will allowing storing the key */
+		start_shift = pg_leftmost_one_pos64(key) / RT_SPAN * RT_SPAN;
+
 		if (tree->ctl->num_keys == 0)
 		{
 			RT_CHILD_PTR node;
 			RT_NODE_4  *n4;
-			int			start_shift = RT_KEY_GET_SHIFT(key);
 
 			/*
 			 * With an empty root node, we don't extend the tree upwards,
@@ -1738,7 +1724,7 @@ RT_SET(RT_RADIX_TREE * tree, uint64 key, RT_VALUE_TYPE * value_p)
 			goto have_slot;
 		}
 		else
-			RT_EXTEND_UP(tree, key);
+			RT_EXTEND_UP(tree, key, start_shift);
 	}
 
 	slot = RT_GET_SLOT_RECURSIVE(tree, &tree->ctl->root,
@@ -2937,7 +2923,6 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_SPAN
 #undef RT_NODE_MAX_SLOTS
 #undef RT_CHUNK_MASK
-#undef RT_MAX_SHIFT
 #undef RT_MAX_LEVEL
 #undef RT_GET_KEY_CHUNK
 #undef RT_BM_IDX
@@ -3032,7 +3017,6 @@ RT_DUMP_NODE(RT_NODE * node)
 #undef RT_NODE_48_GET_CHILD
 #undef RT_NODE_256_IS_CHUNK_USED
 #undef RT_NODE_256_GET_CHILD
-

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-25 Thread Masahiko Sawada

On Thu, Apr 25, 2024 at 1:38 PM Masahiko Sawada  wrote:
>
> On Thu, Apr 25, 2024 at 12:17 PM John Naylor  wrote:
> >
> > On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada  
> > wrote:
> > >
> > > > I saw a SIGSEGV there when using tidstore to write a fix for something 
> > > > else.
> > > > Patch attached.
> > >
> > > Great find, thank you for the patch!
> >
> > +1
> >
> > (This occurred to me a few days ago, but I was far from my computer.)
> >
> > With the purge function that  Noah proposed, I believe we can also get
> > rid of the comment at the top of the .sql test file warning of a
> > maintenance hazard:
> > ..."To avoid adding duplicates,
> > -- each call to do_set_block_offsets() should use different block
> > -- numbers."
>
> Good point. Removed.
>
> >
> > > of do_gset_block_offset() and check_set_block_offsets(). If these are
> > > annoying, we can remove the cases of array[1] and array[1,2].
> >
> > Let's keep those -- 32-bit platforms should also exercise this path.
>
> Agreed.
>
> I've attached a new patch. I'll push it tonight, if there is no further 
> comment.
>

Pushed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-24 Thread Masahiko Sawada

On Thu, Apr 25, 2024 at 12:17 PM John Naylor  wrote:
>
> On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada  wrote:
> >
> > > I saw a SIGSEGV there when using tidstore to write a fix for something 
> > > else.
> > > Patch attached.
> >
> > Great find, thank you for the patch!
>
> +1
>
> (This occurred to me a few days ago, but I was far from my computer.)
>
> With the purge function that  Noah proposed, I believe we can also get
> rid of the comment at the top of the .sql test file warning of a
> maintenance hazard:
> ..."To avoid adding duplicates,
> -- each call to do_set_block_offsets() should use different block
> -- numbers."

Good point. Removed.

>
> > of do_gset_block_offset() and check_set_block_offsets(). If these are
> > annoying, we can remove the cases of array[1] and array[1,2].
>
> Let's keep those -- 32-bit platforms should also exercise this path.

Agreed.

I've attached a new patch. I'll push it tonight, if there is no further comment.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v2-0001-radixtree-Fix-SIGSEGV-at-update-of-embeddable-val.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-24 Thread John Naylor

On Thu, Apr 25, 2024 at 9:50 AM Masahiko Sawada  wrote:
>
> > I saw a SIGSEGV there when using tidstore to write a fix for something else.
> > Patch attached.
>
> Great find, thank you for the patch!

+1

(This occurred to me a few days ago, but I was far from my computer.)

With the purge function that  Noah proposed, I believe we can also get
rid of the comment at the top of the .sql test file warning of a
maintenance hazard:
..."To avoid adding duplicates,
-- each call to do_set_block_offsets() should use different block
-- numbers."

I found that it doesn't add any measurable time to run the test.

> The fix looks good to me. I think we can improve regression tests for
> better coverage. In TidStore on a 64-bit machine, we can store 3
> offsets in the header and these values are embedded to the leaf page.
> With more than 3 offsets, the value size becomes more than 16 bytes
> and a single value leaf. Therefore, if we can add the test with the
> array[1,2,3,4,100], we can cover the case of replacing a single-value
> leaf with a different size new single-value leaf. Now we add 9 pairs

Good idea.

> of do_gset_block_offset() and check_set_block_offsets(). If these are
> annoying, we can remove the cases of array[1] and array[1,2].

Let's keep those -- 32-bit platforms should also exercise this path.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-24 Thread Masahiko Sawada

On Thu, Apr 25, 2024 at 6:03 AM Noah Misch  wrote:
>
> On Mon, Apr 15, 2024 at 04:12:38PM +0700, John Naylor wrote:
> > - Some paths for single-value leaves are not covered:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606
> >
> > However, these paths do get regression test coverage on 32-bit
> > machines. 64-bit builds only have leaves in the TID store, which
> > doesn't (currently) delete entries, and doesn't instantiate the tree
> > with the debug option.
> >
> > - In RT_SET "if (found)" is not covered:
> >
> > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768
> >
> > That's because we don't yet have code that replaces an existing value
> > with a value of a different length.
>
> I saw a SIGSEGV there when using tidstore to write a fix for something else.
> Patch attached.

Great find, thank you for the patch!

The fix looks good to me. I think we can improve regression tests for
better coverage. In TidStore on a 64-bit machine, we can store 3
offsets in the header and these values are embedded to the leaf page.
With more than 3 offsets, the value size becomes more than 16 bytes
and a single value leaf. Therefore, if we can add the test with the
array[1,2,3,4,100], we can cover the case of replacing a single-value
leaf with a different size new single-value leaf. Now we add 9 pairs
of do_gset_block_offset() and check_set_block_offsets(). If these are
annoying, we can remove the cases of array[1] and array[1,2].

I've attached a new patch. In addition to the new test case I
mentioned, I've added some new comments and removed an unnecessary
added line in test_tidstore.sql.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

0001-radixtree-Fix-SIGSEGV-at-update-of-embeddable-value-.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-24 Thread Masahiko Sawada

On Mon, Apr 15, 2024 at 6:12 PM John Naylor  wrote:
>
> I took a look at the coverage report from [1] and it seems pretty
> good, but there are a couple more tests we could do.

Thank you for checking!

>
> - RT_KEY_GET_SHIFT is not covered for key=0:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803
>
> That should be fairly simple to add to the tests.

There are two paths to call RT_KEY_GET_SHIFT():

1. RT_SET() -> RT_KEY_GET_SHIFT()
2. RT_SET() -> RT_EXTEND_UP() -> RT_KEY_GET_SHIFT()

In both cases, it's called when key > tree->ctl->max_val. Since the
minimum value of max_val is 255, RT_KEY_GET_SHIFT() is never called
when key=0.

>
> - Some paths for single-value leaves are not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606
>
> However, these paths do get regression test coverage on 32-bit
> machines. 64-bit builds only have leaves in the TID store, which
> doesn't (currently) delete entries, and doesn't instantiate the tree
> with the debug option.

Right.

>
> - In RT_SET "if (found)" is not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768
>
> That's because we don't yet have code that replaces an existing value
> with a value of a different length.

Noah reported an issue around that. We should incorporate the patch
and cover this code path.

>
> - RT_FREE_RECURSE isn't well covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768
>
> The TID store test is pretty simple as far as distribution of block
> keys, and focuses more on the offset bitmaps. We could try to cover
> all branches here, but it would make the test less readable, and it's
> kind of the wrong place to do that anyway. test_radixtree.c does have
> a commented-out option to use shared memory, but that's for local
> testing and won't be reflected in the coverage report. Maybe it's
> enough.

Agreed.

>
> - RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644
>
> That should be easy to add.

Agreed. The patch is attached.

>
> - RT_DUMP_NODE is not covered, and never called by default anyway:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2804
>
> It seems we could just leave it alone since it's debug-only, but it's
> also a lot of lines. One idea is to use elog with DEBUG5 instead of
> commenting out the call sites, but that would cause a lot of noise.

I think we can leave it alone.

>
> - TidStoreCreate* has some memory clamps that are not covered:
>
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234
>
> Maybe we could experiment with using 1MB for shared, and something
> smaller for local.

I've confirmed that the local and shared tidstore with small max sizes
such as 4kB and 1MB worked. Currently the max size is hard-coded in
test_tidstore.c but if we use work_mem as the max size, we can pass
different max sizes for local and shared in the test script.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


improve_code_coverage_radixtree.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-24 Thread Noah Misch

On Mon, Apr 15, 2024 at 04:12:38PM +0700, John Naylor wrote:
> - Some paths for single-value leaves are not covered:
> 
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606
> 
> However, these paths do get regression test coverage on 32-bit
> machines. 64-bit builds only have leaves in the TID store, which
> doesn't (currently) delete entries, and doesn't instantiate the tree
> with the debug option.
> 
> - In RT_SET "if (found)" is not covered:
> 
> https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768
> 
> That's because we don't yet have code that replaces an existing value
> with a value of a different length.

I saw a SIGSEGV there when using tidstore to write a fix for something else.
Patch attached.
Author: Noah Misch 
Commit: Noah Misch 

radixtree: Fix SIGSEGV at update of embeddable value to non-embeddable.

Also, fix a memory leak when updating from non-embeddable to embeddable.
Both were unreachable without adding C code.

Reviewed by FIXME.

Discussion: https://postgr.es/m/FIXME

diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index dc4c00d..aa8f44c 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -1749,6 +1749,9 @@ have_slot:
 
if (RT_VALUE_IS_EMBEDDABLE(value_p))
{
+   if (found && !RT_CHILDPTR_IS_VALUE(*slot))
+   RT_FREE_LEAF(tree, *slot);
+
/* store value directly in child pointer slot */
memcpy(slot, value_p, value_sz);
 
@@ -1765,7 +1768,7 @@ have_slot:
{
RT_CHILD_PTR leaf;
 
-   if (found)
+   if (found && !RT_CHILDPTR_IS_VALUE(*slot))
{
Assert(RT_PTR_ALLOC_IS_VALID(*slot));
leaf.alloc = *slot;
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out 
b/src/test/modules/test_tidstore/expected/test_tidstore.out
index 06c610e..d116927 100644
--- a/src/test/modules/test_tidstore/expected/test_tidstore.out
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -79,6 +79,96 @@ SELECT test_destroy();
  
 (1 row)
 
+-- Test replacements crossing RT_CHILDPTR_IS_VALUE in both directions
+SELECT test_create(false);
+ test_create 
+-
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT 
check_set_block_offsets();
+ do_set_block_offsets 
+--
+1
+(1 row)
+
+ check_set_block_offsets 
+-
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT 
check_set_block_offsets();
+ do_set_block_offsets 
+--
+1
+(1 row)
+
+ check_set_block_offsets 
+-
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT 
check_set_block_offsets();
+ do_set_block_offsets 
+--
+1
+(1 row)
+
+ check_set_block_offsets 
+-
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3,4]::int2[]); SELECT 
check_set_block_offsets();
+ do_set_block_offsets 
+--
+1
+(1 row)
+
+ check_set_block_offsets 
+-
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2,3]::int2[]); SELECT 
check_set_block_offsets();
+ do_set_block_offsets 
+--
+1
+(1 row)
+
+ check_set_block_offsets 
+-
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1,2]::int2[]); SELECT 
check_set_block_offsets();
+ do_set_block_offsets 
+--
+1
+(1 row)
+
+ check_set_block_offsets 
+-
+ 
+(1 row)
+
+SELECT do_set_block_offsets(1, array[1]::int2[]); SELECT 
check_set_block_offsets();
+ do_set_block_offsets 
+--
+1
+(1 row)
+
+ check_set_block_offsets 
+-
+ 
+(1 row)
+
+SELECT test_destroy();
+ test_destroy 
+--
+ 
+(1 row)
+
 -- Use shared memory this time. We can't do that in test_radixtree.sql,
 -- because unused static functions would raise warnings there.
 SELECT test_create(true);
diff --git a/src/test/modules/test_tidstore/sql/test_tidstore.sql 
b/src/test/modules/test_tidstore/sql/test_tidstore.sql
index bb31877..704d869 100644
--- a/src/test/modules/test_tidstore/sql/test_tidstore.sql
+++ b/src/test/modules/test_tidstore/sql/test_tidstore.sql
@@ -14,6 +14,7 @@ CREATE TEMP TABLE hideblocks(blockno bigint);
 -- We use a higher number to test tidstore.
 \set maxoffset 512
 
+
 SELECT test_create(false);
 
 -- Test on empty tidstore.
@@ -50,6 +51,20 @@ SELECT t

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-15 Thread John Naylor

I took a look at the coverage report from [1] and it seems pretty
good, but there are a couple more tests we could do.

- RT_KEY_GET_SHIFT is not covered for key=0:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L803

That should be fairly simple to add to the tests.

- Some paths for single-value leaves are not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L904
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L954
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2606

However, these paths do get regression test coverage on 32-bit
machines. 64-bit builds only have leaves in the TID store, which
doesn't (currently) delete entries, and doesn't instantiate the tree
with the debug option.

- In RT_SET "if (found)" is not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768

That's because we don't yet have code that replaces an existing value
with a value of a different length.

- RT_FREE_RECURSE isn't well covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L1768

The TID store test is pretty simple as far as distribution of block
keys, and focuses more on the offset bitmaps. We could try to cover
all branches here, but it would make the test less readable, and it's
kind of the wrong place to do that anyway. test_radixtree.c does have
a commented-out option to use shared memory, but that's for local
testing and won't be reflected in the coverage report. Maybe it's
enough.

- RT_DELETE: "if (key > tree->ctl->max_val)" is not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2644

That should be easy to add.

- RT_DUMP_NODE is not covered, and never called by default anyway:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/include/lib/radixtree.h.gcov.html#L2804

It seems we could just leave it alone since it's debug-only, but it's
also a lot of lines. One idea is to use elog with DEBUG5 instead of
commenting out the call sites, but that would cause a lot of noise.

- TidStoreCreate* has some memory clamps that are not covered:

https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L179
https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/common/tidstore.c.gcov.html#L234

Maybe we could experiment with using 1MB for shared, and something
smaller for local.

[1] 
https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-09 Thread John Naylor

On Mon, Apr 8, 2024 at 7:26 PM John Naylor  wrote:
>
> I pushed both of these and see that mylodon complains that anonymous
> unions are a C11 feature. I'm not actually sure that the union with
> uintptr_t is actually needed, though, since that's not accessed as
> such here. The simplest thing seems to get rid if the union and name
> the inner struct "header", as in the attached.

I pushed this with some comment adjustments.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-08 Thread John Naylor

On Mon, Apr 8, 2024 at 7:42 PM Pavel Borisov  wrote:
>
>> I pushed both of these and see that mylodon complains that anonymous
>> unions are a C11 feature. I'm not actually sure that the union with
>> uintptr_t is actually needed, though, since that's not accessed as
>> such here. The simplest thing seems to get rid if the union and name
>> the inner struct "header", as in the attached.
>
>
> Provided  uintptr_t is not accessed it might be good to get rid of it.
>
> Maybe this patch also need correction in this:
> +#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) - sizeof(int8)) 
> / sizeof(OffsetNumber))

For full context the diff was

-#define NUM_FULL_OFFSETS ((sizeof(bitmapword) - sizeof(uint16)) /
sizeof(OffsetNumber))
+#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) -
sizeof(int8)) / sizeof(OffsetNumber))

I wanted the former, from f35bd9bf35 , to be independently useful (in
case the commit in question had some unresolvable issue), and its
intent is to fill struct padding when the array of bitmapword happens
to have length zero. Changing to uintptr_t for the size calculation
reflects the intent to fit in a (local) pointer, regardless of the
size of a bitmapword. (If a DSA pointer happens to be a different size
for some odd platform, it should still work, BTW.)

My thinking with the union was, for big-endian, to force the 'flags'
member to where it can be set, but thinking again, it should still
work if by happenstance the header was smaller than the child pointer:
A different bit would get tagged, but I believe that's irrelevant. The
'flags' member makes sure a byte is reserved for the tag, but it may
not be where the tag is actually located, if that makes sense.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-08 Thread Pavel Borisov

On Mon, 8 Apr 2024 at 16:27, John Naylor  wrote:

> On Sun, Apr 7, 2024 at 9:08 AM John Naylor 
> wrote:
> >
> > I've attached a mostly-polished update on runtime embeddable values,
> > storing up to 3 offsets in the child pointer (1 on 32-bit platforms).
> > As discussed, this includes a macro to cap max possible offset that
> > can be stored in the bitmap, which I believe only reduces the valid
> > offset range for 32kB pages on 32-bit platforms. Even there, it allows
> > for more line pointers than can possibly be useful. It also splits
> > into two parts for readability. It would be committed in two pieces as
> > well, since they are independently useful.
>
> I pushed both of these and see that mylodon complains that anonymous
> unions are a C11 feature. I'm not actually sure that the union with
> uintptr_t is actually needed, though, since that's not accessed as
> such here. The simplest thing seems to get rid if the union and name
> the inner struct "header", as in the attached.
>

Provided  uintptr_t is not accessed it might be good to get rid of it.

Maybe this patch also need correction in this:
+#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint8) -
sizeof(int8)) / sizeof(OffsetNumber))

Regards,
Pavel

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-08 Thread John Naylor

On Sun, Apr 7, 2024 at 9:08 AM John Naylor  wrote:
>
> I've attached a mostly-polished update on runtime embeddable values,
> storing up to 3 offsets in the child pointer (1 on 32-bit platforms).
> As discussed, this includes a macro to cap max possible offset that
> can be stored in the bitmap, which I believe only reduces the valid
> offset range for 32kB pages on 32-bit platforms. Even there, it allows
> for more line pointers than can possibly be useful. It also splits
> into two parts for readability. It would be committed in two pieces as
> well, since they are independently useful.

I pushed both of these and see that mylodon complains that anonymous
unions are a C11 feature. I'm not actually sure that the union with
uintptr_t is actually needed, though, since that's not accessed as
such here. The simplest thing seems to get rid if the union and name
the inner struct "header", as in the attached.
diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index cddbaf013b..78730797d6 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -43,35 +43,30 @@
  */
 typedef struct BlocktableEntry
 {
-	union
+	struct
 	{
-		struct
-		{
 #ifndef WORDS_BIGENDIAN
-			/*
-			 * We need to position this member so that the backing radix tree
-			 * can use the lowest bit for a pointer tag. In particular, it
-			 * must be placed within 'header' so that it corresponds to the
-			 * lowest byte in 'ptr'. We position 'nwords' along with it to
-			 * avoid struct padding.
-			 */
-			uint8		flags;
-
-			int8		nwords;
+		/*
+		 * We need to position this member so that the backing radix tree can
+		 * use the lowest bit for a pointer tag. In particular, it must be
+		 * placed within 'header' so that it corresponds to the lowest byte in
+		 * 'ptr'. We position 'nwords' along with it to avoid struct padding.
+		 */
+		uint8		flags;
+
+		int8		nwords;
 #endif
 
-			/*
-			 * We can store a small number of offsets here to avoid wasting
-			 * space with a sparse bitmap.
-			 */
-			OffsetNumber full_offsets[NUM_FULL_OFFSETS];
+		/*
+		 * We can store a small number of offsets here to avoid wasting space
+		 * with a sparse bitmap.
+		 */
+		OffsetNumber full_offsets[NUM_FULL_OFFSETS];
 
 #ifdef WORDS_BIGENDIAN
-			int8		nwords;
-			uint8		flags;
+		int8		nwords;
+		uint8		flags;
 #endif
-		};
-		uintptr_t	ptr;
 	}			header;
 
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-08 Thread Pavel Borisov

Hi, John!

On Mon, 8 Apr 2024 at 03:13, John Naylor  wrote:

> On Mon, Apr 8, 2024 at 2:07 AM Andres Freund  wrote:
> >
> > Looking at the code, the failure isn't suprising anymore:
> > chardata[MaxBlocktableEntrySize];
> > BlocktableEntry *page = (BlocktableEntry *) data;
> >
> > 'char' doesn't enforce any alignment, but you're storing a
> BlocktableEntry in
> > a char[]. You can't just do that.  Look at how we do that for
> > e.g. PGAlignedblock.
> >
> >
> > With the attached minimal fix, the tests pass again.
>
> Thanks, will push this shortly!
>
Buildfarm animal mylodon looks unhappy with this:

FAILED: src/backend/postgres_lib.a.p/access_common_tidstore.c.o
ccache clang-14 -Isrc/backend/postgres_lib.a.p -Isrc/include
-I../pgsql/src/include -I/usr/include/libxml2 -I/usr/include/security
-fdiagnostics-color=never -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch
-O2 -g -fno-strict-aliasing -fwrapv -D_GNU_SOURCE -Wmissing-prototypes
-Wpointer-arith -Werror=vla -Werror=unguarded-availability-new
-Wendif-labels -Wmissing-format-attribute -Wcast-function-type
-Wformat-security -Wdeclaration-after-statement
-Wno-unused-command-line-argument -Wno-compound-token-split-by-macro
-O1 -ggdb -g3 -fno-omit-frame-pointer -Wall -Wextra
-Wno-unused-parameter -Wno-sign-compare
-Wno-missing-field-initializers -Wno-array-bounds -std=c99
-Wc11-extensions -Werror=c11-extensions -fPIC -isystem
/usr/include/mit-krb5 -pthread -DBUILDING_DLL -MD -MQ
src/backend/postgres_lib.a.p/access_common_tidstore.c.o -MF
src/backend/postgres_lib.a.p/access_common_tidstore.c.o.d -o
src/backend/postgres_lib.a.p/access_common_tidstore.c.o -c
../pgsql/src/backend/access/common/tidstore.c
../pgsql/src/backend/access/common/tidstore.c:48:3: error: anonymous
structs are a C11 extension [-Werror,-Wc11-extensions]
struct
^

1 error generated.

Regards,
Pavel Borisov
Supabase

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-07 Thread John Naylor

On Mon, Apr 8, 2024 at 2:07 AM Andres Freund  wrote:
>
> Looking at the code, the failure isn't suprising anymore:
> chardata[MaxBlocktableEntrySize];
> BlocktableEntry *page = (BlocktableEntry *) data;
>
> 'char' doesn't enforce any alignment, but you're storing a BlocktableEntry in
> a char[]. You can't just do that.  Look at how we do that for
> e.g. PGAlignedblock.
>
>
> With the attached minimal fix, the tests pass again.

Thanks, will push this shortly!

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-07 Thread Andres Freund

Hi,

On 2024-04-01 11:53:28 +0900, Masahiko Sawada wrote:
> On Fri, Mar 29, 2024 at 4:21 PM John Naylor  wrote:
> > I've marked it Ready for Committer.
>
> Thank you! I've attached the patch that I'm going to push tomorrow.

Locally I ran a 32bit build with ubsan enabled (by accident actually), which
complains:

performing post-bootstrap initialization ...
--- stderr ---
../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341:24:
 runtime error: member access within misaligned address 0xffb6258e for type 
'struct BlocktableEntry', which requires 4 byte alignment
0xffb6258e: note: pointer points here
 00 00 02 00 01 40  dc e9 83 0b 80 48 70 ee  00 00 00 00 00 00 00 01  17 00 00 
00 f8 d4 a6 ee  e8 25
 ^
#0 0x814097e in TidStoreSetBlockOffsets 
../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341
#1 0x826560a in dead_items_add 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:2889
#2 0x825f8da in lazy_scan_prune 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:1502
#3 0x825da71 in lazy_scan_heap 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:977
#4 0x825ad8f in heap_vacuum_rel 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:499
#5 0x8697e97 in table_relation_vacuum 
../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1725
#6 0x869fca6 in vacuum_rel 
../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:2206
#7 0x869a0fd in vacuum 
../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:622
#8 0x869986b in ExecVacuum 
../../../../../home/andres/src/postgresql/src/backend/commands/vacuum.c:449
#9 0x8e5f832 in standard_ProcessUtility 
../../../../../home/andres/src/postgresql/src/backend/tcop/utility.c:859
#10 0x8e5e5f6 in ProcessUtility 
../../../../../home/andres/src/postgresql/src/backend/tcop/utility.c:523
#11 0x8e5b71a in PortalRunUtility 
../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:1158
#12 0x8e5be80 in PortalRunMulti 
../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:1315
#13 0x8e59f9b in PortalRun 
../../../../../home/andres/src/postgresql/src/backend/tcop/pquery.c:791
#14 0x8e4d5f3 in exec_simple_query 
../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:1274
#15 0x8e55159 in PostgresMain 
../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4680
#16 0x8e54445 in PostgresSingleUserMain 
../../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4136
#17 0x88bb55e in main 
../../../../../home/andres/src/postgresql/src/backend/main/main.c:194
#18 0xf76f47c4  (/lib/i386-linux-gnu/libc.so.6+0x237c4) (BuildId: 
fe79efe6681a919714a4e119da2baac3a4953fbf)
#19 0xf76f4887 in __libc_start_main (/lib/i386-linux-gnu/libc.so.6+0x23887) 
(BuildId: fe79efe6681a919714a4e119da2baac3a4953fbf)
#20 0x80d40f7 in _start 
(/srv/dev/build/postgres/m-dev-assert-32/tmp_install/srv/dev/install/postgres/m-dev-assert-32/bin/postgres+0x80d40f7)

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior 
../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341:24
 in
Aborted (core dumped)
child process exited with exit code 134
initdb: data directory 
"/srv/dev/build/postgres/m-dev-assert-32/tmp_install/initdb-template" not 
removed at user's request


At first I was confused why CI didn't find this. Turns out that, for me, this
is only triggered without compiler optimizations, and I had used -O0 while CI
uses some optimizations.

Backtrace:
#9  0x0814097f in TidStoreSetBlockOffsets (ts=0xb8dfde4, blkno=15, 
offsets=0xffb6275c, num_offsets=11)
at 
../../../../../home/andres/src/postgresql/src/backend/access/common/tidstore.c:341
#10 0x0826560b in dead_items_add (vacrel=0xb8df6d4, blkno=15, 
offsets=0xffb6275c, num_offsets=11)
at 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:2889
#11 0x0825f8db in lazy_scan_prune (vacrel=0xb8df6d4, buf=24, blkno=15, 
page=0xeeb6c000 "", vmbuffer=729, all_visible_according_to_vm=false,
has_lpdead_items=0xffb62a1f) at 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:1502
#12 0x0825da72 in lazy_scan_heap (vacrel=0xb8df6d4) at 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:977
#13 0x0825ad90 in heap_vacuum_rel (rel=0xb872810, params=0xffb62e90, 
bstrategy=0xb99d5e0)
at 
../../../../../home/andres/src/postgresql/src/backend/access/heap/vacuumlazy.c:499
#14 0x08697e98 in table_relation_vacuum (rel=0xb872810, params=0xffb62e90, 
bstrategy=0xb99d5e0)
at 
../../../../../home/andres/src/postgresql/src/include/access/tableam.h:1725
#15 0x0869fca7 in vacuum_rel (relid=1249, relation=0x0, params=0xff

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-07 Thread John Naylor

On Sun, Apr 7, 2024 at 9:08 AM John Naylor  wrote:
> I've attached a mostly-polished update on runtime embeddable values,
> storing up to 3 offsets in the child pointer (1 on 32-bit platforms).

And...since there's a new bump context patch, I wanted to anticipate
squeezing an update on top of that, if that gets committed. 0004/5 are
the v6 bump context, and 0006 uses it for vacuum. The rest are to show
it works -- the expected.out changes make possible problems in CI
easier to see. The allocation size is 16 bytes, so this difference is
entirely due to lack of chunk header:

aset: 6619136
bump: 5047296

(Note: assert builds still have the chunk header for sanity checking,
so this was done in a more optimized build)
From e2333bad47974a22120a4de3fad804a21757f631 Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Tue, 20 Feb 2024 21:49:57 +1300
Subject: [PATCH v84 5/8] Introduce a bump memory allocator

---
 src/backend/nodes/gen_node_support.pl |   2 +-
 src/backend/utils/mmgr/Makefile   |   1 +
 src/backend/utils/mmgr/bump.c | 818 ++
 src/backend/utils/mmgr/mcxt.c |  15 +-
 src/backend/utils/mmgr/meson.build|   1 +
 src/include/nodes/memnodes.h  |   3 +-
 src/include/utils/memutils.h  |   7 +
 src/include/utils/memutils_internal.h |  18 +-
 src/tools/pgindent/typedefs.list  |   2 +
 9 files changed, 862 insertions(+), 5 deletions(-)
 create mode 100644 src/backend/utils/mmgr/bump.c

diff --git a/src/backend/nodes/gen_node_support.pl b/src/backend/nodes/gen_node_support.pl
index d4244facbb..81df3bdf95 100644
--- a/src/backend/nodes/gen_node_support.pl
+++ b/src/backend/nodes/gen_node_support.pl
@@ -149,7 +149,7 @@ my @abstract_types = qw(Node);
 # they otherwise don't participate in node support.
 my @extra_tags = qw(
   IntList OidList XidList
-  AllocSetContext GenerationContext SlabContext
+  AllocSetContext GenerationContext SlabContext BumpContext
   TIDBitmap
   WindowObjectData
 );
diff --git a/src/backend/utils/mmgr/Makefile b/src/backend/utils/mmgr/Makefile
index dae3432c98..01a1fb8527 100644
--- a/src/backend/utils/mmgr/Makefile
+++ b/src/backend/utils/mmgr/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	alignedalloc.o \
 	aset.o \
+	bump.o \
 	dsa.o \
 	freepage.o \
 	generation.o \
diff --git a/src/backend/utils/mmgr/bump.c b/src/backend/utils/mmgr/bump.c
new file mode 100644
index 00..f98a203a0c
--- /dev/null
+++ b/src/backend/utils/mmgr/bump.c
@@ -0,0 +1,818 @@
+/*-
+ *
+ * bump.c
+ *	  Bump allocator definitions.
+ *
+ * Bump is a MemoryContext implementation designed for memory usages which
+ * require allocating a large number of chunks, none of which ever need to be
+ * pfree'd or realloc'd.  Chunks allocated by this context have no chunk header
+ * and operations which ordinarily require looking at the chunk header cannot
+ * be performed.  For example, pfree, realloc, GetMemoryChunkSpace and
+ * GetMemoryChunkContext are all not possible with bump allocated chunks.  The
+ * only way to release memory allocated by this context type is to reset or
+ * delete the context.
+ *
+ * Portions Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/mmgr/bump.c
+ *
+ *
+ *	Bump is best suited to cases which require a large number of short-lived
+ *	chunks where performance matters.  Because bump allocated chunks don't
+ *	have a chunk header, it can fit more chunks on each block.  This means we
+ *	can do more with less memory and fewer cache lines.  The reason it's best
+ *	suited for short-lived usages of memory is that ideally, pointers to bump
+ *	allocated chunks won't be visible to a large amount of code.  The more
+ *	code that operates on memory allocated by this allocator, the more chances
+ *	that some code will try to perform a pfree or one of the other operations
+ *	which are made impossible due to the lack of chunk header.  In order to
+ *	to detect accidental usage of the various disallowed operations, we do add
+ *	a MemoryChunk chunk header in MEMORY_CONTEXT_CHECKING builds and have the
+ *	various disallowed functions raise an ERROR.
+ *
+ *	Allocations are MAXALIGNed.
+ *
+ *-
+ */
+
+#include "postgres.h"
+
+#include "lib/ilist.h"
+#include "port/pg_bitutils.h"
+#include "utils/memdebug.h"
+#include "utils/memutils.h"
+#include "utils/memutils_memorychunk.h"
+#include "utils/memutils_internal.h"
+
+#define Bump_BLOCKHDRSZ	MAXALIGN(sizeof(BumpBlock))
+
+/* No chunk header unless built with MEMORY_CONTEXT_CHECKING */
+#ifdef MEMORY_CONTEXT_CHECKING
+#define Bump_CHUNKHDRSZ	sizeof(MemoryChunk)
+#else
+#define Bump_CHUNKHDRSZ	0
+#endif
+
+#define Bump_CHUNK_FRACTION	8
+
+/* The keeper block is allocated in the same allocation as the set */
+#define KeeperBlock(set) ((BumpBlock *) ((char

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-04-06 Thread John Naylor

On Mon, Apr 1, 2024 at 9:54 AM Masahiko Sawada  wrote:
>
> Thank you! I've attached the patch that I'm going to push tomorrow.

Excellent!

I've attached a mostly-polished update on runtime embeddable values,
storing up to 3 offsets in the child pointer (1 on 32-bit platforms).
As discussed, this includes a macro to cap max possible offset that
can be stored in the bitmap, which I believe only reduces the valid
offset range for 32kB pages on 32-bit platforms. Even there, it allows
for more line pointers than can possibly be useful. It also splits
into two parts for readability. It would be committed in two pieces as
well, since they are independently useful.
From 24bd672deb4a6fa14abfc8583b500d1cbc332032 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Fri, 5 Apr 2024 17:04:12 +0700
Subject: [PATCH v83 1/3] store offsets in the header

---
 src/backend/access/common/tidstore.c  | 52 +++
 .../test_tidstore/expected/test_tidstore.out  | 14 +
 .../test_tidstore/sql/test_tidstore.sql   |  5 ++
 3 files changed, 71 insertions(+)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index e1a7e82469..4eb5d46951 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -34,6 +34,9 @@
 /* number of active words for a page: */
 #define WORDS_PER_PAGE(n) ((n) / BITS_PER_BITMAPWORD + 1)
 
+/* number of offsets we can store in the header of a BlocktableEntry */
+#define NUM_FULL_OFFSETS ((sizeof(uintptr_t) - sizeof(uint16)) / sizeof(OffsetNumber))
+
 /*
  * This is named similarly to PagetableEntry in tidbitmap.c
  * because the two have a similar function.
@@ -41,6 +44,10 @@
 typedef struct BlocktableEntry
 {
 	uint16		nwords;
+
+	/* We can store a small number of offsets here to avoid wasting space with a sparse bitmap. */
+	OffsetNumber full_offsets[NUM_FULL_OFFSETS];
+
 	bitmapword	words[FLEXIBLE_ARRAY_MEMBER];
 } BlocktableEntry;
 #define MaxBlocktableEntrySize \
@@ -316,6 +323,25 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 	for (int i = 1; i < num_offsets; i++)
 		Assert(offsets[i] > offsets[i - 1]);
 
+	memset(page, 0, offsetof(BlocktableEntry, words));
+
+	if (num_offsets <= NUM_FULL_OFFSETS)
+	{
+		for (int i = 0; i < num_offsets; i++)
+		{
+			OffsetNumber off = offsets[i];
+
+			/* safety check to ensure we don't overrun bit array bounds */
+			if (!OffsetNumberIsValid(off))
+elog(ERROR, "tuple offset out of range: %u", off);
+
+			page->full_offsets[i] = off;
+		}
+
+		page->nwords = 0;
+	}
+	else
+	{
 	for (wordnum = 0, next_word_threshold = BITS_PER_BITMAPWORD;
 		 wordnum <= WORDNUM(offsets[num_offsets - 1]);
 		 wordnum++, next_word_threshold += BITS_PER_BITMAPWORD)
@@ -343,6 +369,7 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 
 	page->nwords = wordnum;
 	Assert(page->nwords == WORDS_PER_PAGE(offsets[num_offsets - 1]));
+}
 
 	if (TidStoreIsShared(ts))
 		shared_ts_set(ts->tree.shared, blkno, page);
@@ -369,6 +396,18 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 	if (page == NULL)
 		return false;
 
+	if (page->nwords == 0)
+	{
+		/* we have offsets in the header */
+		for (int i = 0; i < NUM_FULL_OFFSETS; i++)
+		{
+			if (page->full_offsets[i] == off)
+return true;
+		}
+		return false;
+	}
+	else
+	{
 	wordnum = WORDNUM(off);
 	bitnum = BITNUM(off);
 
@@ -378,6 +417,7 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 
 	return (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
 }
+}
 
 /*
  * Prepare to iterate through a TidStore.
@@ -496,6 +536,17 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 	result->num_offsets = 0;
 	result->blkno = blkno;
 
+	if (page->nwords == 0)
+	{
+		/* we have offsets in the header */
+		for (int i = 0; i < NUM_FULL_OFFSETS; i++)
+		{
+			if (page->full_offsets[i] != InvalidOffsetNumber)
+result->offsets[result->num_offsets++] = page->full_offsets[i];
+		}
+	}
+	else
+	{
 	for (wordnum = 0; wordnum < page->nwords; wordnum++)
 	{
 		bitmapword	w = page->words[wordnum];
@@ -518,3 +569,4 @@ tidstore_iter_extract_tids(TidStoreIter *iter, BlockNumber blkno,
 		}
 	}
 }
+}
diff --git a/src/test/modules/test_tidstore/expected/test_tidstore.out b/src/test/modules/test_tidstore/expected/test_tidstore.out
index 0ae2f970da..c40779859b 100644
--- a/src/test/modules/test_tidstore/expected/test_tidstore.out
+++ b/src/test/modules/test_tidstore/expected/test_tidstore.out
@@ -36,6 +36,20 @@ SELECT do_set_block_offsets(blk, array_agg(off)::int2[])
 (VALUES (0), (1), (:maxblkno / 2), (:maxblkno - 1), (:maxblkno)) AS blocks(blk),
 (VALUES (1), (2), (:maxoffset / 2), (:maxoffset - 1), (:maxoffset)) AS offsets(off)
   GROUP BY blk;
+-- Test offsets embedded in the bitmap header.
+SELECT do_set_block_offsets(501, array[greatest((random() * :maxoffset)::int, 1)]::int2[]);
+ do_set_block_offsets 
+--
+  501
+(1 row)
+
+S

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-31 Thread Masahiko Sawada

On Fri, Mar 29, 2024 at 4:21 PM John Naylor  wrote:
>
> On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada  
> wrote:
> > I think the patch is in good shape. Do you have other comments or
> > suggestions, John?
>
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -1918,11 +1918,6 @@ include_dir 'conf.d'
>  too high.  It may be useful to control for this by separately
>  setting .
> 
> -   
> -Note that for the collection of dead tuple identifiers,
> -VACUUM is only able to utilize up to a maximum of
> -1GB of memory.
> -   
>
>   
>
> This is mentioned twice for two different GUCs -- need to remove the
> other one, too.

Good catch, removed.

> Other than that, I just have minor nits:
>
> - * The major space usage for vacuuming is storage for the array of dead TIDs
> + * The major space usage for vacuuming is TID store, a storage for dead TIDs
>
> I think I've helped edit this sentence before, but I still don't quite
> like it. I'm thinking now "is storage for the dead tuple IDs".
>
> - * set upper bounds on the number of TIDs we can keep track of at once.
> + * set upper bounds on the maximum memory that can be used for keeping track
> + * of dead TIDs at once.
>
> I think "maximum" is redundant with "upper bounds".

Fixed.

>
> I also feel the commit message needs more "meat" -- we need to clearly
> narrate the features and benefits. I've attached how I would write it,
> but feel free to use what you like to match your taste.

Well, that's much better than mine.

>
> I've marked it Ready for Committer.

Thank you! I've attached the patch that I'm going to push tomorrow.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v82-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-29 Thread John Naylor

On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada  wrote:
> I think the patch is in good shape. Do you have other comments or
> suggestions, John?

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1918,11 +1918,6 @@ include_dir 'conf.d'
 too high.  It may be useful to control for this by separately
 setting .

-   
-Note that for the collection of dead tuple identifiers,
-VACUUM is only able to utilize up to a maximum of
-1GB of memory.
-   

This is mentioned twice for two different GUCs -- need to remove the
other one, too. Other than that, I just have minor nits:

- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TID store, a storage for dead TIDs

I think I've helped edit this sentence before, but I still don't quite
like it. I'm thinking now "is storage for the dead tuple IDs".

- * set upper bounds on the number of TIDs we can keep track of at once.
+ * set upper bounds on the maximum memory that can be used for keeping track
+ * of dead TIDs at once.

I think "maximum" is redundant with "upper bounds".

I also feel the commit message needs more "meat" -- we need to clearly
narrate the features and benefits. I've attached how I would write it,
but feel free to use what you like to match your taste.

I've marked it Ready for Committer.
Use TID Store for storage of dead tuple IDs during lazy vacuum

Previously, we used a simple array for storing dead tuple IDs during
lazy vacuum, which had a number of problems:

* The array used a single allocation and so was limited to 1GB.
* The allocation was pessimistically sized according to table size.
* Lookup with binary search was slow because of poor CPU cache and
  branch prediction behavior.

This commit replaces that array with the TID store from commit XX.

Since the backing radix tree makes small allocations as needed,
the 1GB limit is now gone. Further, the total memory used is now
often smaller by an order of magnitude or more, depending on the
distribution of blocks and offsets. These two features should make
multiple rounds of heap scanning and index cleanup an extremely rare
event. TID lookup during index cleanup is also several times faster,
even more so when index order is correlated with heap tuple order.

Since there is no longer a predictable relationship between the number
of dead tuples vacuumed and the space taken up by their TIDs, the number
of tuples no longer provides any meaningful insights for users, nor
is the maximum number predictable. For that reason this commit also
changes to byte-based progress reporting, with the relevant columns
of pg_stat_progress_vacuum renamed accordingly to max_dead_tuple_bytes
and dead_tuple_bytes.

For parallel vacuum, both the TID store and supplemental information
specific to vacuum are shared among the parallel vacuum workers. As
with the previous array, we don't take any locks on TidStore during
parallel vacuum since writes are still only done by the leader process.

XXX: bump catalog version

Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar
Discussion: 
https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-28 Thread Masahiko Sawada

On Thu, Mar 28, 2024 at 6:15 PM John Naylor  wrote:
>
> On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada  
> wrote:
> >
> > Pushed the refactoring patch.
> >
> > I've attached the rebased vacuum improvement patch for cfbot. I
> > mentioned in the commit message that this patch eliminates the 1GB
> > limitation.
> >
> > I think the patch is in good shape. Do you have other comments or
> > suggestions, John?
>
> I'll do another pass tomorrow, but first I wanted to get in another
> slightly-challenging in-situ test.

Thanks!

>
> About the "tuples missed" -- I didn't expect contention during this
> test. I believe that's completely unrelated behavior, but wanted to
> mention it anyway, since I found it confusing.

I don't investigate it enough but bgwriter might be related to the contention.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-28 Thread John Naylor

On Thu, Mar 28, 2024 at 12:55 PM Masahiko Sawada  wrote:
>
> Pushed the refactoring patch.
>
> I've attached the rebased vacuum improvement patch for cfbot. I
> mentioned in the commit message that this patch eliminates the 1GB
> limitation.
>
> I think the patch is in good shape. Do you have other comments or
> suggestions, John?

I'll do another pass tomorrow, but first I wanted to get in another
slightly-challenging in-situ test. On my humble laptop, I can still
fit a table large enough to cause PG16 to choke on multiple rounds of
index cleanup:

drop table if exists test;
create unlogged table test (a int, b uuid) with (autovacuum_enabled=false);

insert into test (a,b) select i, gen_random_uuid() from
generate_series(1,1000*1000*1000) i;

create index on test (a);
create index on test (b);

delete from test;

vacuum (verbose, truncate off, parallel 2) test;

INFO:  vacuuming "john.public.test"
INFO:  launched 1 parallel vacuum worker for index vacuuming (planned: 1)
INFO:  finished vacuuming "john.public.test": index scans: 1
pages: 0 removed, 6369427 remain, 6369427 scanned (100.00% of total)
tuples: 97174 removed, 2826 remain, 0 are dead but not yet removable
tuples missed: 2826 dead from 18 pages not removed due to cleanup lock
contention
removable cutoff: 771, which was 0 XIDs old when operation ended
new relfrozenxid: 767, which is 4 XIDs ahead of previous value
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 6369409 pages from table (100.00% of total) had
97174 dead item identifiers removed
index "test_a_idx": pages: 2741898 in total, 2741825 newly deleted,
2741825 currently deleted, 0 reusable
index "test_b_idx": pages: 3850387 in total, 3842056 newly deleted,
3842056 currently deleted, 0 reusable
avg read rate: 159.740 MB/s, avg write rate: 161.726 MB/s
buffer usage: 26367981 hits, 14958634 misses, 15144601 dirtied
WAL usage: 3 records, 1 full page images, 2050 bytes
system usage: CPU: user: 151.89 s, system: 193.54 s, elapsed: 731.59 s

Watching pg_stat_progress_vacuum, dead_tuple_bytes got up to 398458880.

About the "tuples missed" -- I didn't expect contention during this
test. I believe that's completely unrelated behavior, but wanted to
mention it anyway, since I found it confusing.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-27 Thread Masahiko Sawada

On Wed, Mar 27, 2024 at 5:43 PM Masahiko Sawada  wrote:
>
> On Wed, Mar 27, 2024 at 9:25 AM John Naylor  wrote:
> >
> > On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Mon, Mar 25, 2024 at 3:25 PM John Naylor  
> > > wrote:
> > > >
> > > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada 
> > > >  wrote:
> >
> > > > - * remaining LP_DEAD line pointers on the page in the dead_items
> > > > - * array. These dead items include those pruned by lazy_scan_prune()
> > > > - * as well we line pointers previously marked LP_DEAD.
> > > > + * remaining LP_DEAD line pointers on the page in the dead_items.
> > > > + * These dead items include those pruned by lazy_scan_prune() as well
> > > > + * we line pointers previously marked LP_DEAD.
> > > >
> > > > Here maybe "into dead_items".
> >
> > - * remaining LP_DEAD line pointers on the page in the dead_items.
> > + * remaining LP_DEAD line pointers on the page into the dead_items.
> >
> > Let me explain. It used to be "in the dead_items array." It is not an
> > array anymore, so it was changed to "in the dead_items". dead_items is
> > a variable name, and names don't take "the". "into dead_items" seems
> > most natural to me, but there are other possible phrasings.
>
> Thanks for the explanation. I was distracted. Fixed in the latest patch.
>
> >
> > > > > > Did you try it with 1MB m_w_m?
> > > > >
> > > > > I've incorporated the above comments and test results look good to me.
> > > >
> > > > Could you be more specific about what the test was?
> > > > Does it work with 1MB m_w_m?
> > >
> > > If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.
> > >
> > > FYI other test cases I tested were:
> > >
> > > * m_w_m = 2199023254528 (maximum value)
> > > initial: 1MB
> > > max: 128GB
> > >
> > > * m_w_m = 64MB (default)
> > > initial: 1MB
> > > max: 8MB
> >
> > If the test was a vacuum, how big a table was needed to hit 128GB?
>
> I just checked how TIdStoreCreateLocal() calculated the initial and
> max segment sizes while changing m_w_m, so didn't check how big
> segments are actually allocated in the maximum value test case.
>
> >
> > > > The existing comment slipped past my radar, but max_bytes is not a
> > > > limit, it's a hint. Come to think of it, it never was a limit in the
> > > > normal sense, but in earlier patches it was the criteria for reporting
> > > > "I'm full" when asked.
> > >
> > > Updated the comment.
> >
> > + * max_bytes is not a limit; it's used to choose the memory block sizes of
> > + * a memory context for TID storage in order for the total memory 
> > consumption
> > + * not to be overshot a lot. The caller can use the max_bytes as the 
> > criteria
> > + * for reporting whether it's full or not.
> >
> > This is good information. I suggest this edit:
> >
> > "max_bytes" is not an internally-enforced limit; it is used only as a
> > hint to cap the memory block size of the memory context for TID
> > storage. This reduces space wastage due to over-allocation. If the
> > caller wants to monitor memory usage, it must compare its limit with
> > the value reported by TidStoreMemoryUsage().
> >
> > Other comments:
>
> Thanks for the suggestion!
>
> >
> > v79-0002 looks good to me.
> >
> > v79-0003:
> >
> > "With this commit, when creating a shared TidStore, a dedicated DSA
> > area is created for TID storage instead of using the provided DSA
> > area."
> >
> > This is very subtle, but "the provided..." implies there still is one.
> > -> "a provided..."
> >
> > + * Similar to TidStoreCreateLocal() but create a shared TidStore on a
> > + * DSA area. The TID storage will live in the DSA area, and a memory
> > + * context rt_context will have only meta data of the radix tree.
> >
> > -> "the memory context"
>
> Fixed in the latest patch.
>
> >
> > I think you can go ahead and commit 0002 and 0003/4.
>
> I've pushed the 0002 (dsa init and max segment size) patch, and will
> push the attached 0001 patch next.

Pushed the refactoring patch.

I've attached the rebased vacuum improvement patch for cfbot. I
mentioned in the commit message that this patch eliminates the 1GB
limitation.

I think the patch is in good shape. Do you have other comments or
suggestions, John?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v81-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-27 Thread Masahiko Sawada

On Wed, Mar 27, 2024 at 9:25 AM John Naylor  wrote:
>
> On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada  wrote:
> >
> > On Mon, Mar 25, 2024 at 3:25 PM John Naylor  wrote:
> > >
> > > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada  
> > > wrote:
>
> > > - * remaining LP_DEAD line pointers on the page in the dead_items
> > > - * array. These dead items include those pruned by lazy_scan_prune()
> > > - * as well we line pointers previously marked LP_DEAD.
> > > + * remaining LP_DEAD line pointers on the page in the dead_items.
> > > + * These dead items include those pruned by lazy_scan_prune() as well
> > > + * we line pointers previously marked LP_DEAD.
> > >
> > > Here maybe "into dead_items".
>
> - * remaining LP_DEAD line pointers on the page in the dead_items.
> + * remaining LP_DEAD line pointers on the page into the dead_items.
>
> Let me explain. It used to be "in the dead_items array." It is not an
> array anymore, so it was changed to "in the dead_items". dead_items is
> a variable name, and names don't take "the". "into dead_items" seems
> most natural to me, but there are other possible phrasings.

Thanks for the explanation. I was distracted. Fixed in the latest patch.

>
> > > > > Did you try it with 1MB m_w_m?
> > > >
> > > > I've incorporated the above comments and test results look good to me.
> > >
> > > Could you be more specific about what the test was?
> > > Does it work with 1MB m_w_m?
> >
> > If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.
> >
> > FYI other test cases I tested were:
> >
> > * m_w_m = 2199023254528 (maximum value)
> > initial: 1MB
> > max: 128GB
> >
> > * m_w_m = 64MB (default)
> > initial: 1MB
> > max: 8MB
>
> If the test was a vacuum, how big a table was needed to hit 128GB?

I just checked how TIdStoreCreateLocal() calculated the initial and
max segment sizes while changing m_w_m, so didn't check how big
segments are actually allocated in the maximum value test case.

>
> > > The existing comment slipped past my radar, but max_bytes is not a
> > > limit, it's a hint. Come to think of it, it never was a limit in the
> > > normal sense, but in earlier patches it was the criteria for reporting
> > > "I'm full" when asked.
> >
> > Updated the comment.
>
> + * max_bytes is not a limit; it's used to choose the memory block sizes of
> + * a memory context for TID storage in order for the total memory consumption
> + * not to be overshot a lot. The caller can use the max_bytes as the criteria
> + * for reporting whether it's full or not.
>
> This is good information. I suggest this edit:
>
> "max_bytes" is not an internally-enforced limit; it is used only as a
> hint to cap the memory block size of the memory context for TID
> storage. This reduces space wastage due to over-allocation. If the
> caller wants to monitor memory usage, it must compare its limit with
> the value reported by TidStoreMemoryUsage().
>
> Other comments:

Thanks for the suggestion!

>
> v79-0002 looks good to me.
>
> v79-0003:
>
> "With this commit, when creating a shared TidStore, a dedicated DSA
> area is created for TID storage instead of using the provided DSA
> area."
>
> This is very subtle, but "the provided..." implies there still is one.
> -> "a provided..."
>
> + * Similar to TidStoreCreateLocal() but create a shared TidStore on a
> + * DSA area. The TID storage will live in the DSA area, and a memory
> + * context rt_context will have only meta data of the radix tree.
>
> -> "the memory context"

Fixed in the latest patch.

>
> I think you can go ahead and commit 0002 and 0003/4.

I've pushed the 0002 (dsa init and max segment size) patch, and will
push the attached 0001 patch next.

>
> v79-0005:
>
> - bypass = (vacrel->lpdead_item_pages < threshold &&
> -   vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
> + bypass = (vacrel->lpdead_item_pages < threshold) &&
> + TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);
>
> The parentheses look strange, and the first line shouldn't change
> without a good reason.

Fixed.

>
> - /* Set dead_items space */
> - dead_items = (VacDeadItems *) shm_toc_lookup(toc,
> - PARALLEL_VACUUM_KEY_DEAD_ITEMS,
> - false);
> + /* Set dead items */
> + dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
> + shared->dead_items_handle);
>
> I feel ambivalent about this comment change. The original is not very
> descriptive to begin with. If we need to change at all, maybe "find
> dead_items in shared memory"?

Agreed.

>
> v79-0005: As I said earlier, Dilip Kumar reviewed an earlier version.
>
> v79-0006:
>
> vac_work_mem should also go back to being an int.

Fixed.

I've attached the latest patches.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v80-0001-Rethink-create-and-attach-APIs-of-shared-TidStor.patch
Description: Binary data


v80-0002-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-26 Thread John Naylor

On Mon, Mar 25, 2024 at 8:07 PM Masahiko Sawada  wrote:
>
> On Mon, Mar 25, 2024 at 3:25 PM John Naylor  wrote:
> >
> > On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada  
> > wrote:

> > - * remaining LP_DEAD line pointers on the page in the dead_items
> > - * array. These dead items include those pruned by lazy_scan_prune()
> > - * as well we line pointers previously marked LP_DEAD.
> > + * remaining LP_DEAD line pointers on the page in the dead_items.
> > + * These dead items include those pruned by lazy_scan_prune() as well
> > + * we line pointers previously marked LP_DEAD.
> >
> > Here maybe "into dead_items".

- * remaining LP_DEAD line pointers on the page in the dead_items.
+ * remaining LP_DEAD line pointers on the page into the dead_items.

Let me explain. It used to be "in the dead_items array." It is not an
array anymore, so it was changed to "in the dead_items". dead_items is
a variable name, and names don't take "the". "into dead_items" seems
most natural to me, but there are other possible phrasings.

> > > > Did you try it with 1MB m_w_m?
> > >
> > > I've incorporated the above comments and test results look good to me.
> >
> > Could you be more specific about what the test was?
> > Does it work with 1MB m_w_m?
>
> If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.
>
> FYI other test cases I tested were:
>
> * m_w_m = 2199023254528 (maximum value)
> initial: 1MB
> max: 128GB
>
> * m_w_m = 64MB (default)
> initial: 1MB
> max: 8MB

If the test was a vacuum, how big a table was needed to hit 128GB?

> > The existing comment slipped past my radar, but max_bytes is not a
> > limit, it's a hint. Come to think of it, it never was a limit in the
> > normal sense, but in earlier patches it was the criteria for reporting
> > "I'm full" when asked.
>
> Updated the comment.

+ * max_bytes is not a limit; it's used to choose the memory block sizes of
+ * a memory context for TID storage in order for the total memory consumption
+ * not to be overshot a lot. The caller can use the max_bytes as the criteria
+ * for reporting whether it's full or not.

This is good information. I suggest this edit:

"max_bytes" is not an internally-enforced limit; it is used only as a
hint to cap the memory block size of the memory context for TID
storage. This reduces space wastage due to over-allocation. If the
caller wants to monitor memory usage, it must compare its limit with
the value reported by TidStoreMemoryUsage().

Other comments:

v79-0002 looks good to me.

v79-0003:

"With this commit, when creating a shared TidStore, a dedicated DSA
area is created for TID storage instead of using the provided DSA
area."

This is very subtle, but "the provided..." implies there still is one.
-> "a provided..."

+ * Similar to TidStoreCreateLocal() but create a shared TidStore on a
+ * DSA area. The TID storage will live in the DSA area, and a memory
+ * context rt_context will have only meta data of the radix tree.

-> "the memory context"

I think you can go ahead and commit 0002 and 0003/4.

v79-0005:

- bypass = (vacrel->lpdead_item_pages < threshold &&
-   vacrel->lpdead_items < MAXDEADITEMS(32L * 1024L * 1024L));
+ bypass = (vacrel->lpdead_item_pages < threshold) &&
+ TidStoreMemoryUsage(vacrel->dead_items) < (32L * 1024L * 1024L);

The parentheses look strange, and the first line shouldn't change
without a good reason.

- /* Set dead_items space */
- dead_items = (VacDeadItems *) shm_toc_lookup(toc,
- PARALLEL_VACUUM_KEY_DEAD_ITEMS,
- false);
+ /* Set dead items */
+ dead_items = TidStoreAttach(shared->dead_items_dsa_handle,
+ shared->dead_items_handle);

I feel ambivalent about this comment change. The original is not very
descriptive to begin with. If we need to change at all, maybe "find
dead_items in shared memory"?

v79-0005: As I said earlier, Dilip Kumar reviewed an earlier version.

v79-0006:

vac_work_mem should also go back to being an int.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-25 Thread Masahiko Sawada

On Mon, Mar 25, 2024 at 3:25 PM John Naylor  wrote:
>
> On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada  
> wrote:
> >
> > On Thu, Mar 21, 2024 at 7:48 PM John Naylor  wrote:
>
> > > v77-0001
> > >
> > > - dead_items = (VacDeadItems *) 
> > > palloc(vac_max_items_to_alloc_size(max_items));
> > > - dead_items->max_items = max_items;
> > > - dead_items->num_items = 0;
> > > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
> > > +
> > > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
> > > + dead_items_info->max_bytes = vac_work_mem * 1024L;
> > >
> > > This is confusing enough that it looks like a bug:
> > >
> > > [inside TidStoreCreate()]
> > > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
> > > while (16 * maxBlockSize > max_bytes * 1024L)
> > > maxBlockSize >>= 1;
> > >
> > > This was copied from CreateWorkExprContext, which operates directly on
> > > work_mem -- if the parameter is actually bytes, we can't "* 1024"
> > > here. If we're passing something measured in kilobytes, the parameter
> > > is badly named. Let's use convert once and use bytes everywhere.
> >
> > True. The attached 0001 patch fixes it.
>
> v78-0001 and 02 are fine, but for 0003 there is a consequence that I
> didn't see mentioned:

I think that the fix done in 0001 patch can be merged into 0003 patch.

>  vac_work_mem now refers to bytes, where before
> it referred to kilobytes. It seems pretty confusing to use a different
> convention from elsewhere, especially if it has the same name but
> different meaning across versions. Worse, this change is buried inside
> a moving-stuff-around diff, making it hard to see. Maybe "convert only
> once" is still possible, but I was actually thinking of
>
> + dead_items_info->max_bytes = vac_work_mem * 1024L;
> + vacrel->dead_items = TidStoreCreate(dead_items_info->max_bytes, NULL, 0);
>
> That way it's pretty obvious that it's correct. That may require a bit
> of duplication and moving around for shmem, but there is some of that
> already.

Agreed.

>
> More on 0003:
>
> - * The major space usage for vacuuming is storage for the array of dead TIDs
> + * The major space usage for vacuuming is TidStore, a storage for dead TIDs
>
> + * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
> + * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to 
> vacuum
>
> I wonder if the comments here should refer to it using a more natural
> spelling, like "TID store".
>
> - * items in the dead_items array for later vacuuming, count live and
> + * items in the dead_items for later vacuuming, count live and
>
> Maybe "the dead_items area", or "the dead_items store" or "in dead_items"?
>
> - * remaining LP_DEAD line pointers on the page in the dead_items
> - * array. These dead items include those pruned by lazy_scan_prune()
> - * as well we line pointers previously marked LP_DEAD.
> + * remaining LP_DEAD line pointers on the page in the dead_items.
> + * These dead items include those pruned by lazy_scan_prune() as well
> + * we line pointers previously marked LP_DEAD.
>
> Here maybe "into dead_items".
>
> Also, "we line pointers" seems to be a pre-existing typo.
>
> - (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
> - vacrel->relname, (long long) index, vacuumed_pages)));
> + (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
> in %u pages",
> + vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages)));
>
> This is a translated message, so let's keep the message the same.
>
> /*
>  * Allocate dead_items (either using palloc, or in dynamic shared memory).
>  * Sets dead_items in vacrel for caller.
>  *
>  * Also handles parallel initialization as part of allocating dead_items in
>  * DSM when required.
>  */
> static void
> dead_items_alloc(LVRelState *vacrel, int nworkers)
>
> This comment didn't change at all. It's not wrong, but let's consider
> updating the specifics.

Fixed above comments.

> v78-0005:
>
> "Although commit XXX
> allowed specifying the initial and maximum DSA segment sizes, callers
> still needed to clamp their own limits, which was not consistent and
> user-friendly."
>
> Perhaps s/still needed/would have needed/ ..., since we're preventing
> that necessity.
>
> > > Did you try it with 1MB m_w_m?
> >
> > I've incorporated the above comments and test results look good to me.
>
> Could you be more specific about what the test was?
> Does it work with 1MB m_w_m?

If m_w_m is 1MB, both the initial and maximum segment sizes are 256kB.

FYI other test cases I tested were:

* m_w_m = 2199023254528 (maximum value)
initial: 1MB
max: 128GB

* m_w_m = 64MB (default)
initial: 1MB
max: 8MB

>
> + /*
> + * Choose the initial and maximum DSA segment sizes to be no longer
> + * than 1/16 and 1/8 of max_bytes, respectively. If the initial
> + * segment size is low, we end up having many segments, which risks
> + * exceeding the total number of segments the

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread John Naylor

On Fri, Mar 22, 2024 at 12:20 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 21, 2024 at 7:48 PM John Naylor  wrote:

> > v77-0001
> >
> > - dead_items = (VacDeadItems *) 
> > palloc(vac_max_items_to_alloc_size(max_items));
> > - dead_items->max_items = max_items;
> > - dead_items->num_items = 0;
> > + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
> > +
> > + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
> > + dead_items_info->max_bytes = vac_work_mem * 1024L;
> >
> > This is confusing enough that it looks like a bug:
> >
> > [inside TidStoreCreate()]
> > /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
> > while (16 * maxBlockSize > max_bytes * 1024L)
> > maxBlockSize >>= 1;
> >
> > This was copied from CreateWorkExprContext, which operates directly on
> > work_mem -- if the parameter is actually bytes, we can't "* 1024"
> > here. If we're passing something measured in kilobytes, the parameter
> > is badly named. Let's use convert once and use bytes everywhere.
>
> True. The attached 0001 patch fixes it.

v78-0001 and 02 are fine, but for 0003 there is a consequence that I
didn't see mentioned: vac_work_mem now refers to bytes, where before
it referred to kilobytes. It seems pretty confusing to use a different
convention from elsewhere, especially if it has the same name but
different meaning across versions. Worse, this change is buried inside
a moving-stuff-around diff, making it hard to see. Maybe "convert only
once" is still possible, but I was actually thinking of

+ dead_items_info->max_bytes = vac_work_mem * 1024L;
+ vacrel->dead_items = TidStoreCreate(dead_items_info->max_bytes, NULL, 0);

That way it's pretty obvious that it's correct. That may require a bit
of duplication and moving around for shmem, but there is some of that
already.

More on 0003:

- * The major space usage for vacuuming is storage for the array of dead TIDs
+ * The major space usage for vacuuming is TidStore, a storage for dead TIDs

+ * autovacuum_work_mem) memory space to keep track of dead TIDs.  If the
+ * TidStore is full, we must call lazy_vacuum to vacuum indexes (and to vacuum

I wonder if the comments here should refer to it using a more natural
spelling, like "TID store".

- * items in the dead_items array for later vacuuming, count live and
+ * items in the dead_items for later vacuuming, count live and

Maybe "the dead_items area", or "the dead_items store" or "in dead_items"?

- * remaining LP_DEAD line pointers on the page in the dead_items
- * array. These dead items include those pruned by lazy_scan_prune()
- * as well we line pointers previously marked LP_DEAD.
+ * remaining LP_DEAD line pointers on the page in the dead_items.
+ * These dead items include those pruned by lazy_scan_prune() as well
+ * we line pointers previously marked LP_DEAD.

Here maybe "into dead_items".

Also, "we line pointers" seems to be a pre-existing typo.

- (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages",
- vacrel->relname, (long long) index, vacuumed_pages)));
+ (errmsg("table \"%s\": removed " INT64_FORMAT "dead item identifiers
in %u pages",
+ vacrel->relname, vacrel->dead_items_info->num_items, vacuumed_pages)));

This is a translated message, so let's keep the message the same.

/*
 * Allocate dead_items (either using palloc, or in dynamic shared memory).
 * Sets dead_items in vacrel for caller.
 *
 * Also handles parallel initialization as part of allocating dead_items in
 * DSM when required.
 */
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)

This comment didn't change at all. It's not wrong, but let's consider
updating the specifics.

v78-0004:

> > +#define dsa_create(tranch_id) \
> > + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
> >
> > Since these macros are now referring to defaults, maybe their name
> > should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE
> > (*_MAX_*)
>
> It makes sense to rename DSA_INITIAL_SEGMENT_SIZE , but I think that
> the DSA_MAX_SEGMENT_SIZE is the theoretical maximum size, the current
> name also makes sense to me.

Right, that makes sense.

v78-0005:

"Although commit XXX
allowed specifying the initial and maximum DSA segment sizes, callers
still needed to clamp their own limits, which was not consistent and
user-friendly."

Perhaps s/still needed/would have needed/ ..., since we're preventing
that necessity.

> > Did you try it with 1MB m_w_m?
>
> I've incorporated the above comments and test results look good to me.

Could you be more specific about what the test was?
Does it work with 1MB m_w_m?

+ /*
+ * Choose the initial and maximum DSA segment sizes to be no longer
+ * than 1/16 and 1/8 of max_bytes, respectively. If the initial
+ * segment size is low, we end up having many segments, which risks
+ * exceeding the total number of segments the platform can have.

The second sentence is technically correct, but I'm not sure how it
relates to the cod

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread Masahiko Sawada

On Mon, Mar 25, 2024 at 10:13 AM Tom Lane  wrote:
>
> Masahiko Sawada  writes:
> > On Mon, Mar 25, 2024 at 1:53 AM Tom Lane  wrote:
> >> I think the point here is that if you start with an arbitrary
> >> non-negative shift value, the preceding loop may in fact decrement it
> >> down to something less than zero before exiting, in which case we
> >> would indeed have trouble.  I suspect that the code is making
> >> undocumented assumptions about the possible initial values of shift.
> >> Maybe some Asserts would be good?  Also, if we're effectively assuming
> >> that shift must be exactly zero here, why not let the compiler
> >> hard-code that?
>
> > Sounds like a good solution. I've attached the patch for that.
>
> Personally I'd put the Assert immediately after the loop, because
> it's not related to the "Reserve slot for the value" comment.
> Seems reasonable otherwise.
>

Thanks. Pushed the fix after moving the Assert.


Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread Tom Lane

John Naylor  writes:
> Hmm, before 30e144287 this code only ran in a test module, is it
> possible Coverity would not find it there?

That could indeed explain why Coverity didn't see it.  I'm not
sure how our community run is set up, but it may not build the
test modules.

regards, tom lane

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread John Naylor

On Mon, Mar 25, 2024 at 8:02 AM Masahiko Sawada  wrote:
>
> On Mon, Mar 25, 2024 at 1:53 AM Tom Lane  wrote:
> >
> > I'm not sure why it took a couple weeks for Coverity to notice
> > ee1b30f12, but it saw it today, and it's not happy:
>
> Hmm, I've also done Coverity Scan in development but I wasn't able to
> see this one for some reason...

Hmm, before 30e144287 this code only ran in a test module, is it
possible Coverity would not find it there?

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread Tom Lane

Masahiko Sawada  writes:
> On Mon, Mar 25, 2024 at 1:53 AM Tom Lane  wrote:
>> I think the point here is that if you start with an arbitrary
>> non-negative shift value, the preceding loop may in fact decrement it
>> down to something less than zero before exiting, in which case we
>> would indeed have trouble.  I suspect that the code is making
>> undocumented assumptions about the possible initial values of shift.
>> Maybe some Asserts would be good?  Also, if we're effectively assuming
>> that shift must be exactly zero here, why not let the compiler
>> hard-code that?

> Sounds like a good solution. I've attached the patch for that.

Personally I'd put the Assert immediately after the loop, because
it's not related to the "Reserve slot for the value" comment.
Seems reasonable otherwise.

regards, tom lane

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread Masahiko Sawada

On Mon, Mar 25, 2024 at 1:53 AM Tom Lane  wrote:
>
> John Naylor  writes:
> > Done. I pushed this with a few last-minute cosmetic adjustments. This
> > has been a very long time coming, but we're finally in the home
> > stretch!

Thank you for the report.

>
> I'm not sure why it took a couple weeks for Coverity to notice
> ee1b30f12, but it saw it today, and it's not happy:

Hmm, I've also done Coverity Scan in development but I wasn't able to
see this one for some reason...

>
> /srv/coverity/git/pgsql-git/postgresql/src/include/lib/radixtree.h: 1621 in 
> local_ts_extend_down()
> 1615node = child;
> 1616shift -= RT_SPAN;
> 1617}
> 1618
> 1619/* Reserve slot for the value. */
> 1620n4 = (RT_NODE_4 *) node.local;
> >>> CID 1594658:  Integer handling issues  (BAD_SHIFT)
> >>> In expression "key >> shift", shifting by a negative amount has 
> >>> undefined behavior.  The shift amount, "shift", is as little as -7.
> 1621n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
> 1622n4->base.count = 1;
> 1623
> 1624return &n4->children[0];
> 1625 }
> 1626
>
> I think the point here is that if you start with an arbitrary
> non-negative shift value, the preceding loop may in fact decrement it
> down to something less than zero before exiting, in which case we
> would indeed have trouble.  I suspect that the code is making
> undocumented assumptions about the possible initial values of shift.
> Maybe some Asserts would be good?  Also, if we're effectively assuming
> that shift must be exactly zero here, why not let the compiler
> hard-code that?
>
> -   n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
> +   n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0);

Sounds like a good solution. I've attached the patch for that.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


fix_radixtree.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-24 Thread Tom Lane

John Naylor  writes:
> Done. I pushed this with a few last-minute cosmetic adjustments. This
> has been a very long time coming, but we're finally in the home
> stretch!

I'm not sure why it took a couple weeks for Coverity to notice
ee1b30f12, but it saw it today, and it's not happy:

/srv/coverity/git/pgsql-git/postgresql/src/include/lib/radixtree.h: 1621 in 
local_ts_extend_down()
1615node = child;
1616shift -= RT_SPAN;
1617}
1618 
1619/* Reserve slot for the value. */
1620n4 = (RT_NODE_4 *) node.local;
>>> CID 1594658:  Integer handling issues  (BAD_SHIFT)
>>> In expression "key >> shift", shifting by a negative amount has 
>>> undefined behavior.  The shift amount, "shift", is as little as -7.
1621n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
1622n4->base.count = 1;
1623 
1624return &n4->children[0];
1625 }
1626 

I think the point here is that if you start with an arbitrary
non-negative shift value, the preceding loop may in fact decrement it
down to something less than zero before exiting, in which case we
would indeed have trouble.  I suspect that the code is making
undocumented assumptions about the possible initial values of shift.
Maybe some Asserts would be good?  Also, if we're effectively assuming
that shift must be exactly zero here, why not let the compiler
hard-code that?

-   n4->chunks[0] = RT_GET_KEY_CHUNK(key, shift);
+   n4->chunks[0] = RT_GET_KEY_CHUNK(key, 0);

regards, tom lane

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-21 Thread Masahiko Sawada

On Thu, Mar 21, 2024 at 7:48 PM John Naylor  wrote:
>
> On Thu, Mar 21, 2024 at 4:03 PM Masahiko Sawada  wrote:
> >
> > I've looked into this idea further. Overall, it looks clean and I
> > don't see any problem so far in terms of integration with lazy vacuum.
> > I've attached three patches for discussion and tests.
>
> Seems okay in the big picture, it's the details we need to be careful of.
>
> v77-0001
>
> - dead_items = (VacDeadItems *) 
> palloc(vac_max_items_to_alloc_size(max_items));
> - dead_items->max_items = max_items;
> - dead_items->num_items = 0;
> + vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
> +
> + dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
> + dead_items_info->max_bytes = vac_work_mem * 1024L;
>
> This is confusing enough that it looks like a bug:
>
> [inside TidStoreCreate()]
> /* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
> while (16 * maxBlockSize > max_bytes * 1024L)
> maxBlockSize >>= 1;
>
> This was copied from CreateWorkExprContext, which operates directly on
> work_mem -- if the parameter is actually bytes, we can't "* 1024"
> here. If we're passing something measured in kilobytes, the parameter
> is badly named. Let's use convert once and use bytes everywhere.

True. The attached 0001 patch fixes it.

>
> v77-0002:
>
> +#define dsa_create(tranch_id) \
> + dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)
>
> Since these macros are now referring to defaults, maybe their name
> should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE
> (*_MAX_*)

It makes sense to rename DSA_INITIAL_SEGMENT_SIZE , but I think that
the DSA_MAX_SEGMENT_SIZE is the theoretical maximum size, the current
name also makes sense to me.

>
> +/* The minimum size of a DSM segment. */
> +#define DSA_MIN_SEGMENT_SIZE ((size_t) 1024)
>
> That's a *lot* smaller than it is now. Maybe 256kB? We just want 1MB
> m_w_m to work correctly.

Fixed.

>
> v77-0003:
>
> +/* Public APIs to create local or shared TidStore */
> +
> +TidStore *
> +TidStoreCreateLocal(size_t max_bytes)
> +{
> + return tidstore_create_internal(max_bytes, false, 0);
> +}
> +
> +TidStore *
> +TidStoreCreateShared(size_t max_bytes, int tranche_id)
> +{
> + return tidstore_create_internal(max_bytes, true, tranche_id);
> +}
>
> I don't think these operations have enough in common to justify
> sharing even an internal implementation. Choosing aset block size is
> done for both memory types, but it's pointless to do it for shared
> memory, because the local context is then only used for small
> metadata.
>
> + /*
> + * Choose the DSA initial and max segment sizes to be no longer than
> + * 1/16 and 1/8 of max_bytes, respectively.
> + */
>
> I'm guessing the 1/8 here because the number of segments is limited? I
> know these numbers are somewhat arbitrary, but readers will wonder why
> one has 1/8 and the other has 1/16.
>
> + if (dsa_init_size < DSA_MIN_SEGMENT_SIZE)
> + dsa_init_size = DSA_MIN_SEGMENT_SIZE;
> + if (dsa_max_size < DSA_MAX_SEGMENT_SIZE)
> + dsa_max_size = DSA_MAX_SEGMENT_SIZE;
>
> The second clamp seems against the whole point of this patch -- it
> seems they should all be clamped bigger than the DSA_MIN_SEGMENT_SIZE?
> Did you try it with 1MB m_w_m?

I've incorporated the above comments and test results look good to me.

I've attached the several patches:

- 0002 is a minor fix for tidstore I found.
- 0005 changes the create APIs of tidstore.
- 0006 update the vacuum improvement patch to use the new
TidStoreCreateLocal/Shared() APIs.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v78-0005-Rethink-create-and-attach-APIs-of-shared-TidStor.patch
Description: Binary data


v78-0004-Allow-specifying-initial-and-maximum-segment-siz.patch
Description: Binary data


v78-0003-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data


v78-0006-Adjust-the-vacuum-improvement-patch-to-new-TidSt.patch
Description: Binary data


v78-0002-Fix-an-inconsistent-function-prototype-with-the-.patch
Description: Binary data


v78-0001-Fix-a-calculation-in-TidStoreCreate.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-21 Thread John Naylor

On Thu, Mar 21, 2024 at 4:03 PM Masahiko Sawada  wrote:
>
> I've looked into this idea further. Overall, it looks clean and I
> don't see any problem so far in terms of integration with lazy vacuum.
> I've attached three patches for discussion and tests.

Seems okay in the big picture, it's the details we need to be careful of.

v77-0001

- dead_items = (VacDeadItems *) palloc(vac_max_items_to_alloc_size(max_items));
- dead_items->max_items = max_items;
- dead_items->num_items = 0;
+ vacrel->dead_items = TidStoreCreate(vac_work_mem, NULL, 0);
+
+ dead_items_info = (VacDeadItemsInfo *) palloc(sizeof(VacDeadItemsInfo));
+ dead_items_info->max_bytes = vac_work_mem * 1024L;

This is confusing enough that it looks like a bug:

[inside TidStoreCreate()]
/* choose the maxBlockSize to be no larger than 1/16 of max_bytes */
while (16 * maxBlockSize > max_bytes * 1024L)
maxBlockSize >>= 1;

This was copied from CreateWorkExprContext, which operates directly on
work_mem -- if the parameter is actually bytes, we can't "* 1024"
here. If we're passing something measured in kilobytes, the parameter
is badly named. Let's use convert once and use bytes everywhere.

Note: This was not another pass over the whole vacuum patch, just
looking an the issue at hand.
Also for later: Dilip Kumar reviewed an earlier version.

v77-0002:

+#define dsa_create(tranch_id) \
+ dsa_create_ext(tranch_id, DSA_INITIAL_SEGMENT_SIZE, DSA_MAX_SEGMENT_SIZE)

Since these macros are now referring to defaults, maybe their name
should reflect that. Something like DSA_DEFAULT_INIT_SEGMENT_SIZE
(*_MAX_*)

+/* The minimum size of a DSM segment. */
+#define DSA_MIN_SEGMENT_SIZE ((size_t) 1024)

That's a *lot* smaller than it is now. Maybe 256kB? We just want 1MB
m_w_m to work correctly.

v77-0003:

+/* Public APIs to create local or shared TidStore */
+
+TidStore *
+TidStoreCreateLocal(size_t max_bytes)
+{
+ return tidstore_create_internal(max_bytes, false, 0);
+}
+
+TidStore *
+TidStoreCreateShared(size_t max_bytes, int tranche_id)
+{
+ return tidstore_create_internal(max_bytes, true, tranche_id);
+}

I don't think these operations have enough in common to justify
sharing even an internal implementation. Choosing aset block size is
done for both memory types, but it's pointless to do it for shared
memory, because the local context is then only used for small
metadata.

+ /*
+ * Choose the DSA initial and max segment sizes to be no longer than
+ * 1/16 and 1/8 of max_bytes, respectively.
+ */

I'm guessing the 1/8 here because the number of segments is limited? I
know these numbers are somewhat arbitrary, but readers will wonder why
one has 1/8 and the other has 1/16.

+ if (dsa_init_size < DSA_MIN_SEGMENT_SIZE)
+ dsa_init_size = DSA_MIN_SEGMENT_SIZE;
+ if (dsa_max_size < DSA_MAX_SEGMENT_SIZE)
+ dsa_max_size = DSA_MAX_SEGMENT_SIZE;

The second clamp seems against the whole point of this patch -- it
seems they should all be clamped bigger than the DSA_MIN_SEGMENT_SIZE?
Did you try it with 1MB m_w_m?

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-21 Thread Masahiko Sawada

On Thu, Mar 21, 2024 at 4:35 PM John Naylor  wrote:
>
> On Thu, Mar 21, 2024 at 1:11 PM Masahiko Sawada  wrote:
>
> > Or we can have a new function for dsa.c to set the initial and max
> > segment size (or either one) to the existing DSA area so that
> > TidStoreCreate() can specify them at creation.
>
> I didn't like this very much, because it's splitting an operation
> across an API boundary. The caller already has all the information it
> needs when it creates the DSA. Straw man proposal: it could do the
> same for local memory, then they'd be more similar. But if we made
> local contexts the responsibility of the caller, that would cause
> duplication between creating and resetting.

Fair point.

>
> > In shared TidStore
> > cases, since all memory required by shared radix tree is allocated in
> > the passed-in DSA area and the memory usage is the total segment size
> > allocated in the DSA area
>
> ...plus apparently some overhead, I just found out today, but that's
> beside the point.
>
> On Thu, Mar 21, 2024 at 2:02 PM Masahiko Sawada  wrote:
> >
> > Yet another idea is that TidStore creates its own DSA area in
> > TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area
> > (using dsa_create()) and pass it to RT_CREATE(). Also, we need a new
> > API to get the DSA area. The caller (e.g. parallel vacuum) gets the
> > dsa_handle of the DSA and stores it in the shared memory (e.g. in
> > PVShared). TidStoreAttach() will take two arguments: dsa_handle for
> > the DSA area and dsa_pointer for the shared radix tree. This idea
> > still requires controlling min/max segment sizes since dsa_create()
> > uses the 1MB as the initial segment size. But the TidStoreCreate()
> > would be more user friendly.
>
> This seems like an overall simplification, aside from future size
> configuration, so +1 to continue looking into this. If we go this
> route, I'd like to avoid a boolean parameter and cleanly separate
> TidStoreCreateLocal() and TidStoreCreateShared(). Every operation
> after that can introspect, but it's a bit awkward to force these cases
> into the same function. It always was a little bit, but this change
> makes it more so.

I've looked into this idea further. Overall, it looks clean and I
don't see any problem so far in terms of integration with lazy vacuum.
I've attached three patches for discussion and tests.

- 0001 patch makes lazy vacuum use of tidstore.
- 0002 patch makes DSA init/max segment size configurable (borrowed
from another thread).
- 0003 patch makes TidStore create its own DSA area with init/max DSA
segment adjustment (PoC patch).

One thing unclear to me is that this idea will be usable even when we
want to use the tidstore for parallel bitmap scan. Currently, we
create a shared tidbitmap on a DSA area in ParallelExecutorInfo. This
DSA area is used not only for tidbitmap but also for parallel hash
etc. If the tidstore created its own DSA area, parallel bitmap scan
would have to use the tidstore's DSA in addition to the DSA area in
ParallelExecutorInfo. I'm not sure if there are some differences
between these usages in terms of resource manager etc. It seems no
problem but I might be missing something.

Regards,
-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v77-0003-PoC-Make-shared-TidStore-create-its-own-DSA-area.patch
Description: Binary data


v77-0002-Make-DSA-initial-and-maximum-segment-size-config.patch
Description: Binary data


v77-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-21 Thread John Naylor

On Thu, Mar 21, 2024 at 1:11 PM Masahiko Sawada  wrote:

> Or we can have a new function for dsa.c to set the initial and max
> segment size (or either one) to the existing DSA area so that
> TidStoreCreate() can specify them at creation.

I didn't like this very much, because it's splitting an operation
across an API boundary. The caller already has all the information it
needs when it creates the DSA. Straw man proposal: it could do the
same for local memory, then they'd be more similar. But if we made
local contexts the responsibility of the caller, that would cause
duplication between creating and resetting.

> In shared TidStore
> cases, since all memory required by shared radix tree is allocated in
> the passed-in DSA area and the memory usage is the total segment size
> allocated in the DSA area

...plus apparently some overhead, I just found out today, but that's
beside the point.

On Thu, Mar 21, 2024 at 2:02 PM Masahiko Sawada  wrote:
>
> Yet another idea is that TidStore creates its own DSA area in
> TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area
> (using dsa_create()) and pass it to RT_CREATE(). Also, we need a new
> API to get the DSA area. The caller (e.g. parallel vacuum) gets the
> dsa_handle of the DSA and stores it in the shared memory (e.g. in
> PVShared). TidStoreAttach() will take two arguments: dsa_handle for
> the DSA area and dsa_pointer for the shared radix tree. This idea
> still requires controlling min/max segment sizes since dsa_create()
> uses the 1MB as the initial segment size. But the TidStoreCreate()
> would be more user friendly.

This seems like an overall simplification, aside from future size
configuration, so +1 to continue looking into this. If we go this
route, I'd like to avoid a boolean parameter and cleanly separate
TidStoreCreateLocal() and TidStoreCreateShared(). Every operation
after that can introspect, but it's a bit awkward to force these cases
into the same function. It always was a little bit, but this change
makes it more so.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-21 Thread Masahiko Sawada

On Thu, Mar 21, 2024 at 3:10 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 21, 2024 at 12:40 PM John Naylor  wrote:
> >
> > On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Wed, Mar 20, 2024 at 11:19 PM John Naylor  
> > > wrote:
> > > > Are they (the blocks to be precise) really out of order? The VALUES
> > > > statement is ordered, but after inserting it does not output that way.
> > > > I wondered if this is platform independent, but CI and our dev
> > > > machines haven't failed this test, and I haven't looked into what
> > > > determines the order. It's easy enough to hide the blocks if we ever
> > > > need to, as we do elsewhere...
> > >
> > > It seems not necessary as such a test is already covered by
> > > test_radixtree. I've changed the query to hide the output blocks.
> >
> > Okay.
> >
> > > The buildfarm has been all-green so far.
> >
> > Great!
> >
> > > I've attached the latest vacuum improvement patch.
> > >
> > > I just remembered that the tidstore cannot still be used for parallel
> > > vacuum with minimum maintenance_work_mem. Even when the shared
> > > tidstore is empty, its memory usage reports 1056768 bytes, a bit above
> > > 1MB (1048576 bytes). We need something discussed on another thread[1]
> > > in order to make it work.
> >
> > For exactly this reason, we used to have a clamp on max_bytes when it
> > was internal to tidstore, so that it never reported full when first
> > created, so I guess that got thrown away when we got rid of the
> > control object in shared memory. Forcing callers to clamp their own
> > limits seems pretty unfriendly, though.
>
> Or we can have a new function for dsa.c to set the initial and max
> segment size (or either one) to the existing DSA area so that
> TidStoreCreate() can specify them at creation. In shared TidStore
> cases, since all memory required by shared radix tree is allocated in
> the passed-in DSA area and the memory usage is the total segment size
> allocated in the DSA area, the user will have to prepare a DSA area
> only for the shared tidstore. So we might be able to expect that the
> DSA passed-in to TidStoreCreate() is empty and its segment sizes can
> be adjustable.

Yet another idea is that TidStore creates its own DSA area in
TidStoreCreate(). That is, In TidStoreCreate() we create a DSA area
(using dsa_create()) and pass it to RT_CREATE(). Also, we need a new
API to get the DSA area. The caller (e.g. parallel vacuum) gets the
dsa_handle of the DSA and stores it in the shared memory (e.g. in
PVShared). TidStoreAttach() will take two arguments: dsa_handle for
the DSA area and dsa_pointer for the shared radix tree. This idea
still requires controlling min/max segment sizes since dsa_create()
uses the 1MB as the initial segment size. But the TidStoreCreate()
would be more user friendly.

I've attached a PoC patch for discussion.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


tidstore_creates_dsa.patch.nocfbot
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-20 Thread Masahiko Sawada

On Thu, Mar 21, 2024 at 12:40 PM John Naylor  wrote:
>
> On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada  wrote:
> >
> > On Wed, Mar 20, 2024 at 11:19 PM John Naylor  
> > wrote:
> > > Are they (the blocks to be precise) really out of order? The VALUES
> > > statement is ordered, but after inserting it does not output that way.
> > > I wondered if this is platform independent, but CI and our dev
> > > machines haven't failed this test, and I haven't looked into what
> > > determines the order. It's easy enough to hide the blocks if we ever
> > > need to, as we do elsewhere...
> >
> > It seems not necessary as such a test is already covered by
> > test_radixtree. I've changed the query to hide the output blocks.
>
> Okay.
>
> > The buildfarm has been all-green so far.
>
> Great!
>
> > I've attached the latest vacuum improvement patch.
> >
> > I just remembered that the tidstore cannot still be used for parallel
> > vacuum with minimum maintenance_work_mem. Even when the shared
> > tidstore is empty, its memory usage reports 1056768 bytes, a bit above
> > 1MB (1048576 bytes). We need something discussed on another thread[1]
> > in order to make it work.
>
> For exactly this reason, we used to have a clamp on max_bytes when it
> was internal to tidstore, so that it never reported full when first
> created, so I guess that got thrown away when we got rid of the
> control object in shared memory. Forcing callers to clamp their own
> limits seems pretty unfriendly, though.

Or we can have a new function for dsa.c to set the initial and max
segment size (or either one) to the existing DSA area so that
TidStoreCreate() can specify them at creation. In shared TidStore
cases, since all memory required by shared radix tree is allocated in
the passed-in DSA area and the memory usage is the total segment size
allocated in the DSA area, the user will have to prepare a DSA area
only for the shared tidstore. So we might be able to expect that the
DSA passed-in to TidStoreCreate() is empty and its segment sizes can
be adjustable.

>
> The proposals in that thread are pretty simple. If those don't move
> forward soon, a hackish workaround would be to round down the number
> we get from dsa_get_total_size to the nearest megabyte. Then
> controlling min/max segment size would be a nice-to-have for PG17, not
> a prerequisite.

Interesting idea.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-20 Thread John Naylor

On Thu, Mar 21, 2024 at 9:37 AM Masahiko Sawada  wrote:
>
> On Wed, Mar 20, 2024 at 11:19 PM John Naylor  wrote:
> > Are they (the blocks to be precise) really out of order? The VALUES
> > statement is ordered, but after inserting it does not output that way.
> > I wondered if this is platform independent, but CI and our dev
> > machines haven't failed this test, and I haven't looked into what
> > determines the order. It's easy enough to hide the blocks if we ever
> > need to, as we do elsewhere...
>
> It seems not necessary as such a test is already covered by
> test_radixtree. I've changed the query to hide the output blocks.

Okay.

> The buildfarm has been all-green so far.

Great!

> I've attached the latest vacuum improvement patch.
>
> I just remembered that the tidstore cannot still be used for parallel
> vacuum with minimum maintenance_work_mem. Even when the shared
> tidstore is empty, its memory usage reports 1056768 bytes, a bit above
> 1MB (1048576 bytes). We need something discussed on another thread[1]
> in order to make it work.

For exactly this reason, we used to have a clamp on max_bytes when it
was internal to tidstore, so that it never reported full when first
created, so I guess that got thrown away when we got rid of the
control object in shared memory. Forcing callers to clamp their own
limits seems pretty unfriendly, though.

The proposals in that thread are pretty simple. If those don't move
forward soon, a hackish workaround would be to round down the number
we get from dsa_get_total_size to the nearest megabyte. Then
controlling min/max segment size would be a nice-to-have for PG17, not
a prerequisite.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-20 Thread Masahiko Sawada

On Wed, Mar 20, 2024 at 11:19 PM John Naylor  wrote:
>
> On Wed, Mar 20, 2024 at 8:30 PM Masahiko Sawada  wrote:
> > I forgot to report the results. Yes, I did some tests where I inserted
> > many TIDs to make the tidstore use several GB memory. I did two cases:
> >
> > 1. insert 100M blocks of TIDs with an offset of 100.
> > 2. insert 10M blocks of TIDs with an offset of 2048.
> >
> > The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup
> > and iteration results were expected.
>
> Thanks for confirming!
>
> > While reviewing the codes again, the following two things caught my eyes:
> >
> > in check_set_block_offset() function, we don't take a lock on the
> > tidstore while checking all possible TIDs. I'll add
> > TidStoreLockShare() and TidStoreUnlock() as follows:
> >
> > +   TidStoreLockShare(tidstore);
> > if (TidStoreIsMember(tidstore, &tid))
> > ItemPointerSet(&items.lookup_tids[num_lookup_tids++],
> > blkno, offset);
> > +   TidStoreUnlock(tidstore);
>
> In one sense, all locking in the test module is useless since there is
> only a single process. On the other hand, it seems good to at least
> run what we have written to run it trivially, and serve as an example
> of usage. We should probably be consistent, and document at the top
> that the locks are pro-forma only.

Agreed.

>
> > Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take
> > a lock on the shared tidstore since dsa_get_total_size() (called by
> > RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it
> > in the comment as follows:
> >
> > -/* Return the memory usage of TidStore */
> > +/*
> > + * Return the memory usage of TidStore.
> > + *
> > + * In shared TidStore cases, since shared_ts_memory_usage() does 
> > appropriate
> > + * locking, the caller doesn't need to take a lock.
> > + */
> >
> > What do you think?
>
> That duplicates the underlying comment on the radix tree function that
> this calls, so I'm inclined to leave it out. At this level it's
> probably best to document when a caller _does_ need to take an action.

Okay, I didn't change it.

>
> One thing I forgot to ask about earlier:
>
> +-- Add tids in out of order.
>
> Are they (the blocks to be precise) really out of order? The VALUES
> statement is ordered, but after inserting it does not output that way.
> I wondered if this is platform independent, but CI and our dev
> machines haven't failed this test, and I haven't looked into what
> determines the order. It's easy enough to hide the blocks if we ever
> need to, as we do elsewhere...

It seems not necessary as such a test is already covered by
test_radixtree. I've changed the query to hide the output blocks.

I've pushed the tidstore patch after incorporating the above changes.
In addition to that, I've added the following changes before the push:

- Added src/test/modules/test_tidstore/.gitignore file.
- Removed unnecessary #include from tidstore.c.

The buildfarm has been all-green so far.

I've attached the latest vacuum improvement patch.

I just remembered that the tidstore cannot still be used for parallel
vacuum with minimum maintenance_work_mem. Even when the shared
tidstore is empty, its memory usage reports 1056768 bytes, a bit above
1MB (1048576 bytes). We need something discussed on another thread[1]
in order to make it work.

Regards,

[1] 
https://www.postgresql.org/message-id/CAD21AoCVMw6DSmgZY9h%2BxfzKtzJeqWiwxaUD2T-FztVcV-XibQ%40mail.gmail.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v76-0001-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-20 Thread John Naylor

On Wed, Mar 20, 2024 at 8:30 PM Masahiko Sawada  wrote:
> I forgot to report the results. Yes, I did some tests where I inserted
> many TIDs to make the tidstore use several GB memory. I did two cases:
>
> 1. insert 100M blocks of TIDs with an offset of 100.
> 2. insert 10M blocks of TIDs with an offset of 2048.
>
> The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup
> and iteration results were expected.

Thanks for confirming!

> While reviewing the codes again, the following two things caught my eyes:
>
> in check_set_block_offset() function, we don't take a lock on the
> tidstore while checking all possible TIDs. I'll add
> TidStoreLockShare() and TidStoreUnlock() as follows:
>
> +   TidStoreLockShare(tidstore);
> if (TidStoreIsMember(tidstore, &tid))
> ItemPointerSet(&items.lookup_tids[num_lookup_tids++],
> blkno, offset);
> +   TidStoreUnlock(tidstore);

In one sense, all locking in the test module is useless since there is
only a single process. On the other hand, it seems good to at least
run what we have written to run it trivially, and serve as an example
of usage. We should probably be consistent, and document at the top
that the locks are pro-forma only.

It's both a blessing and a curse that vacuum only has a single writer.
It makes development less of a hassle, but also means that tidstore
locking is done for API-completeness reasons, not (yet) as a practical
necessity. Even tidbitmap.c's hash table currently has a single
writer, and while using tidstore for that is still an engineering
challenge for other reasons, it wouldn't exercise locking
meaningfully, either, at least at first.

> Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take
> a lock on the shared tidstore since dsa_get_total_size() (called by
> RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it
> in the comment as follows:
>
> -/* Return the memory usage of TidStore */
> +/*
> + * Return the memory usage of TidStore.
> + *
> + * In shared TidStore cases, since shared_ts_memory_usage() does appropriate
> + * locking, the caller doesn't need to take a lock.
> + */
>
> What do you think?

That duplicates the underlying comment on the radix tree function that
this calls, so I'm inclined to leave it out. At this level it's
probably best to document when a caller _does_ need to take an action.

One thing I forgot to ask about earlier:

+-- Add tids in out of order.

Are they (the blocks to be precise) really out of order? The VALUES
statement is ordered, but after inserting it does not output that way.
I wondered if this is platform independent, but CI and our dev
machines haven't failed this test, and I haven't looked into what
determines the order. It's easy enough to hide the blocks if we ever
need to, as we do elsewhere...

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-20 Thread Masahiko Sawada

On Wed, Mar 20, 2024 at 3:48 PM John Naylor  wrote:
>
> On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  
> wrote:
> >
> > On Thu, Mar 14, 2024 at 1:29 PM John Naylor  wrote:
> > > Locally (not CI), we should try big inputs to make sure we can
> > > actually go up to many GB -- it's easier and faster this way than
> > > having vacuum give us a large data set.
> >
> > I'll do these tests.
>
> I just remembered this -- did any of this kind of testing happen? I
> can do it as well.

I forgot to report the results. Yes, I did some tests where I inserted
many TIDs to make the tidstore use several GB memory. I did two cases:

1. insert 100M blocks of TIDs with an offset of 100.
2. insert 10M blocks of TIDs with an offset of 2048.

The tidstore used about 4.8GB and 5.2GB, respectively, and all lookup
and iteration results were expected.

>
> > Thank you. I've incorporated all the comments above. I've attached the
> > latest patches, and am going to push them (one by one) after
> > self-review again.
>
> One more cosmetic thing in 0001 that caught my eye:
>
> diff --git a/src/backend/access/common/Makefile
> b/src/backend/access/common/Makefile
> index b9aff0ccfd..67b8cc6108 100644
> --- a/src/backend/access/common/Makefile
> +++ b/src/backend/access/common/Makefile
> @@ -27,6 +27,7 @@ OBJS = \
>   syncscan.o \
>   toast_compression.o \
>   toast_internals.o \
> + tidstore.o \
>   tupconvert.o \
>   tupdesc.o
>
> diff --git a/src/backend/access/common/meson.build
> b/src/backend/access/common/meson.build
> index 725041a4ce..a02397855e 100644
> --- a/src/backend/access/common/meson.build
> +++ b/src/backend/access/common/meson.build
> @@ -15,6 +15,7 @@ backend_sources += files(
>'syncscan.c',
>'toast_compression.c',
>'toast_internals.c',
> +  'tidstore.c',
>'tupconvert.c',
>'tupdesc.c',
>  )
>
> These aren't in alphabetical order.

Good catch. I'll fix them before the push.

While reviewing the codes again, the following two things caught my eyes:

in check_set_block_offset() function, we don't take a lock on the
tidstore while checking all possible TIDs. I'll add
TidStoreLockShare() and TidStoreUnlock() as follows:

+   TidStoreLockShare(tidstore);
if (TidStoreIsMember(tidstore, &tid))
ItemPointerSet(&items.lookup_tids[num_lookup_tids++],
blkno, offset);
+   TidStoreUnlock(tidstore);

---
Regarding TidStoreMemoryUsage(), IIUC the caller doesn't need to take
a lock on the shared tidstore since dsa_get_total_size() (called by
RT_MEMORY_USAGE()) does appropriate locking. I think we can mention it
in the comment as follows:

-/* Return the memory usage of TidStore */
+/*
+ * Return the memory usage of TidStore.
+ *
+ * In shared TidStore cases, since shared_ts_memory_usage() does appropriate
+ * locking, the caller doesn't need to take a lock.
+ */

What do you think?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-19 Thread John Naylor

On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 1:29 PM John Naylor  wrote:
> > Locally (not CI), we should try big inputs to make sure we can
> > actually go up to many GB -- it's easier and faster this way than
> > having vacuum give us a large data set.
>
> I'll do these tests.

I just remembered this -- did any of this kind of testing happen? I
can do it as well.

> Thank you. I've incorporated all the comments above. I've attached the
> latest patches, and am going to push them (one by one) after
> self-review again.

One more cosmetic thing in 0001 that caught my eye:

diff --git a/src/backend/access/common/Makefile
b/src/backend/access/common/Makefile
index b9aff0ccfd..67b8cc6108 100644
--- a/src/backend/access/common/Makefile
+++ b/src/backend/access/common/Makefile
@@ -27,6 +27,7 @@ OBJS = \
  syncscan.o \
  toast_compression.o \
  toast_internals.o \
+ tidstore.o \
  tupconvert.o \
  tupdesc.o

diff --git a/src/backend/access/common/meson.build
b/src/backend/access/common/meson.build
index 725041a4ce..a02397855e 100644
--- a/src/backend/access/common/meson.build
+++ b/src/backend/access/common/meson.build
@@ -15,6 +15,7 @@ backend_sources += files(
   'syncscan.c',
   'toast_compression.c',
   'toast_internals.c',
+  'tidstore.c',
   'tupconvert.c',
   'tupdesc.c',
 )

These aren't in alphabetical order.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-19 Thread Masahiko Sawada

On Tue, Mar 19, 2024 at 6:40 PM John Naylor  wrote:
>
> On Tue, Mar 19, 2024 at 10:24 AM Masahiko Sawada  
> wrote:
> >
> > On Tue, Mar 19, 2024 at 8:35 AM John Naylor  wrote:
> > >
> > > On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor  
> > > > wrote:
>
> > > It might also be worth reducing the number of blocks in the random
> > > test -- multiple runs will have different offsets anyway.
> >
> > Yes. If we reduce the number of blocks from 1000 to 100, the
> > regression test took on my environment:
> >
> > 1000 blocks : 516 ms
> > 100 blocks   : 228 ms
>
> Sounds good.
>
> > Removed some unnecessary variables in 0002 patch.
>
> Looks good.
>
> > So the MaxBlocktableEntrySize calculation would be as follows?
> >
> > #define MaxBlocktableEntrySize \
> > offsetof(BlocktableEntry, words) + \
> > (sizeof(bitmapword) * \
> > WORDS_PER_PAGE(Min(MaxOffsetNumber, \
> >BITS_PER_BITMAPWORD * PG_INT8_MAX - 1
> >
> > I've made this change in the 0003 patch.
>
> This is okay, but one side effect is that we have both an assert and
> an elog, for different limits. I think we'll need a separate #define
> to help. But for now, I don't want to hold up tidstore further with
> this because I believe almost everything else in v74 is in pretty good
> shape. I'll save this for later as a part of the optimization I
> proposed.
>
> Remaining things I noticed:
>
> +#define RT_PREFIX local_rt
> +#define RT_PREFIX shared_rt
>
> Prefixes for simplehash, for example, don't have "sh" -- maybe 
> "local/shared_ts"
>
> + /* MemoryContext where the radix tree uses */
>
> s/where/that/
>
> +/*
> + * Lock support functions.
> + *
> + * We can use the radix tree's lock for shared TidStore as the data we
> + * need to protect is only the shared radix tree.
> + */
> +void
> +TidStoreLockExclusive(TidStore *ts)
>
> Talking about multiple things, so maybe a blank line after the comment.
>
> With those, I think you can go ahead and squash all the tidstore
> patches except for 0003 and commit it.
>
> > While reviewing the vacuum patch, I realized that we always pass
> > LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related
> > to the tidstore is therefore always the same. I think it would be
> > better to make the caller of TidStoreCreate() specify the tranch_id
> > and pass it to RT_CREATE(). That way, the caller can specify their own
> > wait event for tidstore. The 0008 patch tried this idea. dshash.c does
> > the same idea.
>
> Sounds reasonable. I'll just note that src/include/storage/lwlock.h
> still has an entry for LWTRANCHE_SHARED_TIDSTORE.

Thank you. I've incorporated all the comments above. I've attached the
latest patches, and am going to push them (one by one) after
self-review again.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v75-0001-Add-TIDStore-to-store-sets-of-TIDs-ItemPointerDa.patch
Description: Binary data


v75-0002-Use-TidStore-for-dead-tuple-TIDs-storage-during-.patch
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-19 Thread John Naylor

On Tue, Mar 19, 2024 at 10:24 AM Masahiko Sawada  wrote:
>
> On Tue, Mar 19, 2024 at 8:35 AM John Naylor  wrote:
> >
> > On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada  
> > wrote:
> > >
> > > On Sun, Mar 17, 2024 at 11:46 AM John Naylor  
> > > wrote:

> > It might also be worth reducing the number of blocks in the random
> > test -- multiple runs will have different offsets anyway.
>
> Yes. If we reduce the number of blocks from 1000 to 100, the
> regression test took on my environment:
>
> 1000 blocks : 516 ms
> 100 blocks   : 228 ms

Sounds good.

> Removed some unnecessary variables in 0002 patch.

Looks good.

> So the MaxBlocktableEntrySize calculation would be as follows?
>
> #define MaxBlocktableEntrySize \
> offsetof(BlocktableEntry, words) + \
> (sizeof(bitmapword) * \
> WORDS_PER_PAGE(Min(MaxOffsetNumber, \
>BITS_PER_BITMAPWORD * PG_INT8_MAX - 1
>
> I've made this change in the 0003 patch.

This is okay, but one side effect is that we have both an assert and
an elog, for different limits. I think we'll need a separate #define
to help. But for now, I don't want to hold up tidstore further with
this because I believe almost everything else in v74 is in pretty good
shape. I'll save this for later as a part of the optimization I
proposed.

Remaining things I noticed:

+#define RT_PREFIX local_rt
+#define RT_PREFIX shared_rt

Prefixes for simplehash, for example, don't have "sh" -- maybe "local/shared_ts"

+ /* MemoryContext where the radix tree uses */

s/where/that/

+/*
+ * Lock support functions.
+ *
+ * We can use the radix tree's lock for shared TidStore as the data we
+ * need to protect is only the shared radix tree.
+ */
+void
+TidStoreLockExclusive(TidStore *ts)

Talking about multiple things, so maybe a blank line after the comment.

With those, I think you can go ahead and squash all the tidstore
patches except for 0003 and commit it.

> While reviewing the vacuum patch, I realized that we always pass
> LWTRANCHE_SHARED_TIDSTORE to RT_CREATE(), and the wait event related
> to the tidstore is therefore always the same. I think it would be
> better to make the caller of TidStoreCreate() specify the tranch_id
> and pass it to RT_CREATE(). That way, the caller can specify their own
> wait event for tidstore. The 0008 patch tried this idea. dshash.c does
> the same idea.

Sounds reasonable. I'll just note that src/include/storage/lwlock.h
still has an entry for LWTRANCHE_SHARED_TIDSTORE.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-18 Thread Masahiko Sawada

On Tue, Mar 19, 2024 at 8:35 AM John Naylor  wrote:
>
> On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada  
> wrote:
> >
> > On Sun, Mar 17, 2024 at 11:46 AM John Naylor  
> > wrote:
>
> > > Random offsets is what I was thinking of (if made distinct and
> > > ordered), but even there the code is fairy trivial, so I don't have a
> > > strong feeling about it.
> >
> > Agreed.
>
> Looks good.
>
> A related thing I should mention is that the tests which look up all
> possible offsets are really expensive with the number of blocks we're
> using now (assert build):
>
> v70 0.33s
> v72 1.15s
> v73 1.32
>
> To trim that back, I think we should give up on using shared memory
> for the is-full test: We can cause aset to malloc a new block with a
> lot fewer entries. In the attached, this brings it back down to 0.43s.

Looks good. Agreed with this change.

> It might also be worth reducing the number of blocks in the random
> test -- multiple runs will have different offsets anyway.

Yes. If we reduce the number of blocks from 1000 to 100, the
regression test took on my environment:

1000 blocks : 516 ms
100 blocks   : 228 ms

>
> > > I think we can stop including the debug-tid-store patch for CI now.
> > > That would allow getting rid of some unnecessary variables.
> >
> > Agreed.
>
> Okay, all that remains here is to get rid of those variables (might be
> just one).

Removed some unnecessary variables in 0002 patch.

>
> > > + * Scan the TidStore and return a pointer to TidStoreIterResult that has 
> > > TIDs
> > > + * in one block. We return the block numbers in ascending order and the 
> > > offset
> > > + * numbers in each result is also sorted in ascending order.
> > > + */
> > > +TidStoreIterResult *
> > > +TidStoreIterateNext(TidStoreIter *iter)
> > >
> > > The wording is a bit awkward.
> >
> > Fixed.
>
> - * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
> - * in one block. We return the block numbers in ascending order and the 
> offset
> - * numbers in each result is also sorted in ascending order.
> + * Scan the TidStore and return the TIDs of the next block. The returned 
> block
> + * numbers is sorted in ascending order, and the offset numbers in each 
> result
> + * is also sorted in ascending order.
>
> Better, but it's still not very clear. Maybe "The offsets in each
> iteration result are ordered, as are the block numbers over all
> iterations."

Thanks, fixed.

>
> > > +/* Extract TIDs from the given key-value pair */
> > > +static void
> > > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
> > > BlocktableEntry *page)
> > >
> > > This is a leftover from the old encoding scheme. This should really
> > > take a "BlockNumber blockno" not a "key", and the only call site
> > > should probably cast the uint64 to BlockNumber.
> >
> > Fixed.
>
> This part looks good. I didn't notice earlier, but this comment has a
> similar issue
>
> @@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter)
>   return NULL;
>
>   /* Collect TIDs extracted from the key-value pair */
> - tidstore_iter_extract_tids(iter, key, page);
> + tidstore_iter_extract_tids(iter, (BlockNumber) key, page);
>
> ..."extracted" was once a separate operation. I think just removing
> that one word is enough to update it.

Fixed.

>
> Some other review on code comments:
>
> v73-0001:
>
> + /* Enlarge the TID array if necessary */
>
> It's "arrays" now.
>
> v73-0005:
>
> +-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has
> +-- different randon 100 offset numbers each other.
>
> The numbers are obvious from the query. Maybe just mention that the
> offsets are randomized and must be unique and ordered.
>
> + * The caller is responsible for release any locks.
>
> "releasing"

Fixed.

>
> > > +typedef struct BlocktableEntry
> > > +{
> > > + uint16 nwords;
> > > + bitmapword words[FLEXIBLE_ARRAY_MEMBER];
> > > +} BlocktableEntry;
> > >
> > > In my WIP for runtime-embeddable offsets, nwords needs to be one byte.
>
> I should be more clear here: nwords fitting into one byte allows 3
> embedded offsets (1 on 32-bit platforms, which is good for testing at
> least). With uint16 nwords that reduces to 2 (none on 32-bit
> platforms). Further, after the current patch series is fully
> committed, I plan to split the embedded-offset patch into two parts:
> The first would store the offsets in the header, but would still need
> a (smaller) allocation. The second would embed them in the child
> pointer. Only the second patch will care about the size of nwords
> because it needs to reserve a byte for the pointer tag.

Thank you for the clarification.

>
> > > That doesn't have any real-world affect on the largest offset
> > > encountered, and only in 32-bit builds with 32kB block size would the
> > > theoretical max change at all. To be precise, we could use in the
> > > MaxBlocktableEntrySize calculation:
> > >
> > > Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);
> >
> > I don't get thi

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-18 Thread John Naylor

On Mon, Mar 18, 2024 at 11:12 AM Masahiko Sawada  wrote:
>
> On Sun, Mar 17, 2024 at 11:46 AM John Naylor  wrote:

> > Random offsets is what I was thinking of (if made distinct and
> > ordered), but even there the code is fairy trivial, so I don't have a
> > strong feeling about it.
>
> Agreed.

Looks good.

A related thing I should mention is that the tests which look up all
possible offsets are really expensive with the number of blocks we're
using now (assert build):

v70 0.33s
v72 1.15s
v73 1.32

To trim that back, I think we should give up on using shared memory
for the is-full test: We can cause aset to malloc a new block with a
lot fewer entries. In the attached, this brings it back down to 0.43s.
It might also be worth reducing the number of blocks in the random
test -- multiple runs will have different offsets anyway.

> > I think we can stop including the debug-tid-store patch for CI now.
> > That would allow getting rid of some unnecessary variables.
>
> Agreed.

Okay, all that remains here is to get rid of those variables (might be
just one).

> > + * Scan the TidStore and return a pointer to TidStoreIterResult that has 
> > TIDs
> > + * in one block. We return the block numbers in ascending order and the 
> > offset
> > + * numbers in each result is also sorted in ascending order.
> > + */
> > +TidStoreIterResult *
> > +TidStoreIterateNext(TidStoreIter *iter)
> >
> > The wording is a bit awkward.
>
> Fixed.

- * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
- * in one block. We return the block numbers in ascending order and the offset
- * numbers in each result is also sorted in ascending order.
+ * Scan the TidStore and return the TIDs of the next block. The returned block
+ * numbers is sorted in ascending order, and the offset numbers in each result
+ * is also sorted in ascending order.

Better, but it's still not very clear. Maybe "The offsets in each
iteration result are ordered, as are the block numbers over all
iterations."

> > +/* Extract TIDs from the given key-value pair */
> > +static void
> > +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
> > BlocktableEntry *page)
> >
> > This is a leftover from the old encoding scheme. This should really
> > take a "BlockNumber blockno" not a "key", and the only call site
> > should probably cast the uint64 to BlockNumber.
>
> Fixed.

This part looks good. I didn't notice earlier, but this comment has a
similar issue

@@ -384,14 +391,15 @@ TidStoreIterateNext(TidStoreIter *iter)
  return NULL;

  /* Collect TIDs extracted from the key-value pair */
- tidstore_iter_extract_tids(iter, key, page);
+ tidstore_iter_extract_tids(iter, (BlockNumber) key, page);

..."extracted" was once a separate operation. I think just removing
that one word is enough to update it.

Some other review on code comments:

v73-0001:

+ /* Enlarge the TID array if necessary */

It's "arrays" now.

v73-0005:

+-- Random TIDs test. We insert TIDs for 1000 blocks. Each block has
+-- different randon 100 offset numbers each other.

The numbers are obvious from the query. Maybe just mention that the
offsets are randomized and must be unique and ordered.

+ * The caller is responsible for release any locks.

"releasing"

> > +typedef struct BlocktableEntry
> > +{
> > + uint16 nwords;
> > + bitmapword words[FLEXIBLE_ARRAY_MEMBER];
> > +} BlocktableEntry;
> >
> > In my WIP for runtime-embeddable offsets, nwords needs to be one byte.

I should be more clear here: nwords fitting into one byte allows 3
embedded offsets (1 on 32-bit platforms, which is good for testing at
least). With uint16 nwords that reduces to 2 (none on 32-bit
platforms). Further, after the current patch series is fully
committed, I plan to split the embedded-offset patch into two parts:
The first would store the offsets in the header, but would still need
a (smaller) allocation. The second would embed them in the child
pointer. Only the second patch will care about the size of nwords
because it needs to reserve a byte for the pointer tag.

> > That doesn't have any real-world affect on the largest offset
> > encountered, and only in 32-bit builds with 32kB block size would the
> > theoretical max change at all. To be precise, we could use in the
> > MaxBlocktableEntrySize calculation:
> >
> > Min(MaxOffsetNumber, BITS_PER_BITMAPWORD * PG_INT8_MAX - 1);
>
> I don't get this expression. Making the nwords one byte works well?
> With 8kB blocks, MaxOffsetNumber is 2048 and it requires 256
> bitmapword entries on 64-bit OS or 512 bitmapword entries on 32-bit
> OS, respectively. One byte nwrods variable seems not to be sufficient

I believe there is confusion between bitmap words and bytes:
2048 / 64 = 32 words = 256 bytes

It used to be max tuples per (heap) page, but we wanted a simple way
to make this independent of heap. I believe we won't need to ever
store the actual MaxOffsetNumber, although we technically still could
with a one-byte type and 32kB pages, at least on

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-17 Thread Masahiko Sawada

On Sun, Mar 17, 2024 at 11:46 AM John Naylor  wrote:
>
> On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada  wrote:
> >
> > On Fri, Mar 15, 2024 at 4:36 PM John Naylor  wrote:
> > >
> > > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada  
> > > wrote:
>
> > > > Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> > > > overwrite) the value, I think we should not expect that found is
> > > > always false.
> > >
> > > I find that a puzzling statement, since 1) it was designed for
> > > insert-only workloads, not actual overwrite IIRC and 2) the tests will
> > > now fail if the same block is set twice, since we just switched the
> > > tests to use a remnant of vacuum's old array. Having said that, I
> > > don't object to removing artificial barriers to using it for purposes
> > > not yet imagined, as long as test_tidstore.sql warns against that.
> >
> > I think that if it supports only insert-only workload and expects the
> > same block is set only once, it should raise an error rather than an
> > assertion. It's odd to me that the function fails only with an
> > assertion build assertions even though it actually works fine even in
> > that case.
>
> After thinking some more, I think you're right -- it's too
> heavy-handed to throw an error/assert and a public function shouldn't
> make assumptions about the caller. It's probably just a matter of
> documenting the function (and it's lack of generality), and the tests
> (which are based on the thing we're replacing).

Removed 'found' in 0003 patch.

>
> > As for test_tidstore you're right that the test code doesn't handle
> > the case where setting the same block twice. I think that there is no
> > problem in the fixed-TIDs tests, but we would need something for
> > random-TIDs tests so that we don't set the same block twice. I guess
> > it could be trivial since we can use SQL queries to generate TIDs. I'm
> > not sure how the random-TIDs tests would be like, but I think we can
> > use SELECT DISTINCT to eliminate the duplicates of block numbers to
> > use.
>
> Also, I don't think we need random blocks, since the radix tree tests
> excercise that heavily already.
>
> Random offsets is what I was thinking of (if made distinct and
> ordered), but even there the code is fairy trivial, so I don't have a
> strong feeling about it.

Agreed.

>
> > > Given the above two things, I think this function's comment needs
> > > stronger language about its limitations. Perhaps even mention that
> > > it's intended for, and optimized for, vacuum. You and I have long
> > > known that tidstore would need a separate, more complex, function to
> > > add or remove individual tids from existing entries, but it might be
> > > good to have that documented.
> >
> > Agreed.
>
> How about this:
>
>  /*
> - * Set the given TIDs on the blkno to TidStore.
> + * Create or replace an entry for the given block and array of offsets
>   *
> - * NB: the offset numbers in offsets must be sorted in ascending order.
> + * NB: This function is designed and optimized for vacuum's heap scanning
> + * phase, so has some limitations:
> + * - The offset numbers in "offsets" must be sorted in ascending order.
> + * - If the block number already exists, the entry will be replaced --
> + *   there is no way to add or remove offsets from an entry.
>   */
>  void
>  TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber 
> *offsets,

Looks good.

>
> I think we can stop including the debug-tid-store patch for CI now.
> That would allow getting rid of some unnecessary variables.

Agreed.

>
> + * Prepare to iterate through a TidStore. Since the radix tree is locked 
> during
> + * the iteration, TidStoreEndIterate() needs to be called when finished.
>
> + * Concurrent updates during the iteration will be blocked when inserting a
> + * key-value to the radix tree.
>
> This is outdated. Locking is optional. The remaining real reason now
> is that TidStoreEndIterate needs to free memory. We probably need to
> say something about locking, too, but not this.

Fixed.

>
> + * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
> + * in one block. We return the block numbers in ascending order and the 
> offset
> + * numbers in each result is also sorted in ascending order.
> + */
> +TidStoreIterResult *
> +TidStoreIterateNext(TidStoreIter *iter)
>
> The wording is a bit awkward.

Fixed.

>
> +/*
> + * Finish an iteration over TidStore. This needs to be called after finishing
> + * or when existing an iteration.
> + */
>
> s/existing/exiting/ ?
>
> It seems to say we need to finish after finishing. Maybe more precise wording.

Fixed.

>
> +/* Extract TIDs from the given key-value pair */
> +static void
> +tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
> BlocktableEntry *page)
>
> This is a leftover from the old encoding scheme. This should really
> take a "BlockNumber blockno" not a "key", and the only call site
> should probably cast the uint64 to BlockNumbe

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-16 Thread John Naylor

On Fri, Mar 15, 2024 at 9:17 PM Masahiko Sawada  wrote:
>
> On Fri, Mar 15, 2024 at 4:36 PM John Naylor  wrote:
> >
> > On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada  
> > wrote:

> > > Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> > > overwrite) the value, I think we should not expect that found is
> > > always false.
> >
> > I find that a puzzling statement, since 1) it was designed for
> > insert-only workloads, not actual overwrite IIRC and 2) the tests will
> > now fail if the same block is set twice, since we just switched the
> > tests to use a remnant of vacuum's old array. Having said that, I
> > don't object to removing artificial barriers to using it for purposes
> > not yet imagined, as long as test_tidstore.sql warns against that.
>
> I think that if it supports only insert-only workload and expects the
> same block is set only once, it should raise an error rather than an
> assertion. It's odd to me that the function fails only with an
> assertion build assertions even though it actually works fine even in
> that case.

After thinking some more, I think you're right -- it's too
heavy-handed to throw an error/assert and a public function shouldn't
make assumptions about the caller. It's probably just a matter of
documenting the function (and it's lack of generality), and the tests
(which are based on the thing we're replacing).

> As for test_tidstore you're right that the test code doesn't handle
> the case where setting the same block twice. I think that there is no
> problem in the fixed-TIDs tests, but we would need something for
> random-TIDs tests so that we don't set the same block twice. I guess
> it could be trivial since we can use SQL queries to generate TIDs. I'm
> not sure how the random-TIDs tests would be like, but I think we can
> use SELECT DISTINCT to eliminate the duplicates of block numbers to
> use.

Also, I don't think we need random blocks, since the radix tree tests
excercise that heavily already.

Random offsets is what I was thinking of (if made distinct and
ordered), but even there the code is fairy trivial, so I don't have a
strong feeling about it.

> > Given the above two things, I think this function's comment needs
> > stronger language about its limitations. Perhaps even mention that
> > it's intended for, and optimized for, vacuum. You and I have long
> > known that tidstore would need a separate, more complex, function to
> > add or remove individual tids from existing entries, but it might be
> > good to have that documented.
>
> Agreed.

How about this:

 /*
- * Set the given TIDs on the blkno to TidStore.
+ * Create or replace an entry for the given block and array of offsets
  *
- * NB: the offset numbers in offsets must be sorted in ascending order.
+ * NB: This function is designed and optimized for vacuum's heap scanning
+ * phase, so has some limitations:
+ * - The offset numbers in "offsets" must be sorted in ascending order.
+ * - If the block number already exists, the entry will be replaced --
+ *   there is no way to add or remove offsets from an entry.
  */
 void
 TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,

I think we can stop including the debug-tid-store patch for CI now.
That would allow getting rid of some unnecessary variables. More
comments:

+ * Prepare to iterate through a TidStore. Since the radix tree is locked during
+ * the iteration, TidStoreEndIterate() needs to be called when finished.

+ * Concurrent updates during the iteration will be blocked when inserting a
+ * key-value to the radix tree.

This is outdated. Locking is optional. The remaining real reason now
is that TidStoreEndIterate needs to free memory. We probably need to
say something about locking, too, but not this.

+ * Scan the TidStore and return a pointer to TidStoreIterResult that has TIDs
+ * in one block. We return the block numbers in ascending order and the offset
+ * numbers in each result is also sorted in ascending order.
+ */
+TidStoreIterResult *
+TidStoreIterateNext(TidStoreIter *iter)

The wording is a bit awkward.

+/*
+ * Finish an iteration over TidStore. This needs to be called after finishing
+ * or when existing an iteration.
+ */

s/existing/exiting/ ?

It seems to say we need to finish after finishing. Maybe more precise wording.

+/* Extract TIDs from the given key-value pair */
+static void
+tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
BlocktableEntry *page)

This is a leftover from the old encoding scheme. This should really
take a "BlockNumber blockno" not a "key", and the only call site
should probably cast the uint64 to BlockNumber.

+ * tidstore.h
+ *   Tid storage.
+ *
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group

Update year.

+typedef struct BlocktableEntry
+{
+ uint16 nwords;
+ bitmapword words[FLEXIBLE_ARRAY_MEMBER];
+} BlocktableEntry;

In my WIP for runtime-embeddable offsets, nwords needs to be one byte.
That doesn't have any

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-15 Thread Masahiko Sawada

On Fri, Mar 15, 2024 at 4:36 PM John Naylor  wrote:
>
> On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 14, 2024 at 6:55 PM John Naylor  wrote:
> > >
> > > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor  
> > > > wrote:
> > > > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > > > general and put the validation down into C as well. First we save the
> > > > > blocks from do_set_block_offsets() into a table, then with all those
> > > > > blocks lookup a sufficiently-large range of possible offsets and save
> > > > > found values in another array. So the static items structure would
> > > > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > > > the iteration output is private to check_set_block_offsets(). Then
> > > > > sort as needed and check they are all the same.
> > > >
> > > > That's a promising idea. We can use the same mechanism for randomized
> > > > tests too. If you're going to work on this, I'll do other tests on my
> > > > environment in the meantime.
> > >
> > > Some progress on this in v72 -- I tried first without using SQL to
> > > save the blocks, just using the unique blocks from the verification
> > > array. It seems to work fine.
> >
> > Thanks!
>
> Seems I forgot the attachment last time...there's more stuff now
> anyway, based on discussion.

Thank you for updating the patches!

The idea of using three TID arrays for the lookup test and iteration
test looks good to me. I think we can add random-TIDs tests on top of
it.

>
> > > - Since there are now three arrays we should reduce max bytes to
> > > something smaller.
> >
> > Agreed.
>
> I went further than this, see below.
>
> > > - Further on that, I'm not sure if the "is full" test is telling us
> > > much. It seems we could make max bytes a static variable and set it to
> > > the size of the empty store. I'm guessing it wouldn't take much to add
> > > enough tids so that the contexts need to allocate some blocks, and
> > > then it would appear full and we can test that. I've made it so all
> > > arrays repalloc when needed, just in case.
> >
> > How about using work_mem as max_bytes instead of having it as a static
> > variable? In test_tidstore.sql we set work_mem before creating the
> > tidstore. It would make the tidstore more controllable by SQL queries.
>
> My complaint is that the "is full" test is trivial, and also strange
> in that max_bytes is used for two unrelated things:
>
> - the initial size of the verification arrays, which was always larger
> than necessary, and now there are three of them
> - the hint to TidStoreCreate to calculate its max block size / the
> threshold for being "full"
>
> To make the "is_full" test slightly less trivial, my idea is to save
> the empty store size and later add enough tids so that it has to
> allocate new blocks/DSA segments, which is not that many, and then it
> will appear full. I've done this and also separated the purpose of
> various sizes in v72-0009/10.

I see your point and the changes look good to me.

> Using actual work_mem seems a bit more difficult to make this work.

Agreed.

>
>
> > ---
> > +   if (TidStoreIsShared(ts))
> > +   found = shared_rt_set(ts->tree.shared, blkno, page);
> > +   else
> > +   found = local_rt_set(ts->tree.local, blkno, page);
> > +
> > +   Assert(!found);
> >
> > Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> > overwrite) the value, I think we should not expect that found is
> > always false.
>
> I find that a puzzling statement, since 1) it was designed for
> insert-only workloads, not actual overwrite IIRC and 2) the tests will
> now fail if the same block is set twice, since we just switched the
> tests to use a remnant of vacuum's old array. Having said that, I
> don't object to removing artificial barriers to using it for purposes
> not yet imagined, as long as test_tidstore.sql warns against that.

I think that if it supports only insert-only workload and expects the
same block is set only once, it should raise an error rather than an
assertion. It's odd to me that the function fails only with an
assertion build assertions even though it actually works fine even in
that case.

As for test_tidstore you're right that the test code doesn't handle
the case where setting the same block twice. I think that there is no
problem in the fixed-TIDs tests, but we would need something for
random-TIDs tests so that we don't set the same block twice. I guess
it could be trivial since we can use SQL queries to generate TIDs. I'm
not sure how the random-TIDs tests would be like, but I think we can
use SELECT DISTINCT to eliminate the duplicates of block numbers to
use.

>
> Given the above two things, I think this function's comment needs
> stronger language about its limitations. Perhaps even mention that
> it's intended for, and optimized for, vacuum. You and I have long
> kno

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-15 Thread John Naylor

On Thu, Mar 14, 2024 at 7:04 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 6:55 PM John Naylor  wrote:
> >
> > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor  
> > > wrote:
> > > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > > general and put the validation down into C as well. First we save the
> > > > blocks from do_set_block_offsets() into a table, then with all those
> > > > blocks lookup a sufficiently-large range of possible offsets and save
> > > > found values in another array. So the static items structure would
> > > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > > the iteration output is private to check_set_block_offsets(). Then
> > > > sort as needed and check they are all the same.
> > >
> > > That's a promising idea. We can use the same mechanism for randomized
> > > tests too. If you're going to work on this, I'll do other tests on my
> > > environment in the meantime.
> >
> > Some progress on this in v72 -- I tried first without using SQL to
> > save the blocks, just using the unique blocks from the verification
> > array. It seems to work fine.
>
> Thanks!

Seems I forgot the attachment last time...there's more stuff now
anyway, based on discussion.

> > - Since there are now three arrays we should reduce max bytes to
> > something smaller.
>
> Agreed.

I went further than this, see below.

> > - Further on that, I'm not sure if the "is full" test is telling us
> > much. It seems we could make max bytes a static variable and set it to
> > the size of the empty store. I'm guessing it wouldn't take much to add
> > enough tids so that the contexts need to allocate some blocks, and
> > then it would appear full and we can test that. I've made it so all
> > arrays repalloc when needed, just in case.
>
> How about using work_mem as max_bytes instead of having it as a static
> variable? In test_tidstore.sql we set work_mem before creating the
> tidstore. It would make the tidstore more controllable by SQL queries.

My complaint is that the "is full" test is trivial, and also strange
in that max_bytes is used for two unrelated things:

- the initial size of the verification arrays, which was always larger
than necessary, and now there are three of them
- the hint to TidStoreCreate to calculate its max block size / the
threshold for being "full"

To make the "is_full" test slightly less trivial, my idea is to save
the empty store size and later add enough tids so that it has to
allocate new blocks/DSA segments, which is not that many, and then it
will appear full. I've done this and also separated the purpose of
various sizes in v72-0009/10.

Using actual work_mem seems a bit more difficult to make this work.

> > - I'm not sure it's useful to keep test_lookup_tids() around. Since we
> > now have a separate lookup test, the only thing it can tell us is that
> > lookups fail on an empty store. I arranged it so that
> > check_set_block_offsets() works on an empty store. Although that's
> > even more trivial, it's just reusing what we already need.
>
> Agreed.

Removed in v72-0007

On Fri, Mar 15, 2024 at 9:49 AM Masahiko Sawada  wrote:
>
> I have two questions on tidstore.c:
>
> +/*
> + * Set the given TIDs on the blkno to TidStore.
> + *
> + * NB: the offset numbers in offsets must be sorted in ascending order.
> + */
>
> Do we need some assertions to check if the given offset numbers are
> sorted expectedly?

Done in v72-0008

> ---
> +   if (TidStoreIsShared(ts))
> +   found = shared_rt_set(ts->tree.shared, blkno, page);
> +   else
> +   found = local_rt_set(ts->tree.local, blkno, page);
> +
> +   Assert(!found);
>
> Given TidStoreSetBlockOffsets() is designed to always set (i.e.
> overwrite) the value, I think we should not expect that found is
> always false.

I find that a puzzling statement, since 1) it was designed for
insert-only workloads, not actual overwrite IIRC and 2) the tests will
now fail if the same block is set twice, since we just switched the
tests to use a remnant of vacuum's old array. Having said that, I
don't object to removing artificial barriers to using it for purposes
not yet imagined, as long as test_tidstore.sql warns against that.

Given the above two things, I think this function's comment needs
stronger language about its limitations. Perhaps even mention that
it's intended for, and optimized for, vacuum. You and I have long
known that tidstore would need a separate, more complex, function to
add or remove individual tids from existing entries, but it might be
good to have that documented.

Other things:

v72-0011: Test that zero offset raises an error.

v72-0013: I had wanted to microbenchmark this, but since we are
running short of time I decided to skip that, so I want to revert some
code to make it again more similar to the equivalent in tidbitmap.c.
In the absence of evidence, it seems better to do it this way.

v72

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-14 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 9:03 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 6:55 PM John Naylor  wrote:
> >
> > On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Mar 14, 2024 at 1:29 PM John Naylor  
> > > wrote:
> > > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > > general and put the validation down into C as well. First we save the
> > > > blocks from do_set_block_offsets() into a table, then with all those
> > > > blocks lookup a sufficiently-large range of possible offsets and save
> > > > found values in another array. So the static items structure would
> > > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > > the iteration output is private to check_set_block_offsets(). Then
> > > > sort as needed and check they are all the same.
> > >
> > > That's a promising idea. We can use the same mechanism for randomized
> > > tests too. If you're going to work on this, I'll do other tests on my
> > > environment in the meantime.
> >
> > Some progress on this in v72 -- I tried first without using SQL to
> > save the blocks, just using the unique blocks from the verification
> > array. It seems to work fine.
>
> Thanks!
>
> >
> > - Since there are now three arrays we should reduce max bytes to
> > something smaller.
>
> Agreed.
>
> > - Further on that, I'm not sure if the "is full" test is telling us
> > much. It seems we could make max bytes a static variable and set it to
> > the size of the empty store. I'm guessing it wouldn't take much to add
> > enough tids so that the contexts need to allocate some blocks, and
> > then it would appear full and we can test that. I've made it so all
> > arrays repalloc when needed, just in case.
>
> How about using work_mem as max_bytes instead of having it as a static
> variable? In test_tidstore.sql we set work_mem before creating the
> tidstore. It would make the tidstore more controllable by SQL queries.
>
> > - Why are we switching to TopMemoryContext? It's not explained -- the
> > comment only tells what the code is doing (which is obvious), but not
> > why.
>
> This is because the tidstore needs to live across the transaction
> boundary. We can use TopMemoryContext or CacheMemoryContext.
>
> > - I'm not sure it's useful to keep test_lookup_tids() around. Since we
> > now have a separate lookup test, the only thing it can tell us is that
> > lookups fail on an empty store. I arranged it so that
> > check_set_block_offsets() works on an empty store. Although that's
> > even more trivial, it's just reusing what we already need.
>
> Agreed.
>

I have two questions on tidstore.c:

+/*
+ * Set the given TIDs on the blkno to TidStore.
+ *
+ * NB: the offset numbers in offsets must be sorted in ascending order.
+ */

Do we need some assertions to check if the given offset numbers are
sorted expectedly?

---
+   if (TidStoreIsShared(ts))
+   found = shared_rt_set(ts->tree.shared, blkno, page);
+   else
+   found = local_rt_set(ts->tree.local, blkno, page);
+
+   Assert(!found);

Given TidStoreSetBlockOffsets() is designed to always set (i.e.
overwrite) the value, I think we should not expect that found is
always false.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-14 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 6:55 PM John Naylor  wrote:
>
> On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  
> wrote:
> >
> > On Thu, Mar 14, 2024 at 1:29 PM John Naylor  wrote:
> > > Okay, here's an another idea: Change test_lookup_tids() to be more
> > > general and put the validation down into C as well. First we save the
> > > blocks from do_set_block_offsets() into a table, then with all those
> > > blocks lookup a sufficiently-large range of possible offsets and save
> > > found values in another array. So the static items structure would
> > > have 3 arrays: inserts, successful lookups, and iteration (currently
> > > the iteration output is private to check_set_block_offsets(). Then
> > > sort as needed and check they are all the same.
> >
> > That's a promising idea. We can use the same mechanism for randomized
> > tests too. If you're going to work on this, I'll do other tests on my
> > environment in the meantime.
>
> Some progress on this in v72 -- I tried first without using SQL to
> save the blocks, just using the unique blocks from the verification
> array. It seems to work fine.

Thanks!

>
> - Since there are now three arrays we should reduce max bytes to
> something smaller.

Agreed.

> - Further on that, I'm not sure if the "is full" test is telling us
> much. It seems we could make max bytes a static variable and set it to
> the size of the empty store. I'm guessing it wouldn't take much to add
> enough tids so that the contexts need to allocate some blocks, and
> then it would appear full and we can test that. I've made it so all
> arrays repalloc when needed, just in case.

How about using work_mem as max_bytes instead of having it as a static
variable? In test_tidstore.sql we set work_mem before creating the
tidstore. It would make the tidstore more controllable by SQL queries.

> - Why are we switching to TopMemoryContext? It's not explained -- the
> comment only tells what the code is doing (which is obvious), but not
> why.

This is because the tidstore needs to live across the transaction
boundary. We can use TopMemoryContext or CacheMemoryContext.

> - I'm not sure it's useful to keep test_lookup_tids() around. Since we
> now have a separate lookup test, the only thing it can tell us is that
> lookups fail on an empty store. I arranged it so that
> check_set_block_offsets() works on an empty store. Although that's
> even more trivial, it's just reusing what we already need.

Agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-14 Thread John Naylor

On Thu, Mar 14, 2024 at 12:06 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 1:29 PM John Naylor  wrote:
> > Okay, here's an another idea: Change test_lookup_tids() to be more
> > general and put the validation down into C as well. First we save the
> > blocks from do_set_block_offsets() into a table, then with all those
> > blocks lookup a sufficiently-large range of possible offsets and save
> > found values in another array. So the static items structure would
> > have 3 arrays: inserts, successful lookups, and iteration (currently
> > the iteration output is private to check_set_block_offsets(). Then
> > sort as needed and check they are all the same.
>
> That's a promising idea. We can use the same mechanism for randomized
> tests too. If you're going to work on this, I'll do other tests on my
> environment in the meantime.

Some progress on this in v72 -- I tried first without using SQL to
save the blocks, just using the unique blocks from the verification
array. It seems to work fine. Some open questions on the test module:

- Since there are now three arrays we should reduce max bytes to
something smaller.
- Further on that, I'm not sure if the "is full" test is telling us
much. It seems we could make max bytes a static variable and set it to
the size of the empty store. I'm guessing it wouldn't take much to add
enough tids so that the contexts need to allocate some blocks, and
then it would appear full and we can test that. I've made it so all
arrays repalloc when needed, just in case.
- Why are we switching to TopMemoryContext? It's not explained -- the
comment only tells what the code is doing (which is obvious), but not
why.
- I'm not sure it's useful to keep test_lookup_tids() around. Since we
now have a separate lookup test, the only thing it can tell us is that
lookups fail on an empty store. I arranged it so that
check_set_block_offsets() works on an empty store. Although that's
even more trivial, it's just reusing what we already need.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 1:29 PM John Naylor  wrote:
>
> On Thu, Mar 14, 2024 at 8:53 AM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 14, 2024 at 9:59 AM John Naylor  wrote:
> > > > BTW do we still want to test the tidstore by using a combination of
> > > > SQL functions? We might no longer need to input TIDs via a SQL
> > > > function.
> > >
> > > I'm not sure. I stopped short of doing that to get feedback on this
> > > much. One advantage with SQL functions is we can use generate_series
> > > to easily input lists of blocks with different numbers and strides,
> > > and array literals for offsets are a bit easier. What do you think?
> >
> > While I'm not a fan of the following part, I agree that it makes sense
> > to use SQL functions for test data generation:
> >
> > -- Constant values used in the tests.
> > \set maxblkno 4294967295
> > -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 
> > 291.
> > -- We use a higher number to test tidstore.
> > \set maxoffset 512
>
> I'm not really a fan of these either, and could be removed a some
> point if we've done everything else nicely.
>
> > It would also be easier for developers to test the tidstore with their
> > own data set. So I agreed with the current approach; use SQL functions
> > for data generation and do the actual tests inside C functions.
>
> Okay, here's an another idea: Change test_lookup_tids() to be more
> general and put the validation down into C as well. First we save the
> blocks from do_set_block_offsets() into a table, then with all those
> blocks lookup a sufficiently-large range of possible offsets and save
> found values in another array. So the static items structure would
> have 3 arrays: inserts, successful lookups, and iteration (currently
> the iteration output is private to check_set_block_offsets(). Then
> sort as needed and check they are all the same.

That's a promising idea. We can use the same mechanism for randomized
tests too. If you're going to work on this, I'll do other tests on my
environment in the meantime.

>
> Further thought: We may not really need to test block numbers that
> vigorously, since the radix tree tests should cover keys/values pretty
> well.

Agreed. Probably boundary block numbers: 0, 1, MaxBlockNumber - 1, and
MaxBlockNumber, would be sufficient.

>  The difference here is using bitmaps of tids and that should be
> well covered.

Right. We would need to test offset numbers vigorously instead.

>
> Locally (not CI), we should try big inputs to make sure we can
> actually go up to many GB -- it's easier and faster this way than
> having vacuum give us a large data set.

I'll do these tests.

>
> > Is it
> > convenient for developers if we have functions like generate_tids()
> > and generate_random_tids() to generate TIDs so that they can pass them
> > to do_set_block_offsets()?
>
> I guess I don't see the advantage of adding a layer of indirection at
> this point, but it could be useful at a later time.

Agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread John Naylor

On Thu, Mar 14, 2024 at 8:53 AM Masahiko Sawada  wrote:
>
> On Thu, Mar 14, 2024 at 9:59 AM John Naylor  wrote:
> > > BTW do we still want to test the tidstore by using a combination of
> > > SQL functions? We might no longer need to input TIDs via a SQL
> > > function.
> >
> > I'm not sure. I stopped short of doing that to get feedback on this
> > much. One advantage with SQL functions is we can use generate_series
> > to easily input lists of blocks with different numbers and strides,
> > and array literals for offsets are a bit easier. What do you think?
>
> While I'm not a fan of the following part, I agree that it makes sense
> to use SQL functions for test data generation:
>
> -- Constant values used in the tests.
> \set maxblkno 4294967295
> -- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 
> 291.
> -- We use a higher number to test tidstore.
> \set maxoffset 512

I'm not really a fan of these either, and could be removed a some
point if we've done everything else nicely.

> It would also be easier for developers to test the tidstore with their
> own data set. So I agreed with the current approach; use SQL functions
> for data generation and do the actual tests inside C functions.

Okay, here's an another idea: Change test_lookup_tids() to be more
general and put the validation down into C as well. First we save the
blocks from do_set_block_offsets() into a table, then with all those
blocks lookup a sufficiently-large range of possible offsets and save
found values in another array. So the static items structure would
have 3 arrays: inserts, successful lookups, and iteration (currently
the iteration output is private to check_set_block_offsets(). Then
sort as needed and check they are all the same.

Further thought: We may not really need to test block numbers that
vigorously, since the radix tree tests should cover keys/values pretty
well. The difference here is using bitmaps of tids and that should be
well covered.

Locally (not CI), we should try big inputs to make sure we can
actually go up to many GB -- it's easier and faster this way than
having vacuum give us a large data set.

> Is it
> convenient for developers if we have functions like generate_tids()
> and generate_random_tids() to generate TIDs so that they can pass them
> to do_set_block_offsets()?

I guess I don't see the advantage of adding a layer of indirection at
this point, but it could be useful at a later time.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread Masahiko Sawada

On Thu, Mar 14, 2024 at 9:59 AM John Naylor  wrote:
>
> On Wed, Mar 13, 2024 at 9:29 PM Masahiko Sawada  wrote:
> >
> > On Wed, Mar 13, 2024 at 8:05 PM John Naylor  wrote:
> > >
> > > On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada  
> > > wrote:
> > >
> > > > As I mentioned above, if we implement the test cases in C, we can use
> > > > the debug-build array in the test code. And we won't use it in AND/OR
> > > > operations tests in the future.
> > >
> > > That's a really interesting idea, so I went ahead and tried that for
> > > v71. This seems like a good basis for testing larger, randomized
> > > inputs, once we decide how best to hide that from the expected output.
> > > The tests use SQL functions do_set_block_offsets() and
> > > check_set_block_offsets(). The latter does two checks against a tid
> > > array, and replaces test_dump_tids().
> >
> > Great! I think that's a very good starter.
> >
> > The lookup_test() (and test_lookup_tids()) do also test that the
> > IsMember() function returns false as expected if the TID doesn't exist
> > in it, and probably we can do these tests in a C function too.
> >
> > BTW do we still want to test the tidstore by using a combination of
> > SQL functions? We might no longer need to input TIDs via a SQL
> > function.
>
> I'm not sure. I stopped short of doing that to get feedback on this
> much. One advantage with SQL functions is we can use generate_series
> to easily input lists of blocks with different numbers and strides,
> and array literals for offsets are a bit easier. What do you think?

While I'm not a fan of the following part, I agree that it makes sense
to use SQL functions for test data generation:

-- Constant values used in the tests.
\set maxblkno 4294967295
-- The maximum number of heap tuples (MaxHeapTuplesPerPage) in 8kB block is 291.
-- We use a higher number to test tidstore.
\set maxoffset 512

It would also be easier for developers to test the tidstore with their
own data set. So I agreed with the current approach; use SQL functions
for data generation and do the actual tests inside C functions. Is it
convenient for developers if we have functions like generate_tids()
and generate_random_tids() to generate TIDs so that they can pass them
to do_set_block_offsets()? Then they call check_set_block_offsets()
and others for actual data lookup and iteration tests.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread John Naylor

On Wed, Mar 13, 2024 at 9:29 PM Masahiko Sawada  wrote:
>
> On Wed, Mar 13, 2024 at 8:05 PM John Naylor  wrote:
> >
> > On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada  
> > wrote:
> >
> > > As I mentioned above, if we implement the test cases in C, we can use
> > > the debug-build array in the test code. And we won't use it in AND/OR
> > > operations tests in the future.
> >
> > That's a really interesting idea, so I went ahead and tried that for
> > v71. This seems like a good basis for testing larger, randomized
> > inputs, once we decide how best to hide that from the expected output.
> > The tests use SQL functions do_set_block_offsets() and
> > check_set_block_offsets(). The latter does two checks against a tid
> > array, and replaces test_dump_tids().
>
> Great! I think that's a very good starter.
>
> The lookup_test() (and test_lookup_tids()) do also test that the
> IsMember() function returns false as expected if the TID doesn't exist
> in it, and probably we can do these tests in a C function too.
>
> BTW do we still want to test the tidstore by using a combination of
> SQL functions? We might no longer need to input TIDs via a SQL
> function.

I'm not sure. I stopped short of doing that to get feedback on this
much. One advantage with SQL functions is we can use generate_series
to easily input lists of blocks with different numbers and strides,
and array literals for offsets are a bit easier. What do you think?

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread Masahiko Sawada

On Wed, Mar 13, 2024 at 8:05 PM John Naylor  wrote:
>
> On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada  wrote:
>
> > As I mentioned above, if we implement the test cases in C, we can use
> > the debug-build array in the test code. And we won't use it in AND/OR
> > operations tests in the future.
>
> That's a really interesting idea, so I went ahead and tried that for
> v71. This seems like a good basis for testing larger, randomized
> inputs, once we decide how best to hide that from the expected output.
> The tests use SQL functions do_set_block_offsets() and
> check_set_block_offsets(). The latter does two checks against a tid
> array, and replaces test_dump_tids().

Great! I think that's a very good starter.

The lookup_test() (and test_lookup_tids()) do also test that the
IsMember() function returns false as expected if the TID doesn't exist
in it, and probably we can do these tests in a C function too.

BTW do we still want to test the tidstore by using a combination of
SQL functions? We might no longer need to input TIDs via a SQL
function.

> Funnily enough, the debug array
> itself gave false failures when using a similar array in the test
> harness, because it didn't know all the places where the array should
> have been sorted -- it only worked by chance before because of what
> order things were done.

Good catch, thanks.

> I squashed everything from v70 and also took the liberty of switching
> on shared memory for tid store tests. The only reason we didn't do
> this with the radix tree tests is that the static attach/detach
> functions would raise warnings since they are not used.

Agreed to test the tidstore on shared memory.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-13 Thread John Naylor

On Wed, Mar 13, 2024 at 8:39 AM Masahiko Sawada  wrote:

> As I mentioned above, if we implement the test cases in C, we can use
> the debug-build array in the test code. And we won't use it in AND/OR
> operations tests in the future.

That's a really interesting idea, so I went ahead and tried that for
v71. This seems like a good basis for testing larger, randomized
inputs, once we decide how best to hide that from the expected output.
The tests use SQL functions do_set_block_offsets() and
check_set_block_offsets(). The latter does two checks against a tid
array, and replaces test_dump_tids(). Funnily enough, the debug array
itself gave false failures when using a similar array in the test
harness, because it didn't know all the places where the array should
have been sorted -- it only worked by chance before because of what
order things were done.

I squashed everything from v70 and also took the liberty of switching
on shared memory for tid store tests. The only reason we didn't do
this with the radix tree tests is that the static attach/detach
functions would raise warnings since they are not used.
From 9b6b517b9f9977999034d16b4dc33e91094ae7ba Mon Sep 17 00:00:00 2001
From: Masahiko Sawada 
Date: Tue, 12 Dec 2023 22:36:24 +0900
Subject: [PATCH v71 2/6] DEV: Debug TidStore.

---
 src/backend/access/common/tidstore.c | 203 ++-
 1 file changed, 201 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c
index b725b62d4c..33753d8ed2 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -29,6 +29,11 @@
 #include "utils/dsa.h"
 #include "utils/memutils.h"
 
+/* Enable TidStore debugging if USE_ASSERT_CHECKING */
+#ifdef USE_ASSERT_CHECKING
+#define TIDSTORE_DEBUG
+#include "catalog/index.h"
+#endif
 
 #define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
 #define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
@@ -88,6 +93,13 @@ struct TidStore
 
 	/* DSA area for TidStore if using shared memory */
 	dsa_area   *area;
+
+#ifdef TIDSTORE_DEBUG
+	ItemPointerData	*tids;
+	int64		max_tids;
+	int64		num_tids;
+	bool		tids_unordered;
+#endif
 };
 #define TidStoreIsShared(ts) ((ts)->area != NULL)
 
@@ -105,11 +117,25 @@ struct TidStoreIter
 
 	/* output for the caller */
 	TidStoreIterResult output;
+
+#ifdef TIDSTORE_DEBUG
+	/* iterator index for the ts->tids array */
+	int64		tids_idx;
+#endif
 };
 
 static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key,
 	   BlocktableEntry *page);
 
+/* debug functions available only when TIDSTORE_DEBUG */
+#ifdef TIDSTORE_DEBUG
+static void ts_debug_set_block_offsets(TidStore *ts, BlockNumber blkno,
+	   OffsetNumber *offsets, int num_offsets);
+static void ts_debug_iter_check_tids(TidStoreIter *iter);
+static bool ts_debug_is_member(TidStore *ts, ItemPointer tid);
+static int itemptr_cmp(const void *left, const void *right);
+#endif
+
 /*
  * Create a TidStore. The TidStore will live in the memory context that is
  * CurrentMemoryContext at the time of this call. The TID storage, backed
@@ -154,6 +180,17 @@ TidStoreCreate(size_t max_bytes, dsa_area *area)
 	else
 		ts->tree.local = local_rt_create(ts->rt_context);
 
+#ifdef TIDSTORE_DEBUG
+	{
+		int64		max_tids = max_bytes / sizeof(ItemPointerData);
+
+		ts->tids = palloc(sizeof(ItemPointerData) * max_tids);
+		ts->max_tids = max_tids;
+		ts->num_tids = 0;
+		ts->tids_unordered = false;
+	}
+#endif
+
 	return ts;
 }
 
@@ -191,6 +228,7 @@ TidStoreDetach(TidStore *ts)
 	Assert(TidStoreIsShared(ts));
 
 	shared_rt_detach(ts->tree.shared);
+
 	pfree(ts);
 }
 
@@ -241,6 +279,11 @@ TidStoreDestroy(TidStore *ts)
 
 	MemoryContextDelete(ts->rt_context);
 
+#ifdef TIDSTORE_DEBUG
+	if (!TidStoreIsShared(ts))
+		pfree(ts->tids);
+#endif
+
 	pfree(ts);
 }
 
@@ -297,6 +340,14 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets,
 		found = local_rt_set(ts->tree.local, blkno, page);
 
 	Assert(!found);
+
+#ifdef TIDSTORE_DEBUG
+	if (!TidStoreIsShared(ts))
+	{
+		/* Insert TIDs into the TID array too */
+		ts_debug_set_block_offsets(ts, blkno, offsets, num_offsets);
+	}
+#endif
 }
 
 /* Return true if the given TID is present in the TidStore */
@@ -310,6 +361,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 	OffsetNumber off = ItemPointerGetOffsetNumber(tid);
 	bool		ret;
 
+#ifdef TIDSTORE_DEBUG
+	bool ret_debug = false;
+
+	if (!TidStoreIsShared(ts))
+		ret_debug = ts_debug_is_member(ts, tid);
+#endif
+
 	if (TidStoreIsShared(ts))
 		page = shared_rt_find(ts->tree.shared, blk);
 	else
@@ -317,17 +375,29 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
 
 	/* no entry for the blk */
 	if (page == NULL)
-		return false;
+	{
+		ret = false;
+		goto done;
+	}
 
 	wordnum = WORDNUM(off);
 	bitnum = BITNUM(off);
 
 	/* no bitmap for the off */
 	if (wordnum >= page->nwords)
-		return false;
+	{
+		ret = false;
+		goto done;
+	}
 
 	ret = (page->words[wordnum] & ((bi

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-12 Thread Masahiko Sawada

On Tue, Mar 12, 2024 at 7:34 PM John Naylor  wrote:
>
> On Mon, Mar 11, 2024 at 3:13 PM Masahiko Sawada  wrote:
> >
> > On Mon, Mar 11, 2024 at 12:20 PM John Naylor  
> > wrote:
> > >
> > > On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada  
> > > wrote:
>
> > > + ts->context = CurrentMemoryContext;
> > >
> > > As far as I can tell, this member is never accessed again -- am I
> > > missing something?
> >
> > You're right. It was used to re-create the tidstore in the same
> > context again while resetting it, but we no longer support the reset
> > API. Considering it again, would it be better to allocate the iterator
> > struct in the same context as we store the tidstore struct?
>
> That makes sense.
>
> > > + /* DSA for tidstore will be detached at the end of session */
> > >
> > > No other test module pins the mapping, but that doesn't necessarily
> > > mean it's wrong. Is there some advantage over explicitly detaching?
> >
> > One small benefit of not explicitly detaching dsa_area in
> > tidstore_destroy() would be simplicity; IIUC if we want to do that, we
> > need to remember the dsa_area using (for example) a static variable,
> > and free it if it's non-NULL. I've implemented this idea in the
> > attached patch.
>
> Okay, I don't have a strong preference at this point.

I'd keep the update on that.

>
> > > +-- Add tids in random order.
> > >
> > > I don't see any randomization here. I do remember adding row_number to
> > > remove whitespace in the output, but I don't remember a random order.
> > > On that subject, the row_number was an easy trick to avoid extra
> > > whitespace, but maybe we should just teach the setting function to
> > > return blocknumber rather than null?
> >
> > Good idea, fixed.
>
> + test_set_block_offsets
> +
> + 2147483647
> +  0
> + 4294967294
> +  1
> + 4294967295
>
> Hmm, was the earlier comment about randomness referring to this? I'm
> not sure what other regression tests do in these cases, or how
> relibale this is. If this is a problem we could simply insert this
> result into a temp table so it's not output.

I didn't address the comment about randomness.

I think that we will have both random TIDs tests and fixed TIDs tests
in test_tidstore as we discussed, and probably we can do both tests
with similar steps; insert TIDs into both a temp table and tidstore
and check if the tidstore returned the results as expected by
comparing the results to the temp table. Probably we can have a common
pl/pgsql function that checks that and raises a WARNING or an ERROR.
Given that this is very similar to what we did in test_radixtree, why
do we really want to implement it using a pl/pgsql function? When we
discussed it before, I found the current way makes sense. But given
that we're adding more tests and will add more tests in the future,
doing the tests in C will be more maintainable and faster. Also, I
think we can do the debug-build array stuff in the test_tidstore code
instead.

>
> > > +Datum
> > > +tidstore_create(PG_FUNCTION_ARGS)
> > > +{
> > > ...
> > > + tidstore = TidStoreCreate(max_bytes, dsa);
> > >
> > > +Datum
> > > +tidstore_set_block_offsets(PG_FUNCTION_ARGS)
> > > +{
> > > 
> > > + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
> > >
> > > These names are too similar. Maybe the test module should do
> > > s/tidstore_/test_/ or similar.
> >
> > Agreed.
>
> Mostly okay, although a couple look a bit generic now. I'll leave it
> up to you if you want to tweak things.
>
> > > In general, the .sql file is still very hard-coded. Functions are
> > > created that contain a VALUES statement. Maybe it's okay for now, but
> > > wanted to mention it. Ideally, we'd have some randomized tests,
> > > without having to display it. That could be in addition to (not
> > > replacing) the small tests we have that display input. (see below)
> > >
> >
> > Agreed to add randomized tests in addition to the existing tests.
>
> I'll try something tomorrow.
>
> > Sounds a good idea. In fact, if there are some bugs in tidstore, it's
> > likely that even initdb would fail in practice. However, it's a very
> > good idea that we can test the tidstore anyway with such a check
> > without a debug-build array.
> >
> > Or as another idea, I wonder if we could keep the debug-build array in
> > some form. For example, we use the array with the particular build
> > flag and set a BF animal for that. That way, we can test the tidstore
> > in more real cases.
>
> I think the purpose of a debug flag is to help developers catch
> mistakes. I don't think it's quite useful enough for that. For one, it
> has the same 1GB limitation as vacuum's current array. For another,
> it'd be a terrible way to debug moving tidbitmap.c from its hash table
> to use TID store -- AND/OR operations and lossy pages are pretty much
> undoable with a copy of vacuum's array.

Valid points.

As I mentioned above, if we im

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-12 Thread John Naylor

On Mon, Mar 11, 2024 at 3:13 PM Masahiko Sawada  wrote:
>
> On Mon, Mar 11, 2024 at 12:20 PM John Naylor  wrote:
> >
> > On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada  
> > wrote:

> > + ts->context = CurrentMemoryContext;
> >
> > As far as I can tell, this member is never accessed again -- am I
> > missing something?
>
> You're right. It was used to re-create the tidstore in the same
> context again while resetting it, but we no longer support the reset
> API. Considering it again, would it be better to allocate the iterator
> struct in the same context as we store the tidstore struct?

That makes sense.

> > + /* DSA for tidstore will be detached at the end of session */
> >
> > No other test module pins the mapping, but that doesn't necessarily
> > mean it's wrong. Is there some advantage over explicitly detaching?
>
> One small benefit of not explicitly detaching dsa_area in
> tidstore_destroy() would be simplicity; IIUC if we want to do that, we
> need to remember the dsa_area using (for example) a static variable,
> and free it if it's non-NULL. I've implemented this idea in the
> attached patch.

Okay, I don't have a strong preference at this point.

> > +-- Add tids in random order.
> >
> > I don't see any randomization here. I do remember adding row_number to
> > remove whitespace in the output, but I don't remember a random order.
> > On that subject, the row_number was an easy trick to avoid extra
> > whitespace, but maybe we should just teach the setting function to
> > return blocknumber rather than null?
>
> Good idea, fixed.

+ test_set_block_offsets
+
+ 2147483647
+  0
+ 4294967294
+  1
+ 4294967295

Hmm, was the earlier comment about randomness referring to this? I'm
not sure what other regression tests do in these cases, or how
relibale this is. If this is a problem we could simply insert this
result into a temp table so it's not output.

> > +Datum
> > +tidstore_create(PG_FUNCTION_ARGS)
> > +{
> > ...
> > + tidstore = TidStoreCreate(max_bytes, dsa);
> >
> > +Datum
> > +tidstore_set_block_offsets(PG_FUNCTION_ARGS)
> > +{
> > 
> > + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
> >
> > These names are too similar. Maybe the test module should do
> > s/tidstore_/test_/ or similar.
>
> Agreed.

Mostly okay, although a couple look a bit generic now. I'll leave it
up to you if you want to tweak things.

> > In general, the .sql file is still very hard-coded. Functions are
> > created that contain a VALUES statement. Maybe it's okay for now, but
> > wanted to mention it. Ideally, we'd have some randomized tests,
> > without having to display it. That could be in addition to (not
> > replacing) the small tests we have that display input. (see below)
> >
>
> Agreed to add randomized tests in addition to the existing tests.

I'll try something tomorrow.

> Sounds a good idea. In fact, if there are some bugs in tidstore, it's
> likely that even initdb would fail in practice. However, it's a very
> good idea that we can test the tidstore anyway with such a check
> without a debug-build array.
>
> Or as another idea, I wonder if we could keep the debug-build array in
> some form. For example, we use the array with the particular build
> flag and set a BF animal for that. That way, we can test the tidstore
> in more real cases.

I think the purpose of a debug flag is to help developers catch
mistakes. I don't think it's quite useful enough for that. For one, it
has the same 1GB limitation as vacuum's current array. For another,
it'd be a terrible way to debug moving tidbitmap.c from its hash table
to use TID store -- AND/OR operations and lossy pages are pretty much
undoable with a copy of vacuum's array. Last year, when I insisted on
trying a long term realistic load that compares the result with the
array, the encoding scheme was much harder to understand in code. I
think it's now easier, and there are better tests.

> In the latest (v69) patch:
>
> - squashed v68-0005 and v68-0006 patches.
> - removed most of the changes in v68-0007 patch.
> - addressed above review comments in v69-0002 patch.
> - v69-0003, 0004, and 0005 are miscellaneous updates.
>
> As for renaming TidStore to TIDStore, I dropped the patch for now
> since it seems we're using "Tid" in some function names and variable
> names. If we want to update it, we can do that later.

I think we're not consistent across the codebase, and it's fine to
drop that patch.

v70-0008:

@@ -489,7 +489,7 @@ parallel_vacuum_reset_dead_items(ParallelVacuumState *pvs)
  /*
  * Free the current tidstore and return allocated DSA segments to the
  * operating system. Then we recreate the tidstore with the same max_bytes
- * limitation.
+ * limitation we just used.

Nowadays, max_bytes is now more like a hint for tidstore, and not a
limitation, right? Vacuum has the limitation. Maybe instead of "with",
we should say "passing the same limitatio

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-11 Thread Masahiko Sawada

On Mon, Mar 11, 2024 at 5:13 PM Masahiko Sawada  wrote:
>
> In the latest (v69) patch:
>
> - squashed v68-0005 and v68-0006 patches.
> - removed most of the changes in v68-0007 patch.
> - addressed above review comments in v69-0002 patch.
> - v69-0003, 0004, and 0005 are miscellaneous updates.

Since the v69 conflicts with the current HEAD, I've rebased them. In
addition, v70-0008 is the new patch, which cleans up the vacuum
integration patch.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


v70-ART.tar.gz
Description: GNU Zip compressed data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-11 Thread Masahiko Sawada

On Mon, Mar 11, 2024 at 12:20 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada  wrote:
> >
> > I've attached the remaining patches for CI. I've made some minor
> > changes in separate patches and drafted the commit message for
> > tidstore patch.
> >
> > While reviewing the tidstore code, I thought that it would be more
> > appropriate to place tidstore.c under src/backend/lib instead of
> > src/backend/common/access since the tidstore is no longer implemented
> > only for heap or other access methods, and it might also be used by
> > executor nodes in the future. What do you think?
>
> That's a heck of a good question. I don't think src/backend/lib is
> right -- it seems that's for general-purpose data structures.
> Something like backend/utils is also too general.
> src/backend/access/common has things for tuple descriptors, toast,
> sessions, and I don't think tidstore is out of place here. I'm not
> sure there's a better place, but I could be convinced otherwise.

Yeah, I agreed that src/backend/lib seems not to be the place for
tidstore. Let's keep it in src/backend/access/common. If others think
differently, we can move it later.

>
> v68-0001:
>
> I'm not sure if commit messages are much a subject of review, and it's
> up to the committer, but I'll share a couple comments just as
> something to think about, not something I would ask you to change: I
> think it's a bit distracting that the commit message talks about the
> justification to use it for vacuum. Let's save that for the commit
> with actual vacuum changes. Also, I suspect saying there are a "wide
> range" of uses is over-selling it a bit, and that paragraph is a bit
> awkward aside from that.

Thank you for the comment, and I agreed. I've updated the commit message.

>
> + /* Collect TIDs extracted from the key-value pair */
> + result->num_offsets = 0;
> +
>
> This comment has nothing at all to do with this line. If the comment
> is for several lines following, some of which are separated by blank
> lines, there should be a blank line after the comment. Also, why isn't
> tidstore_iter_extract_tids() responsible for setting that to zero?

Agreed, fixed.

I also updated this part so we set result->blkno in
tidstore_iter_extract_tids() too, which seems more readable.

>
> + ts->context = CurrentMemoryContext;
>
> As far as I can tell, this member is never accessed again -- am I
> missing something?

You're right. It was used to re-create the tidstore in the same
context again while resetting it, but we no longer support the reset
API. Considering it again, would it be better to allocate the iterator
struct in the same context as we store the tidstore struct?

>
> + /* DSA for tidstore will be detached at the end of session */
>
> No other test module pins the mapping, but that doesn't necessarily
> mean it's wrong. Is there some advantage over explicitly detaching?

One small benefit of not explicitly detaching dsa_area in
tidstore_destroy() would be simplicity; IIUC if we want to do that, we
need to remember the dsa_area using (for example) a static variable,
and free it if it's non-NULL. I've implemented this idea in the
attached patch.

>
> +-- Add tids in random order.
>
> I don't see any randomization here. I do remember adding row_number to
> remove whitespace in the output, but I don't remember a random order.
> On that subject, the row_number was an easy trick to avoid extra
> whitespace, but maybe we should just teach the setting function to
> return blocknumber rather than null?

Good idea, fixed.

>
> +Datum
> +tidstore_create(PG_FUNCTION_ARGS)
> +{
> ...
> + tidstore = TidStoreCreate(max_bytes, dsa);
>
> +Datum
> +tidstore_set_block_offsets(PG_FUNCTION_ARGS)
> +{
> 
> + TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);
>
> These names are too similar. Maybe the test module should do
> s/tidstore_/test_/ or similar.

Agreed.

>
> +/* Sanity check if we've called tidstore_create() */
> +static void
> +check_tidstore_available(void)
> +{
> + if (tidstore == NULL)
> + elog(ERROR, "tidstore is not initialized");
> +}
>
> I don't find this very helpful. If a developer wiped out the create
> call, wouldn't the test crash and burn pretty obviously?

Removed.

>
> In general, the .sql file is still very hard-coded. Functions are
> created that contain a VALUES statement. Maybe it's okay for now, but
> wanted to mention it. Ideally, we'd have some randomized tests,
> without having to display it. That could be in addition to (not
> replacing) the small tests we have that display input. (see below)
>

Agreed to add randomized tests in addition to the existing tests.

>
> v68-0002:
>
> @@ -329,6 +381,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
>
>   ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;
>
> +#ifdef TIDSTORE_DEBUG
> + if (!TidStoreIsShared(ts))
> + {
> + bool ret_debug = ts_debug_is_member(ts, tid);;
> + Assert(ret == ret_debug);
> + }
> +#endif
>
> This only checking the

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-11 Thread John Naylor

On Fri, Feb 16, 2024 at 10:05 AM Masahiko Sawada  wrote:
>
> On Thu, Feb 15, 2024 at 8:26 PM John Naylor  wrote:

> > v61-0007: Runtime-embeddable tids -- Optional for v17, but should
> > reduce memory regressions, so should be considered. Up to 3 tids can
> > be stored in the last level child pointer. It's not polished, but I'll
> > only proceed with that if we think we need this. "flags" iis called
> > that because it could hold tidbitmap.c booleans (recheck, lossy) in
> > the future, in addition to reserving space for the pointer tag. Note:
> > I hacked the tests to only have 2 offsets per block to demo, but of
> > course both paths should be tested.
>
> Interesting. I've run the same benchmark tests we did[1][2] (the
> median of 3 runs):
[found a big speed-up where we don't expect one]

I tried to reproduce this (similar patch, but rebased on top of a bug
you recently fixed (possibly related?) -- attached, and also shows one
way to address some lack of coverage in the debug build, for as long
as we test that with CI).

Fortunately I cannot see a difference, so I believe it's not affecting
the case in this test all, as expected:

v68:

INFO:  finished vacuuming "john.public.test": index scans: 1
pages: 0 removed, 442478 remain, 88478 scanned (20.00% of total)
tuples: 19995999 removed, 80003979 remain, 0 are dead but not yet removable
removable cutoff: 770, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 88478 pages from table (20.00% of total) had
19995999 dead item identifiers removed
index "test_x_idx": pages: 274194 in total, 54822 newly deleted, 54822
currently deleted, 0 reusable
avg read rate: 620.356 MB/s, avg write rate: 124.105 MB/s
buffer usage: 758236 hits, 274196 misses, 54854 dirtied
WAL usage: 2 records, 0 full page images, 425 bytes

system usage: CPU: user: 3.74 s, system: 0.68 s, elapsed: 4.45 s
system usage: CPU: user: 3.02 s, system: 0.42 s, elapsed: 3.47 s
system usage: CPU: user: 3.09 s, system: 0.38 s, elapsed: 3.49 s
system usage: CPU: user: 3.00 s, system: 0.43 s, elapsed: 3.45 s

v68 + emb values (that cannot be used because > 3 tids per block):

INFO:  finished vacuuming "john.public.test": index scans: 1
pages: 0 removed, 442478 remain, 88478 scanned (20.00% of total)
tuples: 19995999 removed, 80003979 remain, 0 are dead but not yet removable
removable cutoff: 775, which was 0 XIDs old when operation ended
frozen: 0 pages from table (0.00% of total) had 0 tuples frozen
index scan needed: 88478 pages from table (20.00% of total) had
19995999 dead item identifiers removed
index "test_x_idx": pages: 274194 in total, 54822 newly deleted, 54822
currently deleted, 0 reusable
avg read rate: 570.808 MB/s, avg write rate: 114.192 MB/s
buffer usage: 758236 hits, 274196 misses, 54854 dirtied
WAL usage: 2 records, 0 full page images, 425 bytes

system usage: CPU: user: 3.11 s, system: 0.62 s, elapsed: 3.75 s
system usage: CPU: user: 3.04 s, system: 0.41 s, elapsed: 3.46 s
system usage: CPU: user: 3.05 s, system: 0.41 s, elapsed: 3.47 s
system usage: CPU: user: 3.04 s, system: 0.43 s, elapsed: 3.49 s

I'll continue polishing the runtime-embeddable values patch as time
permits, for later consideration.


WIP-3-embeddable-TIDs.patch.nocfbot
Description: Binary data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-10 Thread John Naylor

On Thu, Mar 7, 2024 at 10:35 PM Masahiko Sawada  wrote:
>
> I've attached the remaining patches for CI. I've made some minor
> changes in separate patches and drafted the commit message for
> tidstore patch.
>
> While reviewing the tidstore code, I thought that it would be more
> appropriate to place tidstore.c under src/backend/lib instead of
> src/backend/common/access since the tidstore is no longer implemented
> only for heap or other access methods, and it might also be used by
> executor nodes in the future. What do you think?

That's a heck of a good question. I don't think src/backend/lib is
right -- it seems that's for general-purpose data structures.
Something like backend/utils is also too general.
src/backend/access/common has things for tuple descriptors, toast,
sessions, and I don't think tidstore is out of place here. I'm not
sure there's a better place, but I could be convinced otherwise.

v68-0001:

I'm not sure if commit messages are much a subject of review, and it's
up to the committer, but I'll share a couple comments just as
something to think about, not something I would ask you to change: I
think it's a bit distracting that the commit message talks about the
justification to use it for vacuum. Let's save that for the commit
with actual vacuum changes. Also, I suspect saying there are a "wide
range" of uses is over-selling it a bit, and that paragraph is a bit
awkward aside from that.

+ /* Collect TIDs extracted from the key-value pair */
+ result->num_offsets = 0;
+

This comment has nothing at all to do with this line. If the comment
is for several lines following, some of which are separated by blank
lines, there should be a blank line after the comment. Also, why isn't
tidstore_iter_extract_tids() responsible for setting that to zero?

+ ts->context = CurrentMemoryContext;

As far as I can tell, this member is never accessed again -- am I
missing something?

+ /* DSA for tidstore will be detached at the end of session */

No other test module pins the mapping, but that doesn't necessarily
mean it's wrong. Is there some advantage over explicitly detaching?

+-- Add tids in random order.

I don't see any randomization here. I do remember adding row_number to
remove whitespace in the output, but I don't remember a random order.
On that subject, the row_number was an easy trick to avoid extra
whitespace, but maybe we should just teach the setting function to
return blocknumber rather than null?

+Datum
+tidstore_create(PG_FUNCTION_ARGS)
+{
...
+ tidstore = TidStoreCreate(max_bytes, dsa);

+Datum
+tidstore_set_block_offsets(PG_FUNCTION_ARGS)
+{

+ TidStoreSetBlockOffsets(tidstore, blkno, offs, noffs);

These names are too similar. Maybe the test module should do
s/tidstore_/test_/ or similar.

+/* Sanity check if we've called tidstore_create() */
+static void
+check_tidstore_available(void)
+{
+ if (tidstore == NULL)
+ elog(ERROR, "tidstore is not initialized");
+}

I don't find this very helpful. If a developer wiped out the create
call, wouldn't the test crash and burn pretty obviously?

In general, the .sql file is still very hard-coded. Functions are
created that contain a VALUES statement. Maybe it's okay for now, but
wanted to mention it. Ideally, we'd have some randomized tests,
without having to display it. That could be in addition to (not
replacing) the small tests we have that display input. (see below)

v68-0002:

@@ -329,6 +381,13 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)

  ret = (page->words[wordnum] & ((bitmapword) 1 << bitnum)) != 0;

+#ifdef TIDSTORE_DEBUG
+ if (!TidStoreIsShared(ts))
+ {
+ bool ret_debug = ts_debug_is_member(ts, tid);;
+ Assert(ret == ret_debug);
+ }
+#endif

This only checking the case where we haven't returned already. In particular...

+ /* no entry for the blk */
+ if (page == NULL)
+ return false;
+
+ wordnum = WORDNUM(off);
+ bitnum = BITNUM(off);
+
+ /* no bitmap for the off */
+ if (wordnum >= page->nwords)
+ return false;

...these results are not checked.

More broadly, it seems like the test module should be able to test
everything that the debug-build array would complain about. Including
ordered iteration. This may require first saving our test input to a
table. We could create a cursor on a query that fetches the ordered
input from the table and verifies that the tid store iterate produces
the same ordered set, maybe with pl/pgSQL. Or something like that.
Seems like not a whole lot of work. I can try later in the week, if
you like.

v68-0005/6 look ready to squash

v68-0008 - I'm not a fan of captilizing short comment fragments. I use
the style of either: short lower-case phrases, or full sentences
including capitalization, correct grammar and period. I see these two
styles all over the code base, as appropriate.

+ /* Remain attached until end of backend */

We'll probably want this comment, if in fact we want this behavior.

+ /*
+ * Note that funcctx->call_cntr is incremented in SRF_RETURN_NEXT
+ * before return.
+ */

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread John Naylor

On Fri, Mar 8, 2024 at 9:53 AM John Naylor  wrote:
>
> On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada  wrote:
> > Yesterday I've confirmed the something like the below fixes the
> > problem happened in Windows CI:
> >
> > --- a/src/test/modules/test_radixtree/meson.build
> > +++ b/src/test/modules/test_radixtree/meson.build
> > @@ -12,6 +12,7 @@ endif
> >
> >  test_radixtree = shared_module('test_radixtree',
> >test_radixtree_sources,
> > +  link_with: host_system == 'windows' ? pgport_srv : [],
> >kwargs: pg_test_mod_args,
> >  )
> >  test_install_libs += test_radixtree
>
> pgport_srv is for backend, shared libraries should be using pgport_shlib

> In any case, I'm trying it in CI branch with pgport_shlib now.

That seems to work, so I'll push that just to get things green again.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread John Naylor

On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada  wrote:
> Yesterday I've confirmed the something like the below fixes the
> problem happened in Windows CI:
>
> --- a/src/test/modules/test_radixtree/meson.build
> +++ b/src/test/modules/test_radixtree/meson.build
> @@ -12,6 +12,7 @@ endif
>
>  test_radixtree = shared_module('test_radixtree',
>test_radixtree_sources,
> +  link_with: host_system == 'windows' ? pgport_srv : [],
>kwargs: pg_test_mod_args,
>  )
>  test_install_libs += test_radixtree

pgport_srv is for backend, shared libraries should be using pgport_shlib

Further, the top level meson.build has:

# all shared libraries not part of the backend should depend on this
frontend_shlib_code = declare_dependency(
  include_directories: [postgres_inc],
  link_with: [common_shlib, pgport_shlib],
  sources: generated_headers,
  dependencies: [shlib_code, os_deps, libintl],
)

...but the only things that declare needing frontend_shlib_code are in
src/interfaces/.

In any case, I'm trying it in CI branch with pgport_shlib now.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread John Naylor

On Fri, Mar 8, 2024 at 8:09 AM Masahiko Sawada  wrote:
>
> Yesterday I've confirmed the something like the below fixes the
> problem happened in Windows CI:

Glad you shared before I went and did it.

> --- a/src/test/modules/test_radixtree/meson.build
> +++ b/src/test/modules/test_radixtree/meson.build
> @@ -12,6 +12,7 @@ endif
>
>  test_radixtree = shared_module('test_radixtree',
>test_radixtree_sources,
> +  link_with: host_system == 'windows' ? pgport_srv : [],

I don't see any similar coding elsewhere, so that leaves me wondering
if we're missing something. On the other hand, maybe no test modules
use files in src/port ...

>kwargs: pg_test_mod_args,
>  )
>  test_install_libs += test_radixtree
>
> But I'm not sure it's the right fix especially because I guess it
> could raise "AddressSanitizer: odr-violation" error on Windows.

Well, it's now at zero definitions that it can see, so I imagine it's
possible that adding the above would not cause more than one. In any
case, we might not know since as far as I can tell the MSVC animals
don't have address sanitizer. I'll look around some more, and if I
don't get any revelations, I guess we should go with the above.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread Masahiko Sawada

On Fri, Mar 8, 2024 at 10:04 AM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada  wrote:
> >
> > It looks like it requires a link with pgport_srv but I'm not sure. It
> > seems that the recent commit 1f1d73a8b breaks CI, Windows - Server
> > 2019, VS 2019 - Meson & ninja, too.
>
> Unfortunately, none of the Windows animals happened to run both after
> the initial commit and before removing the (seemingly useless on our
> daily platfoms) link. I'll confirm on my own CI branch in a few
> minutes.

Yesterday I've confirmed the something like the below fixes the
problem happened in Windows CI:

--- a/src/test/modules/test_radixtree/meson.build
+++ b/src/test/modules/test_radixtree/meson.build
@@ -12,6 +12,7 @@ endif

 test_radixtree = shared_module('test_radixtree',
   test_radixtree_sources,
+  link_with: host_system == 'windows' ? pgport_srv : [],
   kwargs: pg_test_mod_args,
 )
 test_install_libs += test_radixtree

But I'm not sure it's the right fix especially because I guess it
could raise "AddressSanitizer: odr-violation" error on Windows.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread John Naylor

On Thu, Mar 7, 2024 at 11:15 PM Masahiko Sawada  wrote:
>
> It looks like it requires a link with pgport_srv but I'm not sure. It
> seems that the recent commit 1f1d73a8b breaks CI, Windows - Server
> 2019, VS 2019 - Meson & ninja, too.

Unfortunately, none of the Windows animals happened to run both after
the initial commit and before removing the (seemingly useless on our
daily platfoms) link. I'll confirm on my own CI branch in a few
minutes.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread Masahiko Sawada

On Thu, Mar 7, 2024 at 8:06 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 7, 2024 at 6:37 PM John Naylor  wrote:
>
> > > $ git grep 'link_with: pgport_srv'
> > > src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,
> > >
> > > No other test module uses this directive, and indeed, removing this
> > > still builds fine for me. Thoughts?
> >
> > Yeah, it could be the culprit. The test_radixtree/meson.build is the
> > sole extension that explicitly specifies a link with pgport_srv. I
> > think we can get rid of it as I've also confirmed the build still fine
> > even without it.
>
> olingo and grassquit have turned green, so that must have been it.

fairywren is complaining another build failure:

[1931/2156] "gcc"  -o
src/test/modules/test_radixtree/test_radixtree.dll
src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj
"-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group"
"-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a"
"-Wl,--stack,4194304" "-Wl,--allow-multiple-definition"
"-Wl,--disable-auto-import" "-fvisibility=hidden"
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a"
"-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm"
"-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32"
"-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group"
FAILED: src/test/modules/test_radixtree/test_radixtree.dll
"gcc"  -o src/test/modules/test_radixtree/test_radixtree.dll
src/test/modules/test_radixtree/test_radixtree.dll.p/win32ver.obj
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj
"-Wl,--allow-shlib-undefined" "-shared" "-Wl,--start-group"
"-Wl,--out-implib=src/test\\modules\\test_radixtree\\test_radixtree.dll.a"
"-Wl,--stack,4194304" "-Wl,--allow-multiple-definition"
"-Wl,--disable-auto-import" "-fvisibility=hidden"
"C:/tools/nmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/src/backend/libpostgres.exe.a"
"-pthread" "C:/tools/nmsys64/ucrt64/bin/../lib/libssl.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libcrypto.dll.a"
"C:/tools/nmsys64/ucrt64/bin/../lib/libz.dll.a" "-lws2_32" "-lm"
"-lkernel32" "-luser32" "-lgdi32" "-lwinspool" "-lshell32" "-lole32"
"-loleaut32" "-luuid" "-lcomdlg32" "-ladvapi32" "-Wl,--end-group"
C:/tools/nmsys64/ucrt64/bin/../lib/gcc/x86_64-w64-mingw32/12.2.0/../../../../x86_64-w64-mingw32/bin/ld.exe:
src/test/modules/test_radixtree/test_radixtree.dll.p/test_radixtree.c.obj:test_radixtree:(.rdata$.refptr.pg_popcount64[.refptr.pg_popcount64]+0x0):
undefined reference to `pg_popcount64'

It looks like it requires a link with pgport_srv but I'm not sure. It
seems that the recent commit 1f1d73a8b breaks CI, Windows - Server
2019, VS 2019 - Meson & ninja, too.

Regards,

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-03-07%2012%3A53%3A20

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread Masahiko Sawada

On Thu, Mar 7, 2024 at 8:06 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 7, 2024 at 6:37 PM John Naylor  wrote:
>
> > > $ git grep 'link_with: pgport_srv'
> > > src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,
> > >
> > > No other test module uses this directive, and indeed, removing this
> > > still builds fine for me. Thoughts?
> >
> > Yeah, it could be the culprit. The test_radixtree/meson.build is the
> > sole extension that explicitly specifies a link with pgport_srv. I
> > think we can get rid of it as I've also confirmed the build still fine
> > even without it.
>
> olingo and grassquit have turned green, so that must have been it.

Cool!

I've attached the remaining patches for CI. I've made some minor
changes in separate patches and drafted the commit message for
tidstore patch.

While reviewing the tidstore code, I thought that it would be more
appropriate to place tidstore.c under src/backend/lib instead of
src/backend/common/access since the tidstore is no longer implemented
only for heap or other access methods, and it might also be used by
executor nodes in the future. What do you think?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

v68-ART.tar.gz
Description: GNU Zip compressed data

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread John Naylor

On Thu, Mar 7, 2024 at 4:47 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 7, 2024 at 6:37 PM John Naylor  wrote:

> > $ git grep 'link_with: pgport_srv'
> > src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,
> >
> > No other test module uses this directive, and indeed, removing this
> > still builds fine for me. Thoughts?
>
> Yeah, it could be the culprit. The test_radixtree/meson.build is the
> sole extension that explicitly specifies a link with pgport_srv. I
> think we can get rid of it as I've also confirmed the build still fine
> even without it.

olingo and grassquit have turned green, so that must have been it.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread Masahiko Sawada

On Thu, Mar 7, 2024 at 6:37 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 1:19 PM John Naylor  wrote:
> >
> > In addition, olingo and grassquit are showing different kinds of
> > "AddressSanitizer: odr-violation" errors, which I'm not sure what to
> > make of -- example:
>
> This might be relevant:
>
> $ git grep 'link_with: pgport_srv'
> src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,
>
> No other test module uses this directive, and indeed, removing this
> still builds fine for me. Thoughts?

Yeah, it could be the culprit. The test_radixtree/meson.build is the
sole extension that explicitly specifies a link with pgport_srv. I
think we can get rid of it as I've also confirmed the build still fine
even without it.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread John Naylor

On Thu, Mar 7, 2024 at 1:19 PM John Naylor  wrote:
>
> In addition, olingo and grassquit are showing different kinds of
> "AddressSanitizer: odr-violation" errors, which I'm not sure what to
> make of -- example:

This might be relevant:

$ git grep 'link_with: pgport_srv'
src/test/modules/test_radixtree/meson.build:  link_with: pgport_srv,

No other test module uses this directive, and indeed, removing this
still builds fine for me. Thoughts?

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-07 Thread John Naylor

On Thu, Mar 7, 2024 at 1:49 PM Masahiko Sawada  wrote:
> odr-violation seems to refer to One Definition Rule (ODR). According
> to Wikipedia[1]:
>
> The One Definition Rule (ODR) is an important rule of the C++
> programming language that prescribes that classes/structs and
> non-inline functions cannot have more than one definition in the
> entire program and template and types cannot have more than one
> definition by translation unit. It is defined in the ISO C++ Standard
> (ISO/IEC 14882) 2003, at section 3.2. Some other programming languages
> have similar but differently defined rules towards the same objective.
>
> I don't fully understand this concept yet but are these two different
> build failures related?

I thought it may have something to do with the prerequisite commit
that moved some symbols from bitmapset.c to .h:

/* Select appropriate bit-twiddling functions for bitmap word size */
#if BITS_PER_BITMAPWORD == 32
#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w)
#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w)
#define bmw_popcount(w) pg_popcount32(w)
#elif BITS_PER_BITMAPWORD == 64
#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w)
#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w)
#define bmw_popcount(w) pg_popcount64(w)
#else
#error "invalid BITS_PER_BITMAPWORD"
#endif

...but olingo's error seems strange to me, because it is complaining
of pg_leftmost_one_pos, which refers to the lookup table in
pg_bitutils.c -- I thought all buildfarm members used the bitscan
instructions.

grassquit is complaining of pg_popcount64, which is a global function,
also in pg_bitutils.c. Not sure what to make of this, since we're just
pointing symbols at things which should have a single definition...

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread Masahiko Sawada

On Thu, Mar 7, 2024 at 4:21 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 2:14 PM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 7, 2024 at 4:01 PM John Naylor  wrote:
> > >
> > > On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor  
> > > > wrote:
> > > > >
> > > > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor  
> > > > > wrote:
> > >
> > > > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > > > > > feature [-Werror,-Wtypedef-redefinition]"
> > > > > >
> > > > > > I'll look in the other templates to see if what they do.
> > > > >
> > > > > Their "declare" sections have full typedefs. I found it works to leave
> > > > > out the typedef for the "define" section, but I first want to
> > > > > reproduce the build failure.
> > > >
> > > > Right. I've reproduced this build failure on my machine by specifying
> > > > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
> > > > below change seems to fix the problem:
> > >
> > > Confirmed, will push shortly.
> >
> > mamba complained different build errors[1]:
> >
> >  2740 |  fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys);
> >   |  ~~^ ~~~
> >   ||  |
> >   |long int   int64 {aka long long 
> > int}
> >   |  %lld
> > ../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld'
> > expects argument of type 'long int', but argument 4 has type 'int64'
> > {aka 'long long int'} [-Werror=format=]
> >  2752 |   fprintf(stderr, ", n%d = %ld", size_class.fanout,
> > tree->ctl->num_nodes[i]);
> >   |~~^
> > ~~~
> >   |  |
> >  |
> >   |  long int
> >  int64 {aka long long int}
> >   |%lld
> > ../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld'
> > expects argument of type 'long int', but argument 3 has type 'int64'
> > {aka 'long long int'} [-Werror=format=]
> >  2755 |  fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
> >   |  ~~^   ~
> >   |||
> >   |long int int64 {aka long long 
> > int}
> >   |  %lld
> >
> > Regards,
> >
> > [1] 
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18
>
> Yeah, the attached fixes it for me.

Thanks, LGTM.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor

On Thu, Mar 7, 2024 at 2:14 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 7, 2024 at 4:01 PM John Naylor  wrote:
> >
> > On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Thu, Mar 7, 2024 at 3:20 PM John Naylor  
> > > wrote:
> > > >
> > > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor  
> > > > wrote:
> >
> > > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > > > > feature [-Werror,-Wtypedef-redefinition]"
> > > > >
> > > > > I'll look in the other templates to see if what they do.
> > > >
> > > > Their "declare" sections have full typedefs. I found it works to leave
> > > > out the typedef for the "define" section, but I first want to
> > > > reproduce the build failure.
> > >
> > > Right. I've reproduced this build failure on my machine by specifying
> > > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
> > > below change seems to fix the problem:
> >
> > Confirmed, will push shortly.
>
> mamba complained different build errors[1]:
>
>  2740 |  fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys);
>   |  ~~^ ~~~
>   ||  |
>   |long int   int64 {aka long long 
> int}
>   |  %lld
> ../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld'
> expects argument of type 'long int', but argument 4 has type 'int64'
> {aka 'long long int'} [-Werror=format=]
>  2752 |   fprintf(stderr, ", n%d = %ld", size_class.fanout,
> tree->ctl->num_nodes[i]);
>   |~~^
> ~~~
>   |  |
>  |
>   |  long int
>  int64 {aka long long int}
>   |%lld
> ../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld'
> expects argument of type 'long int', but argument 3 has type 'int64'
> {aka 'long long int'} [-Werror=format=]
>  2755 |  fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
>   |  ~~^   ~
>   |||
>   |long int int64 {aka long long int}
>   |  %lld
>
> Regards,
>
> [1] 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18

Yeah, the attached fixes it for me.
diff --git a/src/include/lib/radixtree.h b/src/include/lib/radixtree.h
index 93e6a7d809..b8ad51c14d 100644
--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -2737,7 +2737,7 @@ RT_SCOPE void
 RT_STATS(RT_RADIX_TREE * tree)
 {
 	fprintf(stderr, "max_val = " UINT64_FORMAT "\n", tree->ctl->max_val);
-	fprintf(stderr, "num_keys = %ld\n", tree->ctl->num_keys);
+	fprintf(stderr, "num_keys = %lld\n", (long long) tree->ctl->num_keys);
 
 #ifdef RT_SHMEM
 	fprintf(stderr, "handle = " DSA_POINTER_FORMAT "\n", tree->ctl->handle);
@@ -2749,10 +2749,10 @@ RT_STATS(RT_RADIX_TREE * tree)
 	{
 		RT_SIZE_CLASS_ELEM size_class = RT_SIZE_CLASS_INFO[i];
 
-		fprintf(stderr, ", n%d = %ld", size_class.fanout, tree->ctl->num_nodes[i]);
+		fprintf(stderr, ", n%d = %lld", size_class.fanout, (long long) tree->ctl->num_nodes[i]);
 	}
 
-	fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
+	fprintf(stderr, ", leaves = %lld", (long long) tree->ctl->num_leaves);
 
 	fprintf(stderr, "\n");
 }

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread Masahiko Sawada

On Thu, Mar 7, 2024 at 4:01 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada  wrote:
> >
> > On Thu, Mar 7, 2024 at 3:20 PM John Naylor  wrote:
> > >
> > > On Thu, Mar 7, 2024 at 12:59 PM John Naylor  
> > > wrote:
>
> > > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > > > feature [-Werror,-Wtypedef-redefinition]"
> > > >
> > > > I'll look in the other templates to see if what they do.
> > >
> > > Their "declare" sections have full typedefs. I found it works to leave
> > > out the typedef for the "define" section, but I first want to
> > > reproduce the build failure.
> >
> > Right. I've reproduced this build failure on my machine by specifying
> > flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
> > below change seems to fix the problem:
>
> Confirmed, will push shortly.

mamba complained different build errors[1]:

 2740 |  fprintf(stderr, "num_keys = %ld\\n", tree->ctl->num_keys);
  |  ~~^ ~~~
  ||  |
  |long int   int64 {aka long long int}
  |  %lld
../../../../src/include/lib/radixtree.h:2752:30: error: format '%ld'
expects argument of type 'long int', but argument 4 has type 'int64'
{aka 'long long int'} [-Werror=format=]
 2752 |   fprintf(stderr, ", n%d = %ld", size_class.fanout,
tree->ctl->num_nodes[i]);
  |~~^
~~~
  |  |
 |
  |  long int
 int64 {aka long long int}
  |%lld
../../../../src/include/lib/radixtree.h:2755:32: error: format '%ld'
expects argument of type 'long int', but argument 3 has type 'int64'
{aka 'long long int'} [-Werror=format=]
 2755 |  fprintf(stderr, ", leaves = %ld", tree->ctl->num_leaves);
  |  ~~^   ~
  |||
  |long int int64 {aka long long int}
  |  %lld

Regards,

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-03-07%2006%3A05%3A18

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor

On Thu, Mar 7, 2024 at 1:27 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 7, 2024 at 3:20 PM John Naylor  wrote:
> >
> > On Thu, Mar 7, 2024 at 12:59 PM John Naylor  wrote:

> > > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > > feature [-Werror,-Wtypedef-redefinition]"
> > >
> > > I'll look in the other templates to see if what they do.
> >
> > Their "declare" sections have full typedefs. I found it works to leave
> > out the typedef for the "define" section, but I first want to
> > reproduce the build failure.
>
> Right. I've reproduced this build failure on my machine by specifying
> flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
> below change seems to fix the problem:

Confirmed, will push shortly.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread Masahiko Sawada

On Thu, Mar 7, 2024 at 3:27 PM Masahiko Sawada  wrote:
>
> On Thu, Mar 7, 2024 at 3:20 PM John Naylor  wrote:
> >
> >
> > In addition, olingo and grassquit are showing different kinds of
> > "AddressSanitizer: odr-violation" errors, which I'm not sure what to
> > make of -- example:

odr-violation seems to refer to One Definition Rule (ODR). According
to Wikipedia[1]:

The One Definition Rule (ODR) is an important rule of the C++
programming language that prescribes that classes/structs and
non-inline functions cannot have more than one definition in the
entire program and template and types cannot have more than one
definition by translation unit. It is defined in the ISO C++ Standard
(ISO/IEC 14882) 2003, at section 3.2. Some other programming languages
have similar but differently defined rules towards the same objective.

I don't fully understand this concept yet but are these two different
build failures related?

Regards,

[1] https://en.wikipedia.org/wiki/One_Definition_Rule

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread Masahiko Sawada

On Thu, Mar 7, 2024 at 3:20 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 12:59 PM John Naylor  wrote:
> >
> > On Thu, Mar 7, 2024 at 12:55 PM John Naylor  wrote:
> > >
> > > On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > > > + /* Find the control object in shared memory */
> > > > > + control = handle;
> > > >
> > > > I think it's mostly because of readability; it makes clear that the
> > > > handle should be castable to dsa_pointer and it's a control object. I
> > > > borrowed it from dshash_attach().
> > >
> > > I find that a bit strange, but I went ahead and kept it.
> > >
> > >
> > >
> > > On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada  
> > > wrote:
> > >
> > > > The 0001 patch looks good to me. I have some minor comments:
> > >
> > > > +PGFILEDESC = "test_radixtree - test code for 
> > > > src/backend/lib/radixtree.c"
> > > > +
> > > >
> > > > "src/backend/lib/radixtree.c" should be updated to
> > > > "src/include/lib/radixtree.h".
> > >
> > > Done.
> > >
> > > > --- /dev/null
> > > > +++ b/src/test/modules/test_radixtree/README
> > > > @@ -0,0 +1,7 @@
> > > > +test_integerset contains unit tests for testing the integer set 
> > > > implementation
> > > > +in src/backend/lib/integerset.c.
> > > > +
> > > > +The tests verify the correctness of the implementation, but they can 
> > > > also be
> > > > +used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
> > > > +test_integerset.c, the tests will print extra information about 
> > > > execution time
> > > > +and memory usage.
> > > >
> > > > This file is not updated for test_radixtree. I think we can remove it
> > > > as the test cases in test_radixtree are clear.
> > >
> > > Done. I pushed this with a few last-minute cosmetic adjustments. This
> > > has been a very long time coming, but we're finally in the home
> > > stretch!
> > >
> > > Already, I see sifaka doesn't like this, and I'm looking now...
> >
> > It's complaining that these forward declarations...
> >
> > /* generate forward declarations necessary to use the radix tree */
> > #ifdef RT_DECLARE
> >
> > typedef struct RT_RADIX_TREE RT_RADIX_TREE;
> > typedef struct RT_ITER RT_ITER;
> >
> > ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> > feature [-Werror,-Wtypedef-redefinition]"
> >
> > I'll look in the other templates to see if what they do.
>
> Their "declare" sections have full typedefs. I found it works to leave
> out the typedef for the "define" section, but I first want to
> reproduce the build failure.

Right. I've reproduced this build failure on my machine by specifying
flags "-Wtypedef-redefinition -std=gnu99" to clang. Something the
below change seems to fix the problem:

--- a/src/include/lib/radixtree.h
+++ b/src/include/lib/radixtree.h
@@ -676,7 +676,7 @@ typedef struct RT_RADIX_TREE_CONTROL
 }  RT_RADIX_TREE_CONTROL;

 /* Entry point for allocating and accessing the tree */
-typedef struct RT_RADIX_TREE
+struct RT_RADIX_TREE
 {
MemoryContext context;

@@ -691,7 +691,7 @@ typedef struct RT_RADIX_TREE
/* leaf_context is used only for single-value leaves */
MemoryContextData *leaf_context;
 #endif
-}  RT_RADIX_TREE;
+};

 /*
  * Iteration support.
@@ -714,7 +714,7 @@ typedef struct RT_NODE_ITER
 }  RT_NODE_ITER;

 /* state for iterating over the whole radix tree */
-typedef struct RT_ITER
+struct RT_ITER
 {
RT_RADIX_TREE *tree;

@@ -728,7 +728,7 @@ typedef struct RT_ITER

/* The key constructed during iteration */
uint64  key;
-}  RT_ITER;
+};


 /* verification (available only in assert-enabled builds) */

>
> In addition, olingo and grassquit are showing different kinds of
> "AddressSanitizer: odr-violation" errors, which I'm not sure what to
> make of -- example:
>
> ==1862767==ERROR: AddressSanitizer: odr-violation (0x7fc257476b60):
>   [1] size=256 'pg_leftmost_one_pos'
> /home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
>   [2] size=256 'pg_leftmost_one_pos'
> /home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
> These globals were registered at these points:
>   [1]:
> #0 0x563564b97bf6 in __asan_register_globals
> (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
> (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
> #1 0x563564b98d1d in __asan_register_elf_globals
> (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
> (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
> #2 0x7fc265c3fe3d in call_init elf/dl-init.c:74:3
> #3 0x7fc265c3fe3d in call_init elf/dl-init.c:26:1
>
>   [2]:
> #0 0x563564b97bf6 in __asan_register_globals
> (/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
> (BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
> #1 0x563564b98d1d in __asan_register_elf_globals

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor

On Thu, Mar 7, 2024 at 12:59 PM John Naylor  wrote:
>
> On Thu, Mar 7, 2024 at 12:55 PM John Naylor  wrote:
> >
> > On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada  
> > wrote:
> > >
> > > > + /* Find the control object in shared memory */
> > > > + control = handle;
> > >
> > > I think it's mostly because of readability; it makes clear that the
> > > handle should be castable to dsa_pointer and it's a control object. I
> > > borrowed it from dshash_attach().
> >
> > I find that a bit strange, but I went ahead and kept it.
> >
> >
> >
> > On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada  
> > wrote:
> >
> > > The 0001 patch looks good to me. I have some minor comments:
> >
> > > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
> > > +
> > >
> > > "src/backend/lib/radixtree.c" should be updated to
> > > "src/include/lib/radixtree.h".
> >
> > Done.
> >
> > > --- /dev/null
> > > +++ b/src/test/modules/test_radixtree/README
> > > @@ -0,0 +1,7 @@
> > > +test_integerset contains unit tests for testing the integer set 
> > > implementation
> > > +in src/backend/lib/integerset.c.
> > > +
> > > +The tests verify the correctness of the implementation, but they can 
> > > also be
> > > +used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
> > > +test_integerset.c, the tests will print extra information about 
> > > execution time
> > > +and memory usage.
> > >
> > > This file is not updated for test_radixtree. I think we can remove it
> > > as the test cases in test_radixtree are clear.
> >
> > Done. I pushed this with a few last-minute cosmetic adjustments. This
> > has been a very long time coming, but we're finally in the home
> > stretch!
> >
> > Already, I see sifaka doesn't like this, and I'm looking now...
>
> It's complaining that these forward declarations...
>
> /* generate forward declarations necessary to use the radix tree */
> #ifdef RT_DECLARE
>
> typedef struct RT_RADIX_TREE RT_RADIX_TREE;
> typedef struct RT_ITER RT_ITER;
>
> ... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
> feature [-Werror,-Wtypedef-redefinition]"
>
> I'll look in the other templates to see if what they do.

Their "declare" sections have full typedefs. I found it works to leave
out the typedef for the "define" section, but I first want to
reproduce the build failure.

In addition, olingo and grassquit are showing different kinds of
"AddressSanitizer: odr-violation" errors, which I'm not sure what to
make of -- example:

==1862767==ERROR: AddressSanitizer: odr-violation (0x7fc257476b60):
  [1] size=256 'pg_leftmost_one_pos'
/home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
  [2] size=256 'pg_leftmost_one_pos'
/home/bf/bf-build/olingo/HEAD/pgsql.build/../pgsql/src/port/pg_bitutils.c:34
These globals were registered at these points:
  [1]:
#0 0x563564b97bf6 in __asan_register_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#1 0x563564b98d1d in __asan_register_elf_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#2 0x7fc265c3fe3d in call_init elf/dl-init.c:74:3
#3 0x7fc265c3fe3d in call_init elf/dl-init.c:26:1

  [2]:
#0 0x563564b97bf6 in __asan_register_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e2bf6)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#1 0x563564b98d1d in __asan_register_elf_globals
(/home/bf/bf-build/olingo/HEAD/pgsql.build/tmp_install/home/bf/bf-build/olingo/HEAD/inst/bin/postgres+0x3e3d1d)
(BuildId: e2ff70bf14f342e03f451bba119134a49a50b8b8)
#2 0x7fc2649847f5 in call_init csu/../csu/libc-start.c:145:3
#3 0x7fc2649847f5 in __libc_start_main csu/../csu/libc-start.c:347:5

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor

On Thu, Mar 7, 2024 at 12:55 PM John Naylor  wrote:
>
> On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada  wrote:
> >
> > > + /* Find the control object in shared memory */
> > > + control = handle;
> >
> > I think it's mostly because of readability; it makes clear that the
> > handle should be castable to dsa_pointer and it's a control object. I
> > borrowed it from dshash_attach().
>
> I find that a bit strange, but I went ahead and kept it.
>
>
>
> On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada  wrote:
>
> > The 0001 patch looks good to me. I have some minor comments:
>
> > +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
> > +
> >
> > "src/backend/lib/radixtree.c" should be updated to
> > "src/include/lib/radixtree.h".
>
> Done.
>
> > --- /dev/null
> > +++ b/src/test/modules/test_radixtree/README
> > @@ -0,0 +1,7 @@
> > +test_integerset contains unit tests for testing the integer set 
> > implementation
> > +in src/backend/lib/integerset.c.
> > +
> > +The tests verify the correctness of the implementation, but they can also 
> > be
> > +used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
> > +test_integerset.c, the tests will print extra information about execution 
> > time
> > +and memory usage.
> >
> > This file is not updated for test_radixtree. I think we can remove it
> > as the test cases in test_radixtree are clear.
>
> Done. I pushed this with a few last-minute cosmetic adjustments. This
> has been a very long time coming, but we're finally in the home
> stretch!
>
> Already, I see sifaka doesn't like this, and I'm looking now...

It's complaining that these forward declarations...

/* generate forward declarations necessary to use the radix tree */
#ifdef RT_DECLARE

typedef struct RT_RADIX_TREE RT_RADIX_TREE;
typedef struct RT_ITER RT_ITER;

... cause "error: redefinition of typedef 'rt_radix_tree' is a C11
feature [-Werror,-Wtypedef-redefinition]"

I'll look in the other templates to see if what they do.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor

On Wed, Mar 6, 2024 at 6:59 PM Masahiko Sawada  wrote:
>
> > + /* Find the control object in shared memory */
> > + control = handle;
>
> I think it's mostly because of readability; it makes clear that the
> handle should be castable to dsa_pointer and it's a control object. I
> borrowed it from dshash_attach().

I find that a bit strange, but I went ahead and kept it.



On Wed, Mar 6, 2024 at 9:13 PM Masahiko Sawada  wrote:

> The 0001 patch looks good to me. I have some minor comments:

> +PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
> +
>
> "src/backend/lib/radixtree.c" should be updated to
> "src/include/lib/radixtree.h".

Done.

> --- /dev/null
> +++ b/src/test/modules/test_radixtree/README
> @@ -0,0 +1,7 @@
> +test_integerset contains unit tests for testing the integer set 
> implementation
> +in src/backend/lib/integerset.c.
> +
> +The tests verify the correctness of the implementation, but they can also be
> +used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
> +test_integerset.c, the tests will print extra information about execution 
> time
> +and memory usage.
>
> This file is not updated for test_radixtree. I think we can remove it
> as the test cases in test_radixtree are clear.

Done. I pushed this with a few last-minute cosmetic adjustments. This
has been a very long time coming, but we're finally in the home
stretch!

Already, I see sifaka doesn't like this, and I'm looking now...

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread Masahiko Sawada

On Wed, Mar 6, 2024 at 8:20 PM John Naylor  wrote:
>
> On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada  wrote:
> >
> > > I'd like to push 0001 and 0002 shortly, and then do another sweep over
> > > 0003, with remaining feedback, and get that in so we get some
> > > buildfarm testing before the remaining polishing work on
> > > tidstore/vacuum.
> >
> > Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
> > polish tidstore and vacuum patches and update commit messages.
>
> I don't think v66 got a CI run because of vacuumlazy.c bitrot, so I'm
> attaching v67 which fixes that and has some small cosmetic adjustments
> to the template.

Thank you for updating the patch.

> One functional change for debugging build is that
> RT_STATS now prints out the number of leaves. I'll squash and push
> 0001 tomorrow morning unless there are further comments.

The 0001 patch looks good to me. I have some minor comments:

--- /dev/null
+++ b/src/test/modules/test_radixtree/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_radixtree/Makefile
+
+MODULE_big = test_radixtree
+OBJS = \
+   $(WIN32RES) \
+   test_radixtree.o
+PGFILEDESC = "test_radixtree - test code for src/backend/lib/radixtree.c"
+

"src/backend/lib/radixtree.c" should be updated to
"src/include/lib/radixtree.h".

---
--- /dev/null
+++ b/src/test/modules/test_radixtree/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation
+in src/backend/lib/integerset.c.
+
+The tests verify the correctness of the implementation, but they can also be
+used as a micro-benchmark.  If you set the 'intset_test_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.

This file is not updated for test_radixtree. I think we can remove it
as the test cases in test_radixtree are clear.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread Masahiko Sawada

On Wed, Mar 6, 2024 at 8:25 PM John Naylor  wrote:
>
> Actually, I forgot -- I had one more question: Masahiko, is there a
> reason for this extra local variable, which uses the base type, rather
> than the typedef'd parameter?
>
> +RT_SCOPE RT_RADIX_TREE *
> +RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
> +{
> + RT_RADIX_TREE *tree;
> + dsa_pointer control;
> +
> + tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
> +
> + /* Find the control object in shared memory */
> + control = handle;

I think it's mostly because of readability; it makes clear that the
handle should be castable to dsa_pointer and it's a control object. I
borrowed it from dshash_attach().

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor

Actually, I forgot -- I had one more question: Masahiko, is there a
reason for this extra local variable, which uses the base type, rather
than the typedef'd parameter?

+RT_SCOPE RT_RADIX_TREE *
+RT_ATTACH(dsa_area *dsa, RT_HANDLE handle)
+{
+ RT_RADIX_TREE *tree;
+ dsa_pointer control;
+
+ tree = (RT_RADIX_TREE *) palloc0(sizeof(RT_RADIX_TREE));
+
+ /* Find the control object in shared memory */
+ control = handle;

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor

On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada  wrote:
>
> > I'd like to push 0001 and 0002 shortly, and then do another sweep over
> > 0003, with remaining feedback, and get that in so we get some
> > buildfarm testing before the remaining polishing work on
> > tidstore/vacuum.
>
> Sounds a reasonable plan. 0001 and 0002 look good to me. I'm going to
> polish tidstore and vacuum patches and update commit messages.

I don't think v66 got a CI run because of vacuumlazy.c bitrot, so I'm
attaching v67 which fixes that and has some small cosmetic adjustments
to the template. One functional change for debugging build is that
RT_STATS now prints out the number of leaves. I'll squash and push
0001 tomorrow morning unless there are further comments.

v67-ART.tar.gz
Description: application/gzip

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor

On Wed, Mar 6, 2024 at 3:40 PM Masahiko Sawada  wrote:
>
> On Wed, Mar 6, 2024 at 5:33 PM John Naylor  wrote:

> > I've looked around and it seems clang is more lax on conversions.
> > Since it works fine for clang, I think we just need a cast here for
> > gcc. I've attached a blind attempt at a fix -- I'll apply shortly
> > unless someone happens to test and find it doesn't work.
>
> I've reproduced the same error on my raspberry pi, and confirmed the
> patch fixes the error.
>
> My previous idea was wrong. With my proposal, the regression test for
> radix tree failed on my raspberry pi. On the other hand, with your
> patch the tests passed.

Pushed, and at least parula's green now, thanks for testing! And
thanks, Andres, for the ping!

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread Masahiko Sawada

On Wed, Mar 6, 2024 at 5:33 PM John Naylor  wrote:
>
> On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada  wrote:
> >
> > ../../src/include/port/simd.h:326:71: error: incompatible type for
> > argument 1 of \342\200\230vshrq_n_s8\342\200\231
> >   uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 
> > 7));
> >^
> >
> > Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.
>
> I've looked around and it seems clang is more lax on conversions.
> Since it works fine for clang, I think we just need a cast here for
> gcc. I've attached a blind attempt at a fix -- I'll apply shortly
> unless someone happens to test and find it doesn't work.

I've reproduced the same error on my raspberry pi, and confirmed the
patch fixes the error.

My previous idea was wrong. With my proposal, the regression test for
radix tree failed on my raspberry pi. On the other hand, with your
patch the tests passed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor

On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada  wrote:
>
> ../../src/include/port/simd.h:326:71: error: incompatible type for
> argument 1 of \342\200\230vshrq_n_s8\342\200\231
>   uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
>^
>
> Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.

I've looked around and it seems clang is more lax on conversions.
Since it works fine for clang, I think we just need a cast here for
gcc. I've attached a blind attempt at a fix -- I'll apply shortly
unless someone happens to test and find it doesn't work.
diff --git a/src/include/port/simd.h b/src/include/port/simd.h
index 326b4faff5..597496f2fb 100644
--- a/src/include/port/simd.h
+++ b/src/include/port/simd.h
@@ -323,7 +323,7 @@ vector8_highbit_mask(const Vector8 v)
 		1 << 4, 1 << 5, 1 << 6, 1 << 7,
 	};
 
-	uint8x16_t	masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
+	uint8x16_t	masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8((int8x16_t) v, 7));
 	uint8x16_t	maskedhi = vextq_u8(masked, masked, 8);
 
 	return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor

On Wed, Mar 6, 2024 at 3:06 PM John Naylor  wrote:
>
> (Hmm, I thought we had run this code on Arm already...)

CI MacOS uses Clang on aarch64, which has been working fine. The
failing animals are on gcc 7.3...

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread Andres Freund

Hi, 

On March 6, 2024 9:06:50 AM GMT+01:00, John Naylor  
wrote:
>On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada  wrote:
>>
>> On Wed, Mar 6, 2024 at 4:41 PM Andres Freund  wrote:
>
>> > A few ARM buildfarm animals are complaining:
>> >
>> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02
>> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03
>> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18
>> >
>>
>> The error message we got is:
>>
>> ../../src/include/port/simd.h:326:71: error: incompatible type for
>> argument 1 of \342\200\230vshrq_n_s8\342\200\231
>>   uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 
>> 7));
>>^
>>
>> Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.
>
>That sounds plausible, and I'll look further.
>
>(Hmm, I thought we had run this code on Arm already...)

Perhaps we should switch one of the CI jobs to ARM...

Andres 

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread John Naylor

On Wed, Mar 6, 2024 at 3:02 PM Masahiko Sawada  wrote:
>
> On Wed, Mar 6, 2024 at 4:41 PM Andres Freund  wrote:

> > A few ARM buildfarm animals are complaining:
> >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18
> >
>
> The error message we got is:
>
> ../../src/include/port/simd.h:326:71: error: incompatible type for
> argument 1 of \342\200\230vshrq_n_s8\342\200\231
>   uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
>^
>
> Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.

That sounds plausible, and I'll look further.

(Hmm, I thought we had run this code on Arm already...)

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-06 Thread Masahiko Sawada

On Wed, Mar 6, 2024 at 4:41 PM Andres Freund  wrote:
>
> Hi,
>
> On 2024-03-05 16:41:30 +0700, John Naylor wrote:
> > I'd like to push 0001 and 0002 shortly, and then do another sweep over
> > 0003, with remaining feedback, and get that in so we get some
> > buildfarm testing before the remaining polishing work on
> > tidstore/vacuum.
>
> A few ARM buildfarm animals are complaining:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18
>

The error message we got is:

../../src/include/port/simd.h:326:71: error: incompatible type for
argument 1 of \342\200\230vshrq_n_s8\342\200\231
  uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7));
   ^

Since 'v' is uint8x16_t I think we should have used vshrq_n_u8() instead.

Regard,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-05 Thread Andres Freund

Hi,

On 2024-03-05 16:41:30 +0700, John Naylor wrote:
> I'd like to push 0001 and 0002 shortly, and then do another sweep over
> 0003, with remaining feedback, and get that in so we get some
> buildfarm testing before the remaining polishing work on
> tidstore/vacuum.

A few ARM buildfarm animals are complaining:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-03-06%2007%3A34%3A02
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-03-06%2007%3A34%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=massasauga&dt=2024-03-06%2007%3A33%3A18

Greetings,

Andres Freund

Re: [PoC] Improve dead tuple storage for lazy vacuum

2024-03-05 Thread Masahiko Sawada

On Wed, Mar 6, 2024 at 12:59 PM John Naylor  wrote:
>
> On Tue, Mar 5, 2024 at 11:12 PM Masahiko Sawada  wrote:
> >
> > On Tue, Mar 5, 2024 at 6:41 PM John Naylor  wrote:
>
> > > Done in v66-0009. I'd be curious to hear any feedback. I like the
> > > aspect that the random numbers come from a different seed every time
> > > the test runs.
> >
> > The new tests look good. Here are some comments:
> >
> > ---
> > +   expected = keys[i];
> > +   iterval = rt_iterate_next(iter, &iterkey);
> >
> > -   ndeleted++;
> > +   EXPECT_TRUE(iterval != NULL);
> > +   EXPECT_EQ_U64(iterkey, expected);
> > +   EXPECT_EQ_U64(*iterval, expected);
> >
> > Can we verify that the iteration returns keys in ascending order?
>
> We get the "expected" value from the keys we saved in the now-sorted
> array, so we do already. Unless I misunderstand you.

Ah, you're right. Please ignore this comment.

>
> > ---
> > radixtree_ctx = AllocSetContextCreate(CurrentMemoryContext,
> >   "test_radix_tree",
> >   ALLOCSET_DEFAULT_SIZES);
> >
> > We use a mix of ALLOCSET_DEFAULT_SIZES and ALLOCSET_SMALL_SIZES. I
> > think it's better to use either one for consistency.
>
> Will change to "small", since 32-bit platforms will use slab for leaves.

Agreed.

>
> I'll look at the memory usage and estimate what 32-bit platforms will
> use, and maybe adjust the number of keys. A few megabytes is fine, but
> not many megabytes.

Thanks, sounds good.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

1 2 3 4 5 >

1 - 100 of 435 matches

Mail list logo