Re: slab allocator performance issues

2022-12-11 Thread John Naylor
On Sat, Dec 10, 2022 at 11:02 AM David Rowley  wrote:
> [v4]

Thanks for working on this!

I ran an in-situ benchmark using the v13 radix tree patchset ([1] WIP but
should be useful enough for testing allocation speed), only applying the
first five, which are local-memory only. The benchmark is not meant to
represent a realistic workload, and primarily stresses traversal and
allocation of the smallest node type. Minimum of five, with turbo-boost
off, on recent Intel laptop hardware:

v13-0001 to 0005:

# select * from bench_load_random_int(500 * 1000);
 mem_allocated | load_ms
---+-
 151123432 | 222

47.06%  postgres  postgres [.] rt_set
22.89%  postgres  postgres [.] SlabAlloc
 9.65%  postgres  postgres [.] rt_node_insert_inner.isra.0
 5.94%  postgres  [unknown][k] 0xb5e011b7
 3.62%  postgres  postgres [.] MemoryContextAlloc
 2.70%  postgres  libc.so.6[.] __memmove_avx_unaligned_erms
 2.60%  postgres  postgres [.] SlabFree

+ v4 slab:

# select * from bench_load_random_int(500 * 1000);
 mem_allocated | load_ms
---+-
 152463112 | 213

  52.42%  postgres  postgres  [.] rt_set
  12.80%  postgres  postgres  [.] SlabAlloc
   9.38%  postgres  postgres  [.] rt_node_insert_inner.isra.0
   7.87%  postgres  [unknown] [k] 0xb5e011b7
   4.98%  postgres  postgres  [.] SlabFree

While allocation is markedly improved, freeing looks worse here. The
proportion is surprising because only about 2% of nodes are freed during
the load, but doing that takes up 10-40% of the time compared to allocating.

num_keys = 50, height = 7
n4 = 2501016, n15 = 56932, n32 = 270, n125 = 0, n256 = 257

Sidenote: I don't recall ever seeing vsyscall (I think that's what the
0xb5e011b7 address is referring to) in a profile, so not sure what
is happening there.

[1]
https://www.postgresql.org/message-id/CAFBsxsHNE621mGuPhd7kxaGc22vMkoSu7R4JW9Zan1jjorGy3g%40mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com


Re: static assert cleanup

2022-12-09 Thread John Naylor
On Fri, Dec 9, 2022 at 2:47 PM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:
>
> 0003-Move-some-static-assertions-to-better-places.patch
>
> This moves some that I thought were suboptimally placed but it could be
> debated or refined.

+ * We really want ItemPointerData to be exactly 6 bytes.  This is rather a
+ * random place to check, but there is no better place.

Since the assert is no longer in a random function body, it seems we can
remove the second sentence.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-09 Thread John Naylor
On Fri, Dec 9, 2022 at 8:20 AM Masahiko Sawada 
wrote:

> In the meanwhile, I've been working on vacuum integration. There are
> two things I'd like to discuss some time:
>
> The first is the minimum of maintenance_work_mem, 1 MB. Since the
> initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel
> vacuum with radix tree cannot work with the minimum
> maintenance_work_mem. It will need to increase it to 4MB or so. Maybe
> we can start a new thread for that.

I don't think that'd be very controversial, but I'm also not sure why we'd
need 4MB -- can you explain in more detail what exactly we'd need so that
the feature would work? (The minimum doesn't have to work *well* IIUC, just
do some useful work and not fail).

> The second is how to limit the size of the radix tree to
> maintenance_work_mem. I think that it's tricky to estimate the maximum
> number of keys in the radix tree that fit in maintenance_work_mem. The
> radix tree size varies depending on the key distribution. The next
> idea I considered was how to limit the size when inserting a key. In
> order to strictly limit the radix tree size, probably we have to
> change the rt_set so that it breaks off and returns false if the radix
> tree size is about to exceed the memory limit when we allocate a new
> node or grow a node kind/class.

That seems complex, fragile, and wrong scope.

> Ideally, I'd like to control the size
> outside of radix tree (e.g. TIDStore) since it could introduce
> overhead to rt_set() but probably we need to add such logic in radix
> tree.

Does the TIDStore have the ability to ask the DSA (or slab context) to see
how big it is? If a new segment has been allocated that brings us to the
limit, we can stop when we discover that fact. In the local case with slab
blocks, it won't be on nice neat boundaries, but we could check if we're
within the largest block size (~64kB) of overflow.

Remember when we discussed how we might approach parallel pruning? I
envisioned a local array of a few dozen kilobytes to reduce contention on
the tidstore. We could use such an array even for a single worker (always
doing the same thing is simpler anyway). When the array fills up enough so
that the next heap page *could* overflow it: Stop, insert into the store,
and check the store's memory usage before continuing.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: move some bitmapset.c macros to bitmapset.h

2022-12-05 Thread John Naylor
On Tue, Dec 6, 2022 at 12:57 PM Tom Lane  wrote:

> > Well, they've already escaped to tidbitmap.c as a copy. How do you feel
> > about going that route?
>
> Not terribly pleased with that either, I must admit.

Okay, I won't pursue that further.

> If we do put RIGHTMOST_ONE functionality into pg_bitutils.h,
> I'd envision it as pg_bitutils.h exporting both 32-bit and
> 64-bit versions of that, and then bitmapset.c choosing the
> appropriate one just like it chooses bmw_rightmost_one_pos.

Here's a quick go at that. I've not attempted to use it for what I need,
but it looks like it fits the bill.

--
John Naylor
EDB: http://www.enterprisedb.com
 src/backend/nodes/bitmapset.c| 36 ++--
 src/include/nodes/bitmapset.h| 16 ++--
 src/include/port/pg_bitutils.h   | 31 +++
 src/tools/pgindent/typedefs.list |  1 -
 4 files changed, 47 insertions(+), 37 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..4384ff591d 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -32,39 +32,7 @@
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*--
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *xx1
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *yy0
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *yy1
- * and then ANDing with the original value gives
- *001
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *--
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
-/* Select appropriate bit-twiddling functions for bitmap word size */
-#if BITS_PER_BITMAPWORD == 32
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
-#define bmw_popcount(w)pg_popcount32(w)
-#elif BITS_PER_BITMAPWORD == 64
-#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
-#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
-#define bmw_popcount(w)pg_popcount64(w)
-#else
-#error "invalid BITS_PER_BITMAPWORD"
-#endif
+#define HAS_MULTIPLE_ONES(x)	(bmw_rightmost_one(x) != (x))
 
 
 /*
@@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a)
 		{
 			int			result;
 
-			w = RIGHTMOST_ONE(w);
+			w = bmw_rightmost_one(w);
 			a->words[wordnum] &= ~w;
 
 			result = wordnum * BITS_PER_BITMAPWORD;
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..fdc504596b 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -38,13 +38,11 @@ struct List;
 
 #define BITS_PER_BITMAPWORD 64
 typedef uint64 bitmapword;		/* must be an unsigned type */
-typedef int64 signedbitmapword; /* must be the matching signed type */
 
 #else
 
 #define BITS_PER_BITMAPWORD 32
 typedef uint32 bitmapword;		/* must be an unsigned type */
-typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
@@ -75,6 +73,20 @@ typedef enum
 	BMS_MULTIPLE/* >1 member */
 } BMS_Membership;
 
+/* Select appropriate bit-twiddling functions for bitmap word size */
+#if BITS_PER_BITMAPWORD == 32
+#define bmw_rightmost_one(w)		pg_rightmost_one32(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos32(w)
+#define bmw_popcount(w)pg_popcount32(w)
+#elif BITS_PER_BITMAPWORD == 64
+#define bmw_rightmost_one(w)		pg_rightmost_one64(w)
+#define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos64(w)
+#define bmw_rightmost_one_pos(w)	pg_rightmost_one_pos64(w)
+#define bmw_popcount(w)pg_popcount64(w)
+#else
+#error "invalid BITS_PER_BITMAPWORD"
+#endif
 
 /*
  * function prototypes in nodes/bitmapset.c
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 814e0b2dba..f95b6afd86 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+/*--
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *xx1
+ * where x's are unspecified bits.  The t

Re: move some bitmapset.c macros to bitmapset.h

2022-12-05 Thread John Naylor
On Mon, Dec 5, 2022 at 9:33 PM Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > On 2022-Dec-05, John Naylor wrote:
> >> -#define WORDNUM(x)  ((x) / BITS_PER_BITMAPWORD)
> >> -#define BITNUM(x)   ((x) % BITS_PER_BITMAPWORD)
>
> > In this location, nobody can complain about the naming of these macros,
> > since they're just used to implement other bitmapset.c code.  However,
> > if you move them to the .h file, ISTM you should give them more
> > meaningful names.
>
> IMV these are absolutely private to bitmapset.c.  I reject the idea
> that they should be exposed publicly, under these names or any others.

Well, they've already escaped to tidbitmap.c as a copy. How do you feel
about going that route?

> Maybe we need some more bitmapset primitive functions?  What is it
> you actually want to accomplish in the end?

 An inserter into one type of node in a tree structure must quickly find a
free position in an array. We have a bitmap of 128 bits to indicate whether
the corresponding array position is free. The proposed coding is:

 /* get the first word with at least one bit not set */
for (idx = 0; idx < WORDNUM(128); idx++)
{
  if (isset[idx] < ~((bitmapword) 0))
break;
}

/* To get the first unset bit in X, get the first set bit in ~X */
inverse = ~(isset[idx]);
slotpos = idx * BITS_PER_BITMAPWORD;
slotpos += bmw_rightmost_one_pos(inverse);

/* mark the slot used */
isset[idx] |= RIGHTMOST_ONE(inverse);

return slotpos;

...which, if it were reversed so that a set bit meant "available", would
essentially be bms_first_member(), so a more primitive version of that
might work.

--
John Naylor
EDB: http://www.enterprisedb.com


move some bitmapset.c macros to bitmapset.h

2022-12-05 Thread John Naylor
Over in [1], Masahiko and I found that using some bitmapset logic yields a
useful speedup in one part of the proposed radix tree patch. In addition to
what's in bitmapset.h now, we'd need WORDNUM, BITNUM, RIGHTMOST_ONE and
bmw_rightmost_one_pos() from bitmapset.c. The file tidbitmap.c has its own
copies of the first two, and we could adapt that strategy again. But
instead of three files defining these, it seems like it's time to consider
moving them somewhere more central.

Attached is a simple form of this concept, including moving
HAS_MULTIPLE_ONES just for consistency. One possible objection is the
possibility of namespace clash. Thoughts?

I also considered putting the macros and typedefs in pg_bitutils.h. One
organizational advantage is that pg_bitutils.h already offers convenience
function symbols where the parameter depends on SIZEOF_SIZE_T, so putting
the BITS_PER_BITMAPWORD versions there makes sense. But that way is not a
clear win, so I didn't go that far.

[1]
https://www.postgresql.org/message-id/CAFBsxsHgP5LP9q%3DRq_3Q2S6Oyut67z%2BVpemux%2BKseSPXhJF7sg%40mail.gmail.com

-- 
John Naylor
EDB: http://www.enterprisedb.com
 src/backend/nodes/bitmapset.c | 24 
 src/backend/nodes/tidbitmap.c |  5 -
 src/include/nodes/bitmapset.h | 24 
 3 files changed, 24 insertions(+), 29 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index b7b274aeff..3204b49738 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -26,33 +26,9 @@
 #include "port/pg_bitutils.h"
 
 
-#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
-
 #define BITMAPSET_SIZE(nwords)	\
 	(offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword))
 
-/*--
- * This is a well-known cute trick for isolating the rightmost one-bit
- * in a word.  It assumes two's complement arithmetic.  Consider any
- * nonzero value, and focus attention on the rightmost one.  The value is
- * then something like
- *xx1
- * where x's are unspecified bits.  The two's complement negative is formed
- * by inverting all the bits and adding one.  Inversion gives
- *yy0
- * where each y is the inverse of the corresponding x.  Incrementing gives
- *yy1
- * and then ANDing with the original value gives
- *001
- * This works for all cases except original value = zero, where of course
- * we get zero.
- *--
- */
-#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
-
-#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
-
 /* Select appropriate bit-twiddling functions for bitmap word size */
 #if BITS_PER_BITMAPWORD == 32
 #define bmw_leftmost_one_pos(w)		pg_leftmost_one_pos32(w)
diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index a7a6b26668..8a1fd1d205 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -72,11 +72,6 @@
  */
 #define PAGES_PER_CHUNK  (BLCKSZ / 32)
 
-/* We use BITS_PER_BITMAPWORD and typedef bitmapword from nodes/bitmapset.h */
-
-#define WORDNUM(x)	((x) / BITS_PER_BITMAPWORD)
-#define BITNUM(x)	((x) % BITS_PER_BITMAPWORD)
-
 /* number of active words for an exact page: */
 #define WORDS_PER_PAGE	((MAX_TUPLES_PER_PAGE - 1) / BITS_PER_BITMAPWORD + 1)
 /* number of active words for a lossy chunk: */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 2792281658..8462282b9e 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -48,6 +48,30 @@ typedef int32 signedbitmapword; /* must be the matching signed type */
 
 #endif
 
+#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD)
+#define BITNUM(x)  ((x) % BITS_PER_BITMAPWORD)
+
+/*--
+ * This is a well-known cute trick for isolating the rightmost one-bit
+ * in a word.  It assumes two's complement arithmetic.  Consider any
+ * nonzero value, and focus attention on the rightmost one.  The value is
+ * then something like
+ *xx1
+ * where x's are unspecified bits.  The two's complement negative is formed
+ * by inverting all the bits and adding one.  Inversion gives
+ *yy0
+ * where each y is the inverse of the corresponding x.  Incrementing gives
+ *yy1
+ * and then ANDing with the original value gives
+ *001
+ * This works for all cases except original value = zero, where of course
+ * we get zero.
+ *--
+ */
+#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x)))
+
+#define HAS_MULTIPLE_ONES(x)	((bitmapword) RIGHTMOST_ONE(x) != (x))
+
 typedef struct Bitmapset
 {
 	pg_node_attr(custom_copy_equal, special_read_write)


Re: slab allocator performance issues

2022-12-05 Thread John Naylor
On Mon, Dec 5, 2022 at 3:02 PM David Rowley  wrote:
>
> On Fri, 11 Nov 2022 at 22:20, John Naylor 
wrote:

> > #define SLAB_FREELIST_COUNT ((1<<3) + 1)
> > index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);
>
> Doesn't this create a sort of round-robin use of the free list?  What
> we want is a sort of "histogram" bucket set of free lists so we can
> group together blocks that have a close-enough free number of chunks.
> Unless I'm mistaken, I think what you have doesn't do that.

The intent must have slipped my mind along the way.

> I wondered if simply:
>
> index = -(-freecount >> slab->freelist_shift);
>
> would be faster than Andres' version.  I tried it out and on my AMD
> machine, it's about the same speed. Same on a Raspberry Pi 4.
>
> Going by [2], the instructions are very different with each method, so
> other machines with different latencies on those instructions might
> show something different. I attached what I used to test if anyone
> else wants a go.

I get about 0.1% difference on my machine.  Both ways boil down to (on gcc)
3 instructions with low latency. The later ones need the prior results to
execute, which I think is what the XXX comment "isn't great" was referring
to. The new coding is more mysterious (does it do the right thing on all
platforms?), so I guess the original is still the way to go unless we get a
better idea.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Optimizing Node Files Support

2022-12-02 Thread John Naylor
On Thu, Dec 1, 2022 at 8:02 PM Ranier Vilela  wrote:
>
> Hi,
>
> I believe that has room for improving generation node files.
>
> The patch attached reduced the size of generated files by 27 kbytes.
> From 891 kbytes to 864 kbytes.
>
> About the patch:
> 1. Avoid useless attribution when from->field is NULL, once that
> the new node is palloc0.
>
> 2. Avoid useless declaration variable Size, when it is unnecessary.

Not useless -- it prevents a multiple evaluation hazard, which this patch
introduces.

> 3. Optimize comparison functions like memcmp and strcmp, using
>  a short-cut comparison of the first element.

Not sure if the juice is worth the squeeze. Profiling would tell.

> 4. Switch several copy attributions like COPY_SCALAR_FIELD or
COPY_LOCATION_FIELD
> by one memcpy call.

My first thought is, it would cause code churn.

> 5. Avoid useless attribution, ignoring the result of pg_strtok when it is
unnecessary.

Looks worse.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: initdb: Refactor PG_CMD_PUTS loops

2022-12-01 Thread John Naylor
On Thu, Dec 1, 2022 at 5:02 PM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:
>
> Keeping the SQL commands that initdb runs in string arrays before
> feeding them to PG_CMD_PUTS() seems unnecessarily verbose and
> inflexible.  In some cases, the array only has one member.  In other
> cases, one might want to use PG_CMD_PRINTF() instead, to parametrize a
> command, but that would require breaking up the loop or using
> workarounds like replace_token().  This patch unwinds all that; it's
> much simpler that way.

+1,  I can't think of a reason to keep the current coding

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PATCH] Add native windows on arm64 support

2022-12-01 Thread John Naylor
On Fri, Dec 2, 2022 at 12:20 AM Niyas Sait  wrote:
>
>
> On 07/11/2022 06:58, Michael Paquier wrote:
> > Seems so.  Hmm, where does _ARM64_BARRIER_SY come from?  Perhaps it
> > would be better to have a comment referring to it from a different
> > place than the forums of arm, like some actual docs?
>
>
> _ARM64_BARRIER_SY is defined in Microsoft Arm64 intrinsic documentation
> -
>
https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-170#BarrierRestrictions

In particular, at the bottom of that section:

"For the __isb intrinsic, the only restriction that is currently valid is
_ARM64_BARRIER_SY; all other values are reserved by the architecture."

This corresponds to

https://developer.arm.com/documentation/ddi0596/2021-06/Base-Instructions/ISB--Instruction-Synchronization-Barrier-

which says

"SY Full system barrier operation, encoded as CRm = 0b. Can be omitted."

> I couldn't find something more official for the sse2neon library part.

Not quite sure what this is referring to, but it seems we can just point to
the __aarch64__ section in the same file, which uses the same instruction:

spin_delay(void)
{
  __asm__ __volatile__(
  " isb; \n");
}

...and which already explains the choice with a comment.

About v4:

+ * Use _mm_pause (x64) or __isb(arm64) intrinsic instead of rep nop.

Need a space here after __isb.

+  if cc.get_id() == 'msvc'
+cdata.set('USE_ARMV8_CRC32C', false)
+cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
+have_optimized_crc = true
+  else

That seems like a heavy-handed way to force it. Could we just use the same
gating in the test program that the patch puts in the code of interest?
Namely:

+#ifndef _MSC_VER
 #include 
+#endif

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-12-01 Thread John Naylor
On Thu, Dec 1, 2022 at 3:03 PM Masahiko Sawada 
wrote:
>
> On Thu, Dec 1, 2022 at 4:00 PM John Naylor 
wrote:
> >
> > The bigger question in my mind is: Why is there an atomic variable in
backend-local memory?
>
> Because I use the same radix_tree and radix_tree_control structs for
> non-parallel and parallel vacuum. Therefore, radix_tree_control is
> allocated in DSM for parallel-vacuum cases or in backend-local memory
> for non-parallel vacuum cases.

Ok, that could be yet another reason to compile local- and shared-memory
functionality separately, but now I'm wondering why there are atomic
variables at all, since there isn't yet any locking support.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-30 Thread John Naylor
On Wed, Nov 30, 2022 at 2:28 PM Masahiko Sawada 
wrote:
>
> On Fri, Nov 25, 2022 at 5:00 PM John Naylor
>  wrote:

> > These hacks were my work, but I think we can improve that by having two
versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized
nodes. For that to work, in "init-node" we'd need a branch to set fanout to
zero for node256. That should be fine -- it already has to branch for
memset'ing node128's indexes to 0xFF.
>
> Since the node has fanout regardless of fixed-sized and
> variable-sized

As currently coded, yes. But that's not strictly necessary, I think.

>, only node256 is the special case where the fanout in
> the node doesn't match the actual fanout of the node. I think if we
> want to have two versions of NODE_HAS_FREE_SLOT, we can have one for
> node256 and one for other classes. Thoughts? In your idea, for
> NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the
> following?
>
> #define FIXED_NODDE_HAS_FREE_SLOT(node, class)
>   (node->base.n.count < rt_size_class_info[class].fanout)

Right, and the other one could be VAR_NODE_...

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-30 Thread John Naylor
On Wed, Nov 30, 2022 at 11:09 PM Masahiko Sawada 
wrote:
>
> I've investigated this issue and have a question about using atomic
> variables on palloc'ed memory. In non-parallel vacuum cases,
> radix_tree_control is allocated via aset.c. IIUC in 32-bit machines,
> the memory allocated by aset.c is 4-bytes aligned so these atomic
> variables are not always 8-bytes aligned. Is there any way to enforce
> 8-bytes aligned memory allocations in 32-bit machines?

The bigger question in my mind is: Why is there an atomic variable in
backend-local memory?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Improve performance of pg_strtointNN functions

2022-11-30 Thread John Naylor
On Thu, Dec 1, 2022 at 6:42 AM David Rowley  wrote:
>
> I was thinking that we should likely apply this before doing the hex
> literals, which is the main focus of [1].  The reason being is so that
> that patch can immediately have faster conversions by allowing the
> compiler to use bit shifting instead of other means of multiplying by
> a power-of-2 number. I'm hoping this removes a barrier for Peter from
> the small gripe I raised on that thread about the patch having slower
> than required hex, octal and binary string parsing.

I don't see why the non-decimal literal patch needs to be "immediately"
faster? If doing this first leads to less code churn, that's another
consideration, but you haven't made that argument.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Prefetch the next tuple's memory during seqscans

2022-11-30 Thread John Naylor
On Wed, Nov 23, 2022 at 4:58 AM David Rowley  wrote:

> My current thoughts are that it might be best to go with 0005 to start
> with.

+1

> I know Melanie is working on making some changes in this area,
> so perhaps it's best to leave 0002 until that work is complete.

There seem to be some open questions about that one as well.

I reran the same test in [1] (except I don't have the ability to lock clock
speed or affect huge pages) on an older CPU from 2014 (Intel(R) Xeon(R) CPU
E5-2695 v3 @ 2.30GHz, kernel 3.10 gcc 4.8) with good results:

HEAD:

Testing a1
latency average = 965.462 ms
Testing a2
latency average = 1054.608 ms
Testing a3
latency average = 1078.263 ms
Testing a4
latency average = 1120.933 ms
Testing a5
latency average = 1162.753 ms
Testing a6
latency average = 1298.876 ms
Testing a7
latency average = 1228.775 ms
Testing a8
latency average = 1293.535 ms

0001+0005:

Testing a1
latency average = 791.224 ms
Testing a2
latency average = 876.421 ms
Testing a3
latency average = 911.039 ms
Testing a4
latency average = 981.693 ms
Testing a5
latency average = 998.176 ms
Testing a6
latency average = 979.954 ms
Testing a7
latency average = 1066.523 ms
Testing a8
latency average = 1030.235 ms

I then tested a Power8 machine (also kernel 3.10 gcc 4.8). Configure
reports "checking for __builtin_prefetch... yes", but I don't think it does
anything here, as the results are within noise level. A quick search didn't
turn up anything informative on this platform, and I'm not motivated to dig
deeper. In any case, it doesn't make things worse.

HEAD:

Testing a1
latency average = 1402.163 ms
Testing a2
latency average = 1442.971 ms
Testing a3
latency average = 1599.188 ms
Testing a4
latency average = 1664.397 ms
Testing a5
latency average = 1782.091 ms
Testing a6
latency average = 1860.655 ms
Testing a7
latency average = 1929.120 ms
Testing a8
latency average = 2021.100 ms

0001+0005:

Testing a1
latency average = 1433.080 ms
Testing a2
latency average = 1428.369 ms
Testing a3
latency average = 1542.406 ms
Testing a4
latency average = 1642.452 ms
Testing a5
latency average = 1737.173 ms
Testing a6
latency average = 1828.239 ms
Testing a7
latency average = 1920.909 ms
Testing a8
latency average = 2036.922 ms

[1]
https://www.postgresql.org/message-id/CAFBsxsHqmH_S%3D4apc5agKsJsF6xZ9f6NaH0Z83jUYv3EgySHfw%40mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-29 Thread John Naylor
There are a few things up in the air, so I'm coming back to this list to
summarize and add a recent update:

On Mon, Nov 14, 2022 at 7:59 PM John Naylor 
wrote:
>
> - See how much performance we actually gain from tagging the node kind.

Needs a benchmark that has enough branch mispredicts and L2/3 misses to
show a benefit. Otherwise either neutral or worse in its current form,
depending on compiler(?). Put off for later.

> - Try additional size classes while keeping the node kinds to only four.

This is relatively simple and effective. If only one additional size class
(total 5) is coded as a placeholder, I imagine it will be easier to rebase
shared memory logic than using this technique everywhere possible.

> - Optimize node128 insert.

I've attached a rough start at this. The basic idea is borrowed from our
bitmapset nodes, so we can iterate over and operate on word-sized (32- or
64-bit) types at a time, rather than bytes. To make this easier, I've moved
some of the lower-level macros and types from bitmapset.h/.c to
pg_bitutils.h. That's probably going to need a separate email thread to
resolve the coding style clash this causes, so that can be put off for
later. This is not meant to be included in the next patchset.  For
demonstration purposes, I get these results with a function that repeatedly
deletes the last value from a mostly-full node128 leaf and re-inserts it:

select * from bench_node128_load(120);

v11

NOTICE:  num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0,
n61 = 0, n128 = 121, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
+---+--+--
120 | 14400 |   208304 |   56

v11 + 0006 addendum

NOTICE:  num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0,
n61 = 0, n128 = 121, n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms
+---+--+--
120 | 14400 |   208816 |   34

I didn't test inner nodes, but I imagine the difference is bigger. This
bitmap style should also be used for the node256-leaf isset array simply to
be consistent and avoid needing single-use macros, but that has not been
done yet. It won't make a difference for performance because there is no
iteration there.

> - Try templating out the differences between local and shared memory.

I hope to start this sometime after the crashes on 32-bit are resolved.

--
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql 
b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 67ba568531..2fd689aa91 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -63,3 +63,14 @@ OUT rt_search_ms int8
 returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_node128_load(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_sparseload_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c 
b/contrib/bench_radix_tree/bench_radix_tree.c
index e69be48448..b035b3a747 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -31,6 +31,7 @@ PG_FUNCTION_INFO_V1(bench_shuffle_search);
 PG_FUNCTION_INFO_V1(bench_load_random_int);
 PG_FUNCTION_INFO_V1(bench_fixed_height_search);
 PG_FUNCTION_INFO_V1(bench_search_random_nodes);
+PG_FUNCTION_INFO_V1(bench_node128_load);
 
 static uint64
 tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -552,3 +553,85 @@ finish_search:
rt_free(rt);
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, 
nulls)));
 }
+
+Datum
+bench_node128_load(PG_FUNCTION_ARGS)
+{
+   int fanout = PG_GETARG_INT32(0);
+   radix_tree *rt;
+   TupleDesc   tupdesc;
+   TimestampTz start_time,
+   end_time;
+   longsecs;
+   int usecs;
+   int64   rt_sparseload_ms;
+   Datum   values[5];
+   boolnulls[5];
+
+   uint64  r,
+   h;
+   int key_id;
+
+   /* Build a tuple descriptor for our result type */
+   if (get_call_result_type(fcinfo, NULL, ) != TYPEFUNC_COMPOSITE)
+   elog(ERROR, "return type must be a row type");
+
+   rt = rt_create(CurrentMemoryContext);
+
+   key_id = 0;
+
+   for (r = 1; r <= fanout; r++)
+   {
+   for (h = 1; h <= fanout; h++)
+   {
+   uint64  key;
+
+   key = (r << 8) | (h);
+
+   key_id++;
+   rt_set(rt, key, key_id);
+   }
+   }
+
+   rt_stats(rt);

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-28 Thread John Naylor
> The fix is easy enough -- set the child pointer to null upon deletion,
but I'm somewhat astonished that the regression tests didn't hit this. I do
still intend to replace this code with something faster, but before I do so
the tests should probably exercise the deletion paths more. Since VACUUM

Oops. I meant to finish with "Since VACUUM doesn't perform deletion we
didn't have an opportunity to detect this during that operation."

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-28 Thread John Naylor
While creating a benchmark for inserting into node128-inner, I found a bug.
If a caller deletes from a node128, the slot index is set to invalid, but
the child pointer is still valid. Do that a few times, and every child
pointer is valid, even if no slot index points to it. When the next
inserter comes along, something surprising happens. This function:

/* Return an unused slot in node-128 */
static int
node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk)
{
  int slotpos = 0;

  Assert(!NODE_IS_LEAF(node));
  while (node_inner_128_is_slot_used(node, slotpos))
  slotpos++;

  return slotpos;
}

...passes an integer to this function, whose parameter is a uint8:

/* Is the slot in the node used? */
static inline bool
node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot)
{
  Assert(!NODE_IS_LEAF(node));
  return (node->children[slot] != NULL);
}

...so instead of growing the node unnecessarily or segfaulting, it enters
an infinite loop doing this:

add eax, 1
movzx   ecx, al
cmp QWORD PTR [rbx+264+rcx*8], 0
jne .L147

The fix is easy enough -- set the child pointer to null upon deletion, but
I'm somewhat astonished that the regression tests didn't hit this. I do
still intend to replace this code with something faster, but before I do so
the tests should probably exercise the deletion paths more. Since VACUUM

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-25 Thread John Naylor
On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada 
wrote:
>
> [v11]

There is one more thing that just now occurred to me: In expanding the use
of size classes, that makes rebasing and reworking the shared memory piece
more work than it should be. That's important because there are still some
open questions about the design around shared memory. To keep unnecessary
churn to a minimum, perhaps we should limit size class expansion to just
one (or 5 total size classes) for the near future?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-25 Thread John Naylor
ASS(n32, RT_CLASS_32_PARTIAL))

This macro doesn't really improve readability -- it obscures what is being
tested, and the name implies the "else" branch means "node doesn't need to
grow class", which is false. If we want to simplify expressions in this
block, I think it'd be more effective to improve the lines that follow:

+ memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
+ new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;

Maybe we can have const variables old_size and new_fanout to break out the
array lookup? While I'm thinking of it, these arrays should be const so the
compiler can avoid runtime lookups. Speaking of...

+/* Copy both chunks and children/values arrays */
+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+  uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ /* For better code generation */
+ if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
+ pg_unreachable();
+
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}

When I looked at this earlier, I somehow didn't go far enough -- why are we
passing the runtime count in the first place? This function can only be
called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last
parameter to memcpy should evaluate to a compile-time constant, right? Even
when we add node shrinking in the future, the constant should be correct,
IIUC?

- .fanout = 256,
+ /* technically it's 256, but we can't store that in a uint8,
+  and this is the max size class so it will never grow */
+ .fanout = 0,

- Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
+ Assert(((rt_node *) n256)->fanout == 0);
+ Assert(chunk_exists || ((rt_node *) n256)->count < 256);

These hacks were my work, but I think we can improve that by having two
versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized
nodes. For that to work, in "init-node" we'd need a branch to set fanout to
zero for node256. That should be fine -- it already has to branch for
memset'ing node128's indexes to 0xFF.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Non-decimal integer literals

2022-11-23 Thread John Naylor
On Wed, Nov 23, 2022 at 3:54 PM David Rowley  wrote:
>
> Going by [1], clang will actually use multiplication by 16 to
> implement the former. gcc is better and shifts left by 4, so likely
> won't improve things for gcc.  It seems worth doing it this way for
> anything that does not have HAVE__BUILTIN_OP_OVERFLOW anyway.

FWIW, gcc 12.2 generates an imul on my system when compiling in situ. I've
found it useful to run godbolt locally* and load the entire PG file (nicer
to read than plain objdump) -- compilers can make different decisions when
going from isolated snippets to within full functions.

* clone from https://github.com/compiler-explorer/compiler-explorer
install npm 16
run "make" and when finished will show the localhost url
add the right flags, which in this case was

-Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement
-Werror=vla -Wendif-labels -Wmissing-format-attribute
-Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security
-fno-strict-aliasing -fwrapv -fexcess-precision=standard
-Wno-format-truncation -Wno-stringop-truncation -O2
-I/path/to/srcdir/src/include -I/path/to/builddir/src/include  -D_GNU_SOURCE

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Non-decimal integer literals

2022-11-22 Thread John Naylor
On Tue, Nov 22, 2022 at 8:36 PM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:
>
> On 15.11.22 11:31, Peter Eisentraut wrote:
> > On 14.11.22 08:25, John Naylor wrote:
> >> Regarding the patch, it looks good overall. My only suggestion would
> >> be to add a regression test for just below and just above overflow, at
> >> least for int2.
> >
> > ok
>
> This was a valuable suggestion, because this found some breakage.  In
> particular, the handling of grammar-level literals that overflow to
> "Float" was not correct.  (The radix prefix was simply stripped and
> forgotten.)  So I added a bunch more tests for this.  Here is a new patch.

Looks good to me.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Prefetch the next tuple's memory during seqscans

2022-11-22 Thread John Naylor
On Wed, Nov 23, 2022 at 5:00 AM David Rowley  wrote:
>
> On Thu, 3 Nov 2022 at 22:09, John Naylor 
wrote:
> > I tried a similar test, but with text fields of random length, and
there is improvement here:
>
> Thank you for testing that. Can you share which CPU this was on?

That was an Intel Core i7-10750H

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-21 Thread John Naylor
On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada 
wrote:
>
> On Mon, Nov 21, 2022 at 4:20 PM John Naylor
>  wrote:

> > Assuming the smallest node is fixed size (i.e. fanout/capacity member
not part of the common set, so only part of variable-sized nodes), 3 has a
nice property: no wasted padding space:
> >
> > node4: 5 + 4+(7) + 4*8 = 48 bytes
> > node3: 5 + 3 + 3*8 = 32
>
> IIUC if we store the fanout member only in variable-sized nodes,
> rt_node has only count, shift, and chunk, so 4 bytes in total. If so,
> the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The
> size doesn't change but there is 1 byte padding space.

I forgot to mention I'm assuming no pointer-tagging for this exercise.
You've demonstrated it can be done in a small amount of code, and I hope we
can demonstrate a speedup in search. Just in case there is some issue with
portability, valgrind, or some other obstacle, I'm being pessimistic in my
calculations.

> Also, even if we have the node3 a variable-sized node, size class 1
> for node3 could be a good choice since it also doesn't need padding
> space and could be a good alternative to path compression.
>
> node3 :  5 + 3 + 3*8 = 32 bytes
> size class 1 : 5 + 3 + 1*8 = 16 bytes

Precisely! I have that scenario in my notes as well -- it's quite
compelling.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-20 Thread John Naylor
On Fri, Nov 18, 2022 at 2:48 PM I wrote:
> One issue with this patch: The "fanout" member is a uint8, so it can't
hold 256 for the largest node kind. That's not an issue in practice, since
we never need to grow it, and we only compare that value with the count in
an Assert(), so I just set it to zero. That does break an invariant, so
it's not great. We could use 2 bytes to be strictly correct in all cases,
but that limits what we can do with the smallest node kind.

Thinking about this part, there's an easy resolution -- use a different
macro for fixed- and variable-sized node kinds to determine if there is a
free slot.

Also, I wanted to share some results of adjusting the boundary between the
two smallest node kinds. In the hackish attached patch, I modified the
fixed height search benchmark to search a small (within L1 cache) tree
thousands of times. For the first set I modified node4's maximum fanout and
filled it up. For the second, I set node4's fanout to 1, which causes 2+ to
spill to node32 (actually the partially-filled node15 size class
as demoed earlier).

node4:

NOTICE:  num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  2 |16 |16520 |  0 |3

NOTICE:  num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  3 |81 |16456 |  0 |   17

NOTICE:  num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  4 |   256 |16456 |  0 |   89

NOTICE:  num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  5 |   625 |16488 |  0 |  327


node32:

NOTICE:  num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  2 |16 |16488 |  0 |5
(1 row)

NOTICE:  num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  3 |81 |16520 |  0 |   28

NOTICE:  num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  4 |   256 |16408 |  0 |   79

NOTICE:  num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0,
n256 = 0
 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms
+---+--++--
  5 |   625 |24616 |  0 |  199

In this test, node32 seems slightly faster than node4 with 4 elements, at
the cost of more memory.

Assuming the smallest node is fixed size (i.e. fanout/capacity member not
part of the common set, so only part of variable-sized nodes), 3 has a nice
property: no wasted padding space:

node4: 5 + 4+(7) + 4*8 = 48 bytes
node3: 5 + 3 + 3*8 = 32

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-20 Thread John Naylor
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada 
wrote:
>
> On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada 
wrote:
> >
> > On Wed, Nov 16, 2022 at 4:39 PM John Naylor
> >  wrote:

> > > That means my idea for the pointer struct might have some problems,
at least as currently implemented. Maybe in the course of separating out
and polishing that piece, an inefficiency will fall out. Or, it might be
another reason to template local and shared separately. Not sure yet. I
also haven't tried to adjust this test for the shared memory case.

Digging a bit deeper, I see a flaw in my benchmark: Even though the total
distribution of node kinds is decently even, the pattern that the benchmark
sees is not terribly random:

 3,343,352  branch-misses:u  #0.85% of all
branches
   393,204,959  branches:u

Recall a previous benchmark [1] where the leaf node was about half node16
and half node32. Randomizing the leaf node between the two caused branch
misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this
new benchmark, each level has a skewed distribution of nodes, giving a
smart branch predictor something to work with. We will need a way to
efficiently generate keys that lead to a relatively unpredictable
distribution of node kinds, as seen by a searcher. Especially in the leaves
(or just above the leaves), since those are less likely to be cached.

> > I'll also run the test on my environment and do the investigation
tomorrow.
> >
>
> FYI I've not tested the patch you shared today but here are the
> benchmark results I did with the v9 patch in my environment (I used
> the second filter). I splitted 0004 patch into two patches: a patch
> for pure refactoring patch to introduce rt_node_ptr and a patch to do
> pointer tagging.

Would you be able to share the refactoring patch? And a fix for the failing
tests? I'm thinking I want to try the templating approach fairly soon.

[1]
https://www.postgresql.org/message-id/CAFBsxsFEVckVzsBsfgGzGR4Yz%3DJp%3DUxOtjYvTjOz6fOoLXtOig%40mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-18 Thread John Naylor
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada 
wrote:
>
> FYI I've not tested the patch you shared today but here are the
> benchmark results I did with the v9 patch in my environment (I used
> the second filter). I splitted 0004 patch into two patches: a patch
> for pure refactoring patch to introduce rt_node_ptr and a patch to do
> pointer tagging.
>
> v9 0003 patch: 1113 1114 1114
> introduce rt_node_ptr: 1127 1128 1128
> pointer tagging  : 1085 1087 1086 (equivalent to 0004 patch)
>
> In my environment, rt_node_ptr seemed to lead some overhead but
> pointer tagging had performance benefits. I'm not sure the reason why
> the results are different from yours. The radix tree stats shows the
> same as your tests.

There is less than 2% difference from the medial set of results, so it's
hard to distinguish from noise. I did a fresh rebuild and retested with the
same results: about 15% slowdown in v9 0004. That's strange.

On Wed, Nov 16, 2022 at 10:24 PM Masahiko Sawada 
wrote:

> > filter = (((uint64) 1<<32) | (0xFF<<24));
> > LOG:  num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 =
62632, n256 = 3161
> >
> > 1) Any idea why the tree height would be reported as 7 here? I didn't
expect that.
>
> In my environment, (0xFF<<24) is 0xFF00, not 0xFF00.
> It seems the filter should be (((uint64) 1<<32) | ((uint64)
> 0xFF<<24)).

Ugh, sign extension, brain fade on my part. Thanks, I'm glad there was a
straightforward explanation.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-17 Thread John Naylor
On Wed, Sep 28, 2022 at 1:18 PM I wrote:

> Along those lines, one thing I've been thinking about is the number of
size classes. There is a tradeoff between memory efficiency and number of
branches when searching/inserting. My current thinking is there is too much
coupling between size class and data type. Each size class currently uses a
different data type and a different algorithm to search and set it, which
in turn requires another branch. We've found that a larger number of size
classes leads to poor branch prediction [1] and (I imagine) code density.
>
> I'm thinking we can use "flexible array members" for the values/pointers,
and keep the rest of the control data in the struct the same. That way, we
never have more than 4 actual "kinds" to code and branch on. As a bonus,
when migrating a node to a larger size class of the same kind, we can
simply repalloc() to the next size.

While the most important challenge right now is how to best represent and
organize the shared memory case, I wanted to get the above idea working and
out of the way, to be saved for a future time. I've attached a rough
implementation (applies on top of v9 0003) that splits node32 into 2 size
classes. They both share the exact same base data type and hence the same
search/set code, so the number of "kind"s is still four, but here there are
five "size classes", so a new case in the "unlikely" node-growing path. The
smaller instance of node32 is a "node15", because that's currently 160
bytes, corresponding to one of the DSA size classes. This idea can be
applied to any other node except the max size, as we see fit. (Adding a
singleton size class would bring it back in line with the prototype, at
least as far as memory consumption.)

One issue with this patch: The "fanout" member is a uint8, so it can't hold
256 for the largest node kind. That's not an issue in practice, since we
never need to grow it, and we only compare that value with the count in an
Assert(), so I just set it to zero. That does break an invariant, so it's
not great. We could use 2 bytes to be strictly correct in all cases, but
that limits what we can do with the smallest node kind.

In the course of working on this, I encountered a pain point. Since it's
impossible to repalloc in slab, we have to do alloc/copy/free ourselves.
That's fine, but the current coding makes too many assumptions about the
use cases: rt_alloc_node and rt_copy_node are too entangled with each other
and do too much work unrelated to what the names imply. I seem to remember
an earlier version had something like rt_node_copy_common that did
only...copying. That was much easier to reason about. In 0002 I resorted to
doing my own allocation to show what I really want to do, because the new
use case doesn't need zeroing and setting values. It only needs
to...allocate (and increase the stats counter if built that way).

Future optimization work while I'm thinking of it: rt_alloc_node should be
always-inlined and the memset done separately (i.e. not *AllocZero). That
way the compiler should be able generate more efficient zeroing code for
smaller nodes. I'll test the numbers on this sometime in the future.

--
John Naylor
EDB: http://www.enterprisedb.com
From 6fcc970ae7e31f44fa6b6aface983cadb023cc50 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Thu, 17 Nov 2022 16:10:44 +0700
Subject: [PATCH v901 2/2] Make node32 variable sized

Add a size class for 15 elements, which corresponds to 160 bytes,
an allocation size used by DSA. When a 16th element is to be
inserted, allocte a larger area and memcpy the entire old node
to it.

NB: Zeroing the new area is only necessary if it's for an
inner node128, since insert logic must check for null child
pointers.

This technique allows us to limit the node kinds to 4, which
1. limits the number of cases in switch statements
2. allows a possible future optimization to encode the node kind
in a pointer tag
---
 src/backend/lib/radixtree.c | 141 +++-
 1 file changed, 108 insertions(+), 33 deletions(-)

diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c
index bef1a438ab..f368e750d5 100644
--- a/src/backend/lib/radixtree.c
+++ b/src/backend/lib/radixtree.c
@@ -130,6 +130,7 @@ typedef enum
 typedef enum rt_size_class
 {
RT_CLASS_4_FULL = 0,
+   RT_CLASS_32_PARTIAL,
RT_CLASS_32_FULL,
RT_CLASS_128_FULL,
RT_CLASS_256
@@ -147,6 +148,8 @@ typedef struct rt_node
uint16  count;
 
/* Max number of children. We can use uint8 because we never need to 
store 256 */
+   /* WIP: if we don't have a variable sized node4, this should instead be 
in the base
+   types as needed, since saving every byte is crucial for the smallest 
node kind */
uint8   fanout;
 
/*
@@ -166,6 +169,8 @@ typedef struct rt_node
((node)->base.n.count < (node)->ba

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread John Naylor
On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada 
wrote:
>
> On Wed, Nov 16, 2022 at 1:46 PM John Naylor
>  wrote:
> >
> >
> > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada 
wrote:
> > > Thanks! Please let me know if there is something I can help with.
> >
> > I didn't get very far because the tests fail on 0004 in rt_verify_node:
> >
> > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242
>
> Which tests do you use to get this assertion failure? I've confirmed
> there is a bug in 0005 patch but without it, "make check-world"
> passed.

Hmm, I started over and rebuilt and it didn't reproduce. Not sure what
happened, sorry for the noise.

I'm attaching a test I wrote to stress test branch prediction in search,
and while trying it out I found two possible issues.

It's based on the random int load test, but tests search speed. Run like
this:

select * from bench_search_random_nodes(10 * 1000 * 1000)

It also takes some care to include all the different node kinds,
restricting the possible keys by AND-ing with a filter. Here's a simple
demo:

filter = ((uint64)1<<40)-1;
LOG:  num_keys = 967, height = 4, n4 = 17513814, n32 = 6320, n128 =
62663, n256 = 3130

Just using random integers leads to >99% using the smallest node. I wanted
to get close to having the same number of each, but that's difficult while
still using random inputs. I ended up using

filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF)

which gives

LOG:  num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 =
182670, n256 = 1024

Which seems okay for the task. One puzzling thing I found while trying
various filters is that sometimes the reported tree height would change.
For example:

filter = (((uint64) 1<<32) | (0xFF<<24));
LOG:  num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 =
62632, n256 = 3161

1) Any idea why the tree height would be reported as 7 here? I didn't
expect that.

2) It seems that 0004 actually causes a significant slowdown in this test
(as in the attached, using the second filter above and with turboboost
disabled):

v9 0003: 2062 2051 2050
v9 0004: 2346 2316 2321

That means my idea for the pointer struct might have some problems, at
least as currently implemented. Maybe in the course of separating out and
polishing that piece, an inefficiency will fall out. Or, it might be
another reason to template local and shared separately. Not sure yet. I
also haven't tried to adjust this test for the shared memory case.

--
John Naylor
EDB: http://www.enterprisedb.com
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql 
b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 0874201d7e..e0205b364e 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -43,6 +43,14 @@ returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
 
+create function bench_search_random_nodes(
+cnt int8,
+OUT mem_allocated int8,
+OUT search_ms int8)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
 create function bench_fixed_height_search(
 fanout int4,
 OUT fanout int4,
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c 
b/contrib/bench_radix_tree/bench_radix_tree.c
index 7abb237e96..a43fc61c2d 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -29,6 +29,7 @@ PG_FUNCTION_INFO_V1(bench_seq_search);
 PG_FUNCTION_INFO_V1(bench_shuffle_search);
 PG_FUNCTION_INFO_V1(bench_load_random_int);
 PG_FUNCTION_INFO_V1(bench_fixed_height_search);
+PG_FUNCTION_INFO_V1(bench_search_random_nodes);
 
 static uint64
 tid_to_key_off(ItemPointer tid, uint32 *off)
@@ -347,6 +348,77 @@ bench_load_random_int(PG_FUNCTION_ARGS)
PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, 
nulls)));
 }
 
+/* copy of splitmix64() */
+static uint64
+hash64(uint64 x)
+{
+   x ^= x >> 30;
+   x *= UINT64CONST(0xbf58476d1ce4e5b9);
+   x ^= x >> 27;
+   x *= UINT64CONST(0x94d049bb133111eb);
+   x ^= x >> 31;
+   return x;
+}
+
+/* attempts to have a relatively even population of node kinds */
+Datum
+bench_search_random_nodes(PG_FUNCTION_ARGS)
+{
+   uint64  cnt = (uint64) PG_GETARG_INT64(0);
+   radix_tree *rt;
+   TupleDesc   tupdesc;
+   TimestampTz start_time,
+   end_time;
+   longsecs;
+   int usecs;
+   int64   search_time_ms;
+   Datum   values[2] = {0};
+   boolnulls[2] = {0};
+   /* from trial and error */
+   const uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 
0xFF);
+
+   

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread John Naylor
On Wed, Nov 16, 2022 at 11:46 AM John Naylor 
wrote:
>
>
> On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada 
wrote:
> > Thanks! Please let me know if there is something I can help with.
>
> I didn't get very far because the tests fail on 0004 in rt_verify_node:
>
> TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

Actually I do want to offer some general advice. Upthread I recommended a
purely refactoring patch that added the node-pointer struct but did nothing
else, so that the DSA changes would be smaller. 0004 attempted pointer
tagging in the same commit, which makes it no longer a purely refactoring
patch, so that 1) makes it harder to tell what part caused the bug and 2)
obscures what is necessary for DSA pointers and what was additionally
necessary for pointer tagging. Shared memory support is a prerequisite for
a shippable feature, but pointer tagging is (hopefully) a performance
optimization. Let's keep them separate.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-15 Thread John Naylor
On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada 
wrote:
> Thanks! Please let me know if there is something I can help with.

I didn't get very far because the tests fail on 0004 in rt_verify_node:

TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File:
"../src/backend/lib/radixtree.c", Line: 2186, PID: 18242

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-14 Thread John Naylor
On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada 
wrote:
>
> 0004 patch is a new patch supporting a pointer tagging of the node
> kind. Also, it introduces rt_node_ptr we discussed so that internal
> functions use it rather than having two arguments for encoded and
> decoded pointers. With this intermediate patch, the DSA support patch
> became more readable and understandable. Probably we can make it
> smaller further if we move the change of separating the control object
> from radix_tree to the main patch (0002). The patch still needs to be
> polished but I'd like to check if this idea is worthwhile. If we agree
> on this direction, this patch will be merged into the main radix tree
> implementation patch.

Thanks for the new patch set. I've taken a very brief look at 0004 and I
think the broad outlines are okay. As you say it needs polish, but before
going further, I'd like to do some experiments of my own as I mentioned
earlier:

- See how much performance we actually gain from tagging the node kind.
- Try additional size classes while keeping the node kinds to only four.
- Optimize node128 insert.
- Try templating out the differences between local and shared memory. With
local memory, the node-pointer struct would be a union, for example.
Templating would also reduce branches and re-simplify some internal APIs,
but it's likely that would also make the TID store and/or vacuum more
complex, because at least some external functions would be duplicated.

I'll set the patch to "waiting on author", but in this case the author is
me.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Non-decimal integer literals

2022-11-13 Thread John Naylor
On Mon, Oct 10, 2022 at 9:17 PM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:

> Taking another look around ecpg to see how this interacts with C-syntax
> integer literals.  I'm not aware of any particular issues, but it's
> understandably tricky.

I can find no discussion in the archives about the commit that added "xch":

commit 6fb3c3f78fbb2296894424f6e3183d339915eac7
Author: Michael Meskes 
Date:   Fri Oct 15 19:02:08 1999 +

*** empty log message ***

...and I can't think of why bounds checking a C literal was done like this.

Regarding the patch, it looks good overall. My only suggestion would be to
add a regression test for just below and just above overflow, at least for
int2.

Minor nits:

- * Process {integer}.  Note this will also do the right thing with
{decimal},
+ * Process {*integer}.  Note this will also do the right thing with
{numeric},

I scratched my head for a while, thinking this was Flex syntax, until I
realized my brain was supposed to do shell-globbing first, at which point I
could see it was referring to multiple Flex rules. I'd try to rephrase.

+T661 Non-decimal integer literals YES SQL:202x draft

Is there an ETA yet?

Also, it's not this patch's job to do it, but there are at least a half
dozen places that open-code turning a hex char into a number. It might be a
good easy "todo item" to unify that.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Add palloc_aligned() to allow arbitrary power of 2 memory alignment

2022-11-13 Thread John Naylor
On Tue, Nov 8, 2022 at 8:57 AM David Rowley  wrote:

> On Tue, 8 Nov 2022 at 05:24, Andres Freund  wrote:
> > I doubtthere's ever a need to realloc such a pointer? Perhaps we could
just
> > elog(ERROR)?
>
> Are you maybe locked into just thinking about your current planned use
> case that we want to allocate BLCKSZ bytes in each case? It does not
> seem impossible to me that someone will want something more than an
> 8-byte alignment and also might want to enlarge the allocation at some
> point. I thought it might be more dangerous not to implement repalloc.
> It might not be clear to someone using palloc_aligned() that there's
> no code path that can call repalloc on the returned pointer.

I can imagine a use case for arrays of cacheline-sized objects.

> TYPEALIGN() will not work correctly unless the alignment is a power of
> 2. We could modify it to, but that would require doing some modular
> maths instead of bitmasking. That seems pretty horrible when the macro
> is given a value that's not constant at compile time as we'd end up
> with a (slow) divide in the code path.  I think the restriction is a
> good idea. I imagine there will never be any need to align to anything
> that's not a power of 2.

+1

> > Should we handle the case where we get a suitably aligned pointer from
> > MemoryContextAllocExtended() differently?
>
> Maybe it would be worth the extra check. I'm trying to imagine future
> use cases.  Maybe if someone wanted to ensure that we're aligned to
> CPU cache line boundaries then the chances of the pointer already
> being aligned to 64 bytes is decent enough.  The problem is it that
> it's too late to save any memory, it just saves a bit of boxing and
> unboxing of the redirect headers.

To my mind the main point of detecting this case is to save memory, so if
that's not possible/convenient, special-casing doesn't seem worth it.

- Assert((char *) chunk > (char *) block);
+ Assert((char *) chunk >= (char *) block);

Is this related or independent?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-11-11 Thread John Naylor
On Fri, Nov 11, 2022 at 2:12 PM Michael Paquier  wrote:
>
> On Fri, Nov 11, 2022 at 11:53:08AM +0530, Bharath Rupireddy wrote:
> > On Fri, Nov 11, 2022 at 11:14 AM John Naylor <
john.nay...@enterprisedb.com> wrote:
> >> Was there supposed to be an attachment here?
> >
> > Nope. The patches have already been committed -
> > 3bdbdf5d06f2179d4c17926d77ff734ea9e7d525 and
> > 28cc2976a9cf0ed661dbc55f49f669192cce1c89.
>
> The committed patches are pretty much the same as the last version
> sent on this thread, except that the changes have been split across
> the files they locally impact, with a few simplifications tweaks to
> the comments.  Hencem I did not see any need to send a new version for
> this case.

Ah, I missed that -- sorry for the noise.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: slab allocator performance issues

2022-11-11 Thread John Naylor
On Wed, Oct 12, 2022 at 4:37 PM David Rowley  wrote:
> [v3]

+ /*
+ * Compute a shift that guarantees that shifting chunksPerBlock with it
+ * yields is smaller than SLAB_FREELIST_COUNT - 1 (one freelist is used
for full blocks).
+ */
+ slab->freelist_shift = 0;
+ while ((slab->chunksPerBlock >> slab->freelist_shift) >=
(SLAB_FREELIST_COUNT - 1))
+ slab->freelist_shift++;

+ /*
+ * Ensure, without a branch, that index 0 is only used for blocks entirely
+ * without free chunks.
+ * XXX: There probably is a cheaper way to do this. Needing to shift twice
+ * by slab->freelist_shift isn't great.
+ */
+ index = (freecount + (1 << slab->freelist_shift) - 1) >>
slab->freelist_shift;

How about something like

#define SLAB_FREELIST_COUNT ((1<<3) + 1)
index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);

and dispense with both freelist_shift and the loop that computes it?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

2022-11-10 Thread John Naylor
On Tue, Nov 8, 2022 at 11:08 AM Michael Paquier  wrote:
>
> So, I have looked at that, and at the end concluded that Andres'
> suggestion to use PGAlignedBlock in pg_write_zeros() will serve better
> in the long run.  Thomas has mentioned upthread that some of the
> comments don't need to be that long, so I have tweaked these to be
> minimal, and updated a few more areas.  Note that this has been split
> into two commits: one to introduce the new routine in file_utils.c and
> a second for the switch in walmethods.c.

Was there supposed to be an attachment here?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Direct I/O

2022-11-09 Thread John Naylor
On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro  wrote:

> Memory alignment patches:
>
> Direct I/O generally needs to be done to/from VM page-aligned
> addresses, but only "standard" 4KB pages, even when larger VM pages
> are in use (if there is an exotic system where that isn't true, it
> won't work).  We need to deal with buffers on the stack, the heap and
> in shmem.  For the stack, see patch 0001.  For the heap and shared
> memory, see patch 0002, but David Rowley is going to propose that part
> separately, as MemoryContext API adjustments are a specialised enough
> topic to deserve another thread; here I include a copy as a
> dependency.  The main direct I/O patch is 0003.

One thing to note: Currently, a request to aset above 8kB must go into a
dedicated block. Not sure if it's a coincidence that that matches the
default PG page size, but if allocating pages on the heap is hot enough,
maybe we should consider raising that limit. Although then, aligned-to-4kB
requests would result in 16kB chunks requested unless a different allocator
was used.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Add palloc_aligned() to allow arbitrary power of 2 memory alignment

2022-11-07 Thread John Naylor
On Tue, Nov 8, 2022 at 8:57 AM David Rowley  wrote:
> Is there anything we could align to CPU cacheline size that would
> speed something up?

InitCatCache() already has this, which could benefit from simpler notation.

/*
 * Allocate a new cache structure, aligning to a cacheline boundary
 *
 * Note: we rely on zeroing to initialize all the dlist headers correctly
 */
sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE;
cp = (CatCache *) CACHELINEALIGN(palloc0(sz));

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Improve logging when using Huge Pages

2022-11-06 Thread John Naylor
On Thu, Nov 3, 2022 at 8:11 AM Thomas Munro  wrote:
>
> I wonder if the problem here is that people are reluctant to add noise
> to every starting system.   There are people who have not configured
> their system and don't want to see that noise, and then some people
> have configured their system and would like to know about it if it
> doesn't work so they can be aware of that, but don't want to use "off"
> because they don't want a hard failure.  Would it be better if there
> were a new level "try_log" (or something), which only logs a message
> if it tries and fails?

I think the best thing to do is change huge_pages='on' to log a WARNING and
fallback to regular pages if the mapping fails. That way, both dev and prod
can keep the same settings, since 'on' will have both visibility and
robustness. I don't see a good reason to refuse to start -- seems like an
anti-pattern.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: remap the .text segment into huge pages at run time

2022-11-06 Thread John Naylor
On Sat, Nov 5, 2022 at 3:27 PM Andres Freund  wrote:

> > simplified it. Interestingly enough, looking through the commit history,
> > they used to align the segments via linker flags, but took it out here:
> >
> > https://github.com/intel/iodlr/pull/25#discussion_r397787559
> >
> > ...saying "I'm not sure why we added this". :/
>
> That was about using a linker script, not really linker flags though.

Oops, the commit I was referring to pointed to that discussion, but I
should have shown it instead:

--- a/large_page-c/example/Makefile
+++ b/large_page-c/example/Makefile
@@ -28,7 +28,6 @@ OBJFILES=  \
   filler16.o   \

 OBJS=$(addprefix $(OBJDIR)/,$(OBJFILES))
-LDFLAGS=-Wl,-z,max-page-size=2097152

But from what you're saying, this flag wouldn't have been enough anyway...

> I don't think the dummy functions are a good approach, there were plenty
> things after it when I played with them.

To be technical, the point wasn't to have no code after it, but to have no
*hot* code *before* it, since with the iodlr approach the first 1.99MB of
.text is below the first aligned boundary within that section. But yeah,
I'm happy to ditch that hack entirely.

> > > With these flags the "R E" segments all start on a 0x20/2MiB
boundary
> > and
> > > are padded to the next 2MiB boundary. However the OS / dynamic loader
only
> > > maps the necessary part, not all the zero padding.
> > >
> > > This means that if we were to issue a MADV_COLLAPSE, we can before it
do
> > an
> > > mremap() to increase the length of the mapping.
> >
> > I see, interesting. What location are you passing for madvise() and
> > mremap()? The beginning of the segment (for me has .init/.plt) or an
> > aligned boundary within .text?

>/*
> * Make huge pages out of it. Requires at least linux 6.1.  We
could
> * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all
that
> * much in older kernels.
> */

About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for
THP? The man page seems to indicate that.

In the support work I've done, the standard recommendation is to turn THP
off, especially if they report sudden performance problems. If explicit
HP's are used for shared mem, maybe THP is less of a risk? I need to look
back at the tests that led to that advice...

> A real version would have to open /proc/self/maps and do this for at least

I can try and generalize your above sketch into a v2 patch.

> postgres' r-xp mapping. We could do it for libraries too, if they're
suitably
> aligned (both in memory and on-disk).

It looks like plpgsql is only 27 standard pages in size...

Regarding glibc, we could try moving a couple of the hotter functions into
PG, using smaller and simpler coding, if that has better frontend cache
behavior. The paper "Understanding and Mitigating Front-End Stalls in
Warehouse-Scale Computers" talks about this, particularly section 4.4
regarding memcmp().

> > I quickly tried to align the segments with the linker and then in my
patch
> > have the address for mmap() rounded *down* from the .text start to the
> > beginning of that segment. It refused to start without logging an error.
>
> Hm, what linker was that? I did note that you need some additional flags
for
> some of the linkers.

BFD, but I wouldn't worry about that failure too much, since the
mremap()/madvise() strategy has a lot fewer moving parts.

On the subject of linkers, though, one thing that tripped me up was trying
to change the linker with Meson. First I tried

-Dc_args='-fuse-ld=lld'

but that led to warnings like this when :
/usr/bin/ld: warning: -z separate-loadable-segments ignored

When using this in the top level meson.build

elif host_system == 'linux'
  sema_kind = 'unnamed_posix'
  cppflags += '-D_GNU_SOURCE'
  # Align the loadable segments to 2MB boundaries to support remapping to
  # huge pages.
  ldflags += cc.get_supported_link_arguments([
'-Wl,-zmax-page-size=0x20',
'-Wl,-zcommon-page-size=0x20',
'-Wl,-zseparate-loadable-segments'
  ])


According to

https://mesonbuild.com/howtox.html#set-linker

I need to add CC_LD=lld to the env vars before invoking, which got rid of
the warning. Then I wanted to verify that lld was actually used, and in

https://releases.llvm.org/14.0.0/tools/lld/docs/index.html

it says I can run this and it should show “Linker: LLD”, but that doesn't
appear for me:

$ readelf --string-dump .comment inst-perf/bin/postgres

String dump of section '.comment':
  [ 0]  GCC: (GNU) 12.2.1 20220819 (Red Hat 12.2.1-2)


--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-05 Thread John Naylor
On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada 
wrote:
>
> For parallel heap pruning, multiple workers will insert key-value
> pairs to the radix tree concurrently. The simplest solution would be a
> single lock to protect writes but the performance will not be good.
> Another solution would be that we can divide the tables into multiple
> ranges so that keys derived from TIDs are not conflicted with each
> other and have parallel workers process one or more ranges. That way,
> parallel vacuum workers can build *sub-trees* and the leader process
> can merge them. In use cases of lazy vacuum, since the write phase and
> read phase are separated the readers don't need to worry about
> concurrent updates.

It's a good idea to use ranges for a different reason -- readahead. See
commit 56788d2156fc3, which aimed to improve readahead for sequential
scans. It might work to use that as a model: Each worker prunes a range of
64 pages, keeping the dead tids in a local array. At the end of the range:
lock the tid store, enter the tids into the store, unlock, free the local
array, and get the next range from the leader. It's possible contention
won't be too bad, and I suspect using small local arrays as-we-go would be
faster and use less memory than merging multiple sub-trees at the end.

> I've attached a draft patch for lazy vacuum integration that can be
> applied on top of v8 patches. The patch adds a new module called
> TIDStore, an efficient storage for TID backed by radix tree. Lazy
> vacuum and parallel vacuum use it instead of a TID array. The patch
> also introduces rt_detach() that was missed in 0002 patch. It's a very
> rough patch but I hope it helps in considering lazy vacuum
> integration, radix tree APIs, and shared radix tree functionality.

It does help, good to see this.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: remap the .text segment into huge pages at run time

2022-11-04 Thread John Naylor
On Sat, Nov 5, 2022 at 1:33 AM Andres Freund  wrote:

> > I wonder how far we can get with just using the linker hints to align
> > sections. I know that the linux folks are working on promoting
sufficiently
> > aligned executable pages to huge pages too, and might have succeeded
already.
> >
> > IOW, adding the linker flags might be a good first step.
>
> Indeed, I did see that that works to some degree on the 5.19 kernel I was
> running. However, it never seems to get around to using huge pages
> sufficiently to compete with explicit use of huge pages.

Oh nice, I didn't know that! There might be some threshold of pages mapped
before it does so. At least, that issue is mentioned in that paper linked
upthread for FreeBSD.

> More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was
> added into linux 6.1. That explicitly remaps a region and uses huge pages
for
> it. Of course that's going to take a while to be widely available, but it
> seems like a safer approach than the remapping approach from this thread.

I didn't know that either, funny timing.

> I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just
hardcode
> the address / length), and it seems to work nicely.
>
> With the weird caveat that on fs one needs to make sure that the
executable
> doesn't reflinks to reuse parts of other files, and that the mold linker
and
> cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
> binary with cp --reflink=never

What happens otherwise? That sounds like a difficult thing to guard against.

> The difference in itlb.itlb_flush between pipelined / non-pipelined cases
> unsurprisingly is stark.
>
> While the pipelined case still sees a good bit reduced itlb traffic, the
total
> amount of cycles in which a walk is active is just not large enough to
matter,
> by the looks of it.

Good to know, thanks for testing. Maybe the pipelined case is something
devs should consider when microbenchmarking, to reduce noise from context
switches.

On Sat, Nov 5, 2022 at 4:21 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> > > - Add a "cold" __asm__ filler function that just takes up space,
enough to
> > > push the end of the .text segment over the next aligned boundary, or
to
> > > ~8MB in size.
> >
> > I don't understand why this is needed - as long as the pages are
aligned to
> > 2MB, why do we need to fill things up on disk? The in-memory contents
are the
> > relevant bit, no?
>
> I now assume it's because you either observed the mappings set up by the
> loader to not include the space between the segments?

My knowledge is not quite that deep. The iodlr repo has an example "hello
world" program, which links with 8 filler objects, each with 32768
__attribute((used)) dummy functions. I just cargo-culted that idea and
simplified it. Interestingly enough, looking through the commit history,
they used to align the segments via linker flags, but took it out here:

https://github.com/intel/iodlr/pull/25#discussion_r397787559

...saying "I'm not sure why we added this". :/

I quickly tried to align the segments with the linker and then in my patch
have the address for mmap() rounded *down* from the .text start to the
beginning of that segment. It refused to start without logging an error.

BTW, that what I meant before, although I wasn't clear:

> > Since the front is all-cold, and there is very little at the end,
> > practically all hot pages are now remapped. The biggest problem with the
> > hackish filler function (in addition to maintainability) is, if explicit
> > huge pages are turned off in the kernel, attempting mmap() with
MAP_HUGETLB
> > causes complete startup failure if the .text segment is larger than 8MB.
>
> I would expect MAP_HUGETLB to always fail if not enabled in the kernel,
> independent of the .text segment size?

With the file-level hack, it would just fail without a trace with .text >
8MB (I have yet to enable core dumps on this new OS I have...), whereas
without it I did see the failures in the log, and successful fallback.

> With these flags the "R E" segments all start on a 0x20/2MiB boundary
and
> are padded to the next 2MiB boundary. However the OS / dynamic loader only
> maps the necessary part, not all the zero padding.
>
> This means that if we were to issue a MADV_COLLAPSE, we can before it do
an
> mremap() to increase the length of the mapping.

I see, interesting. What location are you passing for madvise() and
mremap()? The beginning of the segment (for me has .init/.plt) or an
aligned boundary within .text?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Incorrect include file order in guc-file.l

2022-11-03 Thread John Naylor
On Fri, Nov 4, 2022 at 5:42 AM Michael Paquier  wrote:
>
> The CI is able to complete without it.  Would you mind if it is
> removed?  If you don't want us to poke more at the bear, that's a nit
> so leaving things as they are is also fine by me.

I've removed it.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Incorrect include file order in guc-file.l

2022-11-03 Thread John Naylor
On Thu, Nov 3, 2022 at 6:40 PM Michael Paquier  wrote:
>
> On Thu, Nov 03, 2022 at 12:40:19PM +0700, John Naylor wrote:
> > On Wed, Nov 2, 2022 at 1:01 PM Julien Rouhaud 
wrote:
> >> Agreed, it's apparently an oversight in dac048f71eb.  +1 for the patch.
> >
> > I've pushed this, thanks!
>
> Thanks for the commit.  I've wanted to get it done yesterday but life
> took over faster than that.  Before committing the change, there is
> something I have noticed though: this header does not seem to be
> necessary at all and it looks that there is nothing in guc-file.l that
> needs it.  Why did you add it in dac048f to begin with?

Because it wouldn't compile otherwise, obviously. :-)

I must have been working on it before bfb9dfd93720

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Prefetch the next tuple's memory during seqscans

2022-11-03 Thread John Naylor
On Tue, Nov 1, 2022 at 5:17 AM David Rowley  wrote:
>
> My test is to run 16 queries changing the WHERE clause each time to
> have WHERE a = 0,  then WHERE a2 = 0 ... WHERE a16 = 0.  I wanted to
> know if prefetching only the first cache line of the tuple would be
> less useful when we require evaluation of say, the "a16" column vs the
> "a" column.

I tried a similar test, but with text fields of random length, and there is
improvement here:

Intel laptop, turbo boost off
shared_buffers = '4GB'
huge_pages = 'on'
max_parallel_workers_per_gather = '0'

create table text8 as
select
repeat('X', int4(random() * 20)) a1,
repeat('X', int4(random() * 20)) a2,
repeat('X', int4(random() * 20)) a3,
repeat('X', int4(random() * 20)) a4,
repeat('X', int4(random() * 20)) a5,
repeat('X', int4(random() * 20)) a6,
repeat('X', int4(random() * 20)) a7,
repeat('X', int4(random() * 20)) a8
from generate_series(1,1000) a;
vacuum freeze text8;

psql -c "select pg_prewarm('text8')" && \
for i in a1 a2 a3 a4 a5 a6 a7 a8;
do
echo Testing $i
echo "select * from text8 where $i = 'ZZZ';" > bench.sql
pgbench -f bench.sql -M prepared -n -T 10 postgres | grep latency
done


Master:

Testing a1
latency average = 980.595 ms
Testing a2
latency average = 1045.081 ms
Testing a3
latency average = 1107.736 ms
Testing a4
latency average = 1162.188 ms
Testing a5
latency average = 1213.985 ms
Testing a6
latency average = 1272.156 ms
Testing a7
latency average = 1318.281 ms
Testing a8
latency average = 1363.359 ms

Patch 0001+0003:

Testing a1
latency average = 812.548 ms
Testing a2
latency average = 897.303 ms
Testing a3
latency average = 955.997 ms
Testing a4
latency average = 1023.497 ms
Testing a5
latency average = 1088.494 ms
Testing a6
latency average = 1149.418 ms
Testing a7
latency average = 1213.134 ms
Testing a8
latency average = 1282.760 ms

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Incorrect include file order in guc-file.l

2022-11-02 Thread John Naylor
On Wed, Nov 2, 2022 at 1:01 PM Julien Rouhaud  wrote:
>
> Hi,
>
> On Wed, Nov 02, 2022 at 02:29:50PM +0900, Michael Paquier wrote:
> >
> > While reviewing a different patch, I have noticed that guc-file.l
> > includes sys/stat.h in the middle of the PG internal headers.  The
> > usual practice is to have first postgres[_fe].h, followed by the
> > system headers and finally the internal headers.  That's a nit, but
> > all the other files do that.
> >
> > {be,fe}-secure-openssl.c include some exceptions though, as documented
> > there.
>
> Agreed, it's apparently an oversight in dac048f71eb.  +1 for the patch.

I've pushed this, thanks!

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-11-02 Thread John Naylor
On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada 
wrote:
>
> I've attached v8 patches. 0001, 0002, and 0003 patches incorporated
> the comments I got so far. 0004 patch is a DSA support patch for PoC.

Thanks for the new patchset. This is not a full review, but I have some
comments:

0001 and 0002 look okay on a quick scan -- I will use this as a base for
further work that we discussed. However, before I do so I'd like to request
another revision regarding the following:

> In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes
> to point its children, and we use rt_node_ptr as either rt_node* or
> dsa_pointer depending on whether the radix tree is shared or not (ie,
> by checking radix_tree->dsa == NULL).

0004: Looks like a good start, but this patch has a large number of changes
like these, making it hard to read:

- if (found && child_p)
- *child_p = child;
+ if (found && childp_p)
+ *childp_p = childp;
...
  rt_node_inner_32 *new32;
+ rt_node_ptr new32p;

  /* grow node from 4 to 32 */
- new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4,
-  RT_NODE_KIND_32);
+ new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32);
+ new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p);

It's difficult to keep in my head what all the variables refer to. I
thought a bit about how to split this patch up to make this easier to read.
Here's what I came up with:

typedef struct rt_node_ptr
{
  uintptr_t encoded;
  rt_node * decoded;
}

Note that there is nothing about "dsa or local". That's deliberate. That
way, we can use the "encoded" field for a tagged pointer as well, as I hope
we can do (at least for local pointers) in the future. So an intermediate
patch would have "static inline void" functions  node_ptr_encode() and
 node_ptr_decode(), which would only copy from one member to another. I
suspect that: 1. The actual DSA changes will be *much* smaller and easier
to reason about. 2. Experimenting with tagged pointers will be easier.

Also, quick question: 0004 has a new function rt_node_update_inner() -- is
that necessary because of DSA?, or does this ideally belong in 0002? What's
the reason for it?

Regarding the performance, I've
> added another boolean argument to bench_seq/shuffle_search(),
> specifying whether to use the shared radix tree or not. Here are
> benchmark results in my environment,

> [...]

> In non-shared radix tree cases (the forth argument is false), I don't
> see a visible performance degradation. On the other hand, in shared
> radix tree cases (the forth argument is true), I see visible overheads
> because of dsa_get_address().

Thanks, this is useful.

> Please note that the current shared radix tree implementation doesn't
> support any locking, so it cannot be read while written by someone.

I think at the very least we need a global lock to enforce this.

> Also, only one process can iterate over the shared radix tree. When it
> comes to parallel vacuum, these don't become restriction as the leader
> process writes the radix tree while scanning heap and the radix tree
> is read by multiple processes while vacuuming indexes. And only the
> leader process can do heap vacuum by iterating the key-value pairs in
> the radix tree. If we want to use it for other cases too, we would
> need to support locking, RCU or something.

A useful exercise here is to think about what we'd need to do parallel heap
pruning. We don't need to go that far for v16 of course, but what's the
simplest thing we can do to make that possible? Other use cases can change
to more sophisticated schemes if need be.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Incorrect include file order in guc-file.l

2022-11-02 Thread John Naylor
On Wed, Nov 2, 2022 at 1:01 PM Julien Rouhaud  wrote:
>
> Hi,
>
> On Wed, Nov 02, 2022 at 02:29:50PM +0900, Michael Paquier wrote:
> >
> > While reviewing a different patch, I have noticed that guc-file.l
> > includes sys/stat.h in the middle of the PG internal headers.  The
> > usual practice is to have first postgres[_fe].h, followed by the
> > system headers and finally the internal headers.  That's a nit, but
> > all the other files do that.
> >
> > {be,fe}-secure-openssl.c include some exceptions though, as documented
> > there.
>
> Agreed, it's apparently an oversight in dac048f71eb.  +1 for the patch.

Yeah. +1, thanks.

--
John Naylor
EDB: http://www.enterprisedb.com


remap the .text segment into huge pages at run time

2022-11-02 Thread John Naylor
dback on whether the entire approach
in this thread is sound enough to justify working further on.

[1] https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf
(paper: "On the Impact of Instruction Address Translation Overhead")
[2] https://twitter.com/AndresFreundTec/status/1214305610172289024
[3] https://github.com/intel/iodlr

-- 
John Naylor
EDB: http://www.enterprisedb.com
From 9cde401f87937c1982f2355c8f81449514166376 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Mon, 31 Oct 2022 13:59:30 +0700
Subject: [PATCH v1 2/2] Put all non-cold .text in huge pages

Tell linker to align addresses on 2MB boundaries. The .init
section will be so aligned, with the .text section soon after that.
Therefore, the start address of .text should always be align up to
nearly 2MB ahead of the actual start. The first nearly 2MB of .text
will not map to huge pages.

We count on cold sections linking to the front of the .text segment:
Since the cold sections total about 600kB in size, we need ~1.4MB of
additional padding to keep non-cold pages mappable to huge pages. Since
PG has about 5.0MB of .text, we also need an additional 1MB to push
the .text end just past an aligned boundary, so when we align the end
down, only a small number of pages will remain un-remapped at their
original 4kB size.
---
 meson.build  |  3 +++
 src/backend/port/filler.c| 29 +
 src/backend/port/meson.build |  3 +++
 3 files changed, 35 insertions(+)
 create mode 100644 src/backend/port/filler.c

diff --git a/meson.build b/meson.build
index bfacbdc0af..450946370c 100644
--- a/meson.build
+++ b/meson.build
@@ -239,6 +239,9 @@ elif host_system == 'freebsd'
 elif host_system == 'linux'
   sema_kind = 'unnamed_posix'
   cppflags += '-D_GNU_SOURCE'
+  # WIP: debug builds are huge
+  # TODO: add portability check
+  ldflags += ['-Wl,-zcommon-page-size=2097152', '-Wl,-zmax-page-size=2097152']
 
 elif host_system == 'netbsd'
   # We must resolve all dynamic linking in the core server at program start.
diff --git a/src/backend/port/filler.c b/src/backend/port/filler.c
new file mode 100644
index 00..de4e33bb05
--- /dev/null
+++ b/src/backend/port/filler.c
@@ -0,0 +1,29 @@
+/*
+ * Add enough padding to .text segment to bring the end just
+ * past a 2MB alignment boundary. In practice, this means .text needs
+ * to be at least 8MB. It shouldn't be much larger than this,
+ * because then more hot pages will remain in 4kB pages.
+ *
+ * FIXME: With this filler added, if explicit huge pages are turned off
+ * in the kernel, attempting mmap() with MAP_HUGETLB causes a crash
+ * instead of reporting failure if the .text segment is larger than 8MB.
+ *
+ * See MapStaticCodeToLargePages() in large_page.c
+ *
+ * XXX: The exact amount of filler must be determined experimentally
+ * on platforms of interest, in non-assert builds.
+ *
+ */
+static void
+__attribute__((used))
+__attribute__((cold))
+fill_function(int x)
+{
+	/* TODO: More architectures */
+#ifdef __x86_64__
+__asm__(
+	".fill 3251000"
+);
+#endif
+	(void) x;
+}
\ No newline at end of file
diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build
index 5ab65115e9..d876712e0c 100644
--- a/src/backend/port/meson.build
+++ b/src/backend/port/meson.build
@@ -16,6 +16,9 @@ if cdata.has('USE_WIN32_SEMAPHORES')
 endif
 
 if cdata.has('USE_SYSV_SHARED_MEMORY')
+  if host_system == 'linux'
+backend_sources += files('filler.c')
+  endif
   backend_sources += files('large_page.c')
   backend_sources += files('sysv_shmem.c')
 endif
-- 
2.37.3

From 0012baab70779f5fc06c8717392dc76e8f156270 Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Mon, 31 Oct 2022 15:24:29 +0700
Subject: [PATCH v1 1/2] Partly remap the .text segment into huge pages at
 postmaster start

Based on MIT licensed libary at https://github.com/intel/iodlr

The basic steps are:

- read ELF info to get the start/end addresses of the .text segment
- calculate addresses therein aligned at huge page boundaries
- mmap temporary region and memcpy aligned portion of .text segment
- mmap start address to new region with huge pages and MAP_FIXED
- memcpy over and revoke the PROT_WRITE bit

The Postgres .text segment is ~5.0MB in a non-assert build, so this
method can put 2-4MB into huge pages.
---
 src/backend/port/large_page.c   | 348 
 src/backend/port/meson.build|   1 +
 src/backend/postmaster/postmaster.c |   7 +
 src/include/port/large_page.h   |  18 ++
 4 files changed, 374 insertions(+)
 create mode 100644 src/backend/port/large_page.c
 create mode 100644 src/include/port/large_page.h

diff --git a/src/backend/port/large_page.c b/src/backend/port/large_page.c
new file mode 100644
index 00..66a584f785
--- /dev/null
+++ b/src/backend/port/large_page.c
@@ -0,0 +1,348 @@
+/*-
+ *
+ * large_page.c
+ *	  Map .text segment of binary to hu

Re: Documentation for building with meson

2022-10-27 Thread John Naylor
+# Run the main pg_regress and isolation tests
+meson test --suite main

This does not work for me in a fresh install until running

meson test --suite setup

In fact, we see in

https://wiki.postgresql.org/wiki/Meson

meson test --suite setup --suite main

That was just an eyeball check from a naive user -- it would be good to try
running everything documented here.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-26 Thread John Naylor
On Thu, Oct 27, 2022 at 9:11 AM Masahiko Sawada 
wrote:
>
> True. I'm going to start with 6 bytes and will consider reducing it to
> 5 bytes.

Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As
discussed, I will attempt the size class decoupling after v8 and see how it
goes.

> Encoding the kind in a pointer tag could be tricky given DSA

If it turns out to be unworkable, that's life. If it's just tricky, that
can certainly be put off for future work. I hope to at least test it out
with local memory.

> support so currently I'm thinking to pack the node kind and node
> capacity classes to uint8.

That won't work, if we need 128 for capacity, leaving no bits left. I want
the capacity to be a number we can directly compare with the count (we
won't ever need to store 256 because that node will never grow). Also,
further to my last message, we need to access the kind quickly, without
more cycles.

> I've made some progress on investigating DSA support. I've written
> draft patch for that and regression tests passed. I'll share it as a
> separate patch for discussion with v8 radix tree patch.

Great!

> While implementing DSA support, I realized that we may not need to use
> pointer tagging to distinguish between backend-local address or
> dsa_pointer. In order to get a backend-local address from dsa_pointer,
> we need to pass dsa_area like:

I was not clear -- when I see how much code changes to accommodate DSA
pointers, I imagine I will pretty much know the places that would be
affected by tagging the pointer with the node kind.

Speaking of tests, there is currently no Meson support, but tests pass
because this library is not used anywhere in the backend yet, and
apparently the CI Meson builds don't know to run the regression test? That
will need to be done too. However, it's okay to keep the benchmarking
module in autoconf, since it won't be committed.

> > +static inline void
> > +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> > + uint8 *dst_chunks, rt_node **dst_children, int count)
> > +{
> > + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> > + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> > +}
> >
> > gcc generates better code with something like this (but not hard-coded)
at the top:
> >
> > if (count > 4)
> > pg_unreachable();

Actually it just now occurred to me there's a bigger issue here: *We* know
this code can only get here iff count==4, so why doesn't the compiler know
that? I believe it boils down to

static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = {

In the assembly, I see it checks if there is room in the node by doing a
runtime lookup in this array, which is not constant. This might not be
important just yet, because I want to base the check on the proposed node
capacity instead, but I mention it as a reminder to us to make sure we take
all opportunities for the compiler to propagate constants.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-26 Thread John Naylor
remind
us.

+ /* there is no key to delete */
+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL))
+   return false;
+
+ /* Update the statistics */
+ tree->num_keys--;
+
+ /*
+  * Delete the key from the leaf node and recursively delete the key in
+  * inner nodes if necessary.
+  */
+ Assert(NODE_IS_LEAF(stack[level]));
+ while (level >= 0)
+ {
+   rt_node*node = stack[level--];
+
+   if (NODE_IS_LEAF(node))
+ rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL);
+   else
+ rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL);
+
+   /* If the node didn't become empty, we stop deleting the key */
+   if (!NODE_IS_EMPTY(node))
+ break;
+
+   /* The node became empty */
+   rt_free_node(tree, node);
+ }

Here we call rt_node_search_leaf() twice -- once to check for existence,
and once to delete. All three search calls are inlined, so this wastes
space. Let's try to delete the leaf, return if not found, otherwise handle
the leaf bookkeepping and loop over the inner nodes. This might require
some duplication of code.

+ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child)

Spelling

+static inline void
+chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
+ uint8 *dst_chunks, rt_node **dst_children, int count)
+{
+ memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
+ memcpy(dst_children, src_children, sizeof(rt_node *) * count);
+}

gcc generates better code with something like this (but not hard-coded) at
the top:

if (count > 4)
pg_unreachable();

This would have to change when we implement shrinking of nodes, but might
still be useful.

+ if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p))
+   return false;
+
+ return true;

Maybe just "return rt_node_search_leaf(...)" ?

-- 
John Naylor
EDB: http://www.enterprisedb.com


Re: fixing typo in comment for restriction_is_or_clause

2022-10-25 Thread John Naylor
On Tue, Oct 25, 2022 at 9:48 AM Richard Guo  wrote:
>
>
> On Tue, Oct 25, 2022 at 10:05 AM John Naylor 
wrote:
>>
>> It's perfectly clear and simple now, even if it doesn't win at "code
golf".
>
>
> Agree with your point.  Do you think we can further make the one-line
> function a macro or an inline function in the .h file?  I think this
> function is called quite frequently during planning, so maybe doing that
> would bring a little bit of efficiency.

My point was misunderstood, which is: I don't think we need to do anything
at all here if the goal was purely about aesthetics.

If the goal has now changed to efficiency, I have no opinion about that yet
since no evidence has been presented.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: fixing typo in comment for restriction_is_or_clause

2022-10-24 Thread John Naylor
On Tue, Oct 25, 2022 at 12:19 AM Zhihong Yu  wrote:
>
> Hi,
> When I was looking at src/backend/optimizer/util/restrictinfo.c, I found
a typo in one of the comments.

Using "t" as an abbreviation for "true" was probably intentional, so not a
typo. There is no doubt what the behavior is.

> I also took the chance to simplify the code a little bit.

It's perfectly clear and simple now, even if it doesn't win at "code golf".

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Reducing the chunk header sizes on all memory context types

2022-10-24 Thread John Naylor
On Thu, Oct 20, 2022 at 1:55 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-10-11 10:21:17 +0700, John Naylor wrote:
> > On Tue, Oct 11, 2022 at 5:31 AM David Rowley 
wrote:
> > >
> > > The proposed patches in [1] do aim to make additional usages of the
> > > slab allocator, and I have a feeling that we'll want to fix the
> > > performance of slab.c before those. Perhaps the Asserts are a better
> > > option if we're to get the proposed radix tree implementation.
> >
> > Going by [1], that use case is not actually a natural fit for slab
because
> > of memory fragmentation.
> >
> > [1]
> >
https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
>
> Not so sure about that - IIRC I made one slab for each different size
class,
> which seemed to work well and suit slab well?

If that's the case, then great! The linked message didn't give me that
impression, but I won't worry about it.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: meson PGXS compatibility

2022-10-12 Thread John Naylor
On Wed, Oct 12, 2022 at 12:50 PM Peter Eisentraut <
peter.eisentr...@enterprisedb.com> wrote:
>
> On 05.10.22 22:07, Andres Freund wrote:
> > 0001: autoconf: Unify CFLAGS_SSE42 and CFLAGS_ARMV8_CRC32C
>
> I wonder, there has been some work lately to use SIMD and such in other
> places.  Would that affect what kinds of extra CPU-specific compiler
> flags and combinations we might need?

In short, no. The functionality added during this cycle only uses
instructions available by default on the platform. CRC on Arm depends on
armv8-a+crc, and CRC on x86-64 depends on SSE4.2. We can't assume those
currently.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: introduce optimized linear search functions that return index of matching element

2022-10-11 Thread John Naylor
On Sat, Sep 17, 2022 at 12:29 PM Nathan Bossart 
wrote:
>
> On Fri, Sep 16, 2022 at 02:54:14PM +0700, John Naylor wrote:
> > v6 demonstrates why this should have been put off towards the end.
(more below)
>
> Since the SIMD code is fresh in my mind, I wanted to offer my review for
> 0001 in the "Improve dead tuple storage for lazy vacuum" thread [0].
> However, I agree with John that the SIMD part of that work should be left
> for the end

As I mentioned in the radix tree thread, I don't believe this level of
abstraction is appropriate for the intended use case. We'll want to
incorporate some of the low-level simd.h improvements later, so you should
get authorship credit for those.  I've marked the entry "returned with
feedback".

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Reducing the chunk header sizes on all memory context types

2022-10-10 Thread John Naylor
On Tue, Oct 11, 2022 at 5:31 AM David Rowley  wrote:
>
> The proposed patches in [1] do aim to make additional usages of the
> slab allocator, and I have a feeling that we'll want to fix the
> performance of slab.c before those. Perhaps the Asserts are a better
> option if we're to get the proposed radix tree implementation.

Going by [1], that use case is not actually a natural fit for slab because
of memory fragmentation. The motivation to use slab there was that the
allocation sizes are just over a power of two, leading to a lot of wasted
space for aset. FWIW, I have proposed in that thread a scheme to squeeze
things into power-of-two sizes without wasting quite as much space. That's
not a done deal, of course, but it could work today without adding memory
management code.

[1]
https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Remove unnecessary commas for goto labels

2022-10-10 Thread John Naylor
On Sun, Oct 9, 2022 at 10:08 AM Julien Rouhaud  wrote:
>
> Hi,
>
> On Sun, Oct 09, 2022 at 07:42:58AM +0800, Japin Li wrote:
> >
> > Hi hackers,
> >
> > I find there are some unnecessary commas for goto lables,
> > attached a patch to remove them.
>
> You mean semi-colon?  +1, and a quick regex later I don't see any other
> occurrence.

Interestingly, I did find that in C, some statement is needed after a
label, even an empty one, otherwise it won't compile. That's not the case
here, though, so pushed.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-09 Thread John Naylor
On Mon, Oct 10, 2022 at 12:16 PM John Naylor 
wrote:
> Thanks for that! Now I can show clear results on some aspects in a simple
way. The attached patches (apply on top of v6)

Forgot the patchset...

--
John Naylor
EDB: http://www.enterprisedb.com


radix-v6-addendum-jcn1.tar.gz
Description: application/gzip


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-09 Thread John Naylor
of, we can discuss further. I
don't pretend to know the practical consequences of every change I mention.

- If you have started coding the shared memory case, I'd advise to continue
so we can see what that looks like. If that has not gotten beyond the
design stage, I'd like to first see an attempt at tearing down some of the
clumsier abstractions in the current patch.
- As a "smoke test", there should ideally be nothing as general as
rt_node_get_children/values(). We should ideally always know what kind we
are if we found out earlier.
- For distinguishing between linear nodes, perhaps some always-inline
functions can help hide details. But at the same time, trying to treat them
the same is not always worthwhile.
- Start to separate treatment of inner/leaves and see how it goes.
- I firmly believe we only need 4 node *kinds*, and later we can decouple
the size classes as a separate concept. I'm willing to put serious time
into that once the broad details are right. I will also investigate pointer
tagging if we can confirm that can work similarly for dsa pointers.

Regarding size class decoupling, I'll respond to a point made earlier:

On Fri, Sep 30, 2022 at 10:47 PM Masahiko Sawada 
wrote:
> With this idea, we can just repalloc() to grow to the larger size in a
> pair but I'm slightly concerned that the more size class we use, the
> more frequent the node needs to grow.

Well, yes, but that's orthogonal. For example, v6 has 5 node kinds. Imagine
that we have 4 node kinds, but the SIMD node kind used 2 size classes. Then
the nodes would grow at *exactly* the same frequency as they do today. I
listed many ways a size class could fit into a power-of-two (and there are
more), but we have a choice in how many to actually use. It's a trade off
between memory usage and complexity.

> If we want to support node
> shrink, the deletion is also affected.

Not necessarily. We don't have to shrink at the same granularity as
growing. My evidence is simple: we don't shrink at all now. :-)

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-06 Thread John Naylor
On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada 
wrote:
> In addition to two patches, I've attached the third patch. It's not
> part of radix tree implementation but introduces a contrib module
> bench_radix_tree, a tool for radix tree performance benchmarking. It
> measures loading and lookup performance of both the radix tree and a
> flat array.

Hi Masahiko, I've been using these benchmarks, along with my own
variations, to try various things that I've mentioned. I'm long overdue for
an update, but the picture is not yet complete.

For now, I have two questions that I can't figure out on my own:

1. There seems to be some non-obvious limit on the number of keys that are
loaded (or at least what the numbers report). This is independent of the
number of tids per block. Example below:

john=# select * from bench_shuffle_search(0, 8*1000*1000);
NOTICE:  num_keys = 800, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 =
25, n256 = 981
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
-+--+-++---+--+-
 800 |268435456 |4800 |661 |
 29 |  276 | 389

john=# select * from bench_shuffle_search(0, 9*1000*1000);
NOTICE:  num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 =
262144, n256 = 1028
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
-+--+-++---+--+-
 8388608 |276824064 |5400 |718 |
 33 |  311 | 446

The array is the right size, but nkeys hasn't kept pace. Can you reproduce
this? Attached is the patch I'm using to show the stats when running the
test. (Side note: The numbers look unfavorable for radix tree because I'm
using 1 tid per block here.)

2. I found that bench_shuffle_search() is much *faster* for traditional
binary search on an array than bench_seq_search(). I've found this to be
true in every case. This seems counterintuitive to me -- any idea why this
is? Example:

john=# select * from bench_seq_search(0, 100);
NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
-+--+-++---+--+-
 100 | 10199040 |   18000 |168 |
106 |  827 |3348

john=# select * from bench_shuffle_search(0, 100);
NOTICE:  num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128
= 1, n256 = 122
  nkeys  | rt_mem_allocated | array_mem_allocated | rt_load_ms |
array_load_ms | rt_search_ms | array_serach_ms
-+--+-++---+--+-
 100 | 10199040 |   18000 |171 |
107 |  827 |    1400

--
John Naylor
EDB: http://www.enterprisedb.com
From 43a50a385930ee340d0a3b003910c704a0ff342c Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Thu, 6 Oct 2022 09:07:41 +0700
Subject: [PATCH v65 1/5] Turn on per-node counts in benchmark

Also add gitigore, fix whitespace, and change to NOTICE
---
 contrib/bench_radix_tree/.gitignore | 3 +++
 contrib/bench_radix_tree/bench_radix_tree.c | 5 +
 src/backend/lib/radixtree.c | 2 +-
 src/include/lib/radixtree.h | 2 +-
 4 files changed, 10 insertions(+), 2 deletions(-)
 create mode 100644 contrib/bench_radix_tree/.gitignore

diff --git a/contrib/bench_radix_tree/.gitignore b/contrib/bench_radix_tree/.gitignore
new file mode 100644
index 00..8830f5460d
--- /dev/null
+++ b/contrib/bench_radix_tree/.gitignore
@@ -0,0 +1,3 @@
+*data
+log/*
+results/*
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c
index 5806ef7519..36c5218ae7 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -13,6 +13,7 @@
 #include "fmgr.h"
 #include "funcapi.h"
 #include "lib/radixtree.h"
+#include 
 #include "miscadmin.h"
 #include "utils/timestamp.h"
 
@@ -183,6 +184,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle)
 	TimestampDifference(start_time, end_time, , );
 	rt_load_ms = secs * 1000 + usecs / 1000;
 
+	rt_stats(rt);
+
 	/* measure the load time of the array */
 	itemptrs = MemoryContextAllocHuge(CurrentMemoryContext,
 	  sizeof(ItemPointerData) * ntids);
@@ -292,6 +295,8 @@ bench_load_random_int(PG_FUNCTION_ARGS)
 	TimestampDifference(start_time, end_time, , );
 	load_time_ms = secs * 1000 + usecs / 1000;
 
+	

Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-06 Thread John Naylor
On Thu, Oct 6, 2022 at 2:53 PM Masahiko Sawada 
wrote:
>
> On Wed, Oct 5, 2022 at 6:40 PM John Naylor 
wrote:
> >
> > This wasn't the focus of your current email, but while experimenting
with v6 I had another thought about local allocation: If we use the default
slab block size of 8192 bytes, then only 3 chunks of size 2088 can fit,
right? If so, since aset and DSA also waste at least a few hundred bytes,
we could store a useless 256-byte slot array within node256. That way,
node128 and node256 share the same start of pointers/values array, so there
would be one less branch for getting that address. In v6,
rt_node_get_values and rt_node_get_children are not inlined (asde: gcc uses
a jump table for 5 kinds but not for 4), but possibly should be, and the
smaller the better.
>
> It would be good for performance but I'm a bit concerned that it's
> highly optimized to the design of aset and DSA. Since size 2088 will
> be currently classed as 2616 in DSA, DSA wastes 528 bytes. However, if
> we introduce a new class of 2304 (=2048 + 256) bytes we cannot store a
> useless 256-byte and the assumption will be broken.

A new DSA class is hypothetical. A better argument against my idea is that
SLAB_DEFAULT_BLOCK_SIZE is arbitrary. FWIW, I looked at the prototype just
now and the slab block sizes are:

Max(pg_nextpower2_32((MAXALIGN(inner_class_info[i].size) + 16) * 32), 1024)

...which would be 128kB for nodemax. I'm curious about the difference.

> > One concern is, handling both local and dsa cases in the same code
requires more (predictable) branches and reduces code density. That might
be a reason in favor of templating to handle each case in its own
translation unit.
>
> Right. We also need to support locking for shared radix tree, which
> would require more branches.

Hmm, now it seems we'll likely want to template local vs. shared as a later
step...

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-10-05 Thread John Naylor
On Wed, Oct 5, 2022 at 1:46 PM Masahiko Sawada 
wrote:
>
> On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada 
wrote:
> >
> > On Fri, Sep 23, 2022 at 12:11 AM John Naylor
> >  wrote:
> > Yeah, node31 and node256 are bloated.  We probably could use slab for
> > node256 independently. It's worth trying a benchmark to see how it
> > affects the performance and the tree size.

This wasn't the focus of your current email, but while experimenting with
v6 I had another thought about local allocation: If we use the default slab
block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right?
If so, since aset and DSA also waste at least a few hundred bytes, we could
store a useless 256-byte slot array within node256. That way, node128 and
node256 share the same start of pointers/values array, so there would be
one less branch for getting that address. In v6, rt_node_get_values and
rt_node_get_children are not inlined (asde: gcc uses a jump table for 5
kinds but not for 4), but possibly should be, and the smaller the better.

> Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
> to point to its child nodes, instead of C pointers (ig, backend-local
> address). I'm thinking of a straightforward approach as the first
> step; inner nodes have a union of rt_node* and dsa_pointer and we
> choose either one based on whether the radix tree is shared or not. We
> allocate and free the shared memory for individual nodes by
> dsa_allocate() and dsa_free(), respectively. Therefore we need to get
> a C pointer from dsa_pointer by using dsa_get_address() while
> descending the tree. I'm a bit concerned that calling
> dsa_get_address() for every descent could be performance overhead but
> I'm going to measure it anyway.

Are dsa pointers aligned the same as pointers to locally allocated memory?
Meaning, is the offset portion always a multiple of 4 (or 8)? It seems that
way from a glance, but I can't say for sure. If the lower 2 bits of a DSA
pointer are never set, we can tag them the same way as a regular pointer.
That same technique could help hide the latency of converting the pointer,
by the same way it would hide the latency of loading parts of a node into
CPU registers.

One concern is, handling both local and dsa cases in the same code requires
more (predictable) branches and reduces code density. That might be a
reason in favor of templating to handle each case in its own translation
unit. But that might be overkill.
--
John Naylor
EDB: http://www.enterprisedb.com


Re: [RFC] building postgres with meson - v13

2022-09-29 Thread John Naylor
On Tue, Sep 27, 2022 at 2:41 PM John Naylor 
wrote:
>
> On Tue, Sep 27, 2022 at 2:06 AM Andres Freund  wrote:
> >
> > On 2022-09-26 15:18:29 +0700, John Naylor wrote:

> > Yea, it's /usr/local on x86-64, based on what was required to make
macos CI
> > work. I updated the wiki page, half-blindly - it'd be nice if you could
> > confirm that that works?
>
> Not sure if you intended for me to try the full script in your last
response or just what's in the wiki page, but for the latter (on commit
bed0927aeb0c6), it fails at
>
> [1656/2199] Linking target src/bin/psql/psql
> FAILED: src/bin/psql/psql

Per off-list discussion with Andres, the linking failure was caused by some
env variables set in my bash profile for the sake of Homebrew. After
removing those, the recipe in the wiki worked fine.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-28 Thread John Naylor
On Wed, Sep 28, 2022 at 1:18 PM John Naylor 
wrote:
> [stuff about size classes]

I kind of buried the lede here on one thing: If we only have 4 kinds
regardless of the number of size classes, we can use 2 bits of the pointer
for dispatch, which would only require 4-byte alignment. That should make
that technique more portable.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-28 Thread John Naylor
On Wed, Sep 28, 2022 at 10:49 AM Masahiko Sawada 
wrote:

> BTW We need to consider not only aset/slab but also DSA since we
> allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> the following size classes:
>
> static const uint16 dsa_size_classes[] = {
> [...]

Thanks for that info -- I wasn't familiar with the details of DSA. For the
non-parallel case, I plan to at least benchmark using aset because I gather
it's the most heavily optimized. I'm thinking that will allow other problem
areas to be more prominent. I'll also want to compare total context size
compared to slab to see if possibly less fragmentation makes up for other
wastage.

Along those lines, one thing I've been thinking about is the number of size
classes. There is a tradeoff between memory efficiency and number of
branches when searching/inserting. My current thinking is there is too much
coupling between size class and data type. Each size class currently uses a
different data type and a different algorithm to search and set it, which
in turn requires another branch. We've found that a larger number of size
classes leads to poor branch prediction [1] and (I imagine) code density.

I'm thinking we can use "flexible array members" for the values/pointers,
and keep the rest of the control data in the struct the same. That way, we
never have more than 4 actual "kinds" to code and branch on. As a bonus,
when migrating a node to a larger size class of the same kind, we can
simply repalloc() to the next size. To show what I mean, consider this new
table:

node2:   5 +  6   +(5)+  2*8 =   32 bytes
node6:   5 +  6   +(5)+  6*8 =   64

node12:  5 + 27   + 12*8 =  128
node27:  5 + 27   + 27*8 =  248(->256)

node91:  5 + 256 + 28 +(7)+ 91*8 = 1024
node219: 5 + 256 + 28 +(7)+219*8 = 2048

node256: 5 + 32   +(3)+256*8 = 2088(->4096)

Seven size classes are grouped into the four kinds.

The common base at the front is here 5 bytes because there is a new uint8
field for "capacity", which we can ignore for node256 since we assume we
can always insert/update that node. The control data is the same in each
pair, and so the offset to the pointer/value array is the same. Thus,
migration would look something like:

case FOO_KIND:
if (unlikely(count == capacity))
{
  if (capacity == XYZ) /* for smaller size class of the pair */
  {
;
capacity = next-higher-capacity;
goto do_insert;
  }
  else
;
}
else
{
do_insert:
  <...>;
  break;
}
/* FALLTHROUGH */
...

One disadvantage is that this wastes some space by reserving the full set
of control data in the smaller size class of the pair, but it's usually
small compared to array size. Somewhat unrelated, we could still implement
Andres' idea [1] to dispense with the isset array in inner nodes of the
indirect array type (now node128), since we can just test if the pointer is
null.

[1]
https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Insertion Sort Improvements

2022-09-27 Thread John Naylor
On Tue, Sep 27, 2022 at 4:39 PM Benjamin Coutu  wrote:
>
> Getting back to improvements for small sort runs, it might make sense to
consider using in-register based sorting via sorting networks for very
small runs.

> It looks like some of the commercial DBMSs do this very successfully.

"Looks like"?

> They use 4 512bit registers (AVX-512) in this example, which could
technically store 4 * 4 sort-elements (given that the sorting key is 64 bit
and the tuple pointer is 64 bit). I wonder whether this could also be done
with just SSE (instead of AVX), which the project now readily supports
thanks to your recent efforts?

SortTuples are currently 24 bytes, and supported vector registers are 16
bytes, so not sure how you think that would work.

More broadly, the more invasive a change is, the more risk is involved, and
the more effort to test and evaluate. If you're serious about trying to
improve insertion sort performance, the simple idea we discussed at the
start of the thread is a much more modest step that has a good chance of
justifying the time put into it. That is not to say it's easy, however,
because testing is a non-trivial amount of work.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [RFC] building postgres with meson - v13

2022-09-27 Thread John Naylor
On Tue, Sep 27, 2022 at 2:06 AM Andres Freund  wrote:
>
> On 2022-09-26 15:18:29 +0700, John Naylor wrote:
> > Either way it doesn't exist on this machine. I was able to get a working
> > build with
> >
> > /usr/local/Homebrew/Library/Homebrew/os/mac/pkgconfig
>
> Hm - what did you need this path for - I don't think that should be
needed.

I just cargo-culted the pattern from Arm (before I figured out it was Arm)
and used the "find" command to look for the directories by name. I tried
again without specifying any of the three directory flags, and I can run
the tests getting:

Ok: 233
Expected Fail:  0
Fail:   0
Unexpected Pass:0
Skipped:2
Timeout:0

...which is fine for me since I don't do much development on MacOS nowadays.

> > 1) /opt/homebrew/ seems to be an "Apple silicon" path?
>
> Yea, it's /usr/local on x86-64, based on what was required to make macos
CI
> work. I updated the wiki page, half-blindly - it'd be nice if you could
> confirm that that works?

Not sure if you intended for me to try the full script in your last
response or just what's in the wiki page, but for the latter (on commit
bed0927aeb0c6), it fails at

[1656/2199] Linking target src/bin/psql/psql
FAILED: src/bin/psql/psql
clang  -o src/bin/psql/psql
src/bin/psql/psql.p/meson-generated_.._psqlscanslash.c.o
src/bin/psql/psql.p/meson-generated_.._sql_help.c.o
src/bin/psql/psql.p/command.c.o src/bin/psql/psql.p/common.c.o
src/bin/psql/psql.p/copy.c.o src/bin/psql/psql.p/crosstabview.c.o
src/bin/psql/psql.p/describe.c.o src/bin/psql/psql.p/help.c.o
src/bin/psql/psql.p/input.c.o src/bin/psql/psql.p/large_obj.c.o
src/bin/psql/psql.p/mainloop.c.o src/bin/psql/psql.p/prompt.c.o
src/bin/psql/psql.p/startup.c.o src/bin/psql/psql.p/stringutils.c.o
src/bin/psql/psql.p/tab-complete.c.o src/bin/psql/psql.p/variables.c.o
-L/usr/local/opt/readline/lib -L/usr/local/opt/gettext/lib
-L/usr/local/opt/zlib/lib -L/usr/local/opt/openssl/lib
-I/usr/local/opt/readline/include -I/usr/local/opt/gettext/include
-I/usr/local/opt/zlib/include -I/usr/local/opt/openssl/include
-Wl,-dead_strip_dylibs -Wl,-headerpad_max_install_names
-Wl,-undefined,error -isysroot
/Library/Developer/CommandLineTools/SDKs/MacOSX11.3.sdk
-Wl,-rpath,@loader_path/../../interfaces/libpq -Wl,-rpath,/usr/local/lib
-Wl,-rpath,/usr/local/Cellar/zstd/1.5.2/lib src/fe_utils/libpgfeutils.a
src/common/libpgcommon.a src/common/libpgcommon_ryu.a
src/common/libpgcommon_config_info.a src/port/libpgport.a
src/port/libpgport_crc.a src/interfaces/libpq/libpq.5.dylib -lm
/usr/local/lib/libintl.dylib -ledit -lz
/usr/local/Cellar/zstd/1.5.2/lib/libzstd.dylib -lz -lz -lz
Undefined symbols for architecture x86_64:
  "_rl_completion_suppress_quote", referenced from:
  _psql_completion in tab-complete.c.o
  _quote_file_name in tab-complete.c.o
  _complete_from_files in tab-complete.c.o
  "_rl_filename_dequoting_function", referenced from:
  _initialize_readline in tab-complete.c.o
  "_rl_filename_quote_characters", referenced from:
  _initialize_readline in tab-complete.c.o
  "_rl_filename_quoting_function", referenced from:
  _initialize_readline in tab-complete.c.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see
invocation)

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [RFC] building postgres with meson - v13

2022-09-26 Thread John Naylor
On Wed, Sep 21, 2022 at 7:11 AM Andres Freund  wrote:
>
> I added installation instructions for meson for a bunch of platforms, but

A couple more things for the wiki:

1) /opt/homebrew/ seems to be an "Apple silicon" path? Either way it
doesn't exist on this machine. I was able to get a working build with

/usr/local/Homebrew/Library/Homebrew/os/mac/pkgconfig

(My homebrew install doesn't seem to have anything relevant for
extra_include_dirs or extra_lib_dirs.)

2) Also, "ninja -v install" has the same line count as "ninja install" --
are there versions that do something different?

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-22 Thread John Naylor
On Thu, Sep 22, 2022 at 11:46 AM John Naylor 
wrote:
> One thing I want to try soon is storing fewer than 16/32 etc entries, so
that the whole node fits comfortably inside a power-of-two allocation. That
would allow us to use aset without wasting space for the smaller nodes,
which would be faster and possibly would solve the fragmentation problem
Andres referred to in

>
https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de

While calculating node sizes that fit within a power-of-two size, I noticed
the current base node is a bit wasteful, taking up 8 bytes. The node kind
only has a small number of values, so it doesn't really make sense to use
an enum here in the struct (in fact, Andres' prototype used a uint8 for
node_kind). We could use a bitfield for the count and kind:

uint16 -- kind and count bitfield
uint8 shift;
uint8 chunk;

That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag,
the bitfield can just go back to being count only.

Here are the v6 node kinds:

node4:   8 +   4 +(4)+   4*8 =   48 bytes
node16:  8 +  16 +  16*8 =  152
node32:  8 +  32 +  32*8 =  296
node128: 8 + 256 + 128/8 + 128*8 = 1304
node256: 8   + 256/8 + 256*8 = 2088

And here are the possible ways we could optimize nodes for space using aset
allocation. Parentheses are padding bytes. Even if my math has mistakes,
the numbers shouldn't be too far off:

node3:   4 +   3 +(1)+   3*8 =   32 bytes
node6:   4 +   6 +(6)+   6*8 =   64
node13:  4 +  13 +(7)+  13*8 =  128
node28:  4 +  28 +  28*8 =  256
node31:  4 + 256 +  32/8 +  31*8 =  512 (XXX not good)
node94:  4 + 256 +  96/8 +  94*8 = 1024
node220: 4 + 256 + 224/8 + 220*8 = 2048
node256: = 4096

The main disadvantage is that node256 would balloon in size.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-22 Thread John Naylor
On Thu, Sep 22, 2022 at 7:52 PM John Naylor 
wrote:
>
>
> On Thu, Sep 22, 2022 at 1:26 PM Masahiko Sawada 
wrote:
> > Good point. While keeping the chunks in the small nodes in sorted
> > order is useful for visiting all keys in sorted order, additional
> > branches and memmove calls could be slow.
>
> Right, the ordering is a property that some users will need, so best to
keep it. Although the node128 doesn't have that property -- too slow to do
so, I think.

Nevermind, I must have been mixing up keys and values there...

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-22 Thread John Naylor
On Thu, Sep 22, 2022 at 1:26 PM Masahiko Sawada 
wrote:
>
> On Thu, Sep 22, 2022 at 1:46 PM John Naylor
>  wrote:
> > While on the subject, I wonder how important it is to keep the chunks
in the small nodes in sorted order. That adds branches and memmove calls,
and is the whole reason for the recent "pg_lfind_ge" function.
>
> Good point. While keeping the chunks in the small nodes in sorted
> order is useful for visiting all keys in sorted order, additional
> branches and memmove calls could be slow.

Right, the ordering is a property that some users will need, so best to
keep it. Although the node128 doesn't have that property -- too slow to do
so, I think.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-21 Thread John Naylor
On Thu, Sep 22, 2022 at 1:01 AM Nathan Bossart 
wrote:
>
> On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote:
>
> > In short, this code needs to be lower level so that we still have full
> > control while being portable. I will work on this, and also the related
> > code for node dispatch.
>
> Is it possible to use approach #2 here, too?  AFAICT space is allocated
for
> all of the chunks, so there wouldn't be any danger in searching all them
> and discarding any results >= node->count.

Sure, the caller could pass the maximum node capacity, and then check if
the returned index is within the range of the node count.

> Granted, we're depending on the
> number of chunks always being a multiple of elements-per-vector in order
to
> avoid the tail path, but that seems like a reasonably safe assumption that
> can be covered with comments.

Actually, we don't need to depend on that at all. When I said "junk" above,
that can be any bytes, as long as we're not reading off the end of
allocated memory. We'll never do that here, since the child pointers/values
follow. In that case, the caller can hard-code the  size (it would even
happen to work now to multiply rt_node_kind by 16, to be sneaky). One thing
I want to try soon is storing fewer than 16/32 etc entries, so that the
whole node fits comfortably inside a power-of-two allocation. That would
allow us to use aset without wasting space for the smaller nodes, which
would be faster and possibly would solve the fragmentation problem Andres
referred to in

https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de

While on the subject, I wonder how important it is to keep the chunks in
the small nodes in sorted order. That adds branches and memmove calls, and
is the whole reason for the recent "pg_lfind_ge" function.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-21 Thread John Naylor
On Tue, Sep 20, 2022 at 3:19 PM Masahiko Sawada 
wrote:
>
> On Fri, Sep 16, 2022 at 4:54 PM John Naylor
>  wrote:

> > Here again, I'd rather put this off and focus on getting the "large
> > details" in good enough shape so we can got towards integrating with
> > vacuum.
>
> Thank you for the comments! These above comments are addressed by
> Nathan in a newly derived thread. I'll work on the patch.

I still seem to be out-voted on when to tackle this particular
optimization, so I've extended the v6 benchmark code with a hackish
function that populates a fixed number of keys, but with different fanouts.
(diff attached as a text file)

I didn't take particular care to make this scientific, but the following
seems pretty reproducible. Note what happens to load and search performance
when node16 has 15 entries versus 16:

 fanout | nkeys  | rt_mem_allocated | rt_load_ms | rt_search_ms
++--++--
 15 | 327680 |  3776512 | 39 |   20
(1 row)
num_keys = 327680, height = 4, n4 = 1, n16 = 23408, n32 = 0, n128 = 0, n256
= 0

 fanout | nkeys  | rt_mem_allocated | rt_load_ms | rt_search_ms
++--++--
 16 | 327680 |  3514368 | 25 |   11
(1 row)
num_keys = 327680, height = 4, n4 = 0, n16 = 21846, n32 = 0, n128 = 0, n256
= 0

In trying to wrap the SIMD code behind layers of abstraction, the latest
patch (and Nathan's cleanup) threw it away in almost all cases. To explain,
we need to talk about how vectorized code deals with the "tail" that is too
small for the register:

1. Use a one-by-one algorithm, like we do for the pg_lfind* variants.
2. Read some junk into the register and mask off false positives from the
result.

There are advantages to both depending on the situation.

Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15
elements or less, it will iterate over them one-by-one exactly like a
node4. Only when full with 16 will the vector path be taken. When another
entry is added, the elements are copied to the next bigger node, so there's
a *small* window where it's fast.

In short, this code needs to be lower level so that we still have full
control while being portable. I will work on this, and also the related
code for node dispatch.

Since v6 has some good infrastructure to do low-level benchmarking, I also
want to do some experiments with memory management.

(I have further comments about the code, but I will put that off until
later)

> I'll consider how to integrate with vacuum as the next step. One
> concern for me is how to limit the memory usage to
> maintenance_work_mem. Unlike using a flat array, memory space for
> adding one TID varies depending on the situation. If we want strictly
> not to allow using memory more than maintenance_work_mem, probably we
> need to estimate the memory consumption in a conservative way.

+1

--
John Naylor
EDB: http://www.enterprisedb.com
commit 18407962e96ccec6c9aeeba97412edd762a5a4fe
Author: John Naylor 
Date:   Wed Sep 21 11:44:43 2022 +0700

Add special benchmark function to test effect of fanout

diff --git a/contrib/bench_radix_tree/Makefile 
b/contrib/bench_radix_tree/Makefile
index b8f70e12d1..952bb0ceae 100644
--- a/contrib/bench_radix_tree/Makefile
+++ b/contrib/bench_radix_tree/Makefile
@@ -7,7 +7,7 @@ OBJS = \
 EXTENSION = bench_radix_tree
 DATA = bench_radix_tree--1.0.sql
 
-REGRESS = bench
+REGRESS = bench_fixed_height
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql 
b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
index 6663abe6a4..f2fee15b17 100644
--- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
+++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql
@@ -40,3 +40,15 @@ OUT load_ms int8)
 returns record
 as 'MODULE_PATHNAME'
 LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
+
+create function bench_fixed_height_search(
+fanout int4,
+OUT fanout int4,
+OUT nkeys int8,
+OUT rt_mem_allocated int8,
+OUT rt_load_ms int8,
+OUT rt_search_ms int8
+)
+returns record
+as 'MODULE_PATHNAME'
+LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE;
diff --git a/contrib/bench_radix_tree/bench_radix_tree.c 
b/contrib/bench_radix_tree/bench_radix_tree.c
index 5806ef7519..0778da2d7b 100644
--- a/contrib/bench_radix_tree/bench_radix_tree.c
+++ b/contrib/bench_radix_tree/bench_radix_tree.c
@@ -13,6 +13,7 @@
 #include "fmgr.h"
 #include "funcapi.h"
 #include "lib/radixtree.h"
+#include 
 #include "miscadmin.h"
 #include "utils/timestamp.h"
 
@@ -24,6 +25,7 @@ PG_MODULE_MAGIC;
 PG_FUNCTION_INFO_V1(bench_seq_search);
 PG_FUNCTION_INFO_V1(bench_shuffle_search);
 PG_FUNCTION_INFO_V1(bench_load_random_int);
+PG_FUNCTION_INFO_V1(bench_fixed_height_search);
 
 static radix_tree *rt = NULL;
 static ItemPointer itemptrs = NULL;

Re: [RFC] building postgres with meson - v13

2022-09-20 Thread John Naylor
On Wed, Sep 21, 2022 at 7:11 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-09-19 19:16:30 -0700, Andres Freund wrote:
> > To have some initial "translation" for other developers I've started a
wiki
> > page with a translation table. Still very WIP:
> > https://wiki.postgresql.org/wiki/Meson
> >
> > For now, a bit of polishing aside, I'm just planning to add a minimal
> > explanation of what's happening, and a reference to this thread.
>
> I added installation instructions for meson for a bunch of platforms, but

Small typo: The homebrew section is still labeled with "find MacPorts
libraries".

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Support for Rust

2022-09-19 Thread John Naylor
On Mon, Sep 12, 2022 at 10:29 PM Tom Lane  wrote:
>
> Lev Kokotov  writes:
> > I took a small part of Postgres to get started, so just as a PoC; it
> > compiles and runs though. Larger parts will take more work (deleting
code,
> > not just swapping object files), and more fancy things like PG_TRY() and
> > friends will have to be rewritten, so not a short and easy migration.
>
> Yeah, that's what I thought.  "Allow some parts to be written in
> language X" soon turns into "Rewrite the entire system in language X,
> including fundamental rethinking of memory management, error handling,
> and some other things".  That's pretty much a non-starter.

Added "Rewrite the code in a different language" to "Features We Do Not
Want" section of Wiki, referencing the two threads that came up:

https://wiki.postgresql.org/wiki/Todo#Features_We_Do_Not_Want

-- 
John Naylor
EDB: http://www.enterprisedb.com


Re: Typo in xact.c

2022-09-18 Thread John Naylor
On Fri, Sep 16, 2022 at 10:51 AM John Naylor
 wrote:
>
> On Fri, Sep 16, 2022 at 10:11 AM Kyotaro Horiguchi
>  wrote:
> >
> > The patch seems to me covering all occurances of PG_PROC as PGPROC.
>
> +1 since this hinders grep-ability.

Pushed this.

> > I found several uses of PG_PROC as (pg_catalog.)pg_proc, which is
> > quite confusing, too..
>
> It's pretty obvious to me what that refers to in primnodes.h, although
> the capitalization of (some, but not all) catalog names in comments in
> that file is a bit strange. Maybe not worth changing there.

I left this alone. It's not wrong, and I don't think it's confusing in context.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [PoC] Improve dead tuple storage for lazy vacuum

2022-09-16 Thread John Naylor
On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada  wrote:
>
> On Mon, Aug 15, 2022 at 10:39 PM John Naylor
>  wrote:
> >
> > bool, buth = and <=. Should be pretty close. Also, i believe if you
> > left this for last as a possible refactoring, it might save some work.

v6 demonstrates why this should have been put off towards the end. (more below)

> > In any case, I'll take a look at the latest patch next month.

Since the CF entry said "Needs Review", I began looking at v5 again
this week. Hopefully not too much has changed, but in the future I
strongly recommend setting to "Waiting on Author" if a new version is
forthcoming. I realize many here share updated patches at any time,
but I'd like to discourage the practice especially for large patches.

> I've updated the radix tree patch. It's now separated into two patches.
>
> 0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find
> better names) that are similar to the pg_lfind8() family but they
> return the index of the key in the vector instead of true/false. The
> patch includes regression tests.

I don't want to do a full review of this just yet, but I'll just point
out some problems from a quick glance.

+/*
+ * Return the index of the first element in the vector that is greater than
+ * or eual to the given scalar. Return sizeof(Vector8) if there is no such
+ * element.

That's a bizarre API to indicate non-existence.

+ *
+ * Note that this function assumes the elements in the vector are sorted.
+ */

That is *completely* unacceptable for a general-purpose function.

+#else /* USE_NO_SIMD */
+ Vector8 r = 0;
+ uint8 *rp = (uint8 *) 
+
+ for (Size i = 0; i < sizeof(Vector8); i++)
+ rp[i] = (((const uint8 *) )[i] == ((const uint8 *) )[i]) ? 0xFF : 0;

I don't think we should try to force the non-simd case to adopt the
special semantics of vector comparisons. It's much easier to just use
the same logic as the assert builds.

+#ifdef USE_SSE2
+ return (uint32) _mm_movemask_epi8(v);
+#elif defined(USE_NEON)
+ static const uint8 mask[16] = {
+1 << 0, 1 << 1, 1 << 2, 1 << 3,
+1 << 4, 1 << 5, 1 << 6, 1 << 7,
+1 << 0, 1 << 1, 1 << 2, 1 << 3,
+1 << 4, 1 << 5, 1 << 6, 1 << 7,
+  };
+
+uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t)
vshrq_n_s8(v, 7));
+uint8x16_t maskedhi = vextq_u8(masked, masked, 8);
+
+return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi));

For Arm, we need to be careful here. This article goes into a lot of
detail for this situation:

https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon

Here again, I'd rather put this off and focus on getting the "large
details" in good enough shape so we can got towards integrating with
vacuum.

> In addition to two patches, I've attached the third patch. It's not
> part of radix tree implementation but introduces a contrib module
> bench_radix_tree, a tool for radix tree performance benchmarking. It
> measures loading and lookup performance of both the radix tree and a
> flat array.

Excellent! This was high on my wish list.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Typo in xact.c

2022-09-15 Thread John Naylor
On Fri, Sep 16, 2022 at 10:11 AM Kyotaro Horiguchi
 wrote:
>
> At Thu, 15 Sep 2022 22:38:01 +0800, Japin Li  wrote in
> >
> > Hi hacker,
> >
> > Recently, I find there might be a typo in xact.c comments.  The comments
> > say "PG_PROC", however, it actually means "PGPROC" structure.  Since we
> > have pg_proc catalog, and use PG_PROC to reference the catalog [1], so,
> > we should use PGPROC to reference the structure.  Any thoughts?
> >
> > [1] src/include/nodes/primnodes.h
>
> The patch seems to me covering all occurances of PG_PROC as PGPROC.

+1 since this hinders grep-ability.

> I found several uses of PG_PROC as (pg_catalog.)pg_proc, which is
> quite confusing, too..

It's pretty obvious to me what that refers to in primnodes.h, although
the capitalization of (some, but not all) catalog names in comments in
that file is a bit strange. Maybe not worth changing there.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Inconsistencies in error messages

2022-09-15 Thread John Naylor
On Wed, Sep 14, 2022 at 5:25 PM John Naylor
 wrote:
> Will commit this way unless there are objections.

I forgot to mention yesterday, but this is done.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Cleaning up historical portability baggage

2022-09-15 Thread John Naylor
On Thu, Sep 15, 2022 at 3:11 PM Ibrar Ahmed  wrote:
> The patch does not apply successfully; please rebase the patch.

There's a good reason for that -- the latest one was committed two
weeks ago. The status should still be waiting on author, though,
namely for:

On Fri, Aug 26, 2022 at 5:28 AM Thomas Munro  wrote:
> Remaining things from this thread:
>  * removing --disable-thread-safety
>  * removing those vestigial HAVE_XXX macros (one by one analysis and patches)
>  * making Unix sockets secure for Windows in tests

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: New strategies for freezing, advancing relfrozenxid early

2022-09-15 Thread John Naylor
On Wed, Sep 14, 2022 at 11:33 PM Peter Geoghegan  wrote:
>
> On Wed, Sep 14, 2022 at 3:18 AM John Naylor

> > Furthermore, it doesn't have to anticipate the maximum size, so there
> > is no up front calculation assuming max-tuples-per-page, so it
> > automatically uses less memory for less demanding tables.
>
> The final number of TIDs doesn't seem like the most interesting
> information that VM snapshots could provide us when it comes to
> building the dead_items TID data structure -- the *distribution* of
> TIDs across heap pages seems much more interesting. The "shape" can be
> known ahead of time, at least to some degree. It can help with
> compression, which will reduce cache misses.

My point here was simply that spilling to disk is an admission of
failure to utilize memory efficiently and thus shouldn't be a selling
point of VM snapshots. Other selling points could still be valid.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Inconsistencies in error messages

2022-09-14 Thread John Naylor
On Wed, Sep 14, 2022 at 5:01 PM Ekaterina Kiryanova
 wrote:
>
> Hi,
>
> When translating error messages, Alexander Lakhin
> () noticed some inconsistencies so I prepared a
> small patch to fix those.

+1

This one

- errmsg("background worker \"%s\": background worker without shared
memory access are not supported",
+ errmsg("background worker \"%s\": background workers without shared
memory access are not supported",

is a grammar error so worth backpatching, but the rest are cosmetic.

Will commit this way unless there are objections.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: New strategies for freezing, advancing relfrozenxid early

2022-09-14 Thread John Naylor
On Wed, Sep 14, 2022 at 12:53 AM Peter Geoghegan  wrote:

> This is still only scratching the surface of what is possible with
> dead_items. The visibility map snapshot concept can enable a far more
> sophisticated approach to resource management in vacuumlazy.c. It
> could help us to replace a simple array of item pointers (the current
> dead_items array) with a faster and more space-efficient data
> structure. Masahiko Sawada has done a lot of work on this recently, so
> this may interest him.

I don't quite see how it helps "enable" that. It'd be more logical to
me to say the VM snapshot *requires* you to think harder about
resource management, since a palloc'd snapshot should surely be
counted as part of the configured memory cap that admins control.
(Commonly, it'll be less than a few dozen MB, so I'll leave that
aside.) Since Masahiko hasn't (to my knowlege) gone as far as
integrating his ideas into vacuum, I'm not sure if the current state
of affairs has some snag that a snapshot will ease, but if there is,
you haven't described what it is.

I do remember your foreshadowing in the radix tree thread a while
back, and I do think it's an intriguing idea to combine pages-to-scan
and dead TIDs in the same data structure. The devil is in the details,
of course. It's worth looking into.

> VM snapshots could also make it practical for the new data structure
> to spill to disk to avoid multiple index scans/passed by VACUUM.

I'm not sure spilling to disk is solving the right problem (as opposed
to the hash join case, or to the proposed conveyor belt system which
has a broader aim). I've found several times that a customer will ask
if raising maintenance work mem from 1GB to 10GB will make vacuum
faster. Looking at the count of index scans, it's pretty much always
"1", so even if the current approach could scale above 1GB, "no" it
wouldn't help to raise that limit.

Your mileage may vary, of course.

Continuing my customer example, searching the dead TID list faster
*will* make vacuum faster. The proposed tree structure is more memory
efficient, and IIUC could scale beyond 1GB automatically since each
node is a separate allocation, so the answer will be "yes" in the rare
case the current setting is in fact causing multiple index scans.
Furthermore, it doesn't have to anticipate the maximum size, so there
is no up front calculation assuming max-tuples-per-page, so it
automatically uses less memory for less demanding tables.

(But +1 for changing that calculation for as long as we do have the
single array.)

--
John Naylor
EDB: http://www.enterprisedb.com




Re: minimum perl version

2022-09-13 Thread John Naylor
On Wed, Sep 14, 2022 at 10:46 AM Tom Lane  wrote:
>
> John Naylor  writes:
> > On Wed, Sep 14, 2022 at 6:47 AM Tom Lane  wrote:
> >> I've just switched longfin to use built-from-source perl 5.14.0.
>
> > In that case, here is a quick update with commit message. Not yet any
> > change for MSVC, but I can put together something later.
>
> Looks reasonable just by eyeball, did not test.
>
> > Since we're much less willing to support older Windows and Visual
> > Studio versions, maybe it's low-enough risk defer the check to the
> > Meson conversion? I understand our MSVC process will then go away much
> > more quickly than autoconf...
>
> Agreed --- the MSVC scripts are on a pretty short leash now.
> Not clear it's worth fixing them for this point.  If we've
> failed to get rid of them by the time v16 release approaches,
> maybe it'd be worth doing something then.

Okay, pushed with no further MSVC changes, after doing a round on CI.

--
John Naylor
EDB: http://www.enterprisedb.com




Re: failing to build preproc.c on solaris with sun studio

2022-09-13 Thread John Naylor
On Wed, Sep 14, 2022 at 5:24 AM Thomas Munro  wrote:
>
> On Tue, Sep 6, 2022 at 9:34 AM Tom Lane  wrote:
> > Peter Eisentraut  writes:
> > > Why is this being proposed?
> >
> > Andres is annoyed by the long build time of ecpg, which he has to
> > wait for whether he wants to test it or not.  I could imagine that
> > I might disable ecpg testing on my slowest buildfarm animals, too.
>
> This message triggered me to try to teach ccache how to cache
> preproc.y -> preproc.{c,h}, and I got that basically working[1], but
> upstream doesn't want it (yet).  I'll try again if the proposed
> refactoring to allow more kinds of compiler-like-things goes
> somewhere.  I think that started with people's struggles with GCC vs
> MSVC.  Given the simplicity of this case, though, I suppose we could
> have a little not-very-general shell/python/whatever wrapper script --
> just compute a checksum of the input and keep the output files around.

If we're going to go to this length, it seems more straightforward to
just check the .c/.h files into version control, like every other
project that I have such knowledge of.

To be fair, our grammar changes much more often. One other possible
deal-breaker of that is that it makes it more painful for forks to
maintain additional syntax.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: build remaining Flex files standalone

2022-09-13 Thread John Naylor
On Wed, Sep 14, 2022 at 6:10 AM Andres Freund  wrote:
>
> On 2022-09-12 14:49:50 +0700, John Naylor wrote:
> > CI builds fine. For v2 I only adjusted the commit message. I'll push
> > in a couple days unless there are objections.
>
> Makes sense to me. Thanks for working on it!

This is done.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: minimum perl version

2022-09-13 Thread John Naylor
On Wed, Sep 14, 2022 at 6:47 AM Tom Lane  wrote:
>
> I've just switched longfin to use built-from-source perl 5.14.0.

In that case, here is a quick update with commit message. Not yet any
change for MSVC, but I can put together something later.

Since we're much less willing to support older Windows and Visual
Studio versions, maybe it's low-enough risk defer the check to the
Meson conversion? I understand our MSVC process will then go away much
more quickly than autoconf...

-- 
John Naylor
EDB: http://www.enterprisedb.com
From d8e241968395c552d667c1eb393cbaea12f0df6c Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 14 Sep 2022 09:58:13 +0700
Subject: [PATCH v2] Bump minimum Perl version to 5.14

The oldest vendor-shipped Perl in the buildfarm is 5.14.2, which is the
last version that Debian Wheezy shipped. That OS is EOL, but we keep
it running because there is no other convenient way to test certain
non-mainstream 32-bit platforms. There is no bugfix in the .2 release
that is required, and yet it's also not the latest minor release --
that would be 5.14.4. To clarify the situation, we also have arranged the
buildfarm to test 5.14.0. That allows configure scripts and documentation
to state 5.14 without fine print.

Discussion: https://www.postgresql.org/message-id/20220902181553.ev4pgzhubhdkg...@awork3.anarazel.de
---
 config/perl.m4   |  4 ++--
 configure|  6 +++---
 doc/src/sgml/install-windows.sgml|  2 +-
 doc/src/sgml/installation.sgml   |  4 ++--
 src/pl/plperl/plc_perlboot.pl|  1 -
 src/test/perl/PostgreSQL/Test/Cluster.pm |  2 +-
 src/test/perl/README | 10 +++---
 src/tools/msvc/gendef.pl |  1 -
 src/tools/pgindent/pgindent  |  1 -
 9 files changed, 12 insertions(+), 19 deletions(-)

diff --git a/config/perl.m4 b/config/perl.m4
index c9fd91397c..8126e79f67 100644
--- a/config/perl.m4
+++ b/config/perl.m4
@@ -11,11 +11,11 @@ if test "$PERL"; then
   pgac_perl_version=`$PERL -v 2>/dev/null | sed -n ['s/This is perl.*v[a-z ]*\([0-9]\.[0-9][0-9.]*\).*$/\1/p']`
   AC_MSG_NOTICE([using perl $pgac_perl_version])
   if echo "$pgac_perl_version" | sed ['s/[.a-z_]/ /g'] | \
-$AWK '{ if ([$]1 == 5 && ([$]2 > 8 || ($[2] == 8 && [$]3 >= 3))) exit 1; else exit 0;}'
+$AWK '{ if ([$]1 == 5 && ([$]2 >= 14)) exit 1; else exit 0;}'
   then
 AC_MSG_WARN([
 *** The installed version of Perl, $PERL, is too old to use with PostgreSQL.
-*** Perl version 5.8.3 or later is required, but this is $pgac_perl_version.])
+*** Perl version 5.14 or later is required, but this is $pgac_perl_version.])
 PERL=""
   fi
 fi
diff --git a/configure b/configure
index fd2a782454..f325bd85b8 100755
--- a/configure
+++ b/configure
@@ -10491,14 +10491,14 @@ if test "$PERL"; then
   { $as_echo "$as_me:${as_lineno-$LINENO}: using perl $pgac_perl_version" >&5
 $as_echo "$as_me: using perl $pgac_perl_version" >&6;}
   if echo "$pgac_perl_version" | sed 's/[.a-z_]/ /g' | \
-$AWK '{ if ($1 == 5 && ($2 > 8 || ($2 == 8 && $3 >= 3))) exit 1; else exit 0;}'
+$AWK '{ if ($1 == 5 && ($2 >= 14)) exit 1; else exit 0;}'
   then
 { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING:
 *** The installed version of Perl, $PERL, is too old to use with PostgreSQL.
-*** Perl version 5.8.3 or later is required, but this is $pgac_perl_version." >&5
+*** Perl version 5.14 or later is required, but this is $pgac_perl_version." >&5
 $as_echo "$as_me: WARNING:
 *** The installed version of Perl, $PERL, is too old to use with PostgreSQL.
-*** Perl version 5.8.3 or later is required, but this is $pgac_perl_version." >&2;}
+*** Perl version 5.14 or later is required, but this is $pgac_perl_version." >&2;}
 PERL=""
   fi
 fi
diff --git a/doc/src/sgml/install-windows.sgml b/doc/src/sgml/install-windows.sgml
index c00ab2b4b2..29d3294dc8 100644
--- a/doc/src/sgml/install-windows.sgml
+++ b/doc/src/sgml/install-windows.sgml
@@ -190,7 +190,7 @@ $ENV{MSBFLAGS}="/m";
   or Cygwin Perl will not work. It must also be present in the PATH.
   Binaries can be downloaded from
   https://www.activestate.com;>
-  (Note: version 5.8.3 or later is required,
+  (Note: version 5.14 or later is required,
   the free Standard Distribution is sufficient).
  
 
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 7c79608e55..319c7e6966 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -165,7 +165,7 @@ su - postgres
   PL/Perl you need a full
   Perl installation, including the
   libperl library and the header files.
-  The minimum required version is Perl 5.8.3.
+  The minimu

Re: minimum perl version

2022-09-13 Thread John Naylor
On Tue, Sep 13, 2022 at 5:53 PM John Naylor
 wrote:
>
> Until such time as that happens, here is a draft to require 5.14.2.

As soon as I hit send, it occurred to me that we don't check the perl
version on Windows, since (I seem to recall) 5.8.3 was too old to be
an option on that platform. We'll have to add a new check somewhere.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: minimum perl version

2022-09-13 Thread John Naylor
On Sat, Sep 3, 2022 at 2:11 AM Tom Lane  wrote:
>
> Andres Freund  writes:
> > 5.14 would be a trivial lift as far as the buildfarm is concerned.
>
> Yeah, that seems like a reasonable new minimum for Perl.  I might
> see about setting up an animal running 5.14.0, just so we can say
> "5.14" in the docs without fine print.

Until such time as that happens, here is a draft to require 5.14.2.

-- 
John Naylor
EDB: http://www.enterprisedb.com
 config/perl.m4   |  4 ++--
 configure|  6 +++---
 doc/src/sgml/install-windows.sgml|  2 +-
 doc/src/sgml/installation.sgml   |  4 ++--
 src/pl/plperl/plc_perlboot.pl|  1 -
 src/test/perl/PostgreSQL/Test/Cluster.pm |  2 +-
 src/test/perl/README | 10 +++---
 src/tools/msvc/gendef.pl |  1 -
 src/tools/pgindent/pgindent  |  1 -
 9 files changed, 12 insertions(+), 19 deletions(-)

diff --git a/config/perl.m4 b/config/perl.m4
index c9fd91397c..29f54bbb79 100644
--- a/config/perl.m4
+++ b/config/perl.m4
@@ -11,11 +11,11 @@ if test "$PERL"; then
   pgac_perl_version=`$PERL -v 2>/dev/null | sed -n ['s/This is perl.*v[a-z ]*\([0-9]\.[0-9][0-9.]*\).*$/\1/p']`
   AC_MSG_NOTICE([using perl $pgac_perl_version])
   if echo "$pgac_perl_version" | sed ['s/[.a-z_]/ /g'] | \
-$AWK '{ if ([$]1 == 5 && ([$]2 > 8 || ($[2] == 8 && [$]3 >= 3))) exit 1; else exit 0;}'
+$AWK '{ if ([$]1 == 5 && ([$]2 > 14 || ($[2] == 14 && [$]3 >= 2))) exit 1; else exit 0;}'
   then
 AC_MSG_WARN([
 *** The installed version of Perl, $PERL, is too old to use with PostgreSQL.
-*** Perl version 5.8.3 or later is required, but this is $pgac_perl_version.])
+*** Perl version 5.14.2 or later is required, but this is $pgac_perl_version.])
 PERL=""
   fi
 fi
diff --git a/configure b/configure
index fd2a782454..160c181441 100755
--- a/configure
+++ b/configure
@@ -10491,14 +10491,14 @@ if test "$PERL"; then
   { $as_echo "$as_me:${as_lineno-$LINENO}: using perl $pgac_perl_version" >&5
 $as_echo "$as_me: using perl $pgac_perl_version" >&6;}
   if echo "$pgac_perl_version" | sed 's/[.a-z_]/ /g' | \
-$AWK '{ if ($1 == 5 && ($2 > 8 || ($2 == 8 && $3 >= 3))) exit 1; else exit 0;}'
+$AWK '{ if ($1 == 5 && ($2 > 14 || ($2 == 14 && $3 >= 2))) exit 1; else exit 0;}'
   then
 { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING:
 *** The installed version of Perl, $PERL, is too old to use with PostgreSQL.
-*** Perl version 5.8.3 or later is required, but this is $pgac_perl_version." >&5
+*** Perl version 5.14.2 or later is required, but this is $pgac_perl_version." >&5
 $as_echo "$as_me: WARNING:
 *** The installed version of Perl, $PERL, is too old to use with PostgreSQL.
-*** Perl version 5.8.3 or later is required, but this is $pgac_perl_version." >&2;}
+*** Perl version 5.14.2 or later is required, but this is $pgac_perl_version." >&2;}
 PERL=""
   fi
 fi
diff --git a/doc/src/sgml/install-windows.sgml b/doc/src/sgml/install-windows.sgml
index c00ab2b4b2..2e41d75db0 100644
--- a/doc/src/sgml/install-windows.sgml
+++ b/doc/src/sgml/install-windows.sgml
@@ -190,7 +190,7 @@ $ENV{MSBFLAGS}="/m";
   or Cygwin Perl will not work. It must also be present in the PATH.
   Binaries can be downloaded from
   https://www.activestate.com;>
-  (Note: version 5.8.3 or later is required,
+  (Note: version 5.14.2 or later is required,
   the free Standard Distribution is sufficient).
  
 
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 7c79608e55..5d7c573729 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -165,7 +165,7 @@ su - postgres
   PL/Perl you need a full
   Perl installation, including the
   libperl library and the header files.
-  The minimum required version is Perl 5.8.3.
+  The minimum required version is Perl 5.14.2.
   Since PL/Perl will be a shared
   library, the libperl
   libperl library must be a shared library
@@ -325,7 +325,7 @@ su - postgres
perl
   
 
-  Perl 5.8.3 or later is needed to build from a Git checkout,
+  Perl 5.14.2 or later is needed to build from a Git checkout,
   or if you changed the input files for any of the build steps that
   use Perl scripts.  If building on Windows you will need
   Perl in any case.  Perl is
diff --git a/src/pl/plperl/plc_perlboot.pl b/src/pl/plperl/plc_perlboot.pl
index 8fd7f998bc..72cb53f6e3 100644
--- a/src/pl/plperl/plc_perlboot.pl
+++ b/src/pl/plperl/plc_perlboot.pl
@@ -6,7 +6,6 @@
 use strict;
 use warnings;
 
-use 5.008001;
 use vars qw(%_SHARED $_TD);
 
 Pos

Re: broken table formatting in psql

2022-09-13 Thread John Naylor
On Thu, Sep 8, 2022 at 12:39 PM John Naylor
 wrote:
>
> On Fri, Sep 2, 2022 at 3:19 PM Kyotaro Horiguchi
>  wrote:
> >
> > At Fri, 2 Sep 2022 13:43:50 +0700, John Naylor 
> >  wrote in
> > > If there is any doubt about including all of Cf, we could also just
> > > add a branch in wchar.c to hard-code the 200B-200F range.
> >
> > If every way has defect to the similar extent, I think we will choose
> > to use authoritative data at least for the first step. We might want
> > to have additional filtering on it but it would be another issue,
> > maybe.
> >
> > Attached is the first cut of that. (The commit messages is not great,
> > though.)
>
> Okay, the patch looks good to me overall.

As discussed, I pushed to master only, with only one additional
comment in the perl script to describe Me/Mn/Cf.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: proposal: possibility to read dumped table's name from file

2022-09-13 Thread John Naylor
On Mon, Sep 12, 2022 at 8:10 PM Pavel Stehule  wrote:
>
> po 12. 9. 2022 v 9:59 odesílatel Daniel Gustafsson  napsal:
>> I don't the capabilities of the tool is all that interesting compared to the
>> long term maintainability and readability of the source code.

With make distprep and maintainer-clean, separate makefile and MSVC
build logic a short time before converting to Meson, I'm not sure that
even the short term maintainability here is a good trade off for what
we're getting.

> The parser in bison/flex does the same work and it is true, so code is more 
> readable. Although for this case, a handy written parser was trivial too.

If the hand-written version is trivial, then we should prefer it.
--
John Naylor
EDB: http://www.enterprisedb.com




Re: broken table formatting in psql

2022-09-12 Thread John Naylor
On Mon, Sep 12, 2022 at 12:44 PM Pavel Stehule  wrote:
> This is not a critical issue, really.  On second thought, I don't see the 
> point in releasing fresh Postgres with this bug, where there is know bugfix - 
> and this bugfix should be compatible (at this moment) with 16.

I agree the actual logic/data change is low-risk. The patch renames
two files, which seems a bit much this late in the cycle. Maybe that's
okay, but I'd like someone else to opine before doing so.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: build remaining Flex files standalone

2022-09-12 Thread John Naylor
On Fri, Sep 9, 2022 at 12:18 PM John Naylor
 wrote:
>
> It seems gramparse.h isn't installed now? In any case, here's a patch
> to move gramparse to the backend dir and stop symlinking/ installing
> gram.h.

Looking more closely at src/include/Makefile, this is incorrect -- all
files in SUBDIRS are copied over. So moving gramparse.h to the backend
will automatically not install it. The explicit install rule for
gram.h was for vpath builds.

CI builds fine. For v2 I only adjusted the commit message. I'll push
in a couple days unless there are objections.
--
John Naylor
EDB: http://www.enterprisedb.com
From dcfe0976e6118a2c385b4473ceab76a742a6f1ff Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Fri, 9 Sep 2022 11:57:39 +0700
Subject: [PATCH v2] Move gramparse.h to src/backend/parser

This header is semi-private, being used only in files related to
raw parsing, so move to the backend directory where those files
live. This allows removal of makefile rules that symlink gram.h to
src/include/parser, since gramparse.h can now include gram.h from within
the same directory. While at it, stop installing gram.h.

Per suggestion from Andres Freund and Peter Eisentraut
Discussion: https://www.postgresql.org/message-id/20220904181759.px6uosll6zbxcum5%40awork3.anarazel.de
---
 src/backend/Makefile| 7 +--
 src/backend/parser/gram.y   | 2 +-
 src/{include => backend}/parser/gramparse.h | 4 ++--
 src/backend/parser/parser.c | 2 +-
 src/backend/parser/scan.l   | 2 +-
 src/include/Makefile| 4 ++--
 src/include/parser/.gitignore   | 1 -
 src/tools/msvc/Install.pm   | 4 
 src/tools/pginclude/cpluspluscheck  | 1 -
 src/tools/pginclude/headerscheck| 1 -
 10 files changed, 8 insertions(+), 20 deletions(-)
 rename src/{include => backend}/parser/gramparse.h (97%)
 delete mode 100644 src/include/parser/.gitignore

diff --git a/src/backend/Makefile b/src/backend/Makefile
index d0d34821d5..181c217fae 100644
--- a/src/backend/Makefile
+++ b/src/backend/Makefile
@@ -153,12 +153,7 @@ submake-utils-headers:
 
 .PHONY: generated-headers
 
-generated-headers: $(top_builddir)/src/include/parser/gram.h $(top_builddir)/src/include/storage/lwlocknames.h submake-catalog-headers submake-nodes-headers submake-utils-headers
-
-$(top_builddir)/src/include/parser/gram.h: parser/gram.h
-	prereqdir=`cd '$(dir $<)' >/dev/null && pwd` && \
-	  cd '$(dir $@)' && rm -f $(notdir $@) && \
-	  $(LN_S) "$$prereqdir/$(notdir $<)" .
+generated-headers: $(top_builddir)/src/include/storage/lwlocknames.h submake-catalog-headers submake-nodes-headers submake-utils-headers
 
 $(top_builddir)/src/include/storage/lwlocknames.h: storage/lmgr/lwlocknames.h
 	prereqdir=`cd '$(dir $<)' >/dev/null && pwd` && \
diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index ea33784316..82f03fc9c9 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -55,9 +55,9 @@
 #include "catalog/pg_trigger.h"
 #include "commands/defrem.h"
 #include "commands/trigger.h"
+#include "gramparse.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
-#include "parser/gramparse.h"
 #include "parser/parser.h"
 #include "storage/lmgr.h"
 #include "utils/date.h"
diff --git a/src/include/parser/gramparse.h b/src/backend/parser/gramparse.h
similarity index 97%
rename from src/include/parser/gramparse.h
rename to src/backend/parser/gramparse.h
index 41b753a96c..c4726c618d 100644
--- a/src/include/parser/gramparse.h
+++ b/src/backend/parser/gramparse.h
@@ -11,7 +11,7 @@
  * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
- * src/include/parser/gramparse.h
+ * src/backend/parser/gramparse.h
  *
  *-
  */
@@ -26,7 +26,7 @@
  * NB: include gram.h only AFTER including scanner.h, because scanner.h
  * is what #defines YYLTYPE.
  */
-#include "parser/gram.h"
+#include "gram.h"
 
 /*
  * The YY_EXTRA data that a flex scanner allows us to pass around.  Private
diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c
index 50227cc098..ef85d3bb68 100644
--- a/src/backend/parser/parser.c
+++ b/src/backend/parser/parser.c
@@ -22,7 +22,7 @@
 #include "postgres.h"
 
 #include "mb/pg_wchar.h"
-#include "parser/gramparse.h"
+#include "gramparse.h"
 #include "parser/parser.h"
 #include "parser/scansup.h"
 
diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l
index 882e081aae..db8b0fe8eb 100644
--- a/src/backend/parser/scan.l
+++ b/src/backend/parser/scan.l
@@ -36,

Re: broken table formatting in psql

2022-09-11 Thread John Naylor
On Thu, Sep 8, 2022 at 12:51 PM Pavel Stehule  wrote:
>
>
>> Does anyone want to advocate for backpatching these missing ranges to
>> v12 and up? v12 still has a table in-line so trivial to remedy, but
>> v13 and up use a script, so these exceptions would likely have to use
>> hard-coded branches to keep from bringing in new changes.
>>
>> If so, does anyone want to advocate for including this patch in v15?
>> It claims Unicode 14.0.0, and this would make that claim more
>> technically correct as well as avoiding additional branches.
>
>
> I think it can be fixed just in v15 and master.  This issue has no impact on 
> SQL.

Well, if the regressions from v11 are not important enough to
backpatch, there is not as much of a case to backpatch the full fix to
v15 either.
-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: [PATCH] Clarify the comments about varlena header encoding

2022-09-11 Thread John Naylor
On Sun, Sep 11, 2022 at 5:06 PM Aleksander Alekseev
 wrote:
>
> Hi John,
>
> Many thanks for the feedback!
>
> > Or put the now-standard 0b in front.
>
> Good idea.

Now that I look at the results, though, it's distracting and not good
for readability. I'm not actually sure we need to do anything here,
but I am somewhat in favor of putting [un]aligned in parentheses, as
already discussed. Even there, in the first email you said:

> Also one can read this as "aligned, uncompressed
> data" (which again is wrong).

I'm not sure it rises to the level of "wrong", because a blob of bytes
immediately after an aligned uint32 is in fact aligned. The important
thing is: a zero byte is always either a padding byte or part of a
4-byte header, so it's the alignment of the header we really care
about.

> > I think the problem is ambiguity about what a "toast pointer" is. This 
> > comment:
> >
> > * struct varatt_external is a traditional "TOAST pointer", that is, the
>
> Right. The comment for varatt_external says that it IS a TOAST
> pointer.

Well, the word "traditional" is not very informative, but it is there.
And afterwards there is also varatt_indirect, varatt_expanded, and
varattrib_1b_e, which all mention "TOAST pointer".

> However the comments for varlena headers bit layout
> implicitly include it into a TOAST pointer, which contradicts the
> previous comments. I suggest we fix this ambiguity by explicitly
> enumerating the type tag in the comments for varlena headers.

- * 1000 1-byte length word, unaligned, TOAST pointer
+ * 0b1000 1-byte length word (unaligned), type tag, TOAST pointer

This is distracting from the point of this whole comment, which, I
will say again is: How to look at the first byte to determine what
kind of varlena we're looking at. There is no reason to mention the
type tag here, at all.

- * In TOAST pointers the va_tag field (see varattrib_1b_e) is used to discern
- * the specific type and length of the pointer datum.
+ * For the TOAST pointers the type tag (see varattrib_1b_e.va_tag field) is
+ * used to discern the specific type and length of the pointer datum.

I don't think this clarifies anything, it's just a rephrasing.

More broadly, I think the description of varlenas in this header is at
a kind of "local maximum" -- minor adjustments are more likely to make
it worse. To significantly improve clarity might require a larger
rewriting, but I'm not personally interested in taking part in that.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: proposal: possibility to read dumped table's name from file

2022-09-09 Thread John Naylor
On Thu, Sep 8, 2022 at 7:32 PM Daniel Gustafsson  wrote:
> [v3]

Note that the grammar has shift-reduce conflicts. If you run a fairly
recent Bison, you can show them like this:

bison -Wno-deprecated -Wcounterexamples -d -o filterparse.c filterparse.y

filterparse.y: warning: 2 shift/reduce conflicts [-Wconflicts-sr]
filterparse.y: warning: shift/reduce conflict on token C_INCLUDE
[-Wcounterexamples]
  Example: • C_INCLUDE include_object pattern
  Shift derivation
Filters
↳ 3: Filter
 ↳ 4: • C_INCLUDE include_object pattern
  Reduce derivation
Filters
↳ 2: Filters  Filter
 ↳ 1: ε • ↳ 4: C_INCLUDE include_object pattern
filterparse.y: warning: shift/reduce conflict on token C_EXCLUDE
[-Wcounterexamples]
  Example: • C_EXCLUDE exclude_object pattern
  Shift derivation
Filters
↳ 3: Filter
 ↳ 5: • C_EXCLUDE exclude_object pattern
  Reduce derivation
Filters
↳ 2: Filters  Filter
 ↳ 1: ε • ↳ 5: C_EXCLUDE exclude_object pattern

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: Minimum bison and flex versions

2022-09-08 Thread John Naylor
On Fri, Sep 9, 2022 at 12:07 AM Andres Freund  wrote:
>
> Hi,
>
> On 2022-09-06 13:32:32 +0700, John Naylor wrote:
> > Here are autoconf-only patches to that effect.
>
> Looks like you did actually include src/tools/msvc as well :)

Ah, in my head I meant "no patches for the Meson branch". :-)

CI fails on MSVC pg_upgrade, but that shows up in some existing CF bot
failures and in any case doesn't have a grammar, so I have pushed,
thanks for looking!

> > Since the retirement of some older buildfarm members, the oldest Bison
> > that gets regular testing is 2.3. MacOS ships that version, and will
> > continue doing so for the forseeable future because of Apple's policy
> > regarding GPLv3. While Mac users could use a package manager to install
> > a newer version, there is no compelling reason to do so at this time.
>
> s/to do so/to force them to do so/?

There are good reasons for a dev to install a newer Bison, like better
diagnostics, so used this language.

--
John Naylor
EDB: http://www.enterprisedb.com




<    1   2   3   4   5   6   7   8   9   10   >