Re: slab allocator performance issues
On Sat, Dec 10, 2022 at 11:02 AM David Rowley wrote: > [v4] Thanks for working on this! I ran an in-situ benchmark using the v13 radix tree patchset ([1] WIP but should be useful enough for testing allocation speed), only applying the first five, which are local-memory only. The benchmark is not meant to represent a realistic workload, and primarily stresses traversal and allocation of the smallest node type. Minimum of five, with turbo-boost off, on recent Intel laptop hardware: v13-0001 to 0005: # select * from bench_load_random_int(500 * 1000); mem_allocated | load_ms ---+- 151123432 | 222 47.06% postgres postgres [.] rt_set 22.89% postgres postgres [.] SlabAlloc 9.65% postgres postgres [.] rt_node_insert_inner.isra.0 5.94% postgres [unknown][k] 0xb5e011b7 3.62% postgres postgres [.] MemoryContextAlloc 2.70% postgres libc.so.6[.] __memmove_avx_unaligned_erms 2.60% postgres postgres [.] SlabFree + v4 slab: # select * from bench_load_random_int(500 * 1000); mem_allocated | load_ms ---+- 152463112 | 213 52.42% postgres postgres [.] rt_set 12.80% postgres postgres [.] SlabAlloc 9.38% postgres postgres [.] rt_node_insert_inner.isra.0 7.87% postgres [unknown] [k] 0xb5e011b7 4.98% postgres postgres [.] SlabFree While allocation is markedly improved, freeing looks worse here. The proportion is surprising because only about 2% of nodes are freed during the load, but doing that takes up 10-40% of the time compared to allocating. num_keys = 50, height = 7 n4 = 2501016, n15 = 56932, n32 = 270, n125 = 0, n256 = 257 Sidenote: I don't recall ever seeing vsyscall (I think that's what the 0xb5e011b7 address is referring to) in a profile, so not sure what is happening there. [1] https://www.postgresql.org/message-id/CAFBsxsHNE621mGuPhd7kxaGc22vMkoSu7R4JW9Zan1jjorGy3g%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com
Re: static assert cleanup
On Fri, Dec 9, 2022 at 2:47 PM Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > > 0003-Move-some-static-assertions-to-better-places.patch > > This moves some that I thought were suboptimally placed but it could be > debated or refined. + * We really want ItemPointerData to be exactly 6 bytes. This is rather a + * random place to check, but there is no better place. Since the assert is no longer in a random function body, it seems we can remove the second sentence. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Dec 9, 2022 at 8:20 AM Masahiko Sawada wrote: > In the meanwhile, I've been working on vacuum integration. There are > two things I'd like to discuss some time: > > The first is the minimum of maintenance_work_mem, 1 MB. Since the > initial DSA segment size is 1MB (DSA_INITIAL_SEGMENT_SIZE), parallel > vacuum with radix tree cannot work with the minimum > maintenance_work_mem. It will need to increase it to 4MB or so. Maybe > we can start a new thread for that. I don't think that'd be very controversial, but I'm also not sure why we'd need 4MB -- can you explain in more detail what exactly we'd need so that the feature would work? (The minimum doesn't have to work *well* IIUC, just do some useful work and not fail). > The second is how to limit the size of the radix tree to > maintenance_work_mem. I think that it's tricky to estimate the maximum > number of keys in the radix tree that fit in maintenance_work_mem. The > radix tree size varies depending on the key distribution. The next > idea I considered was how to limit the size when inserting a key. In > order to strictly limit the radix tree size, probably we have to > change the rt_set so that it breaks off and returns false if the radix > tree size is about to exceed the memory limit when we allocate a new > node or grow a node kind/class. That seems complex, fragile, and wrong scope. > Ideally, I'd like to control the size > outside of radix tree (e.g. TIDStore) since it could introduce > overhead to rt_set() but probably we need to add such logic in radix > tree. Does the TIDStore have the ability to ask the DSA (or slab context) to see how big it is? If a new segment has been allocated that brings us to the limit, we can stop when we discover that fact. In the local case with slab blocks, it won't be on nice neat boundaries, but we could check if we're within the largest block size (~64kB) of overflow. Remember when we discussed how we might approach parallel pruning? I envisioned a local array of a few dozen kilobytes to reduce contention on the tidstore. We could use such an array even for a single worker (always doing the same thing is simpler anyway). When the array fills up enough so that the next heap page *could* overflow it: Stop, insert into the store, and check the store's memory usage before continuing. -- John Naylor EDB: http://www.enterprisedb.com
Re: move some bitmapset.c macros to bitmapset.h
On Tue, Dec 6, 2022 at 12:57 PM Tom Lane wrote: > > Well, they've already escaped to tidbitmap.c as a copy. How do you feel > > about going that route? > > Not terribly pleased with that either, I must admit. Okay, I won't pursue that further. > If we do put RIGHTMOST_ONE functionality into pg_bitutils.h, > I'd envision it as pg_bitutils.h exporting both 32-bit and > 64-bit versions of that, and then bitmapset.c choosing the > appropriate one just like it chooses bmw_rightmost_one_pos. Here's a quick go at that. I've not attempted to use it for what I need, but it looks like it fits the bill. -- John Naylor EDB: http://www.enterprisedb.com src/backend/nodes/bitmapset.c| 36 ++-- src/include/nodes/bitmapset.h| 16 ++-- src/include/port/pg_bitutils.h | 31 +++ src/tools/pgindent/typedefs.list | 1 - 4 files changed, 47 insertions(+), 37 deletions(-) diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index b7b274aeff..4384ff591d 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -32,39 +32,7 @@ #define BITMAPSET_SIZE(nwords) \ (offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword)) -/*-- - * This is a well-known cute trick for isolating the rightmost one-bit - * in a word. It assumes two's complement arithmetic. Consider any - * nonzero value, and focus attention on the rightmost one. The value is - * then something like - *xx1 - * where x's are unspecified bits. The two's complement negative is formed - * by inverting all the bits and adding one. Inversion gives - *yy0 - * where each y is the inverse of the corresponding x. Incrementing gives - *yy1 - * and then ANDing with the original value gives - *001 - * This works for all cases except original value = zero, where of course - * we get zero. - *-- - */ -#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x))) - -#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x)) - -/* Select appropriate bit-twiddling functions for bitmap word size */ -#if BITS_PER_BITMAPWORD == 32 -#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w) -#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w) -#define bmw_popcount(w)pg_popcount32(w) -#elif BITS_PER_BITMAPWORD == 64 -#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w) -#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w) -#define bmw_popcount(w)pg_popcount64(w) -#else -#error "invalid BITS_PER_BITMAPWORD" -#endif +#define HAS_MULTIPLE_ONES(x) (bmw_rightmost_one(x) != (x)) /* @@ -1013,7 +981,7 @@ bms_first_member(Bitmapset *a) { int result; - w = RIGHTMOST_ONE(w); + w = bmw_rightmost_one(w); a->words[wordnum] &= ~w; result = wordnum * BITS_PER_BITMAPWORD; diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h index 2792281658..fdc504596b 100644 --- a/src/include/nodes/bitmapset.h +++ b/src/include/nodes/bitmapset.h @@ -38,13 +38,11 @@ struct List; #define BITS_PER_BITMAPWORD 64 typedef uint64 bitmapword; /* must be an unsigned type */ -typedef int64 signedbitmapword; /* must be the matching signed type */ #else #define BITS_PER_BITMAPWORD 32 typedef uint32 bitmapword; /* must be an unsigned type */ -typedef int32 signedbitmapword; /* must be the matching signed type */ #endif @@ -75,6 +73,20 @@ typedef enum BMS_MULTIPLE/* >1 member */ } BMS_Membership; +/* Select appropriate bit-twiddling functions for bitmap word size */ +#if BITS_PER_BITMAPWORD == 32 +#define bmw_rightmost_one(w) pg_rightmost_one32(w) +#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w) +#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos32(w) +#define bmw_popcount(w)pg_popcount32(w) +#elif BITS_PER_BITMAPWORD == 64 +#define bmw_rightmost_one(w) pg_rightmost_one64(w) +#define bmw_leftmost_one_pos(w) pg_leftmost_one_pos64(w) +#define bmw_rightmost_one_pos(w) pg_rightmost_one_pos64(w) +#define bmw_popcount(w)pg_popcount64(w) +#else +#error "invalid BITS_PER_BITMAPWORD" +#endif /* * function prototypes in nodes/bitmapset.c diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h index 814e0b2dba..f95b6afd86 100644 --- a/src/include/port/pg_bitutils.h +++ b/src/include/port/pg_bitutils.h @@ -17,6 +17,37 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256]; extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256]; extern PGDLLIMPORT const uint8 pg_number_of_ones[256]; +/*-- + * This is a well-known cute trick for isolating the rightmost one-bit + * in a word. It assumes two's complement arithmetic. Consider any + * nonzero value, and focus attention on the rightmost one. The value is + * then something like + *xx1 + * where x's are unspecified bits. The t
Re: move some bitmapset.c macros to bitmapset.h
On Mon, Dec 5, 2022 at 9:33 PM Tom Lane wrote: > > Alvaro Herrera writes: > > On 2022-Dec-05, John Naylor wrote: > >> -#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) > >> -#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) > > > In this location, nobody can complain about the naming of these macros, > > since they're just used to implement other bitmapset.c code. However, > > if you move them to the .h file, ISTM you should give them more > > meaningful names. > > IMV these are absolutely private to bitmapset.c. I reject the idea > that they should be exposed publicly, under these names or any others. Well, they've already escaped to tidbitmap.c as a copy. How do you feel about going that route? > Maybe we need some more bitmapset primitive functions? What is it > you actually want to accomplish in the end? An inserter into one type of node in a tree structure must quickly find a free position in an array. We have a bitmap of 128 bits to indicate whether the corresponding array position is free. The proposed coding is: /* get the first word with at least one bit not set */ for (idx = 0; idx < WORDNUM(128); idx++) { if (isset[idx] < ~((bitmapword) 0)) break; } /* To get the first unset bit in X, get the first set bit in ~X */ inverse = ~(isset[idx]); slotpos = idx * BITS_PER_BITMAPWORD; slotpos += bmw_rightmost_one_pos(inverse); /* mark the slot used */ isset[idx] |= RIGHTMOST_ONE(inverse); return slotpos; ...which, if it were reversed so that a set bit meant "available", would essentially be bms_first_member(), so a more primitive version of that might work. -- John Naylor EDB: http://www.enterprisedb.com
move some bitmapset.c macros to bitmapset.h
Over in [1], Masahiko and I found that using some bitmapset logic yields a useful speedup in one part of the proposed radix tree patch. In addition to what's in bitmapset.h now, we'd need WORDNUM, BITNUM, RIGHTMOST_ONE and bmw_rightmost_one_pos() from bitmapset.c. The file tidbitmap.c has its own copies of the first two, and we could adapt that strategy again. But instead of three files defining these, it seems like it's time to consider moving them somewhere more central. Attached is a simple form of this concept, including moving HAS_MULTIPLE_ONES just for consistency. One possible objection is the possibility of namespace clash. Thoughts? I also considered putting the macros and typedefs in pg_bitutils.h. One organizational advantage is that pg_bitutils.h already offers convenience function symbols where the parameter depends on SIZEOF_SIZE_T, so putting the BITS_PER_BITMAPWORD versions there makes sense. But that way is not a clear win, so I didn't go that far. [1] https://www.postgresql.org/message-id/CAFBsxsHgP5LP9q%3DRq_3Q2S6Oyut67z%2BVpemux%2BKseSPXhJF7sg%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com src/backend/nodes/bitmapset.c | 24 src/backend/nodes/tidbitmap.c | 5 - src/include/nodes/bitmapset.h | 24 3 files changed, 24 insertions(+), 29 deletions(-) diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index b7b274aeff..3204b49738 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -26,33 +26,9 @@ #include "port/pg_bitutils.h" -#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) -#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) - #define BITMAPSET_SIZE(nwords) \ (offsetof(Bitmapset, words) + (nwords) * sizeof(bitmapword)) -/*-- - * This is a well-known cute trick for isolating the rightmost one-bit - * in a word. It assumes two's complement arithmetic. Consider any - * nonzero value, and focus attention on the rightmost one. The value is - * then something like - *xx1 - * where x's are unspecified bits. The two's complement negative is formed - * by inverting all the bits and adding one. Inversion gives - *yy0 - * where each y is the inverse of the corresponding x. Incrementing gives - *yy1 - * and then ANDing with the original value gives - *001 - * This works for all cases except original value = zero, where of course - * we get zero. - *-- - */ -#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x))) - -#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x)) - /* Select appropriate bit-twiddling functions for bitmap word size */ #if BITS_PER_BITMAPWORD == 32 #define bmw_leftmost_one_pos(w) pg_leftmost_one_pos32(w) diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c index a7a6b26668..8a1fd1d205 100644 --- a/src/backend/nodes/tidbitmap.c +++ b/src/backend/nodes/tidbitmap.c @@ -72,11 +72,6 @@ */ #define PAGES_PER_CHUNK (BLCKSZ / 32) -/* We use BITS_PER_BITMAPWORD and typedef bitmapword from nodes/bitmapset.h */ - -#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) -#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) - /* number of active words for an exact page: */ #define WORDS_PER_PAGE ((MAX_TUPLES_PER_PAGE - 1) / BITS_PER_BITMAPWORD + 1) /* number of active words for a lossy chunk: */ diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h index 2792281658..8462282b9e 100644 --- a/src/include/nodes/bitmapset.h +++ b/src/include/nodes/bitmapset.h @@ -48,6 +48,30 @@ typedef int32 signedbitmapword; /* must be the matching signed type */ #endif +#define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) +#define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) + +/*-- + * This is a well-known cute trick for isolating the rightmost one-bit + * in a word. It assumes two's complement arithmetic. Consider any + * nonzero value, and focus attention on the rightmost one. The value is + * then something like + *xx1 + * where x's are unspecified bits. The two's complement negative is formed + * by inverting all the bits and adding one. Inversion gives + *yy0 + * where each y is the inverse of the corresponding x. Incrementing gives + *yy1 + * and then ANDing with the original value gives + *001 + * This works for all cases except original value = zero, where of course + * we get zero. + *-- + */ +#define RIGHTMOST_ONE(x) ((signedbitmapword) (x) & -((signedbitmapword) (x))) + +#define HAS_MULTIPLE_ONES(x) ((bitmapword) RIGHTMOST_ONE(x) != (x)) + typedef struct Bitmapset { pg_node_attr(custom_copy_equal, special_read_write)
Re: slab allocator performance issues
On Mon, Dec 5, 2022 at 3:02 PM David Rowley wrote: > > On Fri, 11 Nov 2022 at 22:20, John Naylor wrote: > > #define SLAB_FREELIST_COUNT ((1<<3) + 1) > > index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0); > > Doesn't this create a sort of round-robin use of the free list? What > we want is a sort of "histogram" bucket set of free lists so we can > group together blocks that have a close-enough free number of chunks. > Unless I'm mistaken, I think what you have doesn't do that. The intent must have slipped my mind along the way. > I wondered if simply: > > index = -(-freecount >> slab->freelist_shift); > > would be faster than Andres' version. I tried it out and on my AMD > machine, it's about the same speed. Same on a Raspberry Pi 4. > > Going by [2], the instructions are very different with each method, so > other machines with different latencies on those instructions might > show something different. I attached what I used to test if anyone > else wants a go. I get about 0.1% difference on my machine. Both ways boil down to (on gcc) 3 instructions with low latency. The later ones need the prior results to execute, which I think is what the XXX comment "isn't great" was referring to. The new coding is more mysterious (does it do the right thing on all platforms?), so I guess the original is still the way to go unless we get a better idea. -- John Naylor EDB: http://www.enterprisedb.com
Re: Optimizing Node Files Support
On Thu, Dec 1, 2022 at 8:02 PM Ranier Vilela wrote: > > Hi, > > I believe that has room for improving generation node files. > > The patch attached reduced the size of generated files by 27 kbytes. > From 891 kbytes to 864 kbytes. > > About the patch: > 1. Avoid useless attribution when from->field is NULL, once that > the new node is palloc0. > > 2. Avoid useless declaration variable Size, when it is unnecessary. Not useless -- it prevents a multiple evaluation hazard, which this patch introduces. > 3. Optimize comparison functions like memcmp and strcmp, using > a short-cut comparison of the first element. Not sure if the juice is worth the squeeze. Profiling would tell. > 4. Switch several copy attributions like COPY_SCALAR_FIELD or COPY_LOCATION_FIELD > by one memcpy call. My first thought is, it would cause code churn. > 5. Avoid useless attribution, ignoring the result of pg_strtok when it is unnecessary. Looks worse. -- John Naylor EDB: http://www.enterprisedb.com
Re: initdb: Refactor PG_CMD_PUTS loops
On Thu, Dec 1, 2022 at 5:02 PM Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > > Keeping the SQL commands that initdb runs in string arrays before > feeding them to PG_CMD_PUTS() seems unnecessarily verbose and > inflexible. In some cases, the array only has one member. In other > cases, one might want to use PG_CMD_PRINTF() instead, to parametrize a > command, but that would require breaking up the loop or using > workarounds like replace_token(). This patch unwinds all that; it's > much simpler that way. +1, I can't think of a reason to keep the current coding -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Add native windows on arm64 support
On Fri, Dec 2, 2022 at 12:20 AM Niyas Sait wrote: > > > On 07/11/2022 06:58, Michael Paquier wrote: > > Seems so. Hmm, where does _ARM64_BARRIER_SY come from? Perhaps it > > would be better to have a comment referring to it from a different > > place than the forums of arm, like some actual docs? > > > _ARM64_BARRIER_SY is defined in Microsoft Arm64 intrinsic documentation > - > https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-170#BarrierRestrictions In particular, at the bottom of that section: "For the __isb intrinsic, the only restriction that is currently valid is _ARM64_BARRIER_SY; all other values are reserved by the architecture." This corresponds to https://developer.arm.com/documentation/ddi0596/2021-06/Base-Instructions/ISB--Instruction-Synchronization-Barrier- which says "SY Full system barrier operation, encoded as CRm = 0b. Can be omitted." > I couldn't find something more official for the sse2neon library part. Not quite sure what this is referring to, but it seems we can just point to the __aarch64__ section in the same file, which uses the same instruction: spin_delay(void) { __asm__ __volatile__( " isb; \n"); } ...and which already explains the choice with a comment. About v4: + * Use _mm_pause (x64) or __isb(arm64) intrinsic instead of rep nop. Need a space here after __isb. + if cc.get_id() == 'msvc' +cdata.set('USE_ARMV8_CRC32C', false) +cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1) +have_optimized_crc = true + else That seems like a heavy-handed way to force it. Could we just use the same gating in the test program that the patch puts in the code of interest? Namely: +#ifndef _MSC_VER #include +#endif -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Dec 1, 2022 at 3:03 PM Masahiko Sawada wrote: > > On Thu, Dec 1, 2022 at 4:00 PM John Naylor wrote: > > > > The bigger question in my mind is: Why is there an atomic variable in backend-local memory? > > Because I use the same radix_tree and radix_tree_control structs for > non-parallel and parallel vacuum. Therefore, radix_tree_control is > allocated in DSM for parallel-vacuum cases or in backend-local memory > for non-parallel vacuum cases. Ok, that could be yet another reason to compile local- and shared-memory functionality separately, but now I'm wondering why there are atomic variables at all, since there isn't yet any locking support. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Nov 30, 2022 at 2:28 PM Masahiko Sawada wrote: > > On Fri, Nov 25, 2022 at 5:00 PM John Naylor > wrote: > > These hacks were my work, but I think we can improve that by having two versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized nodes. For that to work, in "init-node" we'd need a branch to set fanout to zero for node256. That should be fine -- it already has to branch for memset'ing node128's indexes to 0xFF. > > Since the node has fanout regardless of fixed-sized and > variable-sized As currently coded, yes. But that's not strictly necessary, I think. >, only node256 is the special case where the fanout in > the node doesn't match the actual fanout of the node. I think if we > want to have two versions of NODE_HAS_FREE_SLOT, we can have one for > node256 and one for other classes. Thoughts? In your idea, for > NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the > following? > > #define FIXED_NODDE_HAS_FREE_SLOT(node, class) > (node->base.n.count < rt_size_class_info[class].fanout) Right, and the other one could be VAR_NODE_... -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Nov 30, 2022 at 11:09 PM Masahiko Sawada wrote: > > I've investigated this issue and have a question about using atomic > variables on palloc'ed memory. In non-parallel vacuum cases, > radix_tree_control is allocated via aset.c. IIUC in 32-bit machines, > the memory allocated by aset.c is 4-bytes aligned so these atomic > variables are not always 8-bytes aligned. Is there any way to enforce > 8-bytes aligned memory allocations in 32-bit machines? The bigger question in my mind is: Why is there an atomic variable in backend-local memory? -- John Naylor EDB: http://www.enterprisedb.com
Re: Improve performance of pg_strtointNN functions
On Thu, Dec 1, 2022 at 6:42 AM David Rowley wrote: > > I was thinking that we should likely apply this before doing the hex > literals, which is the main focus of [1]. The reason being is so that > that patch can immediately have faster conversions by allowing the > compiler to use bit shifting instead of other means of multiplying by > a power-of-2 number. I'm hoping this removes a barrier for Peter from > the small gripe I raised on that thread about the patch having slower > than required hex, octal and binary string parsing. I don't see why the non-decimal literal patch needs to be "immediately" faster? If doing this first leads to less code churn, that's another consideration, but you haven't made that argument. -- John Naylor EDB: http://www.enterprisedb.com
Re: Prefetch the next tuple's memory during seqscans
On Wed, Nov 23, 2022 at 4:58 AM David Rowley wrote: > My current thoughts are that it might be best to go with 0005 to start > with. +1 > I know Melanie is working on making some changes in this area, > so perhaps it's best to leave 0002 until that work is complete. There seem to be some open questions about that one as well. I reran the same test in [1] (except I don't have the ability to lock clock speed or affect huge pages) on an older CPU from 2014 (Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz, kernel 3.10 gcc 4.8) with good results: HEAD: Testing a1 latency average = 965.462 ms Testing a2 latency average = 1054.608 ms Testing a3 latency average = 1078.263 ms Testing a4 latency average = 1120.933 ms Testing a5 latency average = 1162.753 ms Testing a6 latency average = 1298.876 ms Testing a7 latency average = 1228.775 ms Testing a8 latency average = 1293.535 ms 0001+0005: Testing a1 latency average = 791.224 ms Testing a2 latency average = 876.421 ms Testing a3 latency average = 911.039 ms Testing a4 latency average = 981.693 ms Testing a5 latency average = 998.176 ms Testing a6 latency average = 979.954 ms Testing a7 latency average = 1066.523 ms Testing a8 latency average = 1030.235 ms I then tested a Power8 machine (also kernel 3.10 gcc 4.8). Configure reports "checking for __builtin_prefetch... yes", but I don't think it does anything here, as the results are within noise level. A quick search didn't turn up anything informative on this platform, and I'm not motivated to dig deeper. In any case, it doesn't make things worse. HEAD: Testing a1 latency average = 1402.163 ms Testing a2 latency average = 1442.971 ms Testing a3 latency average = 1599.188 ms Testing a4 latency average = 1664.397 ms Testing a5 latency average = 1782.091 ms Testing a6 latency average = 1860.655 ms Testing a7 latency average = 1929.120 ms Testing a8 latency average = 2021.100 ms 0001+0005: Testing a1 latency average = 1433.080 ms Testing a2 latency average = 1428.369 ms Testing a3 latency average = 1542.406 ms Testing a4 latency average = 1642.452 ms Testing a5 latency average = 1737.173 ms Testing a6 latency average = 1828.239 ms Testing a7 latency average = 1920.909 ms Testing a8 latency average = 2036.922 ms [1] https://www.postgresql.org/message-id/CAFBsxsHqmH_S%3D4apc5agKsJsF6xZ9f6NaH0Z83jUYv3EgySHfw%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
There are a few things up in the air, so I'm coming back to this list to summarize and add a recent update: On Mon, Nov 14, 2022 at 7:59 PM John Naylor wrote: > > - See how much performance we actually gain from tagging the node kind. Needs a benchmark that has enough branch mispredicts and L2/3 misses to show a benefit. Otherwise either neutral or worse in its current form, depending on compiler(?). Put off for later. > - Try additional size classes while keeping the node kinds to only four. This is relatively simple and effective. If only one additional size class (total 5) is coded as a placeholder, I imagine it will be easier to rebase shared memory logic than using this technique everywhere possible. > - Optimize node128 insert. I've attached a rough start at this. The basic idea is borrowed from our bitmapset nodes, so we can iterate over and operate on word-sized (32- or 64-bit) types at a time, rather than bytes. To make this easier, I've moved some of the lower-level macros and types from bitmapset.h/.c to pg_bitutils.h. That's probably going to need a separate email thread to resolve the coding style clash this causes, so that can be put off for later. This is not meant to be included in the next patchset. For demonstration purposes, I get these results with a function that repeatedly deletes the last value from a mostly-full node128 leaf and re-inserts it: select * from bench_node128_load(120); v11 NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms +---+--+-- 120 | 14400 | 208304 | 56 v11 + 0006 addendum NOTICE: num_keys = 14400, height = 1, n1 = 0, n4 = 0, n15 = 0, n32 = 0, n61 = 0, n128 = 121, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_sparseload_ms +---+--+-- 120 | 14400 | 208816 | 34 I didn't test inner nodes, but I imagine the difference is bigger. This bitmap style should also be used for the node256-leaf isset array simply to be consistent and avoid needing single-use macros, but that has not been done yet. It won't make a difference for performance because there is no iteration there. > - Try templating out the differences between local and shared memory. I hope to start this sometime after the crashes on 32-bit are resolved. -- John Naylor EDB: http://www.enterprisedb.com diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql index 67ba568531..2fd689aa91 100644 --- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql +++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql @@ -63,3 +63,14 @@ OUT rt_search_ms int8 returns record as 'MODULE_PATHNAME' LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; + +create function bench_node128_load( +fanout int4, +OUT fanout int4, +OUT nkeys int8, +OUT rt_mem_allocated int8, +OUT rt_sparseload_ms int8 +) +returns record +as 'MODULE_PATHNAME' +LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c index e69be48448..b035b3a747 100644 --- a/contrib/bench_radix_tree/bench_radix_tree.c +++ b/contrib/bench_radix_tree/bench_radix_tree.c @@ -31,6 +31,7 @@ PG_FUNCTION_INFO_V1(bench_shuffle_search); PG_FUNCTION_INFO_V1(bench_load_random_int); PG_FUNCTION_INFO_V1(bench_fixed_height_search); PG_FUNCTION_INFO_V1(bench_search_random_nodes); +PG_FUNCTION_INFO_V1(bench_node128_load); static uint64 tid_to_key_off(ItemPointer tid, uint32 *off) @@ -552,3 +553,85 @@ finish_search: rt_free(rt); PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls))); } + +Datum +bench_node128_load(PG_FUNCTION_ARGS) +{ + int fanout = PG_GETARG_INT32(0); + radix_tree *rt; + TupleDesc tupdesc; + TimestampTz start_time, + end_time; + longsecs; + int usecs; + int64 rt_sparseload_ms; + Datum values[5]; + boolnulls[5]; + + uint64 r, + h; + int key_id; + + /* Build a tuple descriptor for our result type */ + if (get_call_result_type(fcinfo, NULL, ) != TYPEFUNC_COMPOSITE) + elog(ERROR, "return type must be a row type"); + + rt = rt_create(CurrentMemoryContext); + + key_id = 0; + + for (r = 1; r <= fanout; r++) + { + for (h = 1; h <= fanout; h++) + { + uint64 key; + + key = (r << 8) | (h); + + key_id++; + rt_set(rt, key, key_id); + } + } + + rt_stats(rt);
Re: [PoC] Improve dead tuple storage for lazy vacuum
> The fix is easy enough -- set the child pointer to null upon deletion, but I'm somewhat astonished that the regression tests didn't hit this. I do still intend to replace this code with something faster, but before I do so the tests should probably exercise the deletion paths more. Since VACUUM Oops. I meant to finish with "Since VACUUM doesn't perform deletion we didn't have an opportunity to detect this during that operation." -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
While creating a benchmark for inserting into node128-inner, I found a bug. If a caller deletes from a node128, the slot index is set to invalid, but the child pointer is still valid. Do that a few times, and every child pointer is valid, even if no slot index points to it. When the next inserter comes along, something surprising happens. This function: /* Return an unused slot in node-128 */ static int node_inner_128_find_unused_slot(rt_node_inner_128 *node, uint8 chunk) { int slotpos = 0; Assert(!NODE_IS_LEAF(node)); while (node_inner_128_is_slot_used(node, slotpos)) slotpos++; return slotpos; } ...passes an integer to this function, whose parameter is a uint8: /* Is the slot in the node used? */ static inline bool node_inner_128_is_slot_used(rt_node_inner_128 *node, uint8 slot) { Assert(!NODE_IS_LEAF(node)); return (node->children[slot] != NULL); } ...so instead of growing the node unnecessarily or segfaulting, it enters an infinite loop doing this: add eax, 1 movzx ecx, al cmp QWORD PTR [rbx+264+rcx*8], 0 jne .L147 The fix is easy enough -- set the child pointer to null upon deletion, but I'm somewhat astonished that the regression tests didn't hit this. I do still intend to replace this code with something faster, but before I do so the tests should probably exercise the deletion paths more. Since VACUUM -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada wrote: > > [v11] There is one more thing that just now occurred to me: In expanding the use of size classes, that makes rebasing and reworking the shared memory piece more work than it should be. That's important because there are still some open questions about the design around shared memory. To keep unnecessary churn to a minimum, perhaps we should limit size class expansion to just one (or 5 total size classes) for the near future? -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
ASS(n32, RT_CLASS_32_PARTIAL)) This macro doesn't really improve readability -- it obscures what is being tested, and the name implies the "else" branch means "node doesn't need to grow class", which is false. If we want to simplify expressions in this block, I think it'd be more effective to improve the lines that follow: + memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size); + new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout; Maybe we can have const variables old_size and new_fanout to break out the array lookup? While I'm thinking of it, these arrays should be const so the compiler can avoid runtime lookups. Speaking of... +/* Copy both chunks and children/values arrays */ +static inline void +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children, + uint8 *dst_chunks, rt_node **dst_children, int count) +{ + /* For better code generation */ + if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout) + pg_unreachable(); + + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count); + memcpy(dst_children, src_children, sizeof(rt_node *) * count); +} When I looked at this earlier, I somehow didn't go far enough -- why are we passing the runtime count in the first place? This function can only be called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last parameter to memcpy should evaluate to a compile-time constant, right? Even when we add node shrinking in the future, the constant should be correct, IIUC? - .fanout = 256, + /* technically it's 256, but we can't store that in a uint8, + and this is the max size class so it will never grow */ + .fanout = 0, - Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256)); + Assert(((rt_node *) n256)->fanout == 0); + Assert(chunk_exists || ((rt_node *) n256)->count < 256); These hacks were my work, but I think we can improve that by having two versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized nodes. For that to work, in "init-node" we'd need a branch to set fanout to zero for node256. That should be fine -- it already has to branch for memset'ing node128's indexes to 0xFF. -- John Naylor EDB: http://www.enterprisedb.com
Re: Non-decimal integer literals
On Wed, Nov 23, 2022 at 3:54 PM David Rowley wrote: > > Going by [1], clang will actually use multiplication by 16 to > implement the former. gcc is better and shifts left by 4, so likely > won't improve things for gcc. It seems worth doing it this way for > anything that does not have HAVE__BUILTIN_OP_OVERFLOW anyway. FWIW, gcc 12.2 generates an imul on my system when compiling in situ. I've found it useful to run godbolt locally* and load the entire PG file (nicer to read than plain objdump) -- compilers can make different decisions when going from isolated snippets to within full functions. * clone from https://github.com/compiler-explorer/compiler-explorer install npm 16 run "make" and when finished will show the localhost url add the right flags, which in this case was -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-format-truncation -Wno-stringop-truncation -O2 -I/path/to/srcdir/src/include -I/path/to/builddir/src/include -D_GNU_SOURCE -- John Naylor EDB: http://www.enterprisedb.com
Re: Non-decimal integer literals
On Tue, Nov 22, 2022 at 8:36 PM Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > > On 15.11.22 11:31, Peter Eisentraut wrote: > > On 14.11.22 08:25, John Naylor wrote: > >> Regarding the patch, it looks good overall. My only suggestion would > >> be to add a regression test for just below and just above overflow, at > >> least for int2. > > > > ok > > This was a valuable suggestion, because this found some breakage. In > particular, the handling of grammar-level literals that overflow to > "Float" was not correct. (The radix prefix was simply stripped and > forgotten.) So I added a bunch more tests for this. Here is a new patch. Looks good to me. -- John Naylor EDB: http://www.enterprisedb.com
Re: Prefetch the next tuple's memory during seqscans
On Wed, Nov 23, 2022 at 5:00 AM David Rowley wrote: > > On Thu, 3 Nov 2022 at 22:09, John Naylor wrote: > > I tried a similar test, but with text fields of random length, and there is improvement here: > > Thank you for testing that. Can you share which CPU this was on? That was an Intel Core i7-10750H -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Nov 21, 2022 at 3:43 PM Masahiko Sawada wrote: > > On Mon, Nov 21, 2022 at 4:20 PM John Naylor > wrote: > > Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space: > > > > node4: 5 + 4+(7) + 4*8 = 48 bytes > > node3: 5 + 3 + 3*8 = 32 > > IIUC if we store the fanout member only in variable-sized nodes, > rt_node has only count, shift, and chunk, so 4 bytes in total. If so, > the size of node3 (ie. fixed-sized node) is (4 + 3 + (1) + 3*8)? The > size doesn't change but there is 1 byte padding space. I forgot to mention I'm assuming no pointer-tagging for this exercise. You've demonstrated it can be done in a small amount of code, and I hope we can demonstrate a speedup in search. Just in case there is some issue with portability, valgrind, or some other obstacle, I'm being pessimistic in my calculations. > Also, even if we have the node3 a variable-sized node, size class 1 > for node3 could be a good choice since it also doesn't need padding > space and could be a good alternative to path compression. > > node3 : 5 + 3 + 3*8 = 32 bytes > size class 1 : 5 + 3 + 1*8 = 16 bytes Precisely! I have that scenario in my notes as well -- it's quite compelling. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Nov 18, 2022 at 2:48 PM I wrote: > One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's not an issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(), so I just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct in all cases, but that limits what we can do with the smallest node kind. Thinking about this part, there's an easy resolution -- use a different macro for fixed- and variable-sized node kinds to determine if there is a free slot. Also, I wanted to share some results of adjusting the boundary between the two smallest node kinds. In the hackish attached patch, I modified the fixed height search benchmark to search a small (within L1 cache) tree thousands of times. For the first set I modified node4's maximum fanout and filled it up. For the second, I set node4's fanout to 1, which causes 2+ to spill to node32 (actually the partially-filled node15 size class as demoed earlier). node4: NOTICE: num_keys = 16, height = 3, n4 = 15, n15 = 0, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 2 |16 |16520 | 0 |3 NOTICE: num_keys = 81, height = 3, n4 = 40, n15 = 0, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 3 |81 |16456 | 0 | 17 NOTICE: num_keys = 256, height = 3, n4 = 85, n15 = 0, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 4 | 256 |16456 | 0 | 89 NOTICE: num_keys = 625, height = 3, n4 = 156, n15 = 0, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 5 | 625 |16488 | 0 | 327 node32: NOTICE: num_keys = 16, height = 3, n4 = 0, n15 = 15, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 2 |16 |16488 | 0 |5 (1 row) NOTICE: num_keys = 81, height = 3, n4 = 0, n15 = 40, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 3 |81 |16520 | 0 | 28 NOTICE: num_keys = 256, height = 3, n4 = 0, n15 = 85, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 4 | 256 |16408 | 0 | 79 NOTICE: num_keys = 625, height = 3, n4 = 0, n15 = 156, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms +---+--++-- 5 | 625 |24616 | 0 | 199 In this test, node32 seems slightly faster than node4 with 4 elements, at the cost of more memory. Assuming the smallest node is fixed size (i.e. fanout/capacity member not part of the common set, so only part of variable-sized nodes), 3 has a nice property: no wasted padding space: node4: 5 + 4+(7) + 4*8 = 48 bytes node3: 5 + 3 + 3*8 = 32 -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada wrote: > > On Thu, Nov 17, 2022 at 12:24 AM Masahiko Sawada wrote: > > > > On Wed, Nov 16, 2022 at 4:39 PM John Naylor > > wrote: > > > That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case. Digging a bit deeper, I see a flaw in my benchmark: Even though the total distribution of node kinds is decently even, the pattern that the benchmark sees is not terribly random: 3,343,352 branch-misses:u #0.85% of all branches 393,204,959 branches:u Recall a previous benchmark [1] where the leaf node was about half node16 and half node32. Randomizing the leaf node between the two caused branch misses to go from 1% to 2%, causing a noticeable slowdown. Maybe in this new benchmark, each level has a skewed distribution of nodes, giving a smart branch predictor something to work with. We will need a way to efficiently generate keys that lead to a relatively unpredictable distribution of node kinds, as seen by a searcher. Especially in the leaves (or just above the leaves), since those are less likely to be cached. > > I'll also run the test on my environment and do the investigation tomorrow. > > > > FYI I've not tested the patch you shared today but here are the > benchmark results I did with the v9 patch in my environment (I used > the second filter). I splitted 0004 patch into two patches: a patch > for pure refactoring patch to introduce rt_node_ptr and a patch to do > pointer tagging. Would you be able to share the refactoring patch? And a fix for the failing tests? I'm thinking I want to try the templating approach fairly soon. [1] https://www.postgresql.org/message-id/CAFBsxsFEVckVzsBsfgGzGR4Yz%3DJp%3DUxOtjYvTjOz6fOoLXtOig%40mail.gmail.com -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Nov 18, 2022 at 8:20 PM Masahiko Sawada wrote: > > FYI I've not tested the patch you shared today but here are the > benchmark results I did with the v9 patch in my environment (I used > the second filter). I splitted 0004 patch into two patches: a patch > for pure refactoring patch to introduce rt_node_ptr and a patch to do > pointer tagging. > > v9 0003 patch: 1113 1114 1114 > introduce rt_node_ptr: 1127 1128 1128 > pointer tagging : 1085 1087 1086 (equivalent to 0004 patch) > > In my environment, rt_node_ptr seemed to lead some overhead but > pointer tagging had performance benefits. I'm not sure the reason why > the results are different from yours. The radix tree stats shows the > same as your tests. There is less than 2% difference from the medial set of results, so it's hard to distinguish from noise. I did a fresh rebuild and retested with the same results: about 15% slowdown in v9 0004. That's strange. On Wed, Nov 16, 2022 at 10:24 PM Masahiko Sawada wrote: > > filter = (((uint64) 1<<32) | (0xFF<<24)); > > LOG: num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 3161 > > > > 1) Any idea why the tree height would be reported as 7 here? I didn't expect that. > > In my environment, (0xFF<<24) is 0xFF00, not 0xFF00. > It seems the filter should be (((uint64) 1<<32) | ((uint64) > 0xFF<<24)). Ugh, sign extension, brain fade on my part. Thanks, I'm glad there was a straightforward explanation. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Sep 28, 2022 at 1:18 PM I wrote: > Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memory efficiency and number of branches when searching/inserting. My current thinking is there is too much coupling between size class and data type. Each size class currently uses a different data type and a different algorithm to search and set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branch prediction [1] and (I imagine) code density. > > I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in the struct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migrating a node to a larger size class of the same kind, we can simply repalloc() to the next size. While the most important challenge right now is how to best represent and organize the shared memory case, I wanted to get the above idea working and out of the way, to be saved for a future time. I've attached a rough implementation (applies on top of v9 0003) that splits node32 into 2 size classes. They both share the exact same base data type and hence the same search/set code, so the number of "kind"s is still four, but here there are five "size classes", so a new case in the "unlikely" node-growing path. The smaller instance of node32 is a "node15", because that's currently 160 bytes, corresponding to one of the DSA size classes. This idea can be applied to any other node except the max size, as we see fit. (Adding a singleton size class would bring it back in line with the prototype, at least as far as memory consumption.) One issue with this patch: The "fanout" member is a uint8, so it can't hold 256 for the largest node kind. That's not an issue in practice, since we never need to grow it, and we only compare that value with the count in an Assert(), so I just set it to zero. That does break an invariant, so it's not great. We could use 2 bytes to be strictly correct in all cases, but that limits what we can do with the smallest node kind. In the course of working on this, I encountered a pain point. Since it's impossible to repalloc in slab, we have to do alloc/copy/free ourselves. That's fine, but the current coding makes too many assumptions about the use cases: rt_alloc_node and rt_copy_node are too entangled with each other and do too much work unrelated to what the names imply. I seem to remember an earlier version had something like rt_node_copy_common that did only...copying. That was much easier to reason about. In 0002 I resorted to doing my own allocation to show what I really want to do, because the new use case doesn't need zeroing and setting values. It only needs to...allocate (and increase the stats counter if built that way). Future optimization work while I'm thinking of it: rt_alloc_node should be always-inlined and the memset done separately (i.e. not *AllocZero). That way the compiler should be able generate more efficient zeroing code for smaller nodes. I'll test the numbers on this sometime in the future. -- John Naylor EDB: http://www.enterprisedb.com From 6fcc970ae7e31f44fa6b6aface983cadb023cc50 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Thu, 17 Nov 2022 16:10:44 +0700 Subject: [PATCH v901 2/2] Make node32 variable sized Add a size class for 15 elements, which corresponds to 160 bytes, an allocation size used by DSA. When a 16th element is to be inserted, allocte a larger area and memcpy the entire old node to it. NB: Zeroing the new area is only necessary if it's for an inner node128, since insert logic must check for null child pointers. This technique allows us to limit the node kinds to 4, which 1. limits the number of cases in switch statements 2. allows a possible future optimization to encode the node kind in a pointer tag --- src/backend/lib/radixtree.c | 141 +++- 1 file changed, 108 insertions(+), 33 deletions(-) diff --git a/src/backend/lib/radixtree.c b/src/backend/lib/radixtree.c index bef1a438ab..f368e750d5 100644 --- a/src/backend/lib/radixtree.c +++ b/src/backend/lib/radixtree.c @@ -130,6 +130,7 @@ typedef enum typedef enum rt_size_class { RT_CLASS_4_FULL = 0, + RT_CLASS_32_PARTIAL, RT_CLASS_32_FULL, RT_CLASS_128_FULL, RT_CLASS_256 @@ -147,6 +148,8 @@ typedef struct rt_node uint16 count; /* Max number of children. We can use uint8 because we never need to store 256 */ + /* WIP: if we don't have a variable sized node4, this should instead be in the base + types as needed, since saving every byte is crucial for the smallest node kind */ uint8 fanout; /* @@ -166,6 +169,8 @@ typedef struct rt_node ((node)->base.n.count < (node)->ba
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Nov 16, 2022 at 12:33 PM Masahiko Sawada wrote: > > On Wed, Nov 16, 2022 at 1:46 PM John Naylor > wrote: > > > > > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada wrote: > > > Thanks! Please let me know if there is something I can help with. > > > > I didn't get very far because the tests fail on 0004 in rt_verify_node: > > > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242 > > Which tests do you use to get this assertion failure? I've confirmed > there is a bug in 0005 patch but without it, "make check-world" > passed. Hmm, I started over and rebuilt and it didn't reproduce. Not sure what happened, sorry for the noise. I'm attaching a test I wrote to stress test branch prediction in search, and while trying it out I found two possible issues. It's based on the random int load test, but tests search speed. Run like this: select * from bench_search_random_nodes(10 * 1000 * 1000) It also takes some care to include all the different node kinds, restricting the possible keys by AND-ing with a filter. Here's a simple demo: filter = ((uint64)1<<40)-1; LOG: num_keys = 967, height = 4, n4 = 17513814, n32 = 6320, n128 = 62663, n256 = 3130 Just using random integers leads to >99% using the smallest node. I wanted to get close to having the same number of each, but that's difficult while still using random inputs. I ended up using filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF) which gives LOG: num_keys = 9291812, height = 4, n4 = 262144, n32 = 79603, n128 = 182670, n256 = 1024 Which seems okay for the task. One puzzling thing I found while trying various filters is that sometimes the reported tree height would change. For example: filter = (((uint64) 1<<32) | (0xFF<<24)); LOG: num_keys = 944, height = 7, n4 = 47515559, n32 = 6209, n128 = 62632, n256 = 3161 1) Any idea why the tree height would be reported as 7 here? I didn't expect that. 2) It seems that 0004 actually causes a significant slowdown in this test (as in the attached, using the second filter above and with turboboost disabled): v9 0003: 2062 2051 2050 v9 0004: 2346 2316 2321 That means my idea for the pointer struct might have some problems, at least as currently implemented. Maybe in the course of separating out and polishing that piece, an inefficiency will fall out. Or, it might be another reason to template local and shared separately. Not sure yet. I also haven't tried to adjust this test for the shared memory case. -- John Naylor EDB: http://www.enterprisedb.com diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql index 0874201d7e..e0205b364e 100644 --- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql +++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql @@ -43,6 +43,14 @@ returns record as 'MODULE_PATHNAME' LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; +create function bench_search_random_nodes( +cnt int8, +OUT mem_allocated int8, +OUT search_ms int8) +returns record +as 'MODULE_PATHNAME' +LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; + create function bench_fixed_height_search( fanout int4, OUT fanout int4, diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c index 7abb237e96..a43fc61c2d 100644 --- a/contrib/bench_radix_tree/bench_radix_tree.c +++ b/contrib/bench_radix_tree/bench_radix_tree.c @@ -29,6 +29,7 @@ PG_FUNCTION_INFO_V1(bench_seq_search); PG_FUNCTION_INFO_V1(bench_shuffle_search); PG_FUNCTION_INFO_V1(bench_load_random_int); PG_FUNCTION_INFO_V1(bench_fixed_height_search); +PG_FUNCTION_INFO_V1(bench_search_random_nodes); static uint64 tid_to_key_off(ItemPointer tid, uint32 *off) @@ -347,6 +348,77 @@ bench_load_random_int(PG_FUNCTION_ARGS) PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls))); } +/* copy of splitmix64() */ +static uint64 +hash64(uint64 x) +{ + x ^= x >> 30; + x *= UINT64CONST(0xbf58476d1ce4e5b9); + x ^= x >> 27; + x *= UINT64CONST(0x94d049bb133111eb); + x ^= x >> 31; + return x; +} + +/* attempts to have a relatively even population of node kinds */ +Datum +bench_search_random_nodes(PG_FUNCTION_ARGS) +{ + uint64 cnt = (uint64) PG_GETARG_INT64(0); + radix_tree *rt; + TupleDesc tupdesc; + TimestampTz start_time, + end_time; + longsecs; + int usecs; + int64 search_time_ms; + Datum values[2] = {0}; + boolnulls[2] = {0}; + /* from trial and error */ + const uint64 filter = (((uint64) 0x7F<<32) | (0x07<<24) | (0xFF<<16) | 0xFF); + +
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Nov 16, 2022 at 11:46 AM John Naylor wrote: > > > On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada wrote: > > Thanks! Please let me know if there is something I can help with. > > I didn't get very far because the tests fail on 0004 in rt_verify_node: > > TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242 Actually I do want to offer some general advice. Upthread I recommended a purely refactoring patch that added the node-pointer struct but did nothing else, so that the DSA changes would be smaller. 0004 attempted pointer tagging in the same commit, which makes it no longer a purely refactoring patch, so that 1) makes it harder to tell what part caused the bug and 2) obscures what is necessary for DSA pointers and what was additionally necessary for pointer tagging. Shared memory support is a prerequisite for a shippable feature, but pointer tagging is (hopefully) a performance optimization. Let's keep them separate. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Nov 15, 2022 at 11:59 AM Masahiko Sawada wrote: > Thanks! Please let me know if there is something I can help with. I didn't get very far because the tests fail on 0004 in rt_verify_node: TRAP: failed Assert("n4->chunks[i - 1] < n4->chunks[i]"), File: "../src/backend/lib/radixtree.c", Line: 2186, PID: 18242 -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Nov 14, 2022 at 3:44 PM Masahiko Sawada wrote: > > 0004 patch is a new patch supporting a pointer tagging of the node > kind. Also, it introduces rt_node_ptr we discussed so that internal > functions use it rather than having two arguments for encoded and > decoded pointers. With this intermediate patch, the DSA support patch > became more readable and understandable. Probably we can make it > smaller further if we move the change of separating the control object > from radix_tree to the main patch (0002). The patch still needs to be > polished but I'd like to check if this idea is worthwhile. If we agree > on this direction, this patch will be merged into the main radix tree > implementation patch. Thanks for the new patch set. I've taken a very brief look at 0004 and I think the broad outlines are okay. As you say it needs polish, but before going further, I'd like to do some experiments of my own as I mentioned earlier: - See how much performance we actually gain from tagging the node kind. - Try additional size classes while keeping the node kinds to only four. - Optimize node128 insert. - Try templating out the differences between local and shared memory. With local memory, the node-pointer struct would be a union, for example. Templating would also reduce branches and re-simplify some internal APIs, but it's likely that would also make the TID store and/or vacuum more complex, because at least some external functions would be duplicated. I'll set the patch to "waiting on author", but in this case the author is me. -- John Naylor EDB: http://www.enterprisedb.com
Re: Non-decimal integer literals
On Mon, Oct 10, 2022 at 9:17 PM Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > Taking another look around ecpg to see how this interacts with C-syntax > integer literals. I'm not aware of any particular issues, but it's > understandably tricky. I can find no discussion in the archives about the commit that added "xch": commit 6fb3c3f78fbb2296894424f6e3183d339915eac7 Author: Michael Meskes Date: Fri Oct 15 19:02:08 1999 + *** empty log message *** ...and I can't think of why bounds checking a C literal was done like this. Regarding the patch, it looks good overall. My only suggestion would be to add a regression test for just below and just above overflow, at least for int2. Minor nits: - * Process {integer}. Note this will also do the right thing with {decimal}, + * Process {*integer}. Note this will also do the right thing with {numeric}, I scratched my head for a while, thinking this was Flex syntax, until I realized my brain was supposed to do shell-globbing first, at which point I could see it was referring to multiple Flex rules. I'd try to rephrase. +T661 Non-decimal integer literals YES SQL:202x draft Is there an ETA yet? Also, it's not this patch's job to do it, but there are at least a half dozen places that open-code turning a hex char into a number. It might be a good easy "todo item" to unify that. -- John Naylor EDB: http://www.enterprisedb.com
Re: Add palloc_aligned() to allow arbitrary power of 2 memory alignment
On Tue, Nov 8, 2022 at 8:57 AM David Rowley wrote: > On Tue, 8 Nov 2022 at 05:24, Andres Freund wrote: > > I doubtthere's ever a need to realloc such a pointer? Perhaps we could just > > elog(ERROR)? > > Are you maybe locked into just thinking about your current planned use > case that we want to allocate BLCKSZ bytes in each case? It does not > seem impossible to me that someone will want something more than an > 8-byte alignment and also might want to enlarge the allocation at some > point. I thought it might be more dangerous not to implement repalloc. > It might not be clear to someone using palloc_aligned() that there's > no code path that can call repalloc on the returned pointer. I can imagine a use case for arrays of cacheline-sized objects. > TYPEALIGN() will not work correctly unless the alignment is a power of > 2. We could modify it to, but that would require doing some modular > maths instead of bitmasking. That seems pretty horrible when the macro > is given a value that's not constant at compile time as we'd end up > with a (slow) divide in the code path. I think the restriction is a > good idea. I imagine there will never be any need to align to anything > that's not a power of 2. +1 > > Should we handle the case where we get a suitably aligned pointer from > > MemoryContextAllocExtended() differently? > > Maybe it would be worth the extra check. I'm trying to imagine future > use cases. Maybe if someone wanted to ensure that we're aligned to > CPU cache line boundaries then the chances of the pointer already > being aligned to 64 bytes is decent enough. The problem is it that > it's too late to save any memory, it just saves a bit of boxing and > unboxing of the redirect headers. To my mind the main point of detecting this case is to save memory, so if that's not possible/convenient, special-casing doesn't seem worth it. - Assert((char *) chunk > (char *) block); + Assert((char *) chunk >= (char *) block); Is this related or independent? -- John Naylor EDB: http://www.enterprisedb.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Fri, Nov 11, 2022 at 2:12 PM Michael Paquier wrote: > > On Fri, Nov 11, 2022 at 11:53:08AM +0530, Bharath Rupireddy wrote: > > On Fri, Nov 11, 2022 at 11:14 AM John Naylor < john.nay...@enterprisedb.com> wrote: > >> Was there supposed to be an attachment here? > > > > Nope. The patches have already been committed - > > 3bdbdf5d06f2179d4c17926d77ff734ea9e7d525 and > > 28cc2976a9cf0ed661dbc55f49f669192cce1c89. > > The committed patches are pretty much the same as the last version > sent on this thread, except that the changes have been split across > the files they locally impact, with a few simplifications tweaks to > the comments. Hencem I did not see any need to send a new version for > this case. Ah, I missed that -- sorry for the noise. -- John Naylor EDB: http://www.enterprisedb.com
Re: slab allocator performance issues
On Wed, Oct 12, 2022 at 4:37 PM David Rowley wrote: > [v3] + /* + * Compute a shift that guarantees that shifting chunksPerBlock with it + * yields is smaller than SLAB_FREELIST_COUNT - 1 (one freelist is used for full blocks). + */ + slab->freelist_shift = 0; + while ((slab->chunksPerBlock >> slab->freelist_shift) >= (SLAB_FREELIST_COUNT - 1)) + slab->freelist_shift++; + /* + * Ensure, without a branch, that index 0 is only used for blocks entirely + * without free chunks. + * XXX: There probably is a cheaper way to do this. Needing to shift twice + * by slab->freelist_shift isn't great. + */ + index = (freecount + (1 << slab->freelist_shift) - 1) >> slab->freelist_shift; How about something like #define SLAB_FREELIST_COUNT ((1<<3) + 1) index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0); and dispense with both freelist_shift and the loop that computes it? -- John Naylor EDB: http://www.enterprisedb.com
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
On Tue, Nov 8, 2022 at 11:08 AM Michael Paquier wrote: > > So, I have looked at that, and at the end concluded that Andres' > suggestion to use PGAlignedBlock in pg_write_zeros() will serve better > in the long run. Thomas has mentioned upthread that some of the > comments don't need to be that long, so I have tweaked these to be > minimal, and updated a few more areas. Note that this has been split > into two commits: one to introduce the new routine in file_utils.c and > a second for the switch in walmethods.c. Was there supposed to be an attachment here? -- John Naylor EDB: http://www.enterprisedb.com
Re: Direct I/O
On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro wrote: > Memory alignment patches: > > Direct I/O generally needs to be done to/from VM page-aligned > addresses, but only "standard" 4KB pages, even when larger VM pages > are in use (if there is an exotic system where that isn't true, it > won't work). We need to deal with buffers on the stack, the heap and > in shmem. For the stack, see patch 0001. For the heap and shared > memory, see patch 0002, but David Rowley is going to propose that part > separately, as MemoryContext API adjustments are a specialised enough > topic to deserve another thread; here I include a copy as a > dependency. The main direct I/O patch is 0003. One thing to note: Currently, a request to aset above 8kB must go into a dedicated block. Not sure if it's a coincidence that that matches the default PG page size, but if allocating pages on the heap is hot enough, maybe we should consider raising that limit. Although then, aligned-to-4kB requests would result in 16kB chunks requested unless a different allocator was used. -- John Naylor EDB: http://www.enterprisedb.com
Re: Add palloc_aligned() to allow arbitrary power of 2 memory alignment
On Tue, Nov 8, 2022 at 8:57 AM David Rowley wrote: > Is there anything we could align to CPU cacheline size that would > speed something up? InitCatCache() already has this, which could benefit from simpler notation. /* * Allocate a new cache structure, aligning to a cacheline boundary * * Note: we rely on zeroing to initialize all the dlist headers correctly */ sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE; cp = (CatCache *) CACHELINEALIGN(palloc0(sz)); -- John Naylor EDB: http://www.enterprisedb.com
Re: Improve logging when using Huge Pages
On Thu, Nov 3, 2022 at 8:11 AM Thomas Munro wrote: > > I wonder if the problem here is that people are reluctant to add noise > to every starting system. There are people who have not configured > their system and don't want to see that noise, and then some people > have configured their system and would like to know about it if it > doesn't work so they can be aware of that, but don't want to use "off" > because they don't want a hard failure. Would it be better if there > were a new level "try_log" (or something), which only logs a message > if it tries and fails? I think the best thing to do is change huge_pages='on' to log a WARNING and fallback to regular pages if the mapping fails. That way, both dev and prod can keep the same settings, since 'on' will have both visibility and robustness. I don't see a good reason to refuse to start -- seems like an anti-pattern. -- John Naylor EDB: http://www.enterprisedb.com
Re: remap the .text segment into huge pages at run time
On Sat, Nov 5, 2022 at 3:27 PM Andres Freund wrote: > > simplified it. Interestingly enough, looking through the commit history, > > they used to align the segments via linker flags, but took it out here: > > > > https://github.com/intel/iodlr/pull/25#discussion_r397787559 > > > > ...saying "I'm not sure why we added this". :/ > > That was about using a linker script, not really linker flags though. Oops, the commit I was referring to pointed to that discussion, but I should have shown it instead: --- a/large_page-c/example/Makefile +++ b/large_page-c/example/Makefile @@ -28,7 +28,6 @@ OBJFILES= \ filler16.o \ OBJS=$(addprefix $(OBJDIR)/,$(OBJFILES)) -LDFLAGS=-Wl,-z,max-page-size=2097152 But from what you're saying, this flag wouldn't have been enough anyway... > I don't think the dummy functions are a good approach, there were plenty > things after it when I played with them. To be technical, the point wasn't to have no code after it, but to have no *hot* code *before* it, since with the iodlr approach the first 1.99MB of .text is below the first aligned boundary within that section. But yeah, I'm happy to ditch that hack entirely. > > > With these flags the "R E" segments all start on a 0x20/2MiB boundary > > and > > > are padded to the next 2MiB boundary. However the OS / dynamic loader only > > > maps the necessary part, not all the zero padding. > > > > > > This means that if we were to issue a MADV_COLLAPSE, we can before it do > > an > > > mremap() to increase the length of the mapping. > > > > I see, interesting. What location are you passing for madvise() and > > mremap()? The beginning of the segment (for me has .init/.plt) or an > > aligned boundary within .text? >/* > * Make huge pages out of it. Requires at least linux 6.1. We could > * fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that > * much in older kernels. > */ About madvise(), I take it MADV_HUGEPAGE and MADV_COLLAPSE only work for THP? The man page seems to indicate that. In the support work I've done, the standard recommendation is to turn THP off, especially if they report sudden performance problems. If explicit HP's are used for shared mem, maybe THP is less of a risk? I need to look back at the tests that led to that advice... > A real version would have to open /proc/self/maps and do this for at least I can try and generalize your above sketch into a v2 patch. > postgres' r-xp mapping. We could do it for libraries too, if they're suitably > aligned (both in memory and on-disk). It looks like plpgsql is only 27 standard pages in size... Regarding glibc, we could try moving a couple of the hotter functions into PG, using smaller and simpler coding, if that has better frontend cache behavior. The paper "Understanding and Mitigating Front-End Stalls in Warehouse-Scale Computers" talks about this, particularly section 4.4 regarding memcmp(). > > I quickly tried to align the segments with the linker and then in my patch > > have the address for mmap() rounded *down* from the .text start to the > > beginning of that segment. It refused to start without logging an error. > > Hm, what linker was that? I did note that you need some additional flags for > some of the linkers. BFD, but I wouldn't worry about that failure too much, since the mremap()/madvise() strategy has a lot fewer moving parts. On the subject of linkers, though, one thing that tripped me up was trying to change the linker with Meson. First I tried -Dc_args='-fuse-ld=lld' but that led to warnings like this when : /usr/bin/ld: warning: -z separate-loadable-segments ignored When using this in the top level meson.build elif host_system == 'linux' sema_kind = 'unnamed_posix' cppflags += '-D_GNU_SOURCE' # Align the loadable segments to 2MB boundaries to support remapping to # huge pages. ldflags += cc.get_supported_link_arguments([ '-Wl,-zmax-page-size=0x20', '-Wl,-zcommon-page-size=0x20', '-Wl,-zseparate-loadable-segments' ]) According to https://mesonbuild.com/howtox.html#set-linker I need to add CC_LD=lld to the env vars before invoking, which got rid of the warning. Then I wanted to verify that lld was actually used, and in https://releases.llvm.org/14.0.0/tools/lld/docs/index.html it says I can run this and it should show “Linker: LLD”, but that doesn't appear for me: $ readelf --string-dump .comment inst-perf/bin/postgres String dump of section '.comment': [ 0] GCC: (GNU) 12.2.1 20220819 (Red Hat 12.2.1-2) -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Nov 4, 2022 at 10:25 PM Masahiko Sawada wrote: > > For parallel heap pruning, multiple workers will insert key-value > pairs to the radix tree concurrently. The simplest solution would be a > single lock to protect writes but the performance will not be good. > Another solution would be that we can divide the tables into multiple > ranges so that keys derived from TIDs are not conflicted with each > other and have parallel workers process one or more ranges. That way, > parallel vacuum workers can build *sub-trees* and the leader process > can merge them. In use cases of lazy vacuum, since the write phase and > read phase are separated the readers don't need to worry about > concurrent updates. It's a good idea to use ranges for a different reason -- readahead. See commit 56788d2156fc3, which aimed to improve readahead for sequential scans. It might work to use that as a model: Each worker prunes a range of 64 pages, keeping the dead tids in a local array. At the end of the range: lock the tid store, enter the tids into the store, unlock, free the local array, and get the next range from the leader. It's possible contention won't be too bad, and I suspect using small local arrays as-we-go would be faster and use less memory than merging multiple sub-trees at the end. > I've attached a draft patch for lazy vacuum integration that can be > applied on top of v8 patches. The patch adds a new module called > TIDStore, an efficient storage for TID backed by radix tree. Lazy > vacuum and parallel vacuum use it instead of a TID array. The patch > also introduces rt_detach() that was missed in 0002 patch. It's a very > rough patch but I hope it helps in considering lazy vacuum > integration, radix tree APIs, and shared radix tree functionality. It does help, good to see this. -- John Naylor EDB: http://www.enterprisedb.com
Re: remap the .text segment into huge pages at run time
On Sat, Nov 5, 2022 at 1:33 AM Andres Freund wrote: > > I wonder how far we can get with just using the linker hints to align > > sections. I know that the linux folks are working on promoting sufficiently > > aligned executable pages to huge pages too, and might have succeeded already. > > > > IOW, adding the linker flags might be a good first step. > > Indeed, I did see that that works to some degree on the 5.19 kernel I was > running. However, it never seems to get around to using huge pages > sufficiently to compete with explicit use of huge pages. Oh nice, I didn't know that! There might be some threshold of pages mapped before it does so. At least, that issue is mentioned in that paper linked upthread for FreeBSD. > More interestingly, a few days ago, a new madvise hint, MADV_COLLAPSE, was > added into linux 6.1. That explicitly remaps a region and uses huge pages for > it. Of course that's going to take a while to be widely available, but it > seems like a safer approach than the remapping approach from this thread. I didn't know that either, funny timing. > I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just hardcode > the address / length), and it seems to work nicely. > > With the weird caveat that on fs one needs to make sure that the executable > doesn't reflinks to reuse parts of other files, and that the mold linker and > cp do... Not a concern on ext4, but on xfs. I took to copying the postgres > binary with cp --reflink=never What happens otherwise? That sounds like a difficult thing to guard against. > The difference in itlb.itlb_flush between pipelined / non-pipelined cases > unsurprisingly is stark. > > While the pipelined case still sees a good bit reduced itlb traffic, the total > amount of cycles in which a walk is active is just not large enough to matter, > by the looks of it. Good to know, thanks for testing. Maybe the pipelined case is something devs should consider when microbenchmarking, to reduce noise from context switches. On Sat, Nov 5, 2022 at 4:21 AM Andres Freund wrote: > > Hi, > > On 2022-11-03 10:21:23 -0700, Andres Freund wrote: > > > - Add a "cold" __asm__ filler function that just takes up space, enough to > > > push the end of the .text segment over the next aligned boundary, or to > > > ~8MB in size. > > > > I don't understand why this is needed - as long as the pages are aligned to > > 2MB, why do we need to fill things up on disk? The in-memory contents are the > > relevant bit, no? > > I now assume it's because you either observed the mappings set up by the > loader to not include the space between the segments? My knowledge is not quite that deep. The iodlr repo has an example "hello world" program, which links with 8 filler objects, each with 32768 __attribute((used)) dummy functions. I just cargo-culted that idea and simplified it. Interestingly enough, looking through the commit history, they used to align the segments via linker flags, but took it out here: https://github.com/intel/iodlr/pull/25#discussion_r397787559 ...saying "I'm not sure why we added this". :/ I quickly tried to align the segments with the linker and then in my patch have the address for mmap() rounded *down* from the .text start to the beginning of that segment. It refused to start without logging an error. BTW, that what I meant before, although I wasn't clear: > > Since the front is all-cold, and there is very little at the end, > > practically all hot pages are now remapped. The biggest problem with the > > hackish filler function (in addition to maintainability) is, if explicit > > huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB > > causes complete startup failure if the .text segment is larger than 8MB. > > I would expect MAP_HUGETLB to always fail if not enabled in the kernel, > independent of the .text segment size? With the file-level hack, it would just fail without a trace with .text > 8MB (I have yet to enable core dumps on this new OS I have...), whereas without it I did see the failures in the log, and successful fallback. > With these flags the "R E" segments all start on a 0x20/2MiB boundary and > are padded to the next 2MiB boundary. However the OS / dynamic loader only > maps the necessary part, not all the zero padding. > > This means that if we were to issue a MADV_COLLAPSE, we can before it do an > mremap() to increase the length of the mapping. I see, interesting. What location are you passing for madvise() and mremap()? The beginning of the segment (for me has .init/.plt) or an aligned boundary within .text? -- John Naylor EDB: http://www.enterprisedb.com
Re: Incorrect include file order in guc-file.l
On Fri, Nov 4, 2022 at 5:42 AM Michael Paquier wrote: > > The CI is able to complete without it. Would you mind if it is > removed? If you don't want us to poke more at the bear, that's a nit > so leaving things as they are is also fine by me. I've removed it. -- John Naylor EDB: http://www.enterprisedb.com
Re: Incorrect include file order in guc-file.l
On Thu, Nov 3, 2022 at 6:40 PM Michael Paquier wrote: > > On Thu, Nov 03, 2022 at 12:40:19PM +0700, John Naylor wrote: > > On Wed, Nov 2, 2022 at 1:01 PM Julien Rouhaud wrote: > >> Agreed, it's apparently an oversight in dac048f71eb. +1 for the patch. > > > > I've pushed this, thanks! > > Thanks for the commit. I've wanted to get it done yesterday but life > took over faster than that. Before committing the change, there is > something I have noticed though: this header does not seem to be > necessary at all and it looks that there is nothing in guc-file.l that > needs it. Why did you add it in dac048f to begin with? Because it wouldn't compile otherwise, obviously. :-) I must have been working on it before bfb9dfd93720 -- John Naylor EDB: http://www.enterprisedb.com
Re: Prefetch the next tuple's memory during seqscans
On Tue, Nov 1, 2022 at 5:17 AM David Rowley wrote: > > My test is to run 16 queries changing the WHERE clause each time to > have WHERE a = 0, then WHERE a2 = 0 ... WHERE a16 = 0. I wanted to > know if prefetching only the first cache line of the tuple would be > less useful when we require evaluation of say, the "a16" column vs the > "a" column. I tried a similar test, but with text fields of random length, and there is improvement here: Intel laptop, turbo boost off shared_buffers = '4GB' huge_pages = 'on' max_parallel_workers_per_gather = '0' create table text8 as select repeat('X', int4(random() * 20)) a1, repeat('X', int4(random() * 20)) a2, repeat('X', int4(random() * 20)) a3, repeat('X', int4(random() * 20)) a4, repeat('X', int4(random() * 20)) a5, repeat('X', int4(random() * 20)) a6, repeat('X', int4(random() * 20)) a7, repeat('X', int4(random() * 20)) a8 from generate_series(1,1000) a; vacuum freeze text8; psql -c "select pg_prewarm('text8')" && \ for i in a1 a2 a3 a4 a5 a6 a7 a8; do echo Testing $i echo "select * from text8 where $i = 'ZZZ';" > bench.sql pgbench -f bench.sql -M prepared -n -T 10 postgres | grep latency done Master: Testing a1 latency average = 980.595 ms Testing a2 latency average = 1045.081 ms Testing a3 latency average = 1107.736 ms Testing a4 latency average = 1162.188 ms Testing a5 latency average = 1213.985 ms Testing a6 latency average = 1272.156 ms Testing a7 latency average = 1318.281 ms Testing a8 latency average = 1363.359 ms Patch 0001+0003: Testing a1 latency average = 812.548 ms Testing a2 latency average = 897.303 ms Testing a3 latency average = 955.997 ms Testing a4 latency average = 1023.497 ms Testing a5 latency average = 1088.494 ms Testing a6 latency average = 1149.418 ms Testing a7 latency average = 1213.134 ms Testing a8 latency average = 1282.760 ms -- John Naylor EDB: http://www.enterprisedb.com
Re: Incorrect include file order in guc-file.l
On Wed, Nov 2, 2022 at 1:01 PM Julien Rouhaud wrote: > > Hi, > > On Wed, Nov 02, 2022 at 02:29:50PM +0900, Michael Paquier wrote: > > > > While reviewing a different patch, I have noticed that guc-file.l > > includes sys/stat.h in the middle of the PG internal headers. The > > usual practice is to have first postgres[_fe].h, followed by the > > system headers and finally the internal headers. That's a nit, but > > all the other files do that. > > > > {be,fe}-secure-openssl.c include some exceptions though, as documented > > there. > > Agreed, it's apparently an oversight in dac048f71eb. +1 for the patch. I've pushed this, thanks! -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Oct 31, 2022 at 12:47 PM Masahiko Sawada wrote: > > I've attached v8 patches. 0001, 0002, and 0003 patches incorporated > the comments I got so far. 0004 patch is a DSA support patch for PoC. Thanks for the new patchset. This is not a full review, but I have some comments: 0001 and 0002 look okay on a quick scan -- I will use this as a base for further work that we discussed. However, before I do so I'd like to request another revision regarding the following: > In 0004 patch, the basic idea is to use rt_node_ptr in all inner nodes > to point its children, and we use rt_node_ptr as either rt_node* or > dsa_pointer depending on whether the radix tree is shared or not (ie, > by checking radix_tree->dsa == NULL). 0004: Looks like a good start, but this patch has a large number of changes like these, making it hard to read: - if (found && child_p) - *child_p = child; + if (found && childp_p) + *childp_p = childp; ... rt_node_inner_32 *new32; + rt_node_ptr new32p; /* grow node from 4 to 32 */ - new32 = (rt_node_inner_32 *) rt_copy_node(tree, (rt_node *) n4, - RT_NODE_KIND_32); + new32p = rt_copy_node(tree, (rt_node *) n4, RT_NODE_KIND_32); + new32 = (rt_node_inner_32 *) node_ptr_get_local(tree, new32p); It's difficult to keep in my head what all the variables refer to. I thought a bit about how to split this patch up to make this easier to read. Here's what I came up with: typedef struct rt_node_ptr { uintptr_t encoded; rt_node * decoded; } Note that there is nothing about "dsa or local". That's deliberate. That way, we can use the "encoded" field for a tagged pointer as well, as I hope we can do (at least for local pointers) in the future. So an intermediate patch would have "static inline void" functions node_ptr_encode() and node_ptr_decode(), which would only copy from one member to another. I suspect that: 1. The actual DSA changes will be *much* smaller and easier to reason about. 2. Experimenting with tagged pointers will be easier. Also, quick question: 0004 has a new function rt_node_update_inner() -- is that necessary because of DSA?, or does this ideally belong in 0002? What's the reason for it? Regarding the performance, I've > added another boolean argument to bench_seq/shuffle_search(), > specifying whether to use the shared radix tree or not. Here are > benchmark results in my environment, > [...] > In non-shared radix tree cases (the forth argument is false), I don't > see a visible performance degradation. On the other hand, in shared > radix tree cases (the forth argument is true), I see visible overheads > because of dsa_get_address(). Thanks, this is useful. > Please note that the current shared radix tree implementation doesn't > support any locking, so it cannot be read while written by someone. I think at the very least we need a global lock to enforce this. > Also, only one process can iterate over the shared radix tree. When it > comes to parallel vacuum, these don't become restriction as the leader > process writes the radix tree while scanning heap and the radix tree > is read by multiple processes while vacuuming indexes. And only the > leader process can do heap vacuum by iterating the key-value pairs in > the radix tree. If we want to use it for other cases too, we would > need to support locking, RCU or something. A useful exercise here is to think about what we'd need to do parallel heap pruning. We don't need to go that far for v16 of course, but what's the simplest thing we can do to make that possible? Other use cases can change to more sophisticated schemes if need be. -- John Naylor EDB: http://www.enterprisedb.com
Re: Incorrect include file order in guc-file.l
On Wed, Nov 2, 2022 at 1:01 PM Julien Rouhaud wrote: > > Hi, > > On Wed, Nov 02, 2022 at 02:29:50PM +0900, Michael Paquier wrote: > > > > While reviewing a different patch, I have noticed that guc-file.l > > includes sys/stat.h in the middle of the PG internal headers. The > > usual practice is to have first postgres[_fe].h, followed by the > > system headers and finally the internal headers. That's a nit, but > > all the other files do that. > > > > {be,fe}-secure-openssl.c include some exceptions though, as documented > > there. > > Agreed, it's apparently an oversight in dac048f71eb. +1 for the patch. Yeah. +1, thanks. -- John Naylor EDB: http://www.enterprisedb.com
remap the .text segment into huge pages at run time
dback on whether the entire approach in this thread is sound enough to justify working further on. [1] https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf (paper: "On the Impact of Instruction Address Translation Overhead") [2] https://twitter.com/AndresFreundTec/status/1214305610172289024 [3] https://github.com/intel/iodlr -- John Naylor EDB: http://www.enterprisedb.com From 9cde401f87937c1982f2355c8f81449514166376 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Mon, 31 Oct 2022 13:59:30 +0700 Subject: [PATCH v1 2/2] Put all non-cold .text in huge pages Tell linker to align addresses on 2MB boundaries. The .init section will be so aligned, with the .text section soon after that. Therefore, the start address of .text should always be align up to nearly 2MB ahead of the actual start. The first nearly 2MB of .text will not map to huge pages. We count on cold sections linking to the front of the .text segment: Since the cold sections total about 600kB in size, we need ~1.4MB of additional padding to keep non-cold pages mappable to huge pages. Since PG has about 5.0MB of .text, we also need an additional 1MB to push the .text end just past an aligned boundary, so when we align the end down, only a small number of pages will remain un-remapped at their original 4kB size. --- meson.build | 3 +++ src/backend/port/filler.c| 29 + src/backend/port/meson.build | 3 +++ 3 files changed, 35 insertions(+) create mode 100644 src/backend/port/filler.c diff --git a/meson.build b/meson.build index bfacbdc0af..450946370c 100644 --- a/meson.build +++ b/meson.build @@ -239,6 +239,9 @@ elif host_system == 'freebsd' elif host_system == 'linux' sema_kind = 'unnamed_posix' cppflags += '-D_GNU_SOURCE' + # WIP: debug builds are huge + # TODO: add portability check + ldflags += ['-Wl,-zcommon-page-size=2097152', '-Wl,-zmax-page-size=2097152'] elif host_system == 'netbsd' # We must resolve all dynamic linking in the core server at program start. diff --git a/src/backend/port/filler.c b/src/backend/port/filler.c new file mode 100644 index 00..de4e33bb05 --- /dev/null +++ b/src/backend/port/filler.c @@ -0,0 +1,29 @@ +/* + * Add enough padding to .text segment to bring the end just + * past a 2MB alignment boundary. In practice, this means .text needs + * to be at least 8MB. It shouldn't be much larger than this, + * because then more hot pages will remain in 4kB pages. + * + * FIXME: With this filler added, if explicit huge pages are turned off + * in the kernel, attempting mmap() with MAP_HUGETLB causes a crash + * instead of reporting failure if the .text segment is larger than 8MB. + * + * See MapStaticCodeToLargePages() in large_page.c + * + * XXX: The exact amount of filler must be determined experimentally + * on platforms of interest, in non-assert builds. + * + */ +static void +__attribute__((used)) +__attribute__((cold)) +fill_function(int x) +{ + /* TODO: More architectures */ +#ifdef __x86_64__ +__asm__( + ".fill 3251000" +); +#endif + (void) x; +} \ No newline at end of file diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build index 5ab65115e9..d876712e0c 100644 --- a/src/backend/port/meson.build +++ b/src/backend/port/meson.build @@ -16,6 +16,9 @@ if cdata.has('USE_WIN32_SEMAPHORES') endif if cdata.has('USE_SYSV_SHARED_MEMORY') + if host_system == 'linux' +backend_sources += files('filler.c') + endif backend_sources += files('large_page.c') backend_sources += files('sysv_shmem.c') endif -- 2.37.3 From 0012baab70779f5fc06c8717392dc76e8f156270 Mon Sep 17 00:00:00 2001 From: John Naylor Date: Mon, 31 Oct 2022 15:24:29 +0700 Subject: [PATCH v1 1/2] Partly remap the .text segment into huge pages at postmaster start Based on MIT licensed libary at https://github.com/intel/iodlr The basic steps are: - read ELF info to get the start/end addresses of the .text segment - calculate addresses therein aligned at huge page boundaries - mmap temporary region and memcpy aligned portion of .text segment - mmap start address to new region with huge pages and MAP_FIXED - memcpy over and revoke the PROT_WRITE bit The Postgres .text segment is ~5.0MB in a non-assert build, so this method can put 2-4MB into huge pages. --- src/backend/port/large_page.c | 348 src/backend/port/meson.build| 1 + src/backend/postmaster/postmaster.c | 7 + src/include/port/large_page.h | 18 ++ 4 files changed, 374 insertions(+) create mode 100644 src/backend/port/large_page.c create mode 100644 src/include/port/large_page.h diff --git a/src/backend/port/large_page.c b/src/backend/port/large_page.c new file mode 100644 index 00..66a584f785 --- /dev/null +++ b/src/backend/port/large_page.c @@ -0,0 +1,348 @@ +/*- + * + * large_page.c + * Map .text segment of binary to hu
Re: Documentation for building with meson
+# Run the main pg_regress and isolation tests +meson test --suite main This does not work for me in a fresh install until running meson test --suite setup In fact, we see in https://wiki.postgresql.org/wiki/Meson meson test --suite setup --suite main That was just an eyeball check from a naive user -- it would be good to try running everything documented here. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Oct 27, 2022 at 9:11 AM Masahiko Sawada wrote: > > True. I'm going to start with 6 bytes and will consider reducing it to > 5 bytes. Okay, let's plan on 6 for now, so we have the worst-case sizes up front. As discussed, I will attempt the size class decoupling after v8 and see how it goes. > Encoding the kind in a pointer tag could be tricky given DSA If it turns out to be unworkable, that's life. If it's just tricky, that can certainly be put off for future work. I hope to at least test it out with local memory. > support so currently I'm thinking to pack the node kind and node > capacity classes to uint8. That won't work, if we need 128 for capacity, leaving no bits left. I want the capacity to be a number we can directly compare with the count (we won't ever need to store 256 because that node will never grow). Also, further to my last message, we need to access the kind quickly, without more cycles. > I've made some progress on investigating DSA support. I've written > draft patch for that and regression tests passed. I'll share it as a > separate patch for discussion with v8 radix tree patch. Great! > While implementing DSA support, I realized that we may not need to use > pointer tagging to distinguish between backend-local address or > dsa_pointer. In order to get a backend-local address from dsa_pointer, > we need to pass dsa_area like: I was not clear -- when I see how much code changes to accommodate DSA pointers, I imagine I will pretty much know the places that would be affected by tagging the pointer with the node kind. Speaking of tests, there is currently no Meson support, but tests pass because this library is not used anywhere in the backend yet, and apparently the CI Meson builds don't know to run the regression test? That will need to be done too. However, it's okay to keep the benchmarking module in autoconf, since it won't be committed. > > +static inline void > > +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children, > > + uint8 *dst_chunks, rt_node **dst_children, int count) > > +{ > > + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count); > > + memcpy(dst_children, src_children, sizeof(rt_node *) * count); > > +} > > > > gcc generates better code with something like this (but not hard-coded) at the top: > > > > if (count > 4) > > pg_unreachable(); Actually it just now occurred to me there's a bigger issue here: *We* know this code can only get here iff count==4, so why doesn't the compiler know that? I believe it boils down to static rt_node_kind_info_elem rt_node_kind_info[RT_NODE_KIND_COUNT] = { In the assembly, I see it checks if there is room in the node by doing a runtime lookup in this array, which is not constant. This might not be important just yet, because I want to base the check on the proposed node capacity instead, but I mention it as a reminder to us to make sure we take all opportunities for the compiler to propagate constants. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
remind us. + /* there is no key to delete */ + if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, NULL)) + return false; + + /* Update the statistics */ + tree->num_keys--; + + /* + * Delete the key from the leaf node and recursively delete the key in + * inner nodes if necessary. + */ + Assert(NODE_IS_LEAF(stack[level])); + while (level >= 0) + { + rt_node*node = stack[level--]; + + if (NODE_IS_LEAF(node)) + rt_node_search_leaf(node, key, RT_ACTION_DELETE, NULL); + else + rt_node_search_inner(node, key, RT_ACTION_DELETE, NULL); + + /* If the node didn't become empty, we stop deleting the key */ + if (!NODE_IS_EMPTY(node)) + break; + + /* The node became empty */ + rt_free_node(tree, node); + } Here we call rt_node_search_leaf() twice -- once to check for existence, and once to delete. All three search calls are inlined, so this wastes space. Let's try to delete the leaf, return if not found, otherwise handle the leaf bookkeepping and loop over the inner nodes. This might require some duplication of code. +ndoe_inner_128_update(rt_node_inner_128 *node, uint8 chunk, rt_node *child) Spelling +static inline void +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children, + uint8 *dst_chunks, rt_node **dst_children, int count) +{ + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count); + memcpy(dst_children, src_children, sizeof(rt_node *) * count); +} gcc generates better code with something like this (but not hard-coded) at the top: if (count > 4) pg_unreachable(); This would have to change when we implement shrinking of nodes, but might still be useful. + if (!rt_node_search_leaf(node, key, RT_ACTION_FIND, value_p)) + return false; + + return true; Maybe just "return rt_node_search_leaf(...)" ? -- John Naylor EDB: http://www.enterprisedb.com
Re: fixing typo in comment for restriction_is_or_clause
On Tue, Oct 25, 2022 at 9:48 AM Richard Guo wrote: > > > On Tue, Oct 25, 2022 at 10:05 AM John Naylor wrote: >> >> It's perfectly clear and simple now, even if it doesn't win at "code golf". > > > Agree with your point. Do you think we can further make the one-line > function a macro or an inline function in the .h file? I think this > function is called quite frequently during planning, so maybe doing that > would bring a little bit of efficiency. My point was misunderstood, which is: I don't think we need to do anything at all here if the goal was purely about aesthetics. If the goal has now changed to efficiency, I have no opinion about that yet since no evidence has been presented. -- John Naylor EDB: http://www.enterprisedb.com
Re: fixing typo in comment for restriction_is_or_clause
On Tue, Oct 25, 2022 at 12:19 AM Zhihong Yu wrote: > > Hi, > When I was looking at src/backend/optimizer/util/restrictinfo.c, I found a typo in one of the comments. Using "t" as an abbreviation for "true" was probably intentional, so not a typo. There is no doubt what the behavior is. > I also took the chance to simplify the code a little bit. It's perfectly clear and simple now, even if it doesn't win at "code golf". -- John Naylor EDB: http://www.enterprisedb.com
Re: Reducing the chunk header sizes on all memory context types
On Thu, Oct 20, 2022 at 1:55 AM Andres Freund wrote: > > Hi, > > On 2022-10-11 10:21:17 +0700, John Naylor wrote: > > On Tue, Oct 11, 2022 at 5:31 AM David Rowley wrote: > > > > > > The proposed patches in [1] do aim to make additional usages of the > > > slab allocator, and I have a feeling that we'll want to fix the > > > performance of slab.c before those. Perhaps the Asserts are a better > > > option if we're to get the proposed radix tree implementation. > > > > Going by [1], that use case is not actually a natural fit for slab because > > of memory fragmentation. > > > > [1] > > https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de > > Not so sure about that - IIRC I made one slab for each different size class, > which seemed to work well and suit slab well? If that's the case, then great! The linked message didn't give me that impression, but I won't worry about it. -- John Naylor EDB: http://www.enterprisedb.com
Re: meson PGXS compatibility
On Wed, Oct 12, 2022 at 12:50 PM Peter Eisentraut < peter.eisentr...@enterprisedb.com> wrote: > > On 05.10.22 22:07, Andres Freund wrote: > > 0001: autoconf: Unify CFLAGS_SSE42 and CFLAGS_ARMV8_CRC32C > > I wonder, there has been some work lately to use SIMD and such in other > places. Would that affect what kinds of extra CPU-specific compiler > flags and combinations we might need? In short, no. The functionality added during this cycle only uses instructions available by default on the platform. CRC on Arm depends on armv8-a+crc, and CRC on x86-64 depends on SSE4.2. We can't assume those currently. -- John Naylor EDB: http://www.enterprisedb.com
Re: introduce optimized linear search functions that return index of matching element
On Sat, Sep 17, 2022 at 12:29 PM Nathan Bossart wrote: > > On Fri, Sep 16, 2022 at 02:54:14PM +0700, John Naylor wrote: > > v6 demonstrates why this should have been put off towards the end. (more below) > > Since the SIMD code is fresh in my mind, I wanted to offer my review for > 0001 in the "Improve dead tuple storage for lazy vacuum" thread [0]. > However, I agree with John that the SIMD part of that work should be left > for the end As I mentioned in the radix tree thread, I don't believe this level of abstraction is appropriate for the intended use case. We'll want to incorporate some of the low-level simd.h improvements later, so you should get authorship credit for those. I've marked the entry "returned with feedback". -- John Naylor EDB: http://www.enterprisedb.com
Re: Reducing the chunk header sizes on all memory context types
On Tue, Oct 11, 2022 at 5:31 AM David Rowley wrote: > > The proposed patches in [1] do aim to make additional usages of the > slab allocator, and I have a feeling that we'll want to fix the > performance of slab.c before those. Perhaps the Asserts are a better > option if we're to get the proposed radix tree implementation. Going by [1], that use case is not actually a natural fit for slab because of memory fragmentation. The motivation to use slab there was that the allocation sizes are just over a power of two, leading to a lot of wasted space for aset. FWIW, I have proposed in that thread a scheme to squeeze things into power-of-two sizes without wasting quite as much space. That's not a done deal, of course, but it could work today without adding memory management code. [1] https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de -- John Naylor EDB: http://www.enterprisedb.com
Re: Remove unnecessary commas for goto labels
On Sun, Oct 9, 2022 at 10:08 AM Julien Rouhaud wrote: > > Hi, > > On Sun, Oct 09, 2022 at 07:42:58AM +0800, Japin Li wrote: > > > > Hi hackers, > > > > I find there are some unnecessary commas for goto lables, > > attached a patch to remove them. > > You mean semi-colon? +1, and a quick regex later I don't see any other > occurrence. Interestingly, I did find that in C, some statement is needed after a label, even an empty one, otherwise it won't compile. That's not the case here, though, so pushed. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Mon, Oct 10, 2022 at 12:16 PM John Naylor wrote: > Thanks for that! Now I can show clear results on some aspects in a simple way. The attached patches (apply on top of v6) Forgot the patchset... -- John Naylor EDB: http://www.enterprisedb.com radix-v6-addendum-jcn1.tar.gz Description: application/gzip
Re: [PoC] Improve dead tuple storage for lazy vacuum
of, we can discuss further. I don't pretend to know the practical consequences of every change I mention. - If you have started coding the shared memory case, I'd advise to continue so we can see what that looks like. If that has not gotten beyond the design stage, I'd like to first see an attempt at tearing down some of the clumsier abstractions in the current patch. - As a "smoke test", there should ideally be nothing as general as rt_node_get_children/values(). We should ideally always know what kind we are if we found out earlier. - For distinguishing between linear nodes, perhaps some always-inline functions can help hide details. But at the same time, trying to treat them the same is not always worthwhile. - Start to separate treatment of inner/leaves and see how it goes. - I firmly believe we only need 4 node *kinds*, and later we can decouple the size classes as a separate concept. I'm willing to put serious time into that once the broad details are right. I will also investigate pointer tagging if we can confirm that can work similarly for dsa pointers. Regarding size class decoupling, I'll respond to a point made earlier: On Fri, Sep 30, 2022 at 10:47 PM Masahiko Sawada wrote: > With this idea, we can just repalloc() to grow to the larger size in a > pair but I'm slightly concerned that the more size class we use, the > more frequent the node needs to grow. Well, yes, but that's orthogonal. For example, v6 has 5 node kinds. Imagine that we have 4 node kinds, but the SIMD node kind used 2 size classes. Then the nodes would grow at *exactly* the same frequency as they do today. I listed many ways a size class could fit into a power-of-two (and there are more), but we have a choice in how many to actually use. It's a trade off between memory usage and complexity. > If we want to support node > shrink, the deletion is also affected. Not necessarily. We don't have to shrink at the same granularity as growing. My evidence is simple: we don't shrink at all now. :-) -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada wrote: > In addition to two patches, I've attached the third patch. It's not > part of radix tree implementation but introduces a contrib module > bench_radix_tree, a tool for radix tree performance benchmarking. It > measures loading and lookup performance of both the radix tree and a > flat array. Hi Masahiko, I've been using these benchmarks, along with my own variations, to try various things that I've mentioned. I'm long overdue for an update, but the picture is not yet complete. For now, I have two questions that I can't figure out on my own: 1. There seems to be some non-obvious limit on the number of keys that are loaded (or at least what the numbers report). This is independent of the number of tids per block. Example below: john=# select * from bench_shuffle_search(0, 8*1000*1000); NOTICE: num_keys = 800, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 25, n256 = 981 nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms -+--+-++---+--+- 800 |268435456 |4800 |661 | 29 | 276 | 389 john=# select * from bench_shuffle_search(0, 9*1000*1000); NOTICE: num_keys = 8388608, height = 3, n4 = 0, n16 = 1, n32 = 0, n128 = 262144, n256 = 1028 nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms -+--+-++---+--+- 8388608 |276824064 |5400 |718 | 33 | 311 | 446 The array is the right size, but nkeys hasn't kept pace. Can you reproduce this? Attached is the patch I'm using to show the stats when running the test. (Side note: The numbers look unfavorable for radix tree because I'm using 1 tid per block here.) 2. I found that bench_shuffle_search() is much *faster* for traditional binary search on an array than bench_seq_search(). I've found this to be true in every case. This seems counterintuitive to me -- any idea why this is? Example: john=# select * from bench_seq_search(0, 100); NOTICE: num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122 nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms -+--+-++---+--+- 100 | 10199040 | 18000 |168 | 106 | 827 |3348 john=# select * from bench_shuffle_search(0, 100); NOTICE: num_keys = 100, height = 2, n4 = 0, n16 = 0, n32 = 31251, n128 = 1, n256 = 122 nkeys | rt_mem_allocated | array_mem_allocated | rt_load_ms | array_load_ms | rt_search_ms | array_serach_ms -+--+-++---+--+- 100 | 10199040 | 18000 |171 | 107 | 827 | 1400 -- John Naylor EDB: http://www.enterprisedb.com From 43a50a385930ee340d0a3b003910c704a0ff342c Mon Sep 17 00:00:00 2001 From: John Naylor Date: Thu, 6 Oct 2022 09:07:41 +0700 Subject: [PATCH v65 1/5] Turn on per-node counts in benchmark Also add gitigore, fix whitespace, and change to NOTICE --- contrib/bench_radix_tree/.gitignore | 3 +++ contrib/bench_radix_tree/bench_radix_tree.c | 5 + src/backend/lib/radixtree.c | 2 +- src/include/lib/radixtree.h | 2 +- 4 files changed, 10 insertions(+), 2 deletions(-) create mode 100644 contrib/bench_radix_tree/.gitignore diff --git a/contrib/bench_radix_tree/.gitignore b/contrib/bench_radix_tree/.gitignore new file mode 100644 index 00..8830f5460d --- /dev/null +++ b/contrib/bench_radix_tree/.gitignore @@ -0,0 +1,3 @@ +*data +log/* +results/* diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c index 5806ef7519..36c5218ae7 100644 --- a/contrib/bench_radix_tree/bench_radix_tree.c +++ b/contrib/bench_radix_tree/bench_radix_tree.c @@ -13,6 +13,7 @@ #include "fmgr.h" #include "funcapi.h" #include "lib/radixtree.h" +#include #include "miscadmin.h" #include "utils/timestamp.h" @@ -183,6 +184,8 @@ bench_search(FunctionCallInfo fcinfo, bool shuffle) TimestampDifference(start_time, end_time, , ); rt_load_ms = secs * 1000 + usecs / 1000; + rt_stats(rt); + /* measure the load time of the array */ itemptrs = MemoryContextAllocHuge(CurrentMemoryContext, sizeof(ItemPointerData) * ntids); @@ -292,6 +295,8 @@ bench_load_random_int(PG_FUNCTION_ARGS) TimestampDifference(start_time, end_time, , ); load_time_ms = secs * 1000 + usecs / 1000; +
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Oct 6, 2022 at 2:53 PM Masahiko Sawada wrote: > > On Wed, Oct 5, 2022 at 6:40 PM John Naylor wrote: > > > > This wasn't the focus of your current email, but while experimenting with v6 I had another thought about local allocation: If we use the default slab block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right? If so, since aset and DSA also waste at least a few hundred bytes, we could store a useless 256-byte slot array within node256. That way, node128 and node256 share the same start of pointers/values array, so there would be one less branch for getting that address. In v6, rt_node_get_values and rt_node_get_children are not inlined (asde: gcc uses a jump table for 5 kinds but not for 4), but possibly should be, and the smaller the better. > > It would be good for performance but I'm a bit concerned that it's > highly optimized to the design of aset and DSA. Since size 2088 will > be currently classed as 2616 in DSA, DSA wastes 528 bytes. However, if > we introduce a new class of 2304 (=2048 + 256) bytes we cannot store a > useless 256-byte and the assumption will be broken. A new DSA class is hypothetical. A better argument against my idea is that SLAB_DEFAULT_BLOCK_SIZE is arbitrary. FWIW, I looked at the prototype just now and the slab block sizes are: Max(pg_nextpower2_32((MAXALIGN(inner_class_info[i].size) + 16) * 32), 1024) ...which would be 128kB for nodemax. I'm curious about the difference. > > One concern is, handling both local and dsa cases in the same code requires more (predictable) branches and reduces code density. That might be a reason in favor of templating to handle each case in its own translation unit. > > Right. We also need to support locking for shared radix tree, which > would require more branches. Hmm, now it seems we'll likely want to template local vs. shared as a later step... -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Oct 5, 2022 at 1:46 PM Masahiko Sawada wrote: > > On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada wrote: > > > > On Fri, Sep 23, 2022 at 12:11 AM John Naylor > > wrote: > > Yeah, node31 and node256 are bloated. We probably could use slab for > > node256 independently. It's worth trying a benchmark to see how it > > affects the performance and the tree size. This wasn't the focus of your current email, but while experimenting with v6 I had another thought about local allocation: If we use the default slab block size of 8192 bytes, then only 3 chunks of size 2088 can fit, right? If so, since aset and DSA also waste at least a few hundred bytes, we could store a useless 256-byte slot array within node256. That way, node128 and node256 share the same start of pointers/values array, so there would be one less branch for getting that address. In v6, rt_node_get_values and rt_node_get_children are not inlined (asde: gcc uses a jump table for 5 kinds but not for 4), but possibly should be, and the smaller the better. > Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes > to point to its child nodes, instead of C pointers (ig, backend-local > address). I'm thinking of a straightforward approach as the first > step; inner nodes have a union of rt_node* and dsa_pointer and we > choose either one based on whether the radix tree is shared or not. We > allocate and free the shared memory for individual nodes by > dsa_allocate() and dsa_free(), respectively. Therefore we need to get > a C pointer from dsa_pointer by using dsa_get_address() while > descending the tree. I'm a bit concerned that calling > dsa_get_address() for every descent could be performance overhead but > I'm going to measure it anyway. Are dsa pointers aligned the same as pointers to locally allocated memory? Meaning, is the offset portion always a multiple of 4 (or 8)? It seems that way from a glance, but I can't say for sure. If the lower 2 bits of a DSA pointer are never set, we can tag them the same way as a regular pointer. That same technique could help hide the latency of converting the pointer, by the same way it would hide the latency of loading parts of a node into CPU registers. One concern is, handling both local and dsa cases in the same code requires more (predictable) branches and reduces code density. That might be a reason in favor of templating to handle each case in its own translation unit. But that might be overkill. -- John Naylor EDB: http://www.enterprisedb.com
Re: [RFC] building postgres with meson - v13
On Tue, Sep 27, 2022 at 2:41 PM John Naylor wrote: > > On Tue, Sep 27, 2022 at 2:06 AM Andres Freund wrote: > > > > On 2022-09-26 15:18:29 +0700, John Naylor wrote: > > Yea, it's /usr/local on x86-64, based on what was required to make macos CI > > work. I updated the wiki page, half-blindly - it'd be nice if you could > > confirm that that works? > > Not sure if you intended for me to try the full script in your last response or just what's in the wiki page, but for the latter (on commit bed0927aeb0c6), it fails at > > [1656/2199] Linking target src/bin/psql/psql > FAILED: src/bin/psql/psql Per off-list discussion with Andres, the linking failure was caused by some env variables set in my bash profile for the sake of Homebrew. After removing those, the recipe in the wiki worked fine. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Sep 28, 2022 at 1:18 PM John Naylor wrote: > [stuff about size classes] I kind of buried the lede here on one thing: If we only have 4 kinds regardless of the number of size classes, we can use 2 bits of the pointer for dispatch, which would only require 4-byte alignment. That should make that technique more portable. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Wed, Sep 28, 2022 at 10:49 AM Masahiko Sawada wrote: > BTW We need to consider not only aset/slab but also DSA since we > allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses > the following size classes: > > static const uint16 dsa_size_classes[] = { > [...] Thanks for that info -- I wasn't familiar with the details of DSA. For the non-parallel case, I plan to at least benchmark using aset because I gather it's the most heavily optimized. I'm thinking that will allow other problem areas to be more prominent. I'll also want to compare total context size compared to slab to see if possibly less fragmentation makes up for other wastage. Along those lines, one thing I've been thinking about is the number of size classes. There is a tradeoff between memory efficiency and number of branches when searching/inserting. My current thinking is there is too much coupling between size class and data type. Each size class currently uses a different data type and a different algorithm to search and set it, which in turn requires another branch. We've found that a larger number of size classes leads to poor branch prediction [1] and (I imagine) code density. I'm thinking we can use "flexible array members" for the values/pointers, and keep the rest of the control data in the struct the same. That way, we never have more than 4 actual "kinds" to code and branch on. As a bonus, when migrating a node to a larger size class of the same kind, we can simply repalloc() to the next size. To show what I mean, consider this new table: node2: 5 + 6 +(5)+ 2*8 = 32 bytes node6: 5 + 6 +(5)+ 6*8 = 64 node12: 5 + 27 + 12*8 = 128 node27: 5 + 27 + 27*8 = 248(->256) node91: 5 + 256 + 28 +(7)+ 91*8 = 1024 node219: 5 + 256 + 28 +(7)+219*8 = 2048 node256: 5 + 32 +(3)+256*8 = 2088(->4096) Seven size classes are grouped into the four kinds. The common base at the front is here 5 bytes because there is a new uint8 field for "capacity", which we can ignore for node256 since we assume we can always insert/update that node. The control data is the same in each pair, and so the offset to the pointer/value array is the same. Thus, migration would look something like: case FOO_KIND: if (unlikely(count == capacity)) { if (capacity == XYZ) /* for smaller size class of the pair */ { ; capacity = next-higher-capacity; goto do_insert; } else ; } else { do_insert: <...>; break; } /* FALLTHROUGH */ ... One disadvantage is that this wastes some space by reserving the full set of control data in the smaller size class of the pair, but it's usually small compared to array size. Somewhat unrelated, we could still implement Andres' idea [1] to dispense with the isset array in inner nodes of the indirect array type (now node128), since we can just test if the pointer is null. [1] https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de -- John Naylor EDB: http://www.enterprisedb.com
Re: Insertion Sort Improvements
On Tue, Sep 27, 2022 at 4:39 PM Benjamin Coutu wrote: > > Getting back to improvements for small sort runs, it might make sense to consider using in-register based sorting via sorting networks for very small runs. > It looks like some of the commercial DBMSs do this very successfully. "Looks like"? > They use 4 512bit registers (AVX-512) in this example, which could technically store 4 * 4 sort-elements (given that the sorting key is 64 bit and the tuple pointer is 64 bit). I wonder whether this could also be done with just SSE (instead of AVX), which the project now readily supports thanks to your recent efforts? SortTuples are currently 24 bytes, and supported vector registers are 16 bytes, so not sure how you think that would work. More broadly, the more invasive a change is, the more risk is involved, and the more effort to test and evaluate. If you're serious about trying to improve insertion sort performance, the simple idea we discussed at the start of the thread is a much more modest step that has a good chance of justifying the time put into it. That is not to say it's easy, however, because testing is a non-trivial amount of work. -- John Naylor EDB: http://www.enterprisedb.com
Re: [RFC] building postgres with meson - v13
On Tue, Sep 27, 2022 at 2:06 AM Andres Freund wrote: > > On 2022-09-26 15:18:29 +0700, John Naylor wrote: > > Either way it doesn't exist on this machine. I was able to get a working > > build with > > > > /usr/local/Homebrew/Library/Homebrew/os/mac/pkgconfig > > Hm - what did you need this path for - I don't think that should be needed. I just cargo-culted the pattern from Arm (before I figured out it was Arm) and used the "find" command to look for the directories by name. I tried again without specifying any of the three directory flags, and I can run the tests getting: Ok: 233 Expected Fail: 0 Fail: 0 Unexpected Pass:0 Skipped:2 Timeout:0 ...which is fine for me since I don't do much development on MacOS nowadays. > > 1) /opt/homebrew/ seems to be an "Apple silicon" path? > > Yea, it's /usr/local on x86-64, based on what was required to make macos CI > work. I updated the wiki page, half-blindly - it'd be nice if you could > confirm that that works? Not sure if you intended for me to try the full script in your last response or just what's in the wiki page, but for the latter (on commit bed0927aeb0c6), it fails at [1656/2199] Linking target src/bin/psql/psql FAILED: src/bin/psql/psql clang -o src/bin/psql/psql src/bin/psql/psql.p/meson-generated_.._psqlscanslash.c.o src/bin/psql/psql.p/meson-generated_.._sql_help.c.o src/bin/psql/psql.p/command.c.o src/bin/psql/psql.p/common.c.o src/bin/psql/psql.p/copy.c.o src/bin/psql/psql.p/crosstabview.c.o src/bin/psql/psql.p/describe.c.o src/bin/psql/psql.p/help.c.o src/bin/psql/psql.p/input.c.o src/bin/psql/psql.p/large_obj.c.o src/bin/psql/psql.p/mainloop.c.o src/bin/psql/psql.p/prompt.c.o src/bin/psql/psql.p/startup.c.o src/bin/psql/psql.p/stringutils.c.o src/bin/psql/psql.p/tab-complete.c.o src/bin/psql/psql.p/variables.c.o -L/usr/local/opt/readline/lib -L/usr/local/opt/gettext/lib -L/usr/local/opt/zlib/lib -L/usr/local/opt/openssl/lib -I/usr/local/opt/readline/include -I/usr/local/opt/gettext/include -I/usr/local/opt/zlib/include -I/usr/local/opt/openssl/include -Wl,-dead_strip_dylibs -Wl,-headerpad_max_install_names -Wl,-undefined,error -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX11.3.sdk -Wl,-rpath,@loader_path/../../interfaces/libpq -Wl,-rpath,/usr/local/lib -Wl,-rpath,/usr/local/Cellar/zstd/1.5.2/lib src/fe_utils/libpgfeutils.a src/common/libpgcommon.a src/common/libpgcommon_ryu.a src/common/libpgcommon_config_info.a src/port/libpgport.a src/port/libpgport_crc.a src/interfaces/libpq/libpq.5.dylib -lm /usr/local/lib/libintl.dylib -ledit -lz /usr/local/Cellar/zstd/1.5.2/lib/libzstd.dylib -lz -lz -lz Undefined symbols for architecture x86_64: "_rl_completion_suppress_quote", referenced from: _psql_completion in tab-complete.c.o _quote_file_name in tab-complete.c.o _complete_from_files in tab-complete.c.o "_rl_filename_dequoting_function", referenced from: _initialize_readline in tab-complete.c.o "_rl_filename_quote_characters", referenced from: _initialize_readline in tab-complete.c.o "_rl_filename_quoting_function", referenced from: _initialize_readline in tab-complete.c.o ld: symbol(s) not found for architecture x86_64 clang: error: linker command failed with exit code 1 (use -v to see invocation) -- John Naylor EDB: http://www.enterprisedb.com
Re: [RFC] building postgres with meson - v13
On Wed, Sep 21, 2022 at 7:11 AM Andres Freund wrote: > > I added installation instructions for meson for a bunch of platforms, but A couple more things for the wiki: 1) /opt/homebrew/ seems to be an "Apple silicon" path? Either way it doesn't exist on this machine. I was able to get a working build with /usr/local/Homebrew/Library/Homebrew/os/mac/pkgconfig (My homebrew install doesn't seem to have anything relevant for extra_include_dirs or extra_lib_dirs.) 2) Also, "ninja -v install" has the same line count as "ninja install" -- are there versions that do something different? -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Sep 22, 2022 at 11:46 AM John Naylor wrote: > One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably inside a power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which would be faster and possibly would solve the fragmentation problem Andres referred to in > https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de While calculating node sizes that fit within a power-of-two size, I noticed the current base node is a bit wasteful, taking up 8 bytes. The node kind only has a small number of values, so it doesn't really make sense to use an enum here in the struct (in fact, Andres' prototype used a uint8 for node_kind). We could use a bitfield for the count and kind: uint16 -- kind and count bitfield uint8 shift; uint8 chunk; That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, the bitfield can just go back to being count only. Here are the v6 node kinds: node4: 8 + 4 +(4)+ 4*8 = 48 bytes node16: 8 + 16 + 16*8 = 152 node32: 8 + 32 + 32*8 = 296 node128: 8 + 256 + 128/8 + 128*8 = 1304 node256: 8 + 256/8 + 256*8 = 2088 And here are the possible ways we could optimize nodes for space using aset allocation. Parentheses are padding bytes. Even if my math has mistakes, the numbers shouldn't be too far off: node3: 4 + 3 +(1)+ 3*8 = 32 bytes node6: 4 + 6 +(6)+ 6*8 = 64 node13: 4 + 13 +(7)+ 13*8 = 128 node28: 4 + 28 + 28*8 = 256 node31: 4 + 256 + 32/8 + 31*8 = 512 (XXX not good) node94: 4 + 256 + 96/8 + 94*8 = 1024 node220: 4 + 256 + 224/8 + 220*8 = 2048 node256: = 4096 The main disadvantage is that node256 would balloon in size. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Sep 22, 2022 at 7:52 PM John Naylor wrote: > > > On Thu, Sep 22, 2022 at 1:26 PM Masahiko Sawada wrote: > > Good point. While keeping the chunks in the small nodes in sorted > > order is useful for visiting all keys in sorted order, additional > > branches and memmove calls could be slow. > > Right, the ordering is a property that some users will need, so best to keep it. Although the node128 doesn't have that property -- too slow to do so, I think. Nevermind, I must have been mixing up keys and values there... -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Sep 22, 2022 at 1:26 PM Masahiko Sawada wrote: > > On Thu, Sep 22, 2022 at 1:46 PM John Naylor > wrote: > > While on the subject, I wonder how important it is to keep the chunks in the small nodes in sorted order. That adds branches and memmove calls, and is the whole reason for the recent "pg_lfind_ge" function. > > Good point. While keeping the chunks in the small nodes in sorted > order is useful for visiting all keys in sorted order, additional > branches and memmove calls could be slow. Right, the ordering is a property that some users will need, so best to keep it. Although the node128 doesn't have that property -- too slow to do so, I think. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Thu, Sep 22, 2022 at 1:01 AM Nathan Bossart wrote: > > On Wed, Sep 21, 2022 at 01:17:21PM +0700, John Naylor wrote: > > > In short, this code needs to be lower level so that we still have full > > control while being portable. I will work on this, and also the related > > code for node dispatch. > > Is it possible to use approach #2 here, too? AFAICT space is allocated for > all of the chunks, so there wouldn't be any danger in searching all them > and discarding any results >= node->count. Sure, the caller could pass the maximum node capacity, and then check if the returned index is within the range of the node count. > Granted, we're depending on the > number of chunks always being a multiple of elements-per-vector in order to > avoid the tail path, but that seems like a reasonably safe assumption that > can be covered with comments. Actually, we don't need to depend on that at all. When I said "junk" above, that can be any bytes, as long as we're not reading off the end of allocated memory. We'll never do that here, since the child pointers/values follow. In that case, the caller can hard-code the size (it would even happen to work now to multiply rt_node_kind by 16, to be sneaky). One thing I want to try soon is storing fewer than 16/32 etc entries, so that the whole node fits comfortably inside a power-of-two allocation. That would allow us to use aset without wasting space for the smaller nodes, which would be faster and possibly would solve the fragmentation problem Andres referred to in https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de While on the subject, I wonder how important it is to keep the chunks in the small nodes in sorted order. That adds branches and memmove calls, and is the whole reason for the recent "pg_lfind_ge" function. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Tue, Sep 20, 2022 at 3:19 PM Masahiko Sawada wrote: > > On Fri, Sep 16, 2022 at 4:54 PM John Naylor > wrote: > > Here again, I'd rather put this off and focus on getting the "large > > details" in good enough shape so we can got towards integrating with > > vacuum. > > Thank you for the comments! These above comments are addressed by > Nathan in a newly derived thread. I'll work on the patch. I still seem to be out-voted on when to tackle this particular optimization, so I've extended the v6 benchmark code with a hackish function that populates a fixed number of keys, but with different fanouts. (diff attached as a text file) I didn't take particular care to make this scientific, but the following seems pretty reproducible. Note what happens to load and search performance when node16 has 15 entries versus 16: fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms ++--++-- 15 | 327680 | 3776512 | 39 | 20 (1 row) num_keys = 327680, height = 4, n4 = 1, n16 = 23408, n32 = 0, n128 = 0, n256 = 0 fanout | nkeys | rt_mem_allocated | rt_load_ms | rt_search_ms ++--++-- 16 | 327680 | 3514368 | 25 | 11 (1 row) num_keys = 327680, height = 4, n4 = 0, n16 = 21846, n32 = 0, n128 = 0, n256 = 0 In trying to wrap the SIMD code behind layers of abstraction, the latest patch (and Nathan's cleanup) threw it away in almost all cases. To explain, we need to talk about how vectorized code deals with the "tail" that is too small for the register: 1. Use a one-by-one algorithm, like we do for the pg_lfind* variants. 2. Read some junk into the register and mask off false positives from the result. There are advantages to both depending on the situation. Patch v5 and earlier used #2. Patch v6 used #1, so if a node16 has 15 elements or less, it will iterate over them one-by-one exactly like a node4. Only when full with 16 will the vector path be taken. When another entry is added, the elements are copied to the next bigger node, so there's a *small* window where it's fast. In short, this code needs to be lower level so that we still have full control while being portable. I will work on this, and also the related code for node dispatch. Since v6 has some good infrastructure to do low-level benchmarking, I also want to do some experiments with memory management. (I have further comments about the code, but I will put that off until later) > I'll consider how to integrate with vacuum as the next step. One > concern for me is how to limit the memory usage to > maintenance_work_mem. Unlike using a flat array, memory space for > adding one TID varies depending on the situation. If we want strictly > not to allow using memory more than maintenance_work_mem, probably we > need to estimate the memory consumption in a conservative way. +1 -- John Naylor EDB: http://www.enterprisedb.com commit 18407962e96ccec6c9aeeba97412edd762a5a4fe Author: John Naylor Date: Wed Sep 21 11:44:43 2022 +0700 Add special benchmark function to test effect of fanout diff --git a/contrib/bench_radix_tree/Makefile b/contrib/bench_radix_tree/Makefile index b8f70e12d1..952bb0ceae 100644 --- a/contrib/bench_radix_tree/Makefile +++ b/contrib/bench_radix_tree/Makefile @@ -7,7 +7,7 @@ OBJS = \ EXTENSION = bench_radix_tree DATA = bench_radix_tree--1.0.sql -REGRESS = bench +REGRESS = bench_fixed_height ifdef USE_PGXS PG_CONFIG = pg_config diff --git a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql index 6663abe6a4..f2fee15b17 100644 --- a/contrib/bench_radix_tree/bench_radix_tree--1.0.sql +++ b/contrib/bench_radix_tree/bench_radix_tree--1.0.sql @@ -40,3 +40,15 @@ OUT load_ms int8) returns record as 'MODULE_PATHNAME' LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; + +create function bench_fixed_height_search( +fanout int4, +OUT fanout int4, +OUT nkeys int8, +OUT rt_mem_allocated int8, +OUT rt_load_ms int8, +OUT rt_search_ms int8 +) +returns record +as 'MODULE_PATHNAME' +LANGUAGE C STRICT VOLATILE PARALLEL UNSAFE; diff --git a/contrib/bench_radix_tree/bench_radix_tree.c b/contrib/bench_radix_tree/bench_radix_tree.c index 5806ef7519..0778da2d7b 100644 --- a/contrib/bench_radix_tree/bench_radix_tree.c +++ b/contrib/bench_radix_tree/bench_radix_tree.c @@ -13,6 +13,7 @@ #include "fmgr.h" #include "funcapi.h" #include "lib/radixtree.h" +#include #include "miscadmin.h" #include "utils/timestamp.h" @@ -24,6 +25,7 @@ PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1(bench_seq_search); PG_FUNCTION_INFO_V1(bench_shuffle_search); PG_FUNCTION_INFO_V1(bench_load_random_int); +PG_FUNCTION_INFO_V1(bench_fixed_height_search); static radix_tree *rt = NULL; static ItemPointer itemptrs = NULL;
Re: [RFC] building postgres with meson - v13
On Wed, Sep 21, 2022 at 7:11 AM Andres Freund wrote: > > Hi, > > On 2022-09-19 19:16:30 -0700, Andres Freund wrote: > > To have some initial "translation" for other developers I've started a wiki > > page with a translation table. Still very WIP: > > https://wiki.postgresql.org/wiki/Meson > > > > For now, a bit of polishing aside, I'm just planning to add a minimal > > explanation of what's happening, and a reference to this thread. > > I added installation instructions for meson for a bunch of platforms, but Small typo: The homebrew section is still labeled with "find MacPorts libraries". -- John Naylor EDB: http://www.enterprisedb.com
Re: Support for Rust
On Mon, Sep 12, 2022 at 10:29 PM Tom Lane wrote: > > Lev Kokotov writes: > > I took a small part of Postgres to get started, so just as a PoC; it > > compiles and runs though. Larger parts will take more work (deleting code, > > not just swapping object files), and more fancy things like PG_TRY() and > > friends will have to be rewritten, so not a short and easy migration. > > Yeah, that's what I thought. "Allow some parts to be written in > language X" soon turns into "Rewrite the entire system in language X, > including fundamental rethinking of memory management, error handling, > and some other things". That's pretty much a non-starter. Added "Rewrite the code in a different language" to "Features We Do Not Want" section of Wiki, referencing the two threads that came up: https://wiki.postgresql.org/wiki/Todo#Features_We_Do_Not_Want -- John Naylor EDB: http://www.enterprisedb.com
Re: Typo in xact.c
On Fri, Sep 16, 2022 at 10:51 AM John Naylor wrote: > > On Fri, Sep 16, 2022 at 10:11 AM Kyotaro Horiguchi > wrote: > > > > The patch seems to me covering all occurances of PG_PROC as PGPROC. > > +1 since this hinders grep-ability. Pushed this. > > I found several uses of PG_PROC as (pg_catalog.)pg_proc, which is > > quite confusing, too.. > > It's pretty obvious to me what that refers to in primnodes.h, although > the capitalization of (some, but not all) catalog names in comments in > that file is a bit strange. Maybe not worth changing there. I left this alone. It's not wrong, and I don't think it's confusing in context. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PoC] Improve dead tuple storage for lazy vacuum
On Fri, Sep 16, 2022 at 1:01 PM Masahiko Sawada wrote: > > On Mon, Aug 15, 2022 at 10:39 PM John Naylor > wrote: > > > > bool, buth = and <=. Should be pretty close. Also, i believe if you > > left this for last as a possible refactoring, it might save some work. v6 demonstrates why this should have been put off towards the end. (more below) > > In any case, I'll take a look at the latest patch next month. Since the CF entry said "Needs Review", I began looking at v5 again this week. Hopefully not too much has changed, but in the future I strongly recommend setting to "Waiting on Author" if a new version is forthcoming. I realize many here share updated patches at any time, but I'd like to discourage the practice especially for large patches. > I've updated the radix tree patch. It's now separated into two patches. > > 0001 patch introduces pg_lsearch8() and pg_lsearch8_ge() (we may find > better names) that are similar to the pg_lfind8() family but they > return the index of the key in the vector instead of true/false. The > patch includes regression tests. I don't want to do a full review of this just yet, but I'll just point out some problems from a quick glance. +/* + * Return the index of the first element in the vector that is greater than + * or eual to the given scalar. Return sizeof(Vector8) if there is no such + * element. That's a bizarre API to indicate non-existence. + * + * Note that this function assumes the elements in the vector are sorted. + */ That is *completely* unacceptable for a general-purpose function. +#else /* USE_NO_SIMD */ + Vector8 r = 0; + uint8 *rp = (uint8 *) + + for (Size i = 0; i < sizeof(Vector8); i++) + rp[i] = (((const uint8 *) )[i] == ((const uint8 *) )[i]) ? 0xFF : 0; I don't think we should try to force the non-simd case to adopt the special semantics of vector comparisons. It's much easier to just use the same logic as the assert builds. +#ifdef USE_SSE2 + return (uint32) _mm_movemask_epi8(v); +#elif defined(USE_NEON) + static const uint8 mask[16] = { +1 << 0, 1 << 1, 1 << 2, 1 << 3, +1 << 4, 1 << 5, 1 << 6, 1 << 7, +1 << 0, 1 << 1, 1 << 2, 1 << 3, +1 << 4, 1 << 5, 1 << 6, 1 << 7, + }; + +uint8x16_t masked = vandq_u8(vld1q_u8(mask), (uint8x16_t) vshrq_n_s8(v, 7)); +uint8x16_t maskedhi = vextq_u8(masked, masked, 8); + +return (uint32) vaddvq_u16((uint16x8_t) vzip1q_u8(masked, maskedhi)); For Arm, we need to be careful here. This article goes into a lot of detail for this situation: https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon Here again, I'd rather put this off and focus on getting the "large details" in good enough shape so we can got towards integrating with vacuum. > In addition to two patches, I've attached the third patch. It's not > part of radix tree implementation but introduces a contrib module > bench_radix_tree, a tool for radix tree performance benchmarking. It > measures loading and lookup performance of both the radix tree and a > flat array. Excellent! This was high on my wish list. -- John Naylor EDB: http://www.enterprisedb.com
Re: Typo in xact.c
On Fri, Sep 16, 2022 at 10:11 AM Kyotaro Horiguchi wrote: > > At Thu, 15 Sep 2022 22:38:01 +0800, Japin Li wrote in > > > > Hi hacker, > > > > Recently, I find there might be a typo in xact.c comments. The comments > > say "PG_PROC", however, it actually means "PGPROC" structure. Since we > > have pg_proc catalog, and use PG_PROC to reference the catalog [1], so, > > we should use PGPROC to reference the structure. Any thoughts? > > > > [1] src/include/nodes/primnodes.h > > The patch seems to me covering all occurances of PG_PROC as PGPROC. +1 since this hinders grep-ability. > I found several uses of PG_PROC as (pg_catalog.)pg_proc, which is > quite confusing, too.. It's pretty obvious to me what that refers to in primnodes.h, although the capitalization of (some, but not all) catalog names in comments in that file is a bit strange. Maybe not worth changing there. -- John Naylor EDB: http://www.enterprisedb.com
Re: Inconsistencies in error messages
On Wed, Sep 14, 2022 at 5:25 PM John Naylor wrote: > Will commit this way unless there are objections. I forgot to mention yesterday, but this is done. -- John Naylor EDB: http://www.enterprisedb.com
Re: Cleaning up historical portability baggage
On Thu, Sep 15, 2022 at 3:11 PM Ibrar Ahmed wrote: > The patch does not apply successfully; please rebase the patch. There's a good reason for that -- the latest one was committed two weeks ago. The status should still be waiting on author, though, namely for: On Fri, Aug 26, 2022 at 5:28 AM Thomas Munro wrote: > Remaining things from this thread: > * removing --disable-thread-safety > * removing those vestigial HAVE_XXX macros (one by one analysis and patches) > * making Unix sockets secure for Windows in tests -- John Naylor EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Sep 14, 2022 at 11:33 PM Peter Geoghegan wrote: > > On Wed, Sep 14, 2022 at 3:18 AM John Naylor > > Furthermore, it doesn't have to anticipate the maximum size, so there > > is no up front calculation assuming max-tuples-per-page, so it > > automatically uses less memory for less demanding tables. > > The final number of TIDs doesn't seem like the most interesting > information that VM snapshots could provide us when it comes to > building the dead_items TID data structure -- the *distribution* of > TIDs across heap pages seems much more interesting. The "shape" can be > known ahead of time, at least to some degree. It can help with > compression, which will reduce cache misses. My point here was simply that spilling to disk is an admission of failure to utilize memory efficiently and thus shouldn't be a selling point of VM snapshots. Other selling points could still be valid. -- John Naylor EDB: http://www.enterprisedb.com
Re: Inconsistencies in error messages
On Wed, Sep 14, 2022 at 5:01 PM Ekaterina Kiryanova wrote: > > Hi, > > When translating error messages, Alexander Lakhin > () noticed some inconsistencies so I prepared a > small patch to fix those. +1 This one - errmsg("background worker \"%s\": background worker without shared memory access are not supported", + errmsg("background worker \"%s\": background workers without shared memory access are not supported", is a grammar error so worth backpatching, but the rest are cosmetic. Will commit this way unless there are objections. -- John Naylor EDB: http://www.enterprisedb.com
Re: New strategies for freezing, advancing relfrozenxid early
On Wed, Sep 14, 2022 at 12:53 AM Peter Geoghegan wrote: > This is still only scratching the surface of what is possible with > dead_items. The visibility map snapshot concept can enable a far more > sophisticated approach to resource management in vacuumlazy.c. It > could help us to replace a simple array of item pointers (the current > dead_items array) with a faster and more space-efficient data > structure. Masahiko Sawada has done a lot of work on this recently, so > this may interest him. I don't quite see how it helps "enable" that. It'd be more logical to me to say the VM snapshot *requires* you to think harder about resource management, since a palloc'd snapshot should surely be counted as part of the configured memory cap that admins control. (Commonly, it'll be less than a few dozen MB, so I'll leave that aside.) Since Masahiko hasn't (to my knowlege) gone as far as integrating his ideas into vacuum, I'm not sure if the current state of affairs has some snag that a snapshot will ease, but if there is, you haven't described what it is. I do remember your foreshadowing in the radix tree thread a while back, and I do think it's an intriguing idea to combine pages-to-scan and dead TIDs in the same data structure. The devil is in the details, of course. It's worth looking into. > VM snapshots could also make it practical for the new data structure > to spill to disk to avoid multiple index scans/passed by VACUUM. I'm not sure spilling to disk is solving the right problem (as opposed to the hash join case, or to the proposed conveyor belt system which has a broader aim). I've found several times that a customer will ask if raising maintenance work mem from 1GB to 10GB will make vacuum faster. Looking at the count of index scans, it's pretty much always "1", so even if the current approach could scale above 1GB, "no" it wouldn't help to raise that limit. Your mileage may vary, of course. Continuing my customer example, searching the dead TID list faster *will* make vacuum faster. The proposed tree structure is more memory efficient, and IIUC could scale beyond 1GB automatically since each node is a separate allocation, so the answer will be "yes" in the rare case the current setting is in fact causing multiple index scans. Furthermore, it doesn't have to anticipate the maximum size, so there is no up front calculation assuming max-tuples-per-page, so it automatically uses less memory for less demanding tables. (But +1 for changing that calculation for as long as we do have the single array.) -- John Naylor EDB: http://www.enterprisedb.com
Re: minimum perl version
On Wed, Sep 14, 2022 at 10:46 AM Tom Lane wrote: > > John Naylor writes: > > On Wed, Sep 14, 2022 at 6:47 AM Tom Lane wrote: > >> I've just switched longfin to use built-from-source perl 5.14.0. > > > In that case, here is a quick update with commit message. Not yet any > > change for MSVC, but I can put together something later. > > Looks reasonable just by eyeball, did not test. > > > Since we're much less willing to support older Windows and Visual > > Studio versions, maybe it's low-enough risk defer the check to the > > Meson conversion? I understand our MSVC process will then go away much > > more quickly than autoconf... > > Agreed --- the MSVC scripts are on a pretty short leash now. > Not clear it's worth fixing them for this point. If we've > failed to get rid of them by the time v16 release approaches, > maybe it'd be worth doing something then. Okay, pushed with no further MSVC changes, after doing a round on CI. -- John Naylor EDB: http://www.enterprisedb.com
Re: failing to build preproc.c on solaris with sun studio
On Wed, Sep 14, 2022 at 5:24 AM Thomas Munro wrote: > > On Tue, Sep 6, 2022 at 9:34 AM Tom Lane wrote: > > Peter Eisentraut writes: > > > Why is this being proposed? > > > > Andres is annoyed by the long build time of ecpg, which he has to > > wait for whether he wants to test it or not. I could imagine that > > I might disable ecpg testing on my slowest buildfarm animals, too. > > This message triggered me to try to teach ccache how to cache > preproc.y -> preproc.{c,h}, and I got that basically working[1], but > upstream doesn't want it (yet). I'll try again if the proposed > refactoring to allow more kinds of compiler-like-things goes > somewhere. I think that started with people's struggles with GCC vs > MSVC. Given the simplicity of this case, though, I suppose we could > have a little not-very-general shell/python/whatever wrapper script -- > just compute a checksum of the input and keep the output files around. If we're going to go to this length, it seems more straightforward to just check the .c/.h files into version control, like every other project that I have such knowledge of. To be fair, our grammar changes much more often. One other possible deal-breaker of that is that it makes it more painful for forks to maintain additional syntax. -- John Naylor EDB: http://www.enterprisedb.com
Re: build remaining Flex files standalone
On Wed, Sep 14, 2022 at 6:10 AM Andres Freund wrote: > > On 2022-09-12 14:49:50 +0700, John Naylor wrote: > > CI builds fine. For v2 I only adjusted the commit message. I'll push > > in a couple days unless there are objections. > > Makes sense to me. Thanks for working on it! This is done. -- John Naylor EDB: http://www.enterprisedb.com
Re: minimum perl version
On Wed, Sep 14, 2022 at 6:47 AM Tom Lane wrote: > > I've just switched longfin to use built-from-source perl 5.14.0. In that case, here is a quick update with commit message. Not yet any change for MSVC, but I can put together something later. Since we're much less willing to support older Windows and Visual Studio versions, maybe it's low-enough risk defer the check to the Meson conversion? I understand our MSVC process will then go away much more quickly than autoconf... -- John Naylor EDB: http://www.enterprisedb.com From d8e241968395c552d667c1eb393cbaea12f0df6c Mon Sep 17 00:00:00 2001 From: John Naylor Date: Wed, 14 Sep 2022 09:58:13 +0700 Subject: [PATCH v2] Bump minimum Perl version to 5.14 The oldest vendor-shipped Perl in the buildfarm is 5.14.2, which is the last version that Debian Wheezy shipped. That OS is EOL, but we keep it running because there is no other convenient way to test certain non-mainstream 32-bit platforms. There is no bugfix in the .2 release that is required, and yet it's also not the latest minor release -- that would be 5.14.4. To clarify the situation, we also have arranged the buildfarm to test 5.14.0. That allows configure scripts and documentation to state 5.14 without fine print. Discussion: https://www.postgresql.org/message-id/20220902181553.ev4pgzhubhdkg...@awork3.anarazel.de --- config/perl.m4 | 4 ++-- configure| 6 +++--- doc/src/sgml/install-windows.sgml| 2 +- doc/src/sgml/installation.sgml | 4 ++-- src/pl/plperl/plc_perlboot.pl| 1 - src/test/perl/PostgreSQL/Test/Cluster.pm | 2 +- src/test/perl/README | 10 +++--- src/tools/msvc/gendef.pl | 1 - src/tools/pgindent/pgindent | 1 - 9 files changed, 12 insertions(+), 19 deletions(-) diff --git a/config/perl.m4 b/config/perl.m4 index c9fd91397c..8126e79f67 100644 --- a/config/perl.m4 +++ b/config/perl.m4 @@ -11,11 +11,11 @@ if test "$PERL"; then pgac_perl_version=`$PERL -v 2>/dev/null | sed -n ['s/This is perl.*v[a-z ]*\([0-9]\.[0-9][0-9.]*\).*$/\1/p']` AC_MSG_NOTICE([using perl $pgac_perl_version]) if echo "$pgac_perl_version" | sed ['s/[.a-z_]/ /g'] | \ -$AWK '{ if ([$]1 == 5 && ([$]2 > 8 || ($[2] == 8 && [$]3 >= 3))) exit 1; else exit 0;}' +$AWK '{ if ([$]1 == 5 && ([$]2 >= 14)) exit 1; else exit 0;}' then AC_MSG_WARN([ *** The installed version of Perl, $PERL, is too old to use with PostgreSQL. -*** Perl version 5.8.3 or later is required, but this is $pgac_perl_version.]) +*** Perl version 5.14 or later is required, but this is $pgac_perl_version.]) PERL="" fi fi diff --git a/configure b/configure index fd2a782454..f325bd85b8 100755 --- a/configure +++ b/configure @@ -10491,14 +10491,14 @@ if test "$PERL"; then { $as_echo "$as_me:${as_lineno-$LINENO}: using perl $pgac_perl_version" >&5 $as_echo "$as_me: using perl $pgac_perl_version" >&6;} if echo "$pgac_perl_version" | sed 's/[.a-z_]/ /g' | \ -$AWK '{ if ($1 == 5 && ($2 > 8 || ($2 == 8 && $3 >= 3))) exit 1; else exit 0;}' +$AWK '{ if ($1 == 5 && ($2 >= 14)) exit 1; else exit 0;}' then { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: *** The installed version of Perl, $PERL, is too old to use with PostgreSQL. -*** Perl version 5.8.3 or later is required, but this is $pgac_perl_version." >&5 +*** Perl version 5.14 or later is required, but this is $pgac_perl_version." >&5 $as_echo "$as_me: WARNING: *** The installed version of Perl, $PERL, is too old to use with PostgreSQL. -*** Perl version 5.8.3 or later is required, but this is $pgac_perl_version." >&2;} +*** Perl version 5.14 or later is required, but this is $pgac_perl_version." >&2;} PERL="" fi fi diff --git a/doc/src/sgml/install-windows.sgml b/doc/src/sgml/install-windows.sgml index c00ab2b4b2..29d3294dc8 100644 --- a/doc/src/sgml/install-windows.sgml +++ b/doc/src/sgml/install-windows.sgml @@ -190,7 +190,7 @@ $ENV{MSBFLAGS}="/m"; or Cygwin Perl will not work. It must also be present in the PATH. Binaries can be downloaded from https://www.activestate.com;> - (Note: version 5.8.3 or later is required, + (Note: version 5.14 or later is required, the free Standard Distribution is sufficient). diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml index 7c79608e55..319c7e6966 100644 --- a/doc/src/sgml/installation.sgml +++ b/doc/src/sgml/installation.sgml @@ -165,7 +165,7 @@ su - postgres PL/Perl you need a full Perl installation, including the libperl library and the header files. - The minimum required version is Perl 5.8.3. + The minimu
Re: minimum perl version
On Tue, Sep 13, 2022 at 5:53 PM John Naylor wrote: > > Until such time as that happens, here is a draft to require 5.14.2. As soon as I hit send, it occurred to me that we don't check the perl version on Windows, since (I seem to recall) 5.8.3 was too old to be an option on that platform. We'll have to add a new check somewhere. -- John Naylor EDB: http://www.enterprisedb.com
Re: minimum perl version
On Sat, Sep 3, 2022 at 2:11 AM Tom Lane wrote: > > Andres Freund writes: > > 5.14 would be a trivial lift as far as the buildfarm is concerned. > > Yeah, that seems like a reasonable new minimum for Perl. I might > see about setting up an animal running 5.14.0, just so we can say > "5.14" in the docs without fine print. Until such time as that happens, here is a draft to require 5.14.2. -- John Naylor EDB: http://www.enterprisedb.com config/perl.m4 | 4 ++-- configure| 6 +++--- doc/src/sgml/install-windows.sgml| 2 +- doc/src/sgml/installation.sgml | 4 ++-- src/pl/plperl/plc_perlboot.pl| 1 - src/test/perl/PostgreSQL/Test/Cluster.pm | 2 +- src/test/perl/README | 10 +++--- src/tools/msvc/gendef.pl | 1 - src/tools/pgindent/pgindent | 1 - 9 files changed, 12 insertions(+), 19 deletions(-) diff --git a/config/perl.m4 b/config/perl.m4 index c9fd91397c..29f54bbb79 100644 --- a/config/perl.m4 +++ b/config/perl.m4 @@ -11,11 +11,11 @@ if test "$PERL"; then pgac_perl_version=`$PERL -v 2>/dev/null | sed -n ['s/This is perl.*v[a-z ]*\([0-9]\.[0-9][0-9.]*\).*$/\1/p']` AC_MSG_NOTICE([using perl $pgac_perl_version]) if echo "$pgac_perl_version" | sed ['s/[.a-z_]/ /g'] | \ -$AWK '{ if ([$]1 == 5 && ([$]2 > 8 || ($[2] == 8 && [$]3 >= 3))) exit 1; else exit 0;}' +$AWK '{ if ([$]1 == 5 && ([$]2 > 14 || ($[2] == 14 && [$]3 >= 2))) exit 1; else exit 0;}' then AC_MSG_WARN([ *** The installed version of Perl, $PERL, is too old to use with PostgreSQL. -*** Perl version 5.8.3 or later is required, but this is $pgac_perl_version.]) +*** Perl version 5.14.2 or later is required, but this is $pgac_perl_version.]) PERL="" fi fi diff --git a/configure b/configure index fd2a782454..160c181441 100755 --- a/configure +++ b/configure @@ -10491,14 +10491,14 @@ if test "$PERL"; then { $as_echo "$as_me:${as_lineno-$LINENO}: using perl $pgac_perl_version" >&5 $as_echo "$as_me: using perl $pgac_perl_version" >&6;} if echo "$pgac_perl_version" | sed 's/[.a-z_]/ /g' | \ -$AWK '{ if ($1 == 5 && ($2 > 8 || ($2 == 8 && $3 >= 3))) exit 1; else exit 0;}' +$AWK '{ if ($1 == 5 && ($2 > 14 || ($2 == 14 && $3 >= 2))) exit 1; else exit 0;}' then { $as_echo "$as_me:${as_lineno-$LINENO}: WARNING: *** The installed version of Perl, $PERL, is too old to use with PostgreSQL. -*** Perl version 5.8.3 or later is required, but this is $pgac_perl_version." >&5 +*** Perl version 5.14.2 or later is required, but this is $pgac_perl_version." >&5 $as_echo "$as_me: WARNING: *** The installed version of Perl, $PERL, is too old to use with PostgreSQL. -*** Perl version 5.8.3 or later is required, but this is $pgac_perl_version." >&2;} +*** Perl version 5.14.2 or later is required, but this is $pgac_perl_version." >&2;} PERL="" fi fi diff --git a/doc/src/sgml/install-windows.sgml b/doc/src/sgml/install-windows.sgml index c00ab2b4b2..2e41d75db0 100644 --- a/doc/src/sgml/install-windows.sgml +++ b/doc/src/sgml/install-windows.sgml @@ -190,7 +190,7 @@ $ENV{MSBFLAGS}="/m"; or Cygwin Perl will not work. It must also be present in the PATH. Binaries can be downloaded from https://www.activestate.com;> - (Note: version 5.8.3 or later is required, + (Note: version 5.14.2 or later is required, the free Standard Distribution is sufficient). diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml index 7c79608e55..5d7c573729 100644 --- a/doc/src/sgml/installation.sgml +++ b/doc/src/sgml/installation.sgml @@ -165,7 +165,7 @@ su - postgres PL/Perl you need a full Perl installation, including the libperl library and the header files. - The minimum required version is Perl 5.8.3. + The minimum required version is Perl 5.14.2. Since PL/Perl will be a shared library, the libperl libperl library must be a shared library @@ -325,7 +325,7 @@ su - postgres perl - Perl 5.8.3 or later is needed to build from a Git checkout, + Perl 5.14.2 or later is needed to build from a Git checkout, or if you changed the input files for any of the build steps that use Perl scripts. If building on Windows you will need Perl in any case. Perl is diff --git a/src/pl/plperl/plc_perlboot.pl b/src/pl/plperl/plc_perlboot.pl index 8fd7f998bc..72cb53f6e3 100644 --- a/src/pl/plperl/plc_perlboot.pl +++ b/src/pl/plperl/plc_perlboot.pl @@ -6,7 +6,6 @@ use strict; use warnings; -use 5.008001; use vars qw(%_SHARED $_TD); Pos
Re: broken table formatting in psql
On Thu, Sep 8, 2022 at 12:39 PM John Naylor wrote: > > On Fri, Sep 2, 2022 at 3:19 PM Kyotaro Horiguchi > wrote: > > > > At Fri, 2 Sep 2022 13:43:50 +0700, John Naylor > > wrote in > > > If there is any doubt about including all of Cf, we could also just > > > add a branch in wchar.c to hard-code the 200B-200F range. > > > > If every way has defect to the similar extent, I think we will choose > > to use authoritative data at least for the first step. We might want > > to have additional filtering on it but it would be another issue, > > maybe. > > > > Attached is the first cut of that. (The commit messages is not great, > > though.) > > Okay, the patch looks good to me overall. As discussed, I pushed to master only, with only one additional comment in the perl script to describe Me/Mn/Cf. -- John Naylor EDB: http://www.enterprisedb.com
Re: proposal: possibility to read dumped table's name from file
On Mon, Sep 12, 2022 at 8:10 PM Pavel Stehule wrote: > > po 12. 9. 2022 v 9:59 odesílatel Daniel Gustafsson napsal: >> I don't the capabilities of the tool is all that interesting compared to the >> long term maintainability and readability of the source code. With make distprep and maintainer-clean, separate makefile and MSVC build logic a short time before converting to Meson, I'm not sure that even the short term maintainability here is a good trade off for what we're getting. > The parser in bison/flex does the same work and it is true, so code is more > readable. Although for this case, a handy written parser was trivial too. If the hand-written version is trivial, then we should prefer it. -- John Naylor EDB: http://www.enterprisedb.com
Re: broken table formatting in psql
On Mon, Sep 12, 2022 at 12:44 PM Pavel Stehule wrote: > This is not a critical issue, really. On second thought, I don't see the > point in releasing fresh Postgres with this bug, where there is know bugfix - > and this bugfix should be compatible (at this moment) with 16. I agree the actual logic/data change is low-risk. The patch renames two files, which seems a bit much this late in the cycle. Maybe that's okay, but I'd like someone else to opine before doing so. -- John Naylor EDB: http://www.enterprisedb.com
Re: build remaining Flex files standalone
On Fri, Sep 9, 2022 at 12:18 PM John Naylor wrote: > > It seems gramparse.h isn't installed now? In any case, here's a patch > to move gramparse to the backend dir and stop symlinking/ installing > gram.h. Looking more closely at src/include/Makefile, this is incorrect -- all files in SUBDIRS are copied over. So moving gramparse.h to the backend will automatically not install it. The explicit install rule for gram.h was for vpath builds. CI builds fine. For v2 I only adjusted the commit message. I'll push in a couple days unless there are objections. -- John Naylor EDB: http://www.enterprisedb.com From dcfe0976e6118a2c385b4473ceab76a742a6f1ff Mon Sep 17 00:00:00 2001 From: John Naylor Date: Fri, 9 Sep 2022 11:57:39 +0700 Subject: [PATCH v2] Move gramparse.h to src/backend/parser This header is semi-private, being used only in files related to raw parsing, so move to the backend directory where those files live. This allows removal of makefile rules that symlink gram.h to src/include/parser, since gramparse.h can now include gram.h from within the same directory. While at it, stop installing gram.h. Per suggestion from Andres Freund and Peter Eisentraut Discussion: https://www.postgresql.org/message-id/20220904181759.px6uosll6zbxcum5%40awork3.anarazel.de --- src/backend/Makefile| 7 +-- src/backend/parser/gram.y | 2 +- src/{include => backend}/parser/gramparse.h | 4 ++-- src/backend/parser/parser.c | 2 +- src/backend/parser/scan.l | 2 +- src/include/Makefile| 4 ++-- src/include/parser/.gitignore | 1 - src/tools/msvc/Install.pm | 4 src/tools/pginclude/cpluspluscheck | 1 - src/tools/pginclude/headerscheck| 1 - 10 files changed, 8 insertions(+), 20 deletions(-) rename src/{include => backend}/parser/gramparse.h (97%) delete mode 100644 src/include/parser/.gitignore diff --git a/src/backend/Makefile b/src/backend/Makefile index d0d34821d5..181c217fae 100644 --- a/src/backend/Makefile +++ b/src/backend/Makefile @@ -153,12 +153,7 @@ submake-utils-headers: .PHONY: generated-headers -generated-headers: $(top_builddir)/src/include/parser/gram.h $(top_builddir)/src/include/storage/lwlocknames.h submake-catalog-headers submake-nodes-headers submake-utils-headers - -$(top_builddir)/src/include/parser/gram.h: parser/gram.h - prereqdir=`cd '$(dir $<)' >/dev/null && pwd` && \ - cd '$(dir $@)' && rm -f $(notdir $@) && \ - $(LN_S) "$$prereqdir/$(notdir $<)" . +generated-headers: $(top_builddir)/src/include/storage/lwlocknames.h submake-catalog-headers submake-nodes-headers submake-utils-headers $(top_builddir)/src/include/storage/lwlocknames.h: storage/lmgr/lwlocknames.h prereqdir=`cd '$(dir $<)' >/dev/null && pwd` && \ diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index ea33784316..82f03fc9c9 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -55,9 +55,9 @@ #include "catalog/pg_trigger.h" #include "commands/defrem.h" #include "commands/trigger.h" +#include "gramparse.h" #include "nodes/makefuncs.h" #include "nodes/nodeFuncs.h" -#include "parser/gramparse.h" #include "parser/parser.h" #include "storage/lmgr.h" #include "utils/date.h" diff --git a/src/include/parser/gramparse.h b/src/backend/parser/gramparse.h similarity index 97% rename from src/include/parser/gramparse.h rename to src/backend/parser/gramparse.h index 41b753a96c..c4726c618d 100644 --- a/src/include/parser/gramparse.h +++ b/src/backend/parser/gramparse.h @@ -11,7 +11,7 @@ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * - * src/include/parser/gramparse.h + * src/backend/parser/gramparse.h * *- */ @@ -26,7 +26,7 @@ * NB: include gram.h only AFTER including scanner.h, because scanner.h * is what #defines YYLTYPE. */ -#include "parser/gram.h" +#include "gram.h" /* * The YY_EXTRA data that a flex scanner allows us to pass around. Private diff --git a/src/backend/parser/parser.c b/src/backend/parser/parser.c index 50227cc098..ef85d3bb68 100644 --- a/src/backend/parser/parser.c +++ b/src/backend/parser/parser.c @@ -22,7 +22,7 @@ #include "postgres.h" #include "mb/pg_wchar.h" -#include "parser/gramparse.h" +#include "gramparse.h" #include "parser/parser.h" #include "parser/scansup.h" diff --git a/src/backend/parser/scan.l b/src/backend/parser/scan.l index 882e081aae..db8b0fe8eb 100644 --- a/src/backend/parser/scan.l +++ b/src/backend/parser/scan.l @@ -36,
Re: broken table formatting in psql
On Thu, Sep 8, 2022 at 12:51 PM Pavel Stehule wrote: > > >> Does anyone want to advocate for backpatching these missing ranges to >> v12 and up? v12 still has a table in-line so trivial to remedy, but >> v13 and up use a script, so these exceptions would likely have to use >> hard-coded branches to keep from bringing in new changes. >> >> If so, does anyone want to advocate for including this patch in v15? >> It claims Unicode 14.0.0, and this would make that claim more >> technically correct as well as avoiding additional branches. > > > I think it can be fixed just in v15 and master. This issue has no impact on > SQL. Well, if the regressions from v11 are not important enough to backpatch, there is not as much of a case to backpatch the full fix to v15 either. -- John Naylor EDB: http://www.enterprisedb.com
Re: [PATCH] Clarify the comments about varlena header encoding
On Sun, Sep 11, 2022 at 5:06 PM Aleksander Alekseev wrote: > > Hi John, > > Many thanks for the feedback! > > > Or put the now-standard 0b in front. > > Good idea. Now that I look at the results, though, it's distracting and not good for readability. I'm not actually sure we need to do anything here, but I am somewhat in favor of putting [un]aligned in parentheses, as already discussed. Even there, in the first email you said: > Also one can read this as "aligned, uncompressed > data" (which again is wrong). I'm not sure it rises to the level of "wrong", because a blob of bytes immediately after an aligned uint32 is in fact aligned. The important thing is: a zero byte is always either a padding byte or part of a 4-byte header, so it's the alignment of the header we really care about. > > I think the problem is ambiguity about what a "toast pointer" is. This > > comment: > > > > * struct varatt_external is a traditional "TOAST pointer", that is, the > > Right. The comment for varatt_external says that it IS a TOAST > pointer. Well, the word "traditional" is not very informative, but it is there. And afterwards there is also varatt_indirect, varatt_expanded, and varattrib_1b_e, which all mention "TOAST pointer". > However the comments for varlena headers bit layout > implicitly include it into a TOAST pointer, which contradicts the > previous comments. I suggest we fix this ambiguity by explicitly > enumerating the type tag in the comments for varlena headers. - * 1000 1-byte length word, unaligned, TOAST pointer + * 0b1000 1-byte length word (unaligned), type tag, TOAST pointer This is distracting from the point of this whole comment, which, I will say again is: How to look at the first byte to determine what kind of varlena we're looking at. There is no reason to mention the type tag here, at all. - * In TOAST pointers the va_tag field (see varattrib_1b_e) is used to discern - * the specific type and length of the pointer datum. + * For the TOAST pointers the type tag (see varattrib_1b_e.va_tag field) is + * used to discern the specific type and length of the pointer datum. I don't think this clarifies anything, it's just a rephrasing. More broadly, I think the description of varlenas in this header is at a kind of "local maximum" -- minor adjustments are more likely to make it worse. To significantly improve clarity might require a larger rewriting, but I'm not personally interested in taking part in that. -- John Naylor EDB: http://www.enterprisedb.com
Re: proposal: possibility to read dumped table's name from file
On Thu, Sep 8, 2022 at 7:32 PM Daniel Gustafsson wrote: > [v3] Note that the grammar has shift-reduce conflicts. If you run a fairly recent Bison, you can show them like this: bison -Wno-deprecated -Wcounterexamples -d -o filterparse.c filterparse.y filterparse.y: warning: 2 shift/reduce conflicts [-Wconflicts-sr] filterparse.y: warning: shift/reduce conflict on token C_INCLUDE [-Wcounterexamples] Example: • C_INCLUDE include_object pattern Shift derivation Filters ↳ 3: Filter ↳ 4: • C_INCLUDE include_object pattern Reduce derivation Filters ↳ 2: Filters Filter ↳ 1: ε • ↳ 4: C_INCLUDE include_object pattern filterparse.y: warning: shift/reduce conflict on token C_EXCLUDE [-Wcounterexamples] Example: • C_EXCLUDE exclude_object pattern Shift derivation Filters ↳ 3: Filter ↳ 5: • C_EXCLUDE exclude_object pattern Reduce derivation Filters ↳ 2: Filters Filter ↳ 1: ε • ↳ 5: C_EXCLUDE exclude_object pattern -- John Naylor EDB: http://www.enterprisedb.com
Re: Minimum bison and flex versions
On Fri, Sep 9, 2022 at 12:07 AM Andres Freund wrote: > > Hi, > > On 2022-09-06 13:32:32 +0700, John Naylor wrote: > > Here are autoconf-only patches to that effect. > > Looks like you did actually include src/tools/msvc as well :) Ah, in my head I meant "no patches for the Meson branch". :-) CI fails on MSVC pg_upgrade, but that shows up in some existing CF bot failures and in any case doesn't have a grammar, so I have pushed, thanks for looking! > > Since the retirement of some older buildfarm members, the oldest Bison > > that gets regular testing is 2.3. MacOS ships that version, and will > > continue doing so for the forseeable future because of Apple's policy > > regarding GPLv3. While Mac users could use a package manager to install > > a newer version, there is no compelling reason to do so at this time. > > s/to do so/to force them to do so/? There are good reasons for a dev to install a newer Bison, like better diagnostics, so used this language. -- John Naylor EDB: http://www.enterprisedb.com