Hi, On Mon, Aug 14, 2023 at 8:05 PM John Naylor <john.nay...@enterprisedb.com> wrote: > > On Thu, Jul 13, 2023 at 3:09 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > > > 0007, 0008, 0010, and 0011 are straightforward and agree to merge them.
Thank you for updating the patch! > > [Part 1 - clear the deck of earlier performance work etc] > > Thanks for taking a look! I've merged 0007 and 0008. The others need a > performance test to justify them -- an eyeball check is not enough. I've now > made the time to do that. > > ==== sparse loads > > v38 0001-0006 (still using node3 for this test only): > > select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from > bench_load_random_int(100 * 1000 * (1+x-x))) a; > avg > --------------------- > 27.1000000000000000 > > select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from > bench_load_random_int(500 * 1000 * (1+x-x))) a; > avg > ---------------------- > 165.6333333333333333 > > v38-0007-Optimize-RT_EXTEND_DOWN.patch > > select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from > bench_load_random_int(100 * 1000 * (1+x-x))) a; > avg > --------------------- > 25.0900000000000000 > > select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from > bench_load_random_int(500 * 1000 * (1+x-x))) a; > avg > ---------------------- > 157.3666666666666667 > > That seems worth doing. > > v38-0008-Use-4-children-for-node-4-also-attempt-portable-.patch > > This combines two things because I messed up a rebase: Use fanout of 4, and > try some macros for shmem sizes, both 32- and 64-bit. Looking at this much, I > no longer have a goal to have a separate set of size-classes for non-SIMD > platforms, because that would cause global maintenance problems -- it's > probably better to reduce worst-case search time where necessary. That would > be much more localized. > > > I have some questions on 0009 patch: > > > According to the comment, this optimization is for only gcc? > > No, not at all. That tells me the comment is misleading. > > > I think this change reduces > > readability and maintainability. > > Well, that much is obvious. What is not obvious is how much it gains us over > the alternatives. I do have a simpler idea, though... > > ==== load mostly node4 > > select * from bench_search_random_nodes(250*1000, '0xFFFFFF'); > n4 = 42626, n16 = 21492, n32 = 0, n64 = 0, n256 = 257 > mem_allocated | load_ms | search_ms > ---------------+---------+----------- > 7352384 | 25 | 0 > > v38-0009-TEMP-take-out-search-time-from-bench.patch > > This is just to allow LATERAL queries for better measurements. > > select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from > bench_search_random_nodes(250*1000 * (1+x-x), '0xFFFFFF')) a; > > avg > --------------------- > 24.8333333333333333 0007, 0008, and 0009 look good to me. > > v38-0010-Try-a-simpler-way-to-avoid-memmove.patch > > This slightly rewrites the standard loop so that gcc doesn't turn it into a > memmove(). Unlike the patch you didn't like, this *is* gcc-specific. (needs a > comment, which I forgot) > > avg > --------------------- > 21.9600000000000000 > > So, that's not a trivial difference. I wasn't a big fan of Andres' __asm("") > workaround, but that may be just my ignorance about it. We need something > like either of the two. > > v38-0011-Optimize-add_child_4-take-2.patch > avg > --------------------- > 21.3500000000000000 > > This is possibly faster than v38-0010, but looking like not worth the > complexity, assuming the other way avoids the bug going forward. I prefer 0010 but is it worth testing with other compilers such as clang? > > > According to the bugzilla ticket > > referred to in the comment, it's realized as a bug in the community, > > so once the gcc bug fixes, we might no longer need this trick, no? > > No comment in two years... > > v38-0013-Use-constant-for-initial-copy-of-chunks-and-chil.patch > > This is the same as v37-0011. I wasn't quite satisfied with it since it still > has two memcpy() calls, but it actually seems to regress: > > avg > --------------------- > 22.0900000000000000 > > v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch > > This patch uses a single loop for the copy. > > avg > --------------------- > 21.0300000000000000 > > Within noise level of v38-0011, but it's small and simple, so I like it, at > least for small arrays. Agreed. > > v38-0014-node48-Remove-need-for-RIGHTMOST_ONE-in-radix-tr.patch > v38-0015-node48-Remove-dead-code-by-using-loop-local-var.patch > > Just small cleanups. > > v38-0016-Use-memcpy-for-children-when-growing-into-node48.patch > > Makes sense, but untested. Agreed. BTW cfbot reported that some regression tests failed due to OOM. I've attached the patch to fix it. > > =============== > [Part 2] > > Per off-list discussion with Masahiko, it makes sense to take some of the > ideas I've used locally on tidbitmap, and start incorporating them into > earlier vacuum work to get that out the door faster. With that in mind... > > v38-0017-Make-tidstore-more-similar-to-tidbitmap.patch > > This uses a simplified PagetableEntry (unimaginatively called BlocktableEntry > just to avoid confusion), to be replaced with the real thing at a later date. > This is still fixed size, to be replaced with a varlen type. That's more readable. > > Looking at the tidstore tests again after some months, I'm not particularly > pleased with the amount of code required for how little it seems to be > testing, nor the output when something fails. (I wonder how hard it would be > to have SQL functions that add blocks/offsets to the tid store, and emit > tuples of tids found in the store.) It would not be hard to have such SQL functions. I'll try it. > > I'm also concerned about the number of places that have to know if the store > is using shared memory or not. Something to think about later. > > v38-0018-Consolidate-inserting-updating-values.patch > > This is something I coded up to get to an API more similar to one in > simplehash, as used in tidbitmap.c. It seem worth doing on its own to reduce > code duplication, and also simplifies coding of varlen types and > "runtime-embeddable values". Agreed. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
fix_oom.patch
Description: Binary data