Hi,

On Mon, Aug 14, 2023 at 8:05 PM John Naylor
<john.nay...@enterprisedb.com> wrote:
>
> On Thu, Jul 13, 2023 at 3:09 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
> >
> > 0007, 0008, 0010, and 0011 are straightforward and agree to merge them.

Thank you for updating the patch!

>
> [Part 1 - clear the deck of earlier performance work etc]
>
> Thanks for taking a look! I've merged 0007 and 0008. The others need a 
> performance test to justify them -- an eyeball check is not enough. I've now 
> made the time to do that.
>
> ==== sparse loads
>
> v38 0001-0006 (still using node3 for this test only):
>
> select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from 
> bench_load_random_int(100 * 1000 * (1+x-x))) a;
>          avg
> ---------------------
>  27.1000000000000000
>
> select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from 
> bench_load_random_int(500 * 1000 * (1+x-x))) a;
>          avg
> ----------------------
>  165.6333333333333333
>
> v38-0007-Optimize-RT_EXTEND_DOWN.patch
>
> select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from 
> bench_load_random_int(100 * 1000 * (1+x-x))) a;
>          avg
> ---------------------
>  25.0900000000000000
>
> select avg(load_ms) from generate_series(1,30) x(x), lateral (select * from 
> bench_load_random_int(500 * 1000 * (1+x-x))) a;
>          avg
> ----------------------
>  157.3666666666666667
>
> That seems worth doing.
>
> v38-0008-Use-4-children-for-node-4-also-attempt-portable-.patch
>
> This combines two things because I messed up a rebase: Use fanout of 4, and 
> try some macros for shmem sizes, both 32- and 64-bit. Looking at this much, I 
> no longer have a goal to have a separate set of size-classes for non-SIMD 
> platforms, because that would cause global maintenance problems -- it's 
> probably better to reduce worst-case search time where necessary. That would 
> be much more localized.
>
> > I have some questions on 0009 patch:
>
> > According to the comment, this optimization is for only gcc?
>
> No, not at all. That tells me the comment is misleading.
>
> > I think this change reduces
> > readability and maintainability.
>
> Well, that much is obvious. What is not obvious is how much it gains us over 
> the alternatives. I do have a simpler idea, though...
>
> ==== load mostly node4
>
> select * from bench_search_random_nodes(250*1000, '0xFFFFFF');
> n4 = 42626, n16 = 21492, n32 = 0, n64 = 0, n256 = 257
>  mem_allocated | load_ms | search_ms
> ---------------+---------+-----------
>        7352384 |      25 |         0
>
> v38-0009-TEMP-take-out-search-time-from-bench.patch
>
> This is just to allow LATERAL queries for better measurements.
>
> select avg(load_ms) from generate_series(1,100) x(x), lateral (select * from 
> bench_search_random_nodes(250*1000 * (1+x-x), '0xFFFFFF')) a;
>
>          avg
> ---------------------
>  24.8333333333333333

0007, 0008, and 0009 look good to me.

>
> v38-0010-Try-a-simpler-way-to-avoid-memmove.patch
>
> This slightly rewrites the standard loop so that gcc doesn't turn it into a 
> memmove(). Unlike the patch you didn't like, this *is* gcc-specific. (needs a 
> comment, which I forgot)
>
>          avg
> ---------------------
>  21.9600000000000000
>
> So, that's not a trivial difference. I wasn't a big fan of Andres' __asm("") 
> workaround, but that may be just my ignorance about it. We need something 
> like either of the two.
>
> v38-0011-Optimize-add_child_4-take-2.patch
>          avg
> ---------------------
>  21.3500000000000000
>
> This is possibly faster than v38-0010, but looking like not worth the 
> complexity, assuming the other way avoids the bug going forward.

I prefer 0010 but is it worth testing with other compilers such as clang?

>
> > According to the bugzilla ticket
> > referred to in the comment, it's realized as a bug in the community,
> > so once the gcc bug fixes, we might no longer need this trick, no?
>
> No comment in two years...
>
> v38-0013-Use-constant-for-initial-copy-of-chunks-and-chil.patch
>
> This is the same as v37-0011. I wasn't quite satisfied with it since it still 
> has two memcpy() calls, but it actually seems to regress:
>
>          avg
> ---------------------
>  22.0900000000000000
>
> v38-0012-Use-branch-free-coding-to-skip-new-element-index.patch
>
> This patch uses a single loop for the copy.
>
>          avg
> ---------------------
>  21.0300000000000000
>
> Within noise level of v38-0011, but it's small and simple, so I like it, at 
> least for small arrays.

Agreed.

>
> v38-0014-node48-Remove-need-for-RIGHTMOST_ONE-in-radix-tr.patch
> v38-0015-node48-Remove-dead-code-by-using-loop-local-var.patch
>
> Just small cleanups.
>
> v38-0016-Use-memcpy-for-children-when-growing-into-node48.patch
>
> Makes sense, but untested.

Agreed.

BTW cfbot reported that some regression tests failed due to OOM. I've
attached the patch to fix it.

>
> ===============
> [Part 2]
>
> Per off-list discussion with Masahiko, it makes sense to take some of the 
> ideas I've used locally on tidbitmap, and start incorporating them into 
> earlier vacuum work to get that out the door faster. With that in mind...
>
> v38-0017-Make-tidstore-more-similar-to-tidbitmap.patch
>
> This uses a simplified PagetableEntry (unimaginatively called BlocktableEntry 
> just to avoid confusion), to be replaced with the real thing at a later date. 
> This is still fixed size, to be replaced with a varlen type.

That's more readable.

>
> Looking at the tidstore tests again after some months, I'm not particularly 
> pleased with the amount of code required for how little it seems to be 
> testing, nor the output when something fails. (I wonder how hard it would be 
> to have SQL functions that add blocks/offsets to the tid store, and emit 
> tuples of tids found in the store.)

It would not be hard to have such SQL functions. I'll try it.

>
> I'm also concerned about the number of places that have to know if the store 
> is using shared memory or not. Something to think about later.
>
> v38-0018-Consolidate-inserting-updating-values.patch
>
> This is something I coded up to get to an API more similar to one in 
> simplehash, as used in tidbitmap.c. It seem worth doing on its own to reduce 
> code duplication, and also simplifies coding of varlen types and 
> "runtime-embeddable values".

Agreed.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment: fix_oom.patch
Description: Binary data

Reply via email to