Re: Improve eviction algorithm in ReorderBuffer
On Thu, Apr 11, 2024 at 10:46 AM Hayato Kuroda (Fujitsu) wrote: > > Dear Heikki, > > I also prototyped the idea, which has almost the same shape. > I attached just in case, but we may not have to see. > > Few comments based on the experiment. Thank you for reviewing the patch. I didn't include the following suggestions since firstly I wanted to just fix binaryheap part while keeping other parts. If we need these changes, we can do them in separate commits as fixes. > > ``` > + /* txn_heap is ordered by transaction size */ > + buffer->txn_heap = pairingheap_allocate(ReorderBufferTXNSizeCompare, > NULL); > ``` > > I think the pairing heap should be in the same MemoryContext with the buffer. > Can we add MemoryContextSwithTo()? The pairingheap_allocate() allocates a tiny amount of memory for pairingheap and its memory usage doesn't grow even when adding more data. And since it's allocated in logical decoding context its lifetime is also fine. So I'm not sure it's worth including it in rb->context for better memory accountability. > > ``` > + /* Update the max-heap */ > + if (oldsize != 0) > + pairingheap_remove(rb->txn_heap, &txn->txn_node); > + pairingheap_add(rb->txn_heap, &txn->txn_node); > ... > + /* Update the max-heap */ > + pairingheap_remove(rb->txn_heap, &txn->txn_node); > + if (txn->size != 0) > + pairingheap_add(rb->txn_heap, &txn->txn_node); > ``` > > Since the number of stored transactions does not affect to the insert > operation, we may able > to add the node while creating ReorederBufferTXN and remove while cleaning up > it. This can > reduce branches in ReorderBufferChangeMemoryUpdate(). I think it also means that we need to remove the entry while cleaning up even if it doesn't have any changes, which is done in O(log n). I feel the current approach that we don't store transactions with size 0 in the heap is better and I'm not sure that reducing these branches really contributes to the performance improvements.. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On 11/04/2024 11:20, Masahiko Sawada wrote: On Thu, Apr 11, 2024 at 11:52 AM Masahiko Sawada wrote: We can see 2% ~ 3% performance regressions compared to the current HEAD, but it's much smaller than I expected. Given that we can make the code simple, I think we can go with this direction. Pushed the patch and reverted binaryheap changes. Thank you! -- Heikki Linnakangas Neon (https://neon.tech)
Re: Improve eviction algorithm in ReorderBuffer
On Thu, Apr 11, 2024 at 11:52 AM Masahiko Sawada wrote: > > We can see 2% ~ 3% performance regressions compared to the current > HEAD, but it's much smaller than I expected. Given that we can make > the code simple, I think we can go with this direction. Pushed the patch and reverted binaryheap changes. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On Thu, Apr 11, 2024 at 10:32 AM Masahiko Sawada wrote: > > Hi, > > Sorry for the late reply, I took two days off. > > On Thu, Apr 11, 2024 at 6:20 AM Heikki Linnakangas wrote: > > > > On 10/04/2024 08:31, Amit Kapila wrote: > > > On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas > > > wrote: > > >> > > >> On 10/04/2024 07:45, Michael Paquier wrote: > > >>> On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote: > > On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote: > > > Wouldn't the best way forward be to revert > > > 5bec1d6bc5e3 and revisit the whole in v18? > > > > Also consider commits b840508644 and bcb14f4abc. > > >>> > > >>> Indeed. These are also linked. > > >> > > >> I don't feel the urge to revert this: > > >> > > >> - It's not broken as such, we're just discussing better ways to > > >> implement it. We could also do nothing, and revisit this in v18. The > > >> only must-fix issue is some compiler warnings IIUC. > > >> > > >> - It's a pretty localized change in reorderbuffer.c, so it's not in the > > >> way of other patches or reverts. Nothing else depends on the binaryheap > > >> changes yet either. > > >> > > >> - It seems straightforward to repeat the performance tests with whatever > > >> alternative implementations we want to consider. > > >> > > >> My #1 choice would be to write a patch to switch the pairing heap, > > >> performance test that, and revert the binary heap changes. > > >> > > > > > > +1. > > > > To move this forward, here's a patch to switch to a pairing heap. In my > > very quick testing, with the performance test cases posted earlier in > > this thread [1] [2], I'm seeing no meaningful performance difference > > between this and what's in master currently. > > > > Sawada-san, what do you think of this? To be sure, if you could also > > repeat the performance tests you performed earlier, that'd be great. If > > you agree with this direction, and you're happy with this patch, feel > > free take it from here and commit this, and also revert commits > > b840508644 and bcb14f4abc. > > > > Thank you for the patch! > > I agree with the direction that we replace binaryheap + index with the > existing pairing heap and revert the changes for binaryheap. Regarding > the patch, I'm not sure we can remove the MAX_HEAP_TXN_COUNT_THRESHOLD > logic because otherwise we need to remove and add the txn node (i.e. > O(log n)) for every memory update. I'm concerned it could cause some > performance degradation in a case where there are not many > transactions being decoded. > > I'll do performance tests, and share the results soon. > Here are some performance test results. * test case 1 (many subtransactions) test script: create or replace function testfn (cnt int) returns void as $$ begin for i in 1..cnt loop begin insert into test values (i); exception when division_by_zero then raise notice 'caught error'; return; end; end loop; end; $$ language plpgsql; select pg_create_logical_replication_slot('s', 'test_decoding'); select testfn(100); set logical_decoding_work_mem to '4MB'; select from pg_logical_slot_peek_changes('s', null, null); HEAD: 43128.266 ms 40116.313 ms 38790.666 ms Patched: 43626.316 ms 44456.234 ms 39899.753 ms * test case 2 (single big insertion) test script: create table test (c int); select pg_create_logical_replication_slot('s', 'test_decoding'); insert into test select generate_series(1, 1000); set logical_decoding_work_mem to '10GB'; -- avoid data spill select from pg_logical_slot_peek_changes('s', null, null); HEAD: 7996.476 ms 8034.022 ms 8005.583 ms Patched: 8153.500 ms 8121.588 ms 8121.538 ms * test case 3 (many small transactions) test script: pgbench -s -i 300 psql -c "select pg_create_replication_slot('s', 'test_decoding')"; pgbench -t 10 -c 32 psql -c "set logical_decoding_work_mem to '10GB'; select count(*) from pg_logical_slot_peek_changes('s', null, null)" HEAD: 22586.343 ms 22507.905 ms 22504.133 ms Patched: 23365.142 ms 23110.651 ms 23102.170 ms We can see 2% ~ 3% performance regressions compared to the current HEAD, but it's much smaller than I expected. Given that we can make the code simple, I think we can go with this direction. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
RE: Improve eviction algorithm in ReorderBuffer
Dear Heikki, I also prototyped the idea, which has almost the same shape. I attached just in case, but we may not have to see. Few comments based on the experiment. ``` + /* txn_heap is ordered by transaction size */ + buffer->txn_heap = pairingheap_allocate(ReorderBufferTXNSizeCompare, NULL); ``` I think the pairing heap should be in the same MemoryContext with the buffer. Can we add MemoryContextSwithTo()? ``` + /* Update the max-heap */ + if (oldsize != 0) + pairingheap_remove(rb->txn_heap, &txn->txn_node); + pairingheap_add(rb->txn_heap, &txn->txn_node); ... + /* Update the max-heap */ + pairingheap_remove(rb->txn_heap, &txn->txn_node); + if (txn->size != 0) + pairingheap_add(rb->txn_heap, &txn->txn_node); ``` Since the number of stored transactions does not affect to the insert operation, we may able to add the node while creating ReorederBufferTXN and remove while cleaning up it. This can reduce branches in ReorderBufferChangeMemoryUpdate(). Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ <>
Re: Improve eviction algorithm in ReorderBuffer
Hi, Sorry for the late reply, I took two days off. On Thu, Apr 11, 2024 at 6:20 AM Heikki Linnakangas wrote: > > On 10/04/2024 08:31, Amit Kapila wrote: > > On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas wrote: > >> > >> On 10/04/2024 07:45, Michael Paquier wrote: > >>> On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote: > On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote: > > Wouldn't the best way forward be to revert > > 5bec1d6bc5e3 and revisit the whole in v18? > > Also consider commits b840508644 and bcb14f4abc. > >>> > >>> Indeed. These are also linked. > >> > >> I don't feel the urge to revert this: > >> > >> - It's not broken as such, we're just discussing better ways to > >> implement it. We could also do nothing, and revisit this in v18. The > >> only must-fix issue is some compiler warnings IIUC. > >> > >> - It's a pretty localized change in reorderbuffer.c, so it's not in the > >> way of other patches or reverts. Nothing else depends on the binaryheap > >> changes yet either. > >> > >> - It seems straightforward to repeat the performance tests with whatever > >> alternative implementations we want to consider. > >> > >> My #1 choice would be to write a patch to switch the pairing heap, > >> performance test that, and revert the binary heap changes. > >> > > > > +1. > > To move this forward, here's a patch to switch to a pairing heap. In my > very quick testing, with the performance test cases posted earlier in > this thread [1] [2], I'm seeing no meaningful performance difference > between this and what's in master currently. > > Sawada-san, what do you think of this? To be sure, if you could also > repeat the performance tests you performed earlier, that'd be great. If > you agree with this direction, and you're happy with this patch, feel > free take it from here and commit this, and also revert commits > b840508644 and bcb14f4abc. > Thank you for the patch! I agree with the direction that we replace binaryheap + index with the existing pairing heap and revert the changes for binaryheap. Regarding the patch, I'm not sure we can remove the MAX_HEAP_TXN_COUNT_THRESHOLD logic because otherwise we need to remove and add the txn node (i.e. O(log n)) for every memory update. I'm concerned it could cause some performance degradation in a case where there are not many transactions being decoded. I'll do performance tests, and share the results soon. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On 11/04/2024 01:37, Michael Paquier wrote: On Thu, Apr 11, 2024 at 12:20:55AM +0300, Heikki Linnakangas wrote: To move this forward, here's a patch to switch to a pairing heap. In my very quick testing, with the performance test cases posted earlier in this thread [1] [2], I'm seeing no meaningful performance difference between this and what's in master currently. Reading through the patch, that's a nice cleanup. It cuts quite some code. +++ b/src/include/replication/reorderbuffer.h @@ -12,6 +12,7 @@ #include "access/htup_details.h" #include "lib/binaryheap.h" #include "lib/ilist.h" +#include "lib/pairingheap.h" I'm slightly annoyed by the extra amount of information that gets added to reorderbuffer.h for stuff that's only local to reorderbuffer.c, but that's not something new in this area, so.. We can actually remove the "lib/binaryheap.h" in this patch; I missed that. There are no other uses of binaryheap in the file. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Improve eviction algorithm in ReorderBuffer
On Thu, Apr 11, 2024 at 12:20:55AM +0300, Heikki Linnakangas wrote: > To move this forward, here's a patch to switch to a pairing heap. In my very > quick testing, with the performance test cases posted earlier in this thread > [1] [2], I'm seeing no meaningful performance difference between this and > what's in master currently. Reading through the patch, that's a nice cleanup. It cuts quite some code. +++ b/src/include/replication/reorderbuffer.h @@ -12,6 +12,7 @@ #include "access/htup_details.h" #include "lib/binaryheap.h" #include "lib/ilist.h" +#include "lib/pairingheap.h" I'm slightly annoyed by the extra amount of information that gets added to reorderbuffer.h for stuff that's only local to reorderbuffer.c, but that's not something new in this area, so.. -- Michael signature.asc Description: PGP signature
Re: Improve eviction algorithm in ReorderBuffer
On 10/04/2024 08:31, Amit Kapila wrote: On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas wrote: On 10/04/2024 07:45, Michael Paquier wrote: On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote: On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote: Wouldn't the best way forward be to revert 5bec1d6bc5e3 and revisit the whole in v18? Also consider commits b840508644 and bcb14f4abc. Indeed. These are also linked. I don't feel the urge to revert this: - It's not broken as such, we're just discussing better ways to implement it. We could also do nothing, and revisit this in v18. The only must-fix issue is some compiler warnings IIUC. - It's a pretty localized change in reorderbuffer.c, so it's not in the way of other patches or reverts. Nothing else depends on the binaryheap changes yet either. - It seems straightforward to repeat the performance tests with whatever alternative implementations we want to consider. My #1 choice would be to write a patch to switch the pairing heap, performance test that, and revert the binary heap changes. +1. To move this forward, here's a patch to switch to a pairing heap. In my very quick testing, with the performance test cases posted earlier in this thread [1] [2], I'm seeing no meaningful performance difference between this and what's in master currently. Sawada-san, what do you think of this? To be sure, if you could also repeat the performance tests you performed earlier, that'd be great. If you agree with this direction, and you're happy with this patch, feel free take it from here and commit this, and also revert commits b840508644 and bcb14f4abc. -- Heikki Linnakangas Neon (https://neon.tech) From c883d03bd221341b0bbb315d376d9e125b84329a Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Wed, 10 Apr 2024 08:09:38 +0300 Subject: [PATCH 1/1] Replace binaryheap + index with pairingheap A pairing heap can perform the same operations as the binary heap + index, with as good or better algorithmic complexity, and that's an existing data structure so that we don't need to invent anything new compared to v16. This commit makes the new binaryheap functionality that was added in commits b840508644 and bcb14f4abc unnecessarily, but they will be reverted separately. Remove the optimization to only build and maintain the heap when the amount of memory used is close to the limit, becuase the bookkeeping overhead with the pairing heap seems to be small enough that it doesn't matter in practice. --- .../replication/logical/reorderbuffer.c | 186 +++--- src/include/replication/reorderbuffer.h | 8 +- 2 files changed, 29 insertions(+), 165 deletions(-) diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c index 5cf28d4df42..ab2ddabfadd 100644 --- a/src/backend/replication/logical/reorderbuffer.c +++ b/src/backend/replication/logical/reorderbuffer.c @@ -68,19 +68,9 @@ * memory gets actually freed. * * We use a max-heap with transaction size as the key to efficiently find - * the largest transaction. While the max-heap is empty, we don't update - * the max-heap when updating the memory counter. Therefore, we can get - * the largest transaction in O(N) time, where N is the number of - * transactions including top-level transactions and subtransactions. - * - * We build the max-heap just before selecting the largest transactions - * if the number of transactions being decoded is higher than the threshold, - * MAX_HEAP_TXN_COUNT_THRESHOLD. After building the max-heap, we also - * update the max-heap when updating the memory counter. The intention is - * to efficiently find the largest transaction in O(1) time instead of - * incurring the cost of memory counter updates (O(log N)). Once the number - * of transactions got lower than the threshold, we reset the max-heap - * (refer to ReorderBufferMaybeResetMaxHeap() for details). + * the largest transaction. We update the max-heap whenever the memory + * counter is updated; however transactions with size 0 are not stored in + * the heap, because they have no changes to evict. * * We still rely on max_changes_in_memory when loading serialized changes * back into memory. At that point we can't use the memory limit directly @@ -122,23 +112,6 @@ #include "utils/rel.h" #include "utils/relfilenumbermap.h" -/* - * Threshold of the total number of top-level and sub transactions that - * controls whether we use the max-heap for tracking their sizes. Although - * using the max-heap to select the largest transaction is effective when - * there are many transactions being decoded, maintaining the max-heap while - * updating the memory statistics can be costly. Therefore, we use - * MaxConnections as the threshold so that we use the max-heap only when - * using subtransactions. - */ -#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections - -/* - * A macro to check if the
Re: Improve eviction algorithm in ReorderBuffer
On Wed, 2024-04-10 at 08:30 +0300, Heikki Linnakangas wrote: > My #1 choice would be to write a patch to switch the pairing heap, > performance test that, and revert the binary heap changes. Sounds good to me. I would expect it to perform better than the extra hash table, if anything. It also has the advantage that we don't change the API for binaryheap in 17. Regards, Jeff Davis
Re: Improve eviction algorithm in ReorderBuffer
On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas wrote: > > On 10/04/2024 07:45, Michael Paquier wrote: > > On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote: > >> On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote: > >>> Wouldn't the best way forward be to revert > >>> 5bec1d6bc5e3 and revisit the whole in v18? > >> > >> Also consider commits b840508644 and bcb14f4abc. > > > > Indeed. These are also linked. > > I don't feel the urge to revert this: > > - It's not broken as such, we're just discussing better ways to > implement it. We could also do nothing, and revisit this in v18. The > only must-fix issue is some compiler warnings IIUC. > > - It's a pretty localized change in reorderbuffer.c, so it's not in the > way of other patches or reverts. Nothing else depends on the binaryheap > changes yet either. > > - It seems straightforward to repeat the performance tests with whatever > alternative implementations we want to consider. > > My #1 choice would be to write a patch to switch the pairing heap, > performance test that, and revert the binary heap changes. > +1. -- With Regards, Amit Kapila.
Re: Improve eviction algorithm in ReorderBuffer
On 10/04/2024 07:45, Michael Paquier wrote: On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote: On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote: Wouldn't the best way forward be to revert 5bec1d6bc5e3 and revisit the whole in v18? Also consider commits b840508644 and bcb14f4abc. Indeed. These are also linked. I don't feel the urge to revert this: - It's not broken as such, we're just discussing better ways to implement it. We could also do nothing, and revisit this in v18. The only must-fix issue is some compiler warnings IIUC. - It's a pretty localized change in reorderbuffer.c, so it's not in the way of other patches or reverts. Nothing else depends on the binaryheap changes yet either. - It seems straightforward to repeat the performance tests with whatever alternative implementations we want to consider. My #1 choice would be to write a patch to switch the pairing heap, performance test that, and revert the binary heap changes. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Improve eviction algorithm in ReorderBuffer
On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote: > On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote: >> Wouldn't the best way forward be to revert >> 5bec1d6bc5e3 and revisit the whole in v18? > > Also consider commits b840508644 and bcb14f4abc. Indeed. These are also linked. -- Michael signature.asc Description: PGP signature
Re: Improve eviction algorithm in ReorderBuffer
On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote: > Wouldn't the best way forward be to revert > 5bec1d6bc5e3 and revisit the whole in v18? That's a reasonable conclusion. Also consider commits b840508644 and bcb14f4abc. I had tried to come up with a narrower fix, and I think it's already been implemented here in approach 2: https://www.postgresql.org/message-id/CAD21AoAtf12e9Z9NLBuaO1GjHMMo16_8R-yBu9Q9jrk2QLqMEA%40mail.gmail.com but it does feel wrong to introduce an unnecessary hash table in 17 when we know it's not the right solution. Regards, Jeff Davis
Re: Improve eviction algorithm in ReorderBuffer
On Tue, Apr 09, 2024 at 06:24:43PM -0700, Jeff Davis wrote: > I had suggested that the heap could track the element indexes for > efficient update/removal, but that would be a change to the > binaryheap.h API, which would require some discussion (and possibly not > be acceptable post-freeze). > > But I think you're right: a pairing heap already solves the problem > without modification. (Note: our pairing heap API doesn't explicitly > support updating a key, so I think it would need to be done with > remove/add.) So we might as well just do that right now rather than > trying to fix the way the hash table is being used or trying to extend > the binaryheap API. > > Of course, we should measure to be sure there aren't bad cases around > updating/removing a key, but I don't see a fundamental reason that it > should be worse. This is going to require a rewrite of 5bec1d6bc5e3 with a new performance study, which strikes me as something that we'd better not do after feature freeze. Wouldn't the best way forward be to revert 5bec1d6bc5e3 and revisit the whole in v18? I have added an open item, for now. -- Michael signature.asc Description: PGP signature
Re: Improve eviction algorithm in ReorderBuffer
On Tue, 2024-04-09 at 23:49 +0300, Heikki Linnakangas wrote: > > I wonder why this doesn't use the existing pairing heap > implementation? I would assume that to be at least as good as the > binary > heap + hash table I agree that an additional hash table is not needed -- there's already a hash table to do a lookup based on xid in reorderbuffer.c. I had suggested that the heap could track the element indexes for efficient update/removal, but that would be a change to the binaryheap.h API, which would require some discussion (and possibly not be acceptable post-freeze). But I think you're right: a pairing heap already solves the problem without modification. (Note: our pairing heap API doesn't explicitly support updating a key, so I think it would need to be done with remove/add.) So we might as well just do that right now rather than trying to fix the way the hash table is being used or trying to extend the binaryheap API. Of course, we should measure to be sure there aren't bad cases around updating/removing a key, but I don't see a fundamental reason that it should be worse. Regards, Jeff Davis
Re: Improve eviction algorithm in ReorderBuffer
On 09/04/2024 21:04, Jeff Davis wrote: On Fri, 2024-04-05 at 16:58 +0900, Masahiko Sawada wrote: I have some further comments and I believe changes are required for v17. I also noticed that the simplehash is declared in binaryheap.h with "SH_SCOPE extern", which seems wrong. Attached is a rough patch to bring the declarations into binaryheap.c. Note that the attached patch uses "SH_SCOPE static", which makes sense to me in this case, but causes a bunch of warnings in gcc. I will post separately about silencing that warning, but for now you can either use: SH_SCOPE static inline which is probably fine, but will encourage the compiler to inline more code, when not all callers even use the hash table. Alternatively, you can do: SH_SCOPE static pg_attribute_unused() which looks a bit wrong to me, but seems to silence the warnings, and lets the compiler decide whether or not to inline. Also probably needs comment updates, etc. Sorry I'm late to the party, I didn't pay attention to this thread earlier. But I wonder why this doesn't use the existing pairing heap implementation? I would assume that to be at least as good as the binary heap + hash table. And it's cheap to to insert to (O(1)), so we could probably remove the MAX_HEAP_TXN_COUNT_THRESHOLD, and always keep the heap up-to-date. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Improve eviction algorithm in ReorderBuffer
On Fri, 2024-04-05 at 16:58 +0900, Masahiko Sawada wrote: > > I have some further comments and I believe changes are required for > > v17. I also noticed that the simplehash is declared in binaryheap.h with "SH_SCOPE extern", which seems wrong. Attached is a rough patch to bring the declarations into binaryheap.c. Note that the attached patch uses "SH_SCOPE static", which makes sense to me in this case, but causes a bunch of warnings in gcc. I will post separately about silencing that warning, but for now you can either use: SH_SCOPE static inline which is probably fine, but will encourage the compiler to inline more code, when not all callers even use the hash table. Alternatively, you can do: SH_SCOPE static pg_attribute_unused() which looks a bit wrong to me, but seems to silence the warnings, and lets the compiler decide whether or not to inline. Also probably needs comment updates, etc. Regards, Jeff Davis From 08dcf21646e4ded22b10fd0ed536d3bbf6fc1328 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Tue, 9 Apr 2024 10:45:00 -0700 Subject: [PATCH v1] binaryheap: move hash table out of header into binaryheap.c. Commit b840508644 declared the internal hash table in the header with scope "extern", which was unnecessary. --- src/common/binaryheap.c | 15 ++- src/include/lib/binaryheap.h | 25 + 2 files changed, 15 insertions(+), 25 deletions(-) diff --git a/src/common/binaryheap.c b/src/common/binaryheap.c index c20ed50acc..2501dad6f2 100644 --- a/src/common/binaryheap.c +++ b/src/common/binaryheap.c @@ -25,6 +25,18 @@ #include "common/hashfn.h" #include "lib/binaryheap.h" +/* + * Struct for a hash table element to store the node's index in the bh_nodes + * array. + */ +typedef struct bh_nodeidx_entry +{ + bh_node_type key; + int index; /* entry's index within the node array */ + char status; /* hash status */ + uint32 hash; /* hash values (cached) */ +} bh_nodeidx_entry; + /* * Define parameters for hash table code generation. The interface is *also* * declared in binaryheap.h (to generate the types, which are externally @@ -37,12 +49,13 @@ #define SH_HASH_KEY(tb, key) \ hash_bytes((const unsigned char *) &key, sizeof(bh_node_type)) #define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0) -#define SH_SCOPE extern +#define SH_SCOPE static #ifdef FRONTEND #define SH_RAW_ALLOCATOR pg_malloc0 #endif #define SH_STORE_HASH #define SH_GET_HASH(tb, a) a->hash +#define SH_DECLARE #define SH_DEFINE #include "lib/simplehash.h" diff --git a/src/include/lib/binaryheap.h b/src/include/lib/binaryheap.h index 4c1a1bb274..8b47132fc3 100644 --- a/src/include/lib/binaryheap.h +++ b/src/include/lib/binaryheap.h @@ -29,29 +29,6 @@ typedef Datum bh_node_type; */ typedef int (*binaryheap_comparator) (bh_node_type a, bh_node_type b, void *arg); -/* - * Struct for a hash table element to store the node's index in the bh_nodes - * array. - */ -typedef struct bh_nodeidx_entry -{ - bh_node_type key; - int index; /* entry's index within the node array */ - char status; /* hash status */ - uint32 hash; /* hash values (cached) */ -} bh_nodeidx_entry; - -/* Define parameters necessary to generate the hash table interface. */ -#define SH_PREFIX bh_nodeidx -#define SH_ELEMENT_TYPE bh_nodeidx_entry -#define SH_KEY_TYPE bh_node_type -#define SH_SCOPE extern -#ifdef FRONTEND -#define SH_RAW_ALLOCATOR pg_malloc0 -#endif -#define SH_DECLARE -#include "lib/simplehash.h" - /* * binaryheap * @@ -76,7 +53,7 @@ typedef struct binaryheap * node's index in bh_nodes. This enables the caller to perform * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n). */ - bh_nodeidx_hash *bh_nodeidx; + struct bh_nodeidx_hash *bh_nodeidx; } binaryheap; extern binaryheap *binaryheap_allocate(int num_nodes, -- 2.34.1
Re: Improve eviction algorithm in ReorderBuffer
On Mon, 2024-04-08 at 21:29 +0900, Masahiko Sawada wrote: > For v17, changes for #2 are smaller, but I'm concerned > that the new API that requires a hash function to be able to use > binaryheap_update_{up|down} might not be user friendly. The only API change in 02 is accepting a hash callback rather than a boolean in binaryheap_allocate(), so I don't see that as worse than what's there now. It also directly fixes my complaint (hashing the pointer) and does nothing more, so I think it's the right fix for now. I do think that the API can be better (templated like simplehash), but I don't think right now is a great time to change it. > When it comes to performance overhead, I mentioned that there is some > regression in the current binaryheap even without indexing. As far as I can tell, you are just adding a single branch in that path, and I would expect it to be a predictable branch, right? Thank you for testing, but small differences in a microbenchmark aren't terribly worrying for me. If other call sites are that sensitive to binaryheap performance, the right answer is to have a templated version that would not only avoid this unnecessary branch, but also inline the comparator (which probably matters more). Regards, Jeff Davis
Re: Improve eviction algorithm in ReorderBuffer
On Sat, Apr 6, 2024 at 5:44 AM Jeff Davis wrote: > > > > > It sounds like a data structure that mixes the hash table and the > > binary heap and we use it as the main storage (e.g. for > > ReorderBufferTXN) instead of using the binary heap as the secondary > > data structure. IIUC with your idea, the indexed binary heap has a > > hash table to store elements each of which has its index within the > > heap node array. I guess it's better to create it as a new data > > structure rather than extending the existing binaryheap, since APIs > > could be very different. I might be missing something, though. > > You are right that this approach starts to feel like a new data > structure and is not v17 material. > > I am interested in this for v18 though -- we could make the API more > like simplehash to be more flexible when using values (rather than > pointers) and to be able to inline the comparator. Interesting project. It would be great if we could support increasing and decreasing the key as APIs. The current binaryheap_update_{up|down} APIs are not very user-friendly. > > > > * remove the hash table from binaryheap.c > > > > > > * supply a new callback to the binary heap with type like: > > > > > > typedef void (*binaryheap_update_index)( > > > bh_node_type node, > > > int new_element_index); > > > > > > * make the remove, update_up, and update_down methods take the > > > element > > > index rather than the pointer > > ... > > > This shows that the current indexed binaryheap is much slower than > > the > > other two implementations, and the xx_binaryheap has a good number in > > spite of also being indexed. > > xx_binaryheap isn't indexed though, right? Well, yes. To be xact, xx_binaryheap isn't indexed but the element indexes are stored in the element itself (see test_elem struct) so the caller still can update the key using xx_binaryheap_update_{up|down}. > > > When it comes to > > implementing the above idea (i.e. changing binaryheap to > > xx_binaryheap), it was simple since we just replace the code where we > > update the hash table with the code where we call the callback, if we > > get the consensus on API change. > > That seems reasonable to me. > > > The fact that we use simplehash for the internal hash table might > > make > > this idea complex. If I understand your suggestion correctly, the > > caller needs to tell the hash table the hash function when creating a > > binaryheap but the hash function needs to be specified at a compile > > time. We can use a dynahash instead but it would make the binaryheap > > slow further. > > simplehash.h supports private_data, which makes it easier to track a > callback. > > In binaryheap.c, that would look something like: > > static inline uint32 > binaryheap_hash(bh_nodeidx_hash *tab, uint32 key) > { > binaryheap_hashfunc hashfunc = tab->private_data; > return hashfunc(key); > } > > ... > #define SH_HASH_KEY(tb, key) binaryheap_hash(tb, key) > ... > > binaryheap_allocate(int num_nodes, binaryheap_comparator compare, > void *arg, binaryheap_hashfunc hashfunc) > { > ... > if (hashfunc != NULL) > { >/* could have a new structure, but we only need to > * store one pointer, so don't bother with palloc/pfree */ >void *private_data = (void *)hashfunc; >heap->bh_nodeidx = bh_nodeidx_create(..., private_data); >... > > > And in reorderbuffer.c, define the callback like: > > static uint32 > reorderbuffer_xid_hash(TransactionId xid) > { > /* fasthash32 is 'static inline' so may > * be faster than hash_bytes()? */ > return fasthash32(&xid, sizeof(TransactionId), 0); > } > Thanks, that's a good idea. > > > In summary, there are two viable approaches for addressing the concerns > in v17: > > 1. Provide callback to update ReorderBufferTXN->heap_element_index, and > use that index (rather than the pointer) for updating the heap key > (transaction size) or removing elements from the heap. > > 2. Provide callback for hashing, so that binaryheap.c can hash the xid > value rather than the pointer. > > I don't have a strong opinion about which one to use. I prefer > something closer to #1 for v18, but for v17 I suggest whichever one > comes out simpler. I've implemented prototypes of both ideas, and attached the draft patches. I agree with you that something closer to #1 is for v18. Probably we can implement the #1 idea while making binaryheap codes template like simplehash.h. For v17, changes for #2 are smaller, but I'm concerned that the new API that requires a hash function to be able to use binaryheap_update_{up|down} might not be user friendly. In terms of APIs, I prefer #1 idea. And changes for #1 can make the binaryheap code simple, although it requires adding a variable in ReorderBufferTXN instead. But overall, it can remove the hash table and some functions so it looks better to me. When it comes to performance overhead, I mentioned
Re: Improve eviction algorithm in ReorderBuffer
On Fri, 2024-04-05 at 16:58 +0900, Masahiko Sawada wrote: > IIUC for example in ReorderBuffer, the heap key is transaction size > and the hash key is xid. Yes. > > I see your point. It assumes that the bh_node_type is a pointer or at > least unique. So it cannot work with Datum being a value. Right. One option might just be to add some comments explaining the API and limitations, but in general I feel it's confusing to hash a pointer without a good reason. > > It sounds like a data structure that mixes the hash table and the > binary heap and we use it as the main storage (e.g. for > ReorderBufferTXN) instead of using the binary heap as the secondary > data structure. IIUC with your idea, the indexed binary heap has a > hash table to store elements each of which has its index within the > heap node array. I guess it's better to create it as a new data > structure rather than extending the existing binaryheap, since APIs > could be very different. I might be missing something, though. You are right that this approach starts to feel like a new data structure and is not v17 material. I am interested in this for v18 though -- we could make the API more like simplehash to be more flexible when using values (rather than pointers) and to be able to inline the comparator. > > * remove the hash table from binaryheap.c > > > > * supply a new callback to the binary heap with type like: > > > > typedef void (*binaryheap_update_index)( > > bh_node_type node, > > int new_element_index); > > > > * make the remove, update_up, and update_down methods take the > > element > > index rather than the pointer ... > This shows that the current indexed binaryheap is much slower than > the > other two implementations, and the xx_binaryheap has a good number in > spite of also being indexed. xx_binaryheap isn't indexed though, right? > When it comes to > implementing the above idea (i.e. changing binaryheap to > xx_binaryheap), it was simple since we just replace the code where we > update the hash table with the code where we call the callback, if we > get the consensus on API change. That seems reasonable to me. > The fact that we use simplehash for the internal hash table might > make > this idea complex. If I understand your suggestion correctly, the > caller needs to tell the hash table the hash function when creating a > binaryheap but the hash function needs to be specified at a compile > time. We can use a dynahash instead but it would make the binaryheap > slow further. simplehash.h supports private_data, which makes it easier to track a callback. In binaryheap.c, that would look something like: static inline uint32 binaryheap_hash(bh_nodeidx_hash *tab, uint32 key) { binaryheap_hashfunc hashfunc = tab->private_data; return hashfunc(key); } ... #define SH_HASH_KEY(tb, key) binaryheap_hash(tb, key) ... binaryheap_allocate(int num_nodes, binaryheap_comparator compare, void *arg, binaryheap_hashfunc hashfunc) { ... if (hashfunc != NULL) { /* could have a new structure, but we only need to * store one pointer, so don't bother with palloc/pfree */ void *private_data = (void *)hashfunc; heap->bh_nodeidx = bh_nodeidx_create(..., private_data); ... And in reorderbuffer.c, define the callback like: static uint32 reorderbuffer_xid_hash(TransactionId xid) { /* fasthash32 is 'static inline' so may * be faster than hash_bytes()? */ return fasthash32(&xid, sizeof(TransactionId), 0); } In summary, there are two viable approaches for addressing the concerns in v17: 1. Provide callback to update ReorderBufferTXN->heap_element_index, and use that index (rather than the pointer) for updating the heap key (transaction size) or removing elements from the heap. 2. Provide callback for hashing, so that binaryheap.c can hash the xid value rather than the pointer. I don't have a strong opinion about which one to use. I prefer something closer to #1 for v18, but for v17 I suggest whichever one comes out simpler. Regards, Jeff Davis
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Apr 5, 2024 at 2:55 AM Jeff Davis wrote: > > On Thu, 2024-04-04 at 17:28 +0900, Masahiko Sawada wrote: > > > Perhaps it's not worth the effort though, if performance is already > > > good enough? > > > > Yeah, it would be better to measure the overhead first. I'll do that. > > I have some further comments and I believe changes are required for > v17. > > An indexed binary heap API requires both a comparator and a hash > function to be specified, and has two different kinds of keys: the heap > key (mutable) and the hash key (immutable). It provides heap methods > and hashtable methods, and keep the two internal structures (heap and > HT) in sync. IIUC for example in ReorderBuffer, the heap key is transaction size and the hash key is xid. > > The implementation in b840508644 uses the bh_node_type as the hash key, > which is just a Datum, and it just hashes the bytes. I believe the > implicit assumption is that the Datum is a pointer -- I'm not sure how > one would use that API if the Datum were a value. Hashing a pointer > seems strange to me and, while I see why you did it that way, I think > it reflects that the API boundaries are not quite right. I see your point. It assumes that the bh_node_type is a pointer or at least unique. So it cannot work with Datum being a value. > > One consequence of using the pointer as the hash key is that you need > to find the pointer first: you can't change or remove elements based on > the transaction ID, you have to get the ReorderBufferTXN pointer by > finding it in another structure, first. Currently, that's being done by > searching ReorderBuffer->by_txn. So we actually have two hash tables > for essentially the same purpose: one with xid as the key, and the > other with the pointer as the key. That makes no sense -- let's have a > proper indexed binary heap to look things up by xid (the internal HT) > or by transaction size (using the internal heap). > > I suggest: > > * Make a proper indexed binary heap API that accepts a hash function > and provides both heap methods and HT methods that operate based on > values (transaction size and transaction ID, respectively). > * Get rid of ReorderBuffer->by_txn and use the indexed binary heap > instead. > > This will be a net simplification in reorderbuffer.c, which is good, > because that file makes use of a *lot* of data strucutres. > It sounds like a data structure that mixes the hash table and the binary heap and we use it as the main storage (e.g. for ReorderBufferTXN) instead of using the binary heap as the secondary data structure. IIUC with your idea, the indexed binary heap has a hash table to store elements each of which has its index within the heap node array. I guess it's better to create it as a new data structure rather than extending the existing binaryheap, since APIs could be very different. I might be missing something, though. On Fri, Apr 5, 2024 at 3:55 AM Jeff Davis wrote: > > On Thu, 2024-04-04 at 10:55 -0700, Jeff Davis wrote: > > * Make a proper indexed binary heap API that accepts a hash > > function > > and provides both heap methods and HT methods that operate based on > > values (transaction size and transaction ID, respectively). > > * Get rid of ReorderBuffer->by_txn and use the indexed binary heap > > instead. > > An alternative idea: > > * remove the hash table from binaryheap.c > > * supply a new callback to the binary heap with type like: > > typedef void (*binaryheap_update_index)( > bh_node_type node, > int new_element_index); > > * make the remove, update_up, and update_down methods take the element > index rather than the pointer > > reorderbuffer.c would then do something like: > > void > txn_update_heap_index(ReorderBufferTXN *txn, int new_element_index) > { > txn->heap_element_index = new_element_index; > } > > ... > > txn_heap = binaryheap_allocate(..., txn_update_heap_index, ...); > > and then binaryheap.c would effectively maintain txn- > >heap_element_index, so reorderbuffer.c can pass that to the APIs that > require the element index. Thank you for the idea. I was thinking the same idea when considering your previous comment. With this idea, we still use the binaryheap for ReorderBuffer as the second data structure. Since we can implement this idea with relatively small changes to the current binaryheap, I've implemented it and measured performances. I've attached a patch that adds an extension for benchmarking binaryheap implementations. binaryheap_bench.c is the main test module. To make the comparison between different binaryheap implementations, the extension includes two different binaryheap implementations. Therefore, binaryheap_bench.c uses three different binaryheap implementation in total as the comment on top of the file says: /* * This benchmark tool uses three binary heap implementations. * * "binaryheap" is the current binaryheap implementation in PostgreSQL. That * is, it internally has a hash table to track
Re: Improve eviction algorithm in ReorderBuffer
On Thu, 2024-04-04 at 10:55 -0700, Jeff Davis wrote: > * Make a proper indexed binary heap API that accepts a hash > function > and provides both heap methods and HT methods that operate based on > values (transaction size and transaction ID, respectively). > * Get rid of ReorderBuffer->by_txn and use the indexed binary heap > instead. An alternative idea: * remove the hash table from binaryheap.c * supply a new callback to the binary heap with type like: typedef void (*binaryheap_update_index)( bh_node_type node, int new_element_index); * make the remove, update_up, and update_down methods take the element index rather than the pointer reorderbuffer.c would then do something like: void txn_update_heap_index(ReorderBufferTXN *txn, int new_element_index) { txn->heap_element_index = new_element_index; } ... txn_heap = binaryheap_allocate(..., txn_update_heap_index, ...); and then binaryheap.c would effectively maintain txn- >heap_element_index, so reorderbuffer.c can pass that to the APIs that require the element index. Another alternative is to keep the hash table in binaryheap.c, and supply a hash function that hashes the xid. That leaves us with two hash tables still, but it would be cleaner than hashing the pointer. That might be best for right now, and we can consider these other ideas later. Regards, Jeff Davis
Re: Improve eviction algorithm in ReorderBuffer
On Thu, 2024-04-04 at 17:28 +0900, Masahiko Sawada wrote: > > Perhaps it's not worth the effort though, if performance is already > > good enough? > > Yeah, it would be better to measure the overhead first. I'll do that. I have some further comments and I believe changes are required for v17. An indexed binary heap API requires both a comparator and a hash function to be specified, and has two different kinds of keys: the heap key (mutable) and the hash key (immutable). It provides heap methods and hashtable methods, and keep the two internal structures (heap and HT) in sync. The implementation in b840508644 uses the bh_node_type as the hash key, which is just a Datum, and it just hashes the bytes. I believe the implicit assumption is that the Datum is a pointer -- I'm not sure how one would use that API if the Datum were a value. Hashing a pointer seems strange to me and, while I see why you did it that way, I think it reflects that the API boundaries are not quite right. One consequence of using the pointer as the hash key is that you need to find the pointer first: you can't change or remove elements based on the transaction ID, you have to get the ReorderBufferTXN pointer by finding it in another structure, first. Currently, that's being done by searching ReorderBuffer->by_txn. So we actually have two hash tables for essentially the same purpose: one with xid as the key, and the other with the pointer as the key. That makes no sense -- let's have a proper indexed binary heap to look things up by xid (the internal HT) or by transaction size (using the internal heap). I suggest: * Make a proper indexed binary heap API that accepts a hash function and provides both heap methods and HT methods that operate based on values (transaction size and transaction ID, respectively). * Get rid of ReorderBuffer->by_txn and use the indexed binary heap instead. This will be a net simplification in reorderbuffer.c, which is good, because that file makes use of a *lot* of data strucutres. Regards Jeff Davis
Re: Improve eviction algorithm in ReorderBuffer
On Thu, Apr 4, 2024 at 1:54 PM Jeff Davis wrote: > > On Thu, 2024-04-04 at 09:31 +0900, Masahiko Sawada wrote: > > IIUC, with your suggestion, sift_{up|down} needs to update the > > heap_index field as well. Does it mean that the caller needs to pass > > the address of heap_index down to sift_{up|down}? > > I'm not sure quite how binaryheap should be changed. Bringing the heap > implementation into reorderbuffer.c would obviously work, but that > would be more code. Right. > Another option might be to make the API of > binaryheap look a little more like simplehash, where some #defines > control optional behavior and can tell the implementation where to find > fields in the structure. Interesting idea. > > Perhaps it's not worth the effort though, if performance is already > good enough? Yeah, it would be better to measure the overhead first. I'll do that. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On Thu, 2024-04-04 at 09:31 +0900, Masahiko Sawada wrote: > IIUC, with your suggestion, sift_{up|down} needs to update the > heap_index field as well. Does it mean that the caller needs to pass > the address of heap_index down to sift_{up|down}? I'm not sure quite how binaryheap should be changed. Bringing the heap implementation into reorderbuffer.c would obviously work, but that would be more code. Another option might be to make the API of binaryheap look a little more like simplehash, where some #defines control optional behavior and can tell the implementation where to find fields in the structure. Perhaps it's not worth the effort though, if performance is already good enough? Regards, Jeff Davis
Re: Improve eviction algorithm in ReorderBuffer
Hi, On Thu, Apr 4, 2024 at 2:32 AM Jeff Davis wrote: > > On Wed, 2024-04-03 at 01:45 -0700, Jeff Davis wrote: > > I suggest that you add a "heap_index" field to ReorderBufferTXN that > > would point to the index into the heap's array (the same as > > bh_nodeidx_entry.index in your patch). Each time an element moves > > within the heap array, just follow the pointer to the > > ReorderBufferTXN > > object and update the heap_index -- no hash lookup required. > > It looks like my email was slightly too late, as the work was already > committed. Thank you for the suggestions! I should have informed it earlier. > > My suggestion is not required for 17, and so it's fine if this waits > until the next CF. If it turns out to be a win we can consider > backporting to 17 just to keep the code consistent, otherwise it can go > in 18. IIUC, with your suggestion, sift_{up|down} needs to update the heap_index field as well. Does it mean that the caller needs to pass the address of heap_index down to sift_{up|down}? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On Wed, 2024-04-03 at 01:45 -0700, Jeff Davis wrote: > I suggest that you add a "heap_index" field to ReorderBufferTXN that > would point to the index into the heap's array (the same as > bh_nodeidx_entry.index in your patch). Each time an element moves > within the heap array, just follow the pointer to the > ReorderBufferTXN > object and update the heap_index -- no hash lookup required. It looks like my email was slightly too late, as the work was already committed. My suggestion is not required for 17, and so it's fine if this waits until the next CF. If it turns out to be a win we can consider backporting to 17 just to keep the code consistent, otherwise it can go in 18. Regards, Jeff Davis
Re: Improve eviction algorithm in ReorderBuffer
On Mon, 2024-04-01 at 12:42 +0900, Masahiko Sawada wrote: > While reviewing the patches, I realized the comment of > binearyheap_allocate() should also be updated. So I've attached the > new patches. In sift_{up|down}, each loop iteration calls set_node(), and each call to set_node does a hash lookup. I didn't measure it, but that feels wasteful. I don't even think you really need the hash table. The key to the hash table is a pointer, so it's not really doing anything that couldn't be done more efficiently by just following the pointer. I suggest that you add a "heap_index" field to ReorderBufferTXN that would point to the index into the heap's array (the same as bh_nodeidx_entry.index in your patch). Each time an element moves within the heap array, just follow the pointer to the ReorderBufferTXN object and update the heap_index -- no hash lookup required. That's not easy to do with the current binaryheap API. But a binary heap is not a terribly complex structure, so you can just do an inline implementation of it where sift_{up|down} know to update the heap_index field of the ReorderBufferTXN. Regards, Jeff Davis
Re: Improve eviction algorithm in ReorderBuffer
On Mon, Apr 1, 2024 at 11:26 AM Masahiko Sawada wrote: > > On Fri, Mar 29, 2024 at 7:37 PM Amit Kapila wrote: > > > > On Fri, Mar 29, 2024 at 12:13 PM Masahiko Sawada > > wrote: > > > > > > On Fri, Mar 29, 2024 at 2:09 PM vignesh C wrote: > > > > > > > > On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada > > > > wrote: > > > > > > > > > > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada > > > > > wrote: > > > > > > > > > > > > > > > > > > I've attached new version patches. > > > > > > > > > > Since the previous patch conflicts with the current HEAD, I've > > > > > attached the rebased patches. > > > > > > > > Thanks for the updated patch. > > > > One comment: > > > > I felt we can mention the improvement where we update memory > > > > accounting info at transaction level instead of per change level which > > > > is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and > > > > ReorderBufferSerializeTXN also in the commit message: > > > > > > Agreed. > > > > > > I think the patch is in good shape. I'll push the patch with the > > > suggestion next week, barring any objections. > > > > > > > Few minor comments: > > 1. > > @@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb) > > Assert(txn->nentries_mem == 0); > > } > > > > + ReorderBufferMaybeResetMaxHeap(rb); > > + > > > > Can we write a comment about why this reset is required here? > > Otherwise, the reason is not apparent. > > Yes, added. > > > > > 2. > > Although using max-heap to select the largest > > + * transaction is effective when there are many transactions being decoded, > > + * there is generally no need to use it as long as all transactions being > > + * decoded are top-level transactions. Therefore, we use MaxConnections as > > the > > + * threshold so we can prevent switching to the state unless we use > > + * subtransactions. > > + */ > > +#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections > > > > Isn't using max-heap equally effective in finding the largest > > transaction whether there are top-level or top-level plus > > subtransactions? This comment indicates it is only effective when > > there are subtransactions. > > You're right. Updated the comment. > > I've attached the updated patches. > While reviewing the patches, I realized the comment of binearyheap_allocate() should also be updated. So I've attached the new patches. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v12-0001-Make-binaryheap-enlargeable.patch Description: Binary data v12-0002-Add-functions-to-binaryheap-for-efficient-key-re.patch Description: Binary data v12-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patch Description: Binary data
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Mar 29, 2024 at 8:48 PM Hayato Kuroda (Fujitsu) wrote: > > Dear Sawada-san, > > > Agreed. > > > > I think the patch is in good shape. I'll push the patch with the > > suggestion next week, barring any objections. > > Thanks for working on this. Agreed it is committable. > Few minor comments: Thank you for the comments! > > ``` > + * Either txn or change must be non-NULL at least. We update the memory > + * counter of txn if it's non-NULL, otherwise change->txn. > ``` > > IIUC no one checks the restriction. Should we add Assert() for it, e.g,: > Assert(txn || change)? Agreed to add it. > > ``` > +/* make sure enough space for a new node */ > ... > +/* make sure enough space for a new node */ > ``` > > Should be started with upper case? I don't think we need to change it. There are other comments in the same file that are one line and start with lowercase. I've just submitted the updated patches[1] Regards, [1] https://www.postgresql.org/message-id/CAD21AoA6%3D%2BtL%3DbtB_s9N%2BcZK7tKz1W%3DPQyNq72nzjUcdyE%2BwZw%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Mar 29, 2024 at 7:37 PM Amit Kapila wrote: > > On Fri, Mar 29, 2024 at 12:13 PM Masahiko Sawada > wrote: > > > > On Fri, Mar 29, 2024 at 2:09 PM vignesh C wrote: > > > > > > On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada > > > wrote: > > > > > > > > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada > > > > wrote: > > > > > > > > > > > > > > > I've attached new version patches. > > > > > > > > Since the previous patch conflicts with the current HEAD, I've > > > > attached the rebased patches. > > > > > > Thanks for the updated patch. > > > One comment: > > > I felt we can mention the improvement where we update memory > > > accounting info at transaction level instead of per change level which > > > is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and > > > ReorderBufferSerializeTXN also in the commit message: > > > > Agreed. > > > > I think the patch is in good shape. I'll push the patch with the > > suggestion next week, barring any objections. > > > > Few minor comments: > 1. > @@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb) > Assert(txn->nentries_mem == 0); > } > > + ReorderBufferMaybeResetMaxHeap(rb); > + > > Can we write a comment about why this reset is required here? > Otherwise, the reason is not apparent. Yes, added. > > 2. > Although using max-heap to select the largest > + * transaction is effective when there are many transactions being decoded, > + * there is generally no need to use it as long as all transactions being > + * decoded are top-level transactions. Therefore, we use MaxConnections as > the > + * threshold so we can prevent switching to the state unless we use > + * subtransactions. > + */ > +#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections > > Isn't using max-heap equally effective in finding the largest > transaction whether there are top-level or top-level plus > subtransactions? This comment indicates it is only effective when > there are subtransactions. You're right. Updated the comment. I've attached the updated patches. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v11-0002-Add-functions-to-binaryheap-for-efficient-key-re.patch Description: Binary data v11-0001-Make-binaryheap-enlargeable.patch Description: Binary data v11-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patch Description: Binary data
RE: Improve eviction algorithm in ReorderBuffer
Dear Sawada-san, > Agreed. > > I think the patch is in good shape. I'll push the patch with the > suggestion next week, barring any objections. Thanks for working on this. Agreed it is committable. Few minor comments: ``` + * Either txn or change must be non-NULL at least. We update the memory + * counter of txn if it's non-NULL, otherwise change->txn. ``` IIUC no one checks the restriction. Should we add Assert() for it, e.g,: Assert(txn || change)? ``` +/* make sure enough space for a new node */ ... +/* make sure enough space for a new node */ ``` Should be started with upper case? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Mar 29, 2024 at 12:13 PM Masahiko Sawada wrote: > > On Fri, Mar 29, 2024 at 2:09 PM vignesh C wrote: > > > > On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada wrote: > > > > > > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada > > > wrote: > > > > > > > > > > > > I've attached new version patches. > > > > > > Since the previous patch conflicts with the current HEAD, I've > > > attached the rebased patches. > > > > Thanks for the updated patch. > > One comment: > > I felt we can mention the improvement where we update memory > > accounting info at transaction level instead of per change level which > > is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and > > ReorderBufferSerializeTXN also in the commit message: > > Agreed. > > I think the patch is in good shape. I'll push the patch with the > suggestion next week, barring any objections. > Few minor comments: 1. @@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb) Assert(txn->nentries_mem == 0); } + ReorderBufferMaybeResetMaxHeap(rb); + Can we write a comment about why this reset is required here? Otherwise, the reason is not apparent. 2. Although using max-heap to select the largest + * transaction is effective when there are many transactions being decoded, + * there is generally no need to use it as long as all transactions being + * decoded are top-level transactions. Therefore, we use MaxConnections as the + * threshold so we can prevent switching to the state unless we use + * subtransactions. + */ +#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections Isn't using max-heap equally effective in finding the largest transaction whether there are top-level or top-level plus subtransactions? This comment indicates it is only effective when there are subtransactions. -- With Regards, Amit Kapila.
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Mar 29, 2024 at 2:09 PM vignesh C wrote: > > On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada wrote: > > > > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada > > wrote: > > > > > > > > > I've attached new version patches. > > > > Since the previous patch conflicts with the current HEAD, I've > > attached the rebased patches. > > Thanks for the updated patch. > One comment: > I felt we can mention the improvement where we update memory > accounting info at transaction level instead of per change level which > is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and > ReorderBufferSerializeTXN also in the commit message: Agreed. I think the patch is in good shape. I'll push the patch with the suggestion next week, barring any objections. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada wrote: > > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada > wrote: > > > > > > I've attached new version patches. > > Since the previous patch conflicts with the current HEAD, I've > attached the rebased patches. Thanks for the updated patch. One comment: I felt we can mention the improvement where we update memory accounting info at transaction level instead of per change level which is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and ReorderBufferSerializeTXN also in the commit message: @@ -1527,7 +1573,7 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) /* Check we're not mixing changes from different transactions. */ Assert(change->txn == txn); - ReorderBufferReturnChange(rb, change, true); + ReorderBufferReturnChange(rb, change, false); } /* @@ -1586,8 +1632,13 @@ ReorderBufferCleanupTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) if (rbtxn_is_serialized(txn)) ReorderBufferRestoreCleanup(rb, txn); + /* Update the memory counter */ + ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size); Regards, Vignesh
Re: Improve eviction algorithm in ReorderBuffer
On Tue, Mar 5, 2024 at 8:50 AM vignesh C wrote: > > On Wed, 28 Feb 2024 at 11:40, Amit Kapila wrote: > > > > On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada > > wrote: > > > > > > > A few comments on 0003: > > === > > 1. > > +/* > > + * Threshold of the total number of top-level and sub transactions > > that controls > > + * whether we switch the memory track state. While the MAINTAIN_HEAP state > > is > > + * effective when there are many transactions being decoded, in many > > systems > > + * there is generally no need to use it as long as all transactions > > being decoded > > + * are top-level transactions. Therefore, we use MaxConnections as > > the threshold > > + * so we can prevent switch to the state unless we use subtransactions. > > + */ > > +#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections > > > > The comment seems to imply that MAINTAIN_HEAP is useful for large > > number of transactions but ReorderBufferLargestTXN() switches to this > > state even when there is one transaction. So, basically we use the > > binary_heap technique to get the largest even when we have one > > transaction but we don't maintain that heap unless we have > > REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are > > in-progress. This means there is some additional work when (build and > > reset heap each time when we pick largest xact) we have fewer > > transactions in the system but that may not be impacting us because of > > other costs involved like serializing all the changes. I think once we > > can try to stress test this by setting > > debug_logical_replication_streaming to 'immediate' to see if the new > > mechanism has any overhead. > > I ran the test with a transaction having many inserts: > > | 5000 | 1 | 2 | 10| 100 | 1000 > --- > |---|||--|| > Head | 26.31 | 48.84 | 93.65 | 480.05 | 4808.29 | 47020.16 > Patch | 26.35 | 50.8 | 97.99 | 484.8| 4856.95 | 48108.89 > > The same test with debug_logical_replication_streaming= 'immediate' > > | 5000 | 1 | 2 | 10| 100 | 1000 > --- > |---|||--|| > Head | 59.29 | 115.84 | 227.21 | 1156.08 | 11367.42 | 113986.14 > Patch | 62.45 | 120.48 | 240.56 | 1185.12 | 11855.37 | 119921.81 > > The execution time is in milliseconds. The column header indicates the > number of inserts in the transaction. > In this case I noticed that the test execution with patch was taking > slightly more time. I have ran the tests that Vignesh had reported a issue, the test results with the latest patch is given below: Without debug_logical_replication_streaming= 'immediate' Record|1000 |100 |10 | 2 | 1 | 5000 --|---|-|---|--|--|-- Head|47563.759| 4917.057|478.923|97.28 |50.368 |25.917 Patch|47445.733| 4722.874|472.817|95.15 |48.801 |26.168 %imp|0.248| 03.949|01.274 |02.189|03.111 |-0.968 With debug_logical_replication_streaming= 'immediate' Record| 1000 | 100 | 10 | 2 | 1 | 5000 --||--|-|---|---|-- Head|106281.236|10669.992|1073.815|214.287|107.62 |54.947 Patch|103108.673|10603.139|1064.98 |210.229|106.321|54.218 %imp| 02.985| 0.626 |0.822 |01.893 |01.207 |01.326 The execution time is in milliseconds. The column header indicates the number of inserts in the transaction. I can notice with the test result that the issue has been resolved with the new patch. Thanks and Regards, Shubham Khanna.
Re: Improve eviction algorithm in ReorderBuffer
On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada wrote: > > > I've attached new version patches. Since the previous patch conflicts with the current HEAD, I've attached the rebased patches. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v10-0001-Make-binaryheap-enlargeable.patch Description: Binary data v10-0003-Improve-eviction-algorithm-in-Reorderbuffer-usin.patch Description: Binary data v10-0002-Add-functions-to-binaryheap-for-efficient-key-re.patch Description: Binary data
Re: Improve eviction algorithm in ReorderBuffer
On Wed, Mar 13, 2024 at 11:23 AM Peter Smith wrote: > > On Wed, Mar 13, 2024 at 12:48 PM Masahiko Sawada > wrote: > > > > On Wed, Mar 13, 2024 at 10:15 AM Peter Smith wrote: > > > > > > On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada > > > wrote: > > > > > > > > On Fri, Mar 8, 2024 at 12:58 PM Peter Smith > > > > wrote: > > > > > > > > ... > > > > > > > 5. > > > > > > > + * > > > > > > > + * If 'indexed' is true, we create a hash table to track of each > > > > > > > node's > > > > > > > + * index in the heap, enabling to perform some operations such > > > > > > > as removing > > > > > > > + * the node from the heap. > > > > > > > */ > > > > > > > binaryheap * > > > > > > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, > > > > > > > void *arg) > > > > > > > +binaryheap_allocate(int capacity, binaryheap_comparator compare, > > > > > > > + bool indexed, void *arg) > > > > > > > > > > > > > > BEFORE > > > > > > > ... enabling to perform some operations such as removing the node > > > > > > > from the heap. > > > > > > > > > > > > > > SUGGESTION > > > > > > > ... to help make operations such as removing nodes more efficient. > > > > > > > > > > > > > > > > > > > But these operations literally require the indexed binary heap as we > > > > > > have an assertion: > > > > > > > > > > > > void > > > > > > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d) > > > > > > { > > > > > > bh_nodeidx_entry *ent; > > > > > > > > > > > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); > > > > > > Assert(heap->bh_indexed); > > > > > > > > > > > > > > > > I didn’t quite understand -- the operations mentioned are "operations > > > > > such as removing the node", but binaryheap_remove_node() also removes > > > > > a node from the heap. So I still felt the comment wording of the patch > > > > > is not quite correct. > > > > > > > > Now I understand your point. That's a valid point. > > > > > > > > > > > > > > Now, if the removal of a node from an indexed heap can *only* be done > > > > > using binaryheap_remove_node_ptr() then: > > > > > - the other removal functions (binaryheap_remove_*) probably need some > > > > > comments to make sure nobody is tempted to call them directly for an > > > > > indexed heap. > > > > > - maybe some refactoring and assertions are needed to ensure those > > > > > *cannot* be called directly for an indexed heap. > > > > > > > > > > > > > If the 'index' is true, the caller can not only use the existing > > > > functions but also newly added functions such as > > > > binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about > > > > something like below? > > > > > > > > > > You said: "can not only use the existing functions but also..." > > > > > > Hmm. Is that right? IIUC those existing "remove" functions should NOT > > > be called directly if the heap was "indexed" because they'll delete > > > the node from the heap OK, but any corresponding index for that > > > deleted node will be left lying around -- i.e. everything gets out of > > > sync. This was the reason for my original concern. > > > > > > > All existing binaryheap functions should be available even if the > > binaryheap is 'indexed'. For instance, with the patch, > > binaryheap_remote_node() is: > > > > void > > binaryheap_remove_node(binaryheap *heap, int n) > > { > > int cmp; > > > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); > > Assert(n >= 0 && n < heap->bh_size); > > > > /* compare last node to the one that is being removed */ > > cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size], > >heap->bh_nodes[n], > >heap->bh_arg); > > > > /* remove the last node, placing it in the vacated entry */ > > replace_node(heap, n, heap->bh_nodes[heap->bh_size]); > > > > /* sift as needed to preserve the heap property */ > > if (cmp > 0) > > sift_up(heap, n); > > else if (cmp < 0) > > sift_down(heap, n); > > } > > > > The replace_node(), sift_up() and sift_down() update node's index as > > well if the binaryheap is indexed. When deleting the node from the > > binaryheap, it will also delete its index from the hash table. > > > > I see now. Thanks for the information. > > ~~~ > > Some more review comments for v8-0002 > > == > > 1. > +/* > + * Remove the node's index from the hash table if the heap is indexed. > + */ > +static bool > +delete_nodeidx(binaryheap *heap, bh_node_type node) > +{ > + if (!binaryheap_indexed(heap)) > + return false; > + > + return bh_nodeidx_delete(heap->bh_nodeidx, node); > +} > > I wasn't sure if having this function was a good idea. Yes, it makes > code more readable, but I felt the heap code ought to be as efficient > as possible so maybe it is better for the index check to be done at > the caller, instead of incurring any overhead of function calls that > might do nothing. > > SUGGESTION > if (binar
Re: Improve eviction algorithm in ReorderBuffer
On Mon, Mar 11, 2024 at 3:04 PM Peter Smith wrote: > > Here are some review comments for v8-0003 > > == > 0. GENERAL -- why the state enum? > > This patch introduced a new ReorderBufferMemTrackState with 2 states > (REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP, > REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP) > > It's the same as having a boolean flag OFF/ON, so I didn't see any > benefit of the enum instead of a simple boolean flag like > 'track_txn_sizes'. > > NOTE: Below in this post (see #11) I would like to propose another > idea, which can simplify much further, eliminating the need for the > state boolean. If adopted that will impact lots of these other review > comments. Good point! We used to use three states in the earlier version patch but now that we have only two we don't necessarily need to use an enum. I've used your idea. > > == > Commit Message > > 1. > Previously, when selecting the transaction to evict during logical > decoding, we check all transactions to find the largest > transaction. Which could lead to a significant replication lag > especially in case where there are many subtransactions. > > ~ > > /Which could/This could/ > > /in case/in the case/ Fixed. > > == > .../replication/logical/reorderbuffer.c > > 2. > * We use a max-heap with transaction size as the key to efficiently find > * the largest transaction. The max-heap state is managed in two states: > * REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and > REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP. > > /The max-heap state is managed in two states:/The max-heap is managed > in two states:/ This part is removed. > > ~~~ > > 3. > +/* > + * Threshold of the total number of top-level and sub transactions > that controls > + * whether we switch the memory track state. While using max-heap to select > + * the largest transaction is effective when there are many transactions > being > + * decoded, in many systems there is generally no need to use it as long as > all > + * transactions being decoded are top-level transactions. Therefore, we use > + * MaxConnections as the threshold* so we can prevent switch to the > state unless > + * we use subtransactions. > + */ > +#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections > > 3a. > /memory track state./memory tracking state./ > > /While using max-heap/Although using max-heap/ > > "in many systems" (are these words adding anything?) > > /threshold*/threshold/ > > /so we can prevent switch/so we can prevent switching/ > Fixed. > ~ > > 3b. > There's nothing really in this name to indicate the units of the > threshold. Consider if there is some more informative name for this > macro: e.g. > MAXHEAP_TX_COUNT_THRESHOLD (?) Fixed. > > ~~~ > > 4. > + /* > + * Don't start with a lower number than > + * REORDER_BUFFER_MEM_TRACK_THRESHOLD, since we add at least > + * REORDER_BUFFER_MEM_TRACK_THRESHOLD entries at once. > + */ > + buffer->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP; > + buffer->txn_heap = binaryheap_allocate(REORDER_BUFFER_MEM_TRACK_THRESHOLD * > 2, > +ReorderBufferTXNSizeCompare, > +true, NULL); > + > > IIUC the comment intends to say: > > Allocate the initial heap size greater than THRESHOLD because the > txn_heap will not be used until the threshold is exceeded. > > Also, maybe the comment should make a point of saying "Note: the > binary heap is INDEXED for faster manipulations". or something > similar. > Fixed. > ~~~ > > 5. > static void > ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb, > ReorderBufferChange *change, > + ReorderBufferTXN *txn, > bool addition, Size sz) > { > - ReorderBufferTXN *txn; > ReorderBufferTXN *toptxn; > > - Assert(change->txn); > - > > There seems some trick now where the passed 'change' could be NULL, > which was not possible before. e.g., when change is NULL then 'txn' is > not NULL, and vice versa. Some explanation about this logic and the > meaning of these parameters should be written in this function > comment. Added comments. > > ~ > > 6. > + txn = txn != NULL ? txn : change->txn; > > IMO it's more natural to code the ternary using the same order as the > parameters: > > e.g. txn = change ? change->txn : txn; I see your point. I changed it to: if (txn == NULL) txn = change->txn; so we don't change txn if it's not NULL. > > ~~~ > > 7. > /* > * Build the max-heap and switch the state. We will run a heap assembly step > * at the end, which is more efficient. > */ > static void > ReorderBufferBuildMaxHeap(ReorderBuffer *rb) > > /We will run a heap assembly step at the end, which is more > efficient./The heap assembly step is deferred until the end, for > efficiency./ Fixed. > > ~~~ > > 8. ReorderBufferLargestTXN > > + if (hash_get_num_entries(rb->by_txn) < REORDER_BUFFER_MEM_TRACK_THRESHOLD) > + { > + HASH_SEQ_STATUS hash_seq; > + ReorderBufferTXNByIdEnt *ent; > + > + hash_seq_init(&hash_seq, rb->by_txn); > + while ((ent = hash_seq_search(&hash_seq)) != NULL) > + { > + ReorderBuf
Re: Improve eviction algorithm in ReorderBuffer
On Wed, Mar 13, 2024 at 12:48 PM Masahiko Sawada wrote: > > On Wed, Mar 13, 2024 at 10:15 AM Peter Smith wrote: > > > > On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada > > wrote: > > > > > > On Fri, Mar 8, 2024 at 12:58 PM Peter Smith wrote: > > > > > > ... > > > > > > 5. > > > > > > + * > > > > > > + * If 'indexed' is true, we create a hash table to track of each > > > > > > node's > > > > > > + * index in the heap, enabling to perform some operations such as > > > > > > removing > > > > > > + * the node from the heap. > > > > > > */ > > > > > > binaryheap * > > > > > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, > > > > > > void *arg) > > > > > > +binaryheap_allocate(int capacity, binaryheap_comparator compare, > > > > > > + bool indexed, void *arg) > > > > > > > > > > > > BEFORE > > > > > > ... enabling to perform some operations such as removing the node > > > > > > from the heap. > > > > > > > > > > > > SUGGESTION > > > > > > ... to help make operations such as removing nodes more efficient. > > > > > > > > > > > > > > > > But these operations literally require the indexed binary heap as we > > > > > have an assertion: > > > > > > > > > > void > > > > > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d) > > > > > { > > > > > bh_nodeidx_entry *ent; > > > > > > > > > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); > > > > > Assert(heap->bh_indexed); > > > > > > > > > > > > > I didn’t quite understand -- the operations mentioned are "operations > > > > such as removing the node", but binaryheap_remove_node() also removes > > > > a node from the heap. So I still felt the comment wording of the patch > > > > is not quite correct. > > > > > > Now I understand your point. That's a valid point. > > > > > > > > > > > Now, if the removal of a node from an indexed heap can *only* be done > > > > using binaryheap_remove_node_ptr() then: > > > > - the other removal functions (binaryheap_remove_*) probably need some > > > > comments to make sure nobody is tempted to call them directly for an > > > > indexed heap. > > > > - maybe some refactoring and assertions are needed to ensure those > > > > *cannot* be called directly for an indexed heap. > > > > > > > > > > If the 'index' is true, the caller can not only use the existing > > > functions but also newly added functions such as > > > binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about > > > something like below? > > > > > > > You said: "can not only use the existing functions but also..." > > > > Hmm. Is that right? IIUC those existing "remove" functions should NOT > > be called directly if the heap was "indexed" because they'll delete > > the node from the heap OK, but any corresponding index for that > > deleted node will be left lying around -- i.e. everything gets out of > > sync. This was the reason for my original concern. > > > > All existing binaryheap functions should be available even if the > binaryheap is 'indexed'. For instance, with the patch, > binaryheap_remote_node() is: > > void > binaryheap_remove_node(binaryheap *heap, int n) > { > int cmp; > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); > Assert(n >= 0 && n < heap->bh_size); > > /* compare last node to the one that is being removed */ > cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size], >heap->bh_nodes[n], >heap->bh_arg); > > /* remove the last node, placing it in the vacated entry */ > replace_node(heap, n, heap->bh_nodes[heap->bh_size]); > > /* sift as needed to preserve the heap property */ > if (cmp > 0) > sift_up(heap, n); > else if (cmp < 0) > sift_down(heap, n); > } > > The replace_node(), sift_up() and sift_down() update node's index as > well if the binaryheap is indexed. When deleting the node from the > binaryheap, it will also delete its index from the hash table. > I see now. Thanks for the information. ~~~ Some more review comments for v8-0002 == 1. +/* + * Remove the node's index from the hash table if the heap is indexed. + */ +static bool +delete_nodeidx(binaryheap *heap, bh_node_type node) +{ + if (!binaryheap_indexed(heap)) + return false; + + return bh_nodeidx_delete(heap->bh_nodeidx, node); +} I wasn't sure if having this function was a good idea. Yes, it makes code more readable, but I felt the heap code ought to be as efficient as possible so maybe it is better for the index check to be done at the caller, instead of incurring any overhead of function calls that might do nothing. SUGGESTION if (binaryheap_indexed(heap)) found = bh_nodeidx_delete(heap->bh_nodeidx, node); ~~~ 2. +/* + * binaryheap_update_up + * + * Sift the given node up after the node's key is updated. The caller must + * ensure that the given node is in the heap. O(log n) worst case. + * + * This function can be used only if the heap is indexed. + */ +void
Re: Improve eviction algorithm in ReorderBuffer
On Wed, Mar 13, 2024 at 10:15 AM Peter Smith wrote: > > On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada wrote: > > > > On Fri, Mar 8, 2024 at 12:58 PM Peter Smith wrote: > > > > ... > > > > > 5. > > > > > + * > > > > > + * If 'indexed' is true, we create a hash table to track of each > > > > > node's > > > > > + * index in the heap, enabling to perform some operations such as > > > > > removing > > > > > + * the node from the heap. > > > > > */ > > > > > binaryheap * > > > > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, > > > > > void *arg) > > > > > +binaryheap_allocate(int capacity, binaryheap_comparator compare, > > > > > + bool indexed, void *arg) > > > > > > > > > > BEFORE > > > > > ... enabling to perform some operations such as removing the node > > > > > from the heap. > > > > > > > > > > SUGGESTION > > > > > ... to help make operations such as removing nodes more efficient. > > > > > > > > > > > > > But these operations literally require the indexed binary heap as we > > > > have an assertion: > > > > > > > > void > > > > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d) > > > > { > > > > bh_nodeidx_entry *ent; > > > > > > > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); > > > > Assert(heap->bh_indexed); > > > > > > > > > > I didn’t quite understand -- the operations mentioned are "operations > > > such as removing the node", but binaryheap_remove_node() also removes > > > a node from the heap. So I still felt the comment wording of the patch > > > is not quite correct. > > > > Now I understand your point. That's a valid point. > > > > > > > > Now, if the removal of a node from an indexed heap can *only* be done > > > using binaryheap_remove_node_ptr() then: > > > - the other removal functions (binaryheap_remove_*) probably need some > > > comments to make sure nobody is tempted to call them directly for an > > > indexed heap. > > > - maybe some refactoring and assertions are needed to ensure those > > > *cannot* be called directly for an indexed heap. > > > > > > > If the 'index' is true, the caller can not only use the existing > > functions but also newly added functions such as > > binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about > > something like below? > > > > You said: "can not only use the existing functions but also..." > > Hmm. Is that right? IIUC those existing "remove" functions should NOT > be called directly if the heap was "indexed" because they'll delete > the node from the heap OK, but any corresponding index for that > deleted node will be left lying around -- i.e. everything gets out of > sync. This was the reason for my original concern. > All existing binaryheap functions should be available even if the binaryheap is 'indexed'. For instance, with the patch, binaryheap_remote_node() is: void binaryheap_remove_node(binaryheap *heap, int n) { int cmp; Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); Assert(n >= 0 && n < heap->bh_size); /* compare last node to the one that is being removed */ cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size], heap->bh_nodes[n], heap->bh_arg); /* remove the last node, placing it in the vacated entry */ replace_node(heap, n, heap->bh_nodes[heap->bh_size]); /* sift as needed to preserve the heap property */ if (cmp > 0) sift_up(heap, n); else if (cmp < 0) sift_down(heap, n); } The replace_node(), sift_up() and sift_down() update node's index as well if the binaryheap is indexed. When deleting the node from the binaryheap, it will also delete its index from the hash table. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada wrote: > > On Fri, Mar 8, 2024 at 12:58 PM Peter Smith wrote: > > ... > > > > 5. > > > > + * > > > > + * If 'indexed' is true, we create a hash table to track of each node's > > > > + * index in the heap, enabling to perform some operations such as > > > > removing > > > > + * the node from the heap. > > > > */ > > > > binaryheap * > > > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, void > > > > *arg) > > > > +binaryheap_allocate(int capacity, binaryheap_comparator compare, > > > > + bool indexed, void *arg) > > > > > > > > BEFORE > > > > ... enabling to perform some operations such as removing the node from > > > > the heap. > > > > > > > > SUGGESTION > > > > ... to help make operations such as removing nodes more efficient. > > > > > > > > > > But these operations literally require the indexed binary heap as we > > > have an assertion: > > > > > > void > > > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d) > > > { > > > bh_nodeidx_entry *ent; > > > > > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); > > > Assert(heap->bh_indexed); > > > > > > > I didn’t quite understand -- the operations mentioned are "operations > > such as removing the node", but binaryheap_remove_node() also removes > > a node from the heap. So I still felt the comment wording of the patch > > is not quite correct. > > Now I understand your point. That's a valid point. > > > > > Now, if the removal of a node from an indexed heap can *only* be done > > using binaryheap_remove_node_ptr() then: > > - the other removal functions (binaryheap_remove_*) probably need some > > comments to make sure nobody is tempted to call them directly for an > > indexed heap. > > - maybe some refactoring and assertions are needed to ensure those > > *cannot* be called directly for an indexed heap. > > > > If the 'index' is true, the caller can not only use the existing > functions but also newly added functions such as > binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about > something like below? > You said: "can not only use the existing functions but also..." Hmm. Is that right? IIUC those existing "remove" functions should NOT be called directly if the heap was "indexed" because they'll delete the node from the heap OK, but any corresponding index for that deleted node will be left lying around -- i.e. everything gets out of sync. This was the reason for my original concern. > * If 'indexed' is true, we create a hash table to track each node's > * index in the heap, enabling to perform some operations such as > * binaryheap_remove_node_ptr() etc. > Yeah, something like that... I'll wait for the next patch version before commenting further. -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Mar 8, 2024 at 12:58 PM Peter Smith wrote: > > On Thu, Mar 7, 2024 at 2:16 PM Masahiko Sawada wrote: > > > > On Tue, Mar 5, 2024 at 3:28 PM Peter Smith wrote: > > > > > > > 4a. > > > The comment in simplehash.h says > > > * The following parameters are only relevant when SH_DEFINE is defined: > > > * - SH_KEY - ... > > > * - SH_EQUAL(table, a, b) - ... > > > * - SH_HASH_KEY(table, key) - ... > > > * - SH_STORE_HASH - ... > > > * - SH_GET_HASH(tb, a) - ... > > > > > > So maybe it is nicer to reorder the #defines in that same order? > > > > > > SUGGESTION: > > > +#define SH_PREFIX bh_nodeidx > > > +#define SH_ELEMENT_TYPE bh_nodeidx_entry > > > +#define SH_KEY_TYPE bh_node_type > > > +#define SH_SCOPE extern > > > +#ifdef FRONTEND > > > +#define SH_RAW_ALLOCATOR pg_malloc0 > > > +#endif > > > > > > +#define SH_DEFINE > > > +#define SH_KEY key > > > +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0) > > > +#define SH_HASH_KEY(tb, key) \ > > > + hash_bytes((const unsigned char *) &key, sizeof(bh_node_type)) > > > +#include "lib/simplehash.h" > > > > I'm really not sure it helps increase readability. For instance, for > > me it's readable if SH_DEFINE and SH_DECLARE come to the last before > > #include since it's more obvious whether we want to declare, define or > > both. Other simplehash.h users also do so. > > > > OK. > > > > 5. > > > + * > > > + * If 'indexed' is true, we create a hash table to track of each node's > > > + * index in the heap, enabling to perform some operations such as > > > removing > > > + * the node from the heap. > > > */ > > > binaryheap * > > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, void > > > *arg) > > > +binaryheap_allocate(int capacity, binaryheap_comparator compare, > > > + bool indexed, void *arg) > > > > > > BEFORE > > > ... enabling to perform some operations such as removing the node from > > > the heap. > > > > > > SUGGESTION > > > ... to help make operations such as removing nodes more efficient. > > > > > > > But these operations literally require the indexed binary heap as we > > have an assertion: > > > > void > > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d) > > { > > bh_nodeidx_entry *ent; > > > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); > > Assert(heap->bh_indexed); > > > > I didn’t quite understand -- the operations mentioned are "operations > such as removing the node", but binaryheap_remove_node() also removes > a node from the heap. So I still felt the comment wording of the patch > is not quite correct. Now I understand your point. That's a valid point. > > Now, if the removal of a node from an indexed heap can *only* be done > using binaryheap_remove_node_ptr() then: > - the other removal functions (binaryheap_remove_*) probably need some > comments to make sure nobody is tempted to call them directly for an > indexed heap. > - maybe some refactoring and assertions are needed to ensure those > *cannot* be called directly for an indexed heap. > If the 'index' is true, the caller can not only use the existing functions but also newly added functions such as binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about something like below? * If 'indexed' is true, we create a hash table to track each node's * index in the heap, enabling to perform some operations such as * binaryheap_remove_node_ptr() etc. > > And, here are some review comments for v8-0002. > > == > 1. delete_nodeidx > > +/* > + * Remove the node's index from the hash table if the heap is indexed. > + */ > +static bool > +delete_nodeidx(binaryheap *heap, bh_node_type node) > +{ > + if (!binaryheap_indexed(heap)) > + return false; > + > + return bh_nodeidx_delete(heap->bh_nodeidx, node); > +} > > 1a. > In v8 this function was changed to now return bool, so, I think the > function comment should explain the meaning of that return value. > > ~ > > 1b. > I felt the function body is better expressed positively: "If this then > do that", instead of "If not this then do nothing otherwise do that" > > SUGGESTION > if (binaryheap_indexed(heap)) > return bh_nodeidx_delete(heap->bh_nodeidx, node); > > return false; > > ~~~ > > 2. > +static void > +replace_node(binaryheap *heap, int index, bh_node_type new_node) > +{ > + bool found PG_USED_FOR_ASSERTS_ONLY; > + > + /* Quick return if not necessary to move */ > + if (heap->bh_nodes[index] == new_node) > + return; > + > + /* > + * Remove overwritten node's index. The overwritten node's position must > + * have been tracked, if enabled. > + */ > + found = delete_nodeidx(heap, heap->bh_nodes[index]); > + Assert(!binaryheap_indexed(heap) || found); > + > + /* > + * Replace it with the given new node. This node's position must also be > + * tracked as we assume to replace the node by the existing node. > + */ > + found = set_node(heap, new_node, index); > + Assert(!binaryheap_indexed(heap) || found); > +} > > 2a.
Re: Improve eviction algorithm in ReorderBuffer
Here are some review comments for v8-0003 == 0. GENERAL -- why the state enum? This patch introduced a new ReorderBufferMemTrackState with 2 states (REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP, REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP) It's the same as having a boolean flag OFF/ON, so I didn't see any benefit of the enum instead of a simple boolean flag like 'track_txn_sizes'. NOTE: Below in this post (see #11) I would like to propose another idea, which can simplify much further, eliminating the need for the state boolean. If adopted that will impact lots of these other review comments. == Commit Message 1. Previously, when selecting the transaction to evict during logical decoding, we check all transactions to find the largest transaction. Which could lead to a significant replication lag especially in case where there are many subtransactions. ~ /Which could/This could/ /in case/in the case/ == .../replication/logical/reorderbuffer.c 2. * We use a max-heap with transaction size as the key to efficiently find * the largest transaction. The max-heap state is managed in two states: * REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP and REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP. /The max-heap state is managed in two states:/The max-heap is managed in two states:/ ~~~ 3. +/* + * Threshold of the total number of top-level and sub transactions that controls + * whether we switch the memory track state. While using max-heap to select + * the largest transaction is effective when there are many transactions being + * decoded, in many systems there is generally no need to use it as long as all + * transactions being decoded are top-level transactions. Therefore, we use + * MaxConnections as the threshold* so we can prevent switch to the state unless + * we use subtransactions. + */ +#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections 3a. /memory track state./memory tracking state./ /While using max-heap/Although using max-heap/ "in many systems" (are these words adding anything?) /threshold*/threshold/ /so we can prevent switch/so we can prevent switching/ ~ 3b. There's nothing really in this name to indicate the units of the threshold. Consider if there is some more informative name for this macro: e.g. MAXHEAP_TX_COUNT_THRESHOLD (?) ~~~ 4. + /* + * Don't start with a lower number than + * REORDER_BUFFER_MEM_TRACK_THRESHOLD, since we add at least + * REORDER_BUFFER_MEM_TRACK_THRESHOLD entries at once. + */ + buffer->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP; + buffer->txn_heap = binaryheap_allocate(REORDER_BUFFER_MEM_TRACK_THRESHOLD * 2, +ReorderBufferTXNSizeCompare, +true, NULL); + IIUC the comment intends to say: Allocate the initial heap size greater than THRESHOLD because the txn_heap will not be used until the threshold is exceeded. Also, maybe the comment should make a point of saying "Note: the binary heap is INDEXED for faster manipulations". or something similar. ~~~ 5. static void ReorderBufferChangeMemoryUpdate(ReorderBuffer *rb, ReorderBufferChange *change, + ReorderBufferTXN *txn, bool addition, Size sz) { - ReorderBufferTXN *txn; ReorderBufferTXN *toptxn; - Assert(change->txn); - There seems some trick now where the passed 'change' could be NULL, which was not possible before. e.g., when change is NULL then 'txn' is not NULL, and vice versa. Some explanation about this logic and the meaning of these parameters should be written in this function comment. ~ 6. + txn = txn != NULL ? txn : change->txn; IMO it's more natural to code the ternary using the same order as the parameters: e.g. txn = change ? change->txn : txn; ~~~ 7. /* * Build the max-heap and switch the state. We will run a heap assembly step * at the end, which is more efficient. */ static void ReorderBufferBuildMaxHeap(ReorderBuffer *rb) /We will run a heap assembly step at the end, which is more efficient./The heap assembly step is deferred until the end, for efficiency./ ~~~ 8. ReorderBufferLargestTXN + if (hash_get_num_entries(rb->by_txn) < REORDER_BUFFER_MEM_TRACK_THRESHOLD) + { + HASH_SEQ_STATUS hash_seq; + ReorderBufferTXNByIdEnt *ent; + + hash_seq_init(&hash_seq, rb->by_txn); + while ((ent = hash_seq_search(&hash_seq)) != NULL) + { + ReorderBufferTXN *txn = ent->txn; + + /* if the current transaction is larger, remember it */ + if ((!largest) || (txn->size > largest->size)) + largest = txn; + } + + Assert(largest); + } That Assert(largest) seems redundant because there is anyway another Assert(largest) immediately after this code. ~~~ 9. + /* Get the largest transaction from the max-heap */ + if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP) + { + Assert(binaryheap_size(rb->txn_heap) > 0); + largest = (ReorderBufferTXN *) + DatumGetPointer(binaryheap_first(rb->txn_heap)); } Assert(binaryheap_size(rb->txn_heap) > 0); seemed like slightly less readable way of saying: Assert(!binaryheap_empty(rb->txn_heap)); ~~~ 10. + +/* + * Comp
Re: Improve eviction algorithm in ReorderBuffer
On Thu, Mar 7, 2024 at 2:16 PM Masahiko Sawada wrote: > > On Tue, Mar 5, 2024 at 3:28 PM Peter Smith wrote: > > > > 4a. > > The comment in simplehash.h says > > * The following parameters are only relevant when SH_DEFINE is defined: > > * - SH_KEY - ... > > * - SH_EQUAL(table, a, b) - ... > > * - SH_HASH_KEY(table, key) - ... > > * - SH_STORE_HASH - ... > > * - SH_GET_HASH(tb, a) - ... > > > > So maybe it is nicer to reorder the #defines in that same order? > > > > SUGGESTION: > > +#define SH_PREFIX bh_nodeidx > > +#define SH_ELEMENT_TYPE bh_nodeidx_entry > > +#define SH_KEY_TYPE bh_node_type > > +#define SH_SCOPE extern > > +#ifdef FRONTEND > > +#define SH_RAW_ALLOCATOR pg_malloc0 > > +#endif > > > > +#define SH_DEFINE > > +#define SH_KEY key > > +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0) > > +#define SH_HASH_KEY(tb, key) \ > > + hash_bytes((const unsigned char *) &key, sizeof(bh_node_type)) > > +#include "lib/simplehash.h" > > I'm really not sure it helps increase readability. For instance, for > me it's readable if SH_DEFINE and SH_DECLARE come to the last before > #include since it's more obvious whether we want to declare, define or > both. Other simplehash.h users also do so. > OK. > > 5. > > + * > > + * If 'indexed' is true, we create a hash table to track of each node's > > + * index in the heap, enabling to perform some operations such as removing > > + * the node from the heap. > > */ > > binaryheap * > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg) > > +binaryheap_allocate(int capacity, binaryheap_comparator compare, > > + bool indexed, void *arg) > > > > BEFORE > > ... enabling to perform some operations such as removing the node from the > > heap. > > > > SUGGESTION > > ... to help make operations such as removing nodes more efficient. > > > > But these operations literally require the indexed binary heap as we > have an assertion: > > void > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d) > { > bh_nodeidx_entry *ent; > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); > Assert(heap->bh_indexed); > I didn’t quite understand -- the operations mentioned are "operations such as removing the node", but binaryheap_remove_node() also removes a node from the heap. So I still felt the comment wording of the patch is not quite correct. Now, if the removal of a node from an indexed heap can *only* be done using binaryheap_remove_node_ptr() then: - the other removal functions (binaryheap_remove_*) probably need some comments to make sure nobody is tempted to call them directly for an indexed heap. - maybe some refactoring and assertions are needed to ensure those *cannot* be called directly for an indexed heap. > > 7b. > > IMO the parameters would be better the other way around (e.g. 'index' > > before the 'node') because that's what the assignments look like: > > > > > > heap->bh_nodes[heap->bh_size] = d; > > > > becomes: > > bh_set_node(heap, heap->bh_size, d); > > > > I think it assumes heap->bh_nodes is an array. But if we change it in > the future, it will no longer make sense. I think it would make more > sense if we define the parameters in an order like "we set the 'node' > at 'index'". What do you think? YMMV. The patch code is also OK by me if you prefer it. // And, here are some review comments for v8-0002. == 1. delete_nodeidx +/* + * Remove the node's index from the hash table if the heap is indexed. + */ +static bool +delete_nodeidx(binaryheap *heap, bh_node_type node) +{ + if (!binaryheap_indexed(heap)) + return false; + + return bh_nodeidx_delete(heap->bh_nodeidx, node); +} 1a. In v8 this function was changed to now return bool, so, I think the function comment should explain the meaning of that return value. ~ 1b. I felt the function body is better expressed positively: "If this then do that", instead of "If not this then do nothing otherwise do that" SUGGESTION if (binaryheap_indexed(heap)) return bh_nodeidx_delete(heap->bh_nodeidx, node); return false; ~~~ 2. +static void +replace_node(binaryheap *heap, int index, bh_node_type new_node) +{ + bool found PG_USED_FOR_ASSERTS_ONLY; + + /* Quick return if not necessary to move */ + if (heap->bh_nodes[index] == new_node) + return; + + /* + * Remove overwritten node's index. The overwritten node's position must + * have been tracked, if enabled. + */ + found = delete_nodeidx(heap, heap->bh_nodes[index]); + Assert(!binaryheap_indexed(heap) || found); + + /* + * Replace it with the given new node. This node's position must also be + * tracked as we assume to replace the node by the existing node. + */ + found = set_node(heap, new_node, index); + Assert(!binaryheap_indexed(heap) || found); +} 2a. /Remove overwritten/Remove the overwritten/ /replace the node by the existing node/replace the node with the existing node/ ~ 2b. It might be helpful to declare another local var... bh_node_type
Re: Improve eviction algorithm in ReorderBuffer
On Tue, Mar 5, 2024 at 3:28 PM Peter Smith wrote: > > Hi, here are some review comments for v7-0002 > > == > Commit Message > > 1. > This commit adds a hash table to binaryheap in order to track of > positions of each nodes in the binaryheap. That way, by using newly > added functions such as binaryheap_update_up() etc., both updating a > key and removing a node can be done in O(1) on an average and O(log n) > in worst case. This is known as the indexed binary heap. The caller > can specify to use the indexed binaryheap by passing indexed = true. > > ~ > > /to track of positions of each nodes/to track the position of each node/ > > ~~~ > > 2. > There is no user of it but it will be used by a upcoming patch. > > ~ > > The current code does not use the new indexing logic, but it will be > used by an upcoming patch. Fixed. > > == > src/common/binaryheap.c > > 3. > +/* > + * Define parameters for hash table code generation. The interface is *also*" > + * declared in binaryheaph.h (to generate the types, which are externally > + * visible). > + */ > > Typo: *also*" Fixed. > > ~~~ > > 4. > +#define SH_PREFIX bh_nodeidx > +#define SH_ELEMENT_TYPE bh_nodeidx_entry > +#define SH_KEY_TYPE bh_node_type > +#define SH_KEY key > +#define SH_HASH_KEY(tb, key) \ > + hash_bytes((const unsigned char *) &key, sizeof(bh_node_type)) > +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0) > +#define SH_SCOPE extern > +#ifdef FRONTEND > +#define SH_RAW_ALLOCATOR pg_malloc0 > +#endif > +#define SH_DEFINE > +#include "lib/simplehash.h" > > 4a. > The comment in simplehash.h says > * The following parameters are only relevant when SH_DEFINE is defined: > * - SH_KEY - ... > * - SH_EQUAL(table, a, b) - ... > * - SH_HASH_KEY(table, key) - ... > * - SH_STORE_HASH - ... > * - SH_GET_HASH(tb, a) - ... > > So maybe it is nicer to reorder the #defines in that same order? > > SUGGESTION: > +#define SH_PREFIX bh_nodeidx > +#define SH_ELEMENT_TYPE bh_nodeidx_entry > +#define SH_KEY_TYPE bh_node_type > +#define SH_SCOPE extern > +#ifdef FRONTEND > +#define SH_RAW_ALLOCATOR pg_malloc0 > +#endif > > +#define SH_DEFINE > +#define SH_KEY key > +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0) > +#define SH_HASH_KEY(tb, key) \ > + hash_bytes((const unsigned char *) &key, sizeof(bh_node_type)) > +#include "lib/simplehash.h" I'm really not sure it helps increase readability. For instance, for me it's readable if SH_DEFINE and SH_DECLARE come to the last before #include since it's more obvious whether we want to declare, define or both. Other simplehash.h users also do so. > > ~~ > > 4b. > The comment in simplehash.h says that "it's preferable, if possible, > to store the element's hash in the element's data type", so should > SH_STORE_HASH and SH_GET_HASH also be defined here? Good catch. I've used these macros. > > ~~~ > > 5. > + * > + * If 'indexed' is true, we create a hash table to track of each node's > + * index in the heap, enabling to perform some operations such as removing > + * the node from the heap. > */ > binaryheap * > -binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg) > +binaryheap_allocate(int capacity, binaryheap_comparator compare, > + bool indexed, void *arg) > > BEFORE > ... enabling to perform some operations such as removing the node from the > heap. > > SUGGESTION > ... to help make operations such as removing nodes more efficient. > But these operations literally require the indexed binary heap as we have an assertion: void binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d) { bh_nodeidx_entry *ent; Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property); Assert(heap->bh_indexed); > ~~~ > > 6. > + heap->bh_indexed = indexed; > + if (heap->bh_indexed) > + { > +#ifdef FRONTEND > + heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL); > +#else > + heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity, > + NULL); > +#endif > + } > + > > The heap allocation just uses palloc instead of palloc0 so it might be > better to assign "heap->bh_nodeidx = NULL;" up-front, just so you will > never get a situation where bh_indexed is false but bh_nodeidx has > some (garbage) value. Fixed. > > ~~~ > > 7. > +/* > + * Set the given node at the 'index', and updates its position accordingly. > + * > + * Return true if the node's index is already tracked. > + */ > +static bool > +bh_set_node(binaryheap *heap, bh_node_type node, int index) > > 7a. > I felt the 1st sentence should be more like: > > SUGGESTION > Set the given node at the 'index' and track it if required. Fixed. > > ~ > > 7b. > IMO the parameters would be better the other way around (e.g. 'index' > before the 'node') because that's what the assignments look like: > > > heap->bh_nodes[heap->bh_size] = d; > > becomes: > bh_set_node(heap, heap->bh_size, d); > I think it assumes heap->bh_nodes is an array. But if we change it in the futur
Re: Improve eviction algorithm in ReorderBuffer
On Tue, Mar 5, 2024 at 12:20 PM vignesh C wrote: > > On Wed, 28 Feb 2024 at 11:40, Amit Kapila wrote: > > > > On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada > > wrote: > > > > > > > A few comments on 0003: > > === > > 1. > > +/* > > + * Threshold of the total number of top-level and sub transactions > > that controls > > + * whether we switch the memory track state. While the MAINTAIN_HEAP state > > is > > + * effective when there are many transactions being decoded, in many > > systems > > + * there is generally no need to use it as long as all transactions > > being decoded > > + * are top-level transactions. Therefore, we use MaxConnections as > > the threshold > > + * so we can prevent switch to the state unless we use subtransactions. > > + */ > > +#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections > > > > The comment seems to imply that MAINTAIN_HEAP is useful for large > > number of transactions but ReorderBufferLargestTXN() switches to this > > state even when there is one transaction. So, basically we use the > > binary_heap technique to get the largest even when we have one > > transaction but we don't maintain that heap unless we have > > REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are > > in-progress. This means there is some additional work when (build and > > reset heap each time when we pick largest xact) we have fewer > > transactions in the system but that may not be impacting us because of > > other costs involved like serializing all the changes. I think once we > > can try to stress test this by setting > > debug_logical_replication_streaming to 'immediate' to see if the new > > mechanism has any overhead. > > I ran the test with a transaction having many inserts: > > | 5000 | 1 | 2 | 10| 100 | 1000 > --- > |---|||--|| > Head | 26.31 | 48.84 | 93.65 | 480.05 | 4808.29 | 47020.16 > Patch | 26.35 | 50.8 | 97.99 | 484.8| 4856.95 | 48108.89 > > The same test with debug_logical_replication_streaming= 'immediate' > > | 5000 | 1 | 2 | 10| 100 | 1000 > --- > |---|||--|| > Head | 59.29 | 115.84 | 227.21 | 1156.08 | 11367.42 | 113986.14 > Patch | 62.45 | 120.48 | 240.56 | 1185.12 | 11855.37 | 119921.81 > > The execution time is in milliseconds. The column header indicates the > number of inserts in the transaction. > In this case I noticed that the test execution with patch was taking > slightly more time. > Thank you for testing! With 10M records, I can see 2% regression in the 'buffered' case and 5% regression in the 'immediate' case. I think that in general it makes sense to postpone using a max-heap until the number of transactions is higher than the threshold. I've implemented this idea and here are the results on my environment (with 10M records and debug_logical_replication_streaming = 'immediate'): HEAD: 68937.887 ms 69450.174 ms 68808.248 ms v7 patch: 71280.783 ms 71673.101 ms 71330.898 ms v8 patch: 68918.259 ms 68822.330 ms 68972.452 ms Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v8-0001-Make-binaryheap-enlargeable.patch Description: Binary data v8-0002-Add-functions-to-binaryheap-for-efficient-key-rem.patch Description: Binary data v8-0003-Improve-eviction-algorithm-in-Reorderbuffer-using.patch Description: Binary data
Re: Improve eviction algorithm in ReorderBuffer
On Tue, Mar 5, 2024 at 11:25 AM Peter Smith wrote: > > Hi, Here are some review comments for v7-0001 Thank you for reviewing the patch. > > 1. > /* > * binaryheap_free > * > * Releases memory used by the given binaryheap. > */ > void > binaryheap_free(binaryheap *heap) > { > pfree(heap); > } > > > Shouldn't the above function (not modified by the patch) also firstly > free the memory allocated for the heap->bh_nodes? > > ~~~ > > 2. > +/* > + * Make sure there is enough space for nodes. > + */ > +static void > +bh_enlarge_node_array(binaryheap *heap) > +{ > + heap->bh_space *= 2; > + heap->bh_nodes = repalloc(heap->bh_nodes, > + sizeof(bh_node_type) * heap->bh_space); > +} > > Strictly speaking, this function doesn't really "Make sure" of > anything because the caller does the check whether we need more space. > All that happens here is allocating more space. Maybe this function > comment should say something like "Double the space allocated for > nodes." Agreed with the above two points. I'll fix them in the next version patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
Hi, here are some review comments for v7-0002 == Commit Message 1. This commit adds a hash table to binaryheap in order to track of positions of each nodes in the binaryheap. That way, by using newly added functions such as binaryheap_update_up() etc., both updating a key and removing a node can be done in O(1) on an average and O(log n) in worst case. This is known as the indexed binary heap. The caller can specify to use the indexed binaryheap by passing indexed = true. ~ /to track of positions of each nodes/to track the position of each node/ ~~~ 2. There is no user of it but it will be used by a upcoming patch. ~ The current code does not use the new indexing logic, but it will be used by an upcoming patch. == src/common/binaryheap.c 3. +/* + * Define parameters for hash table code generation. The interface is *also*" + * declared in binaryheaph.h (to generate the types, which are externally + * visible). + */ Typo: *also*" ~~~ 4. +#define SH_PREFIX bh_nodeidx +#define SH_ELEMENT_TYPE bh_nodeidx_entry +#define SH_KEY_TYPE bh_node_type +#define SH_KEY key +#define SH_HASH_KEY(tb, key) \ + hash_bytes((const unsigned char *) &key, sizeof(bh_node_type)) +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0) +#define SH_SCOPE extern +#ifdef FRONTEND +#define SH_RAW_ALLOCATOR pg_malloc0 +#endif +#define SH_DEFINE +#include "lib/simplehash.h" 4a. The comment in simplehash.h says * The following parameters are only relevant when SH_DEFINE is defined: * - SH_KEY - ... * - SH_EQUAL(table, a, b) - ... * - SH_HASH_KEY(table, key) - ... * - SH_STORE_HASH - ... * - SH_GET_HASH(tb, a) - ... So maybe it is nicer to reorder the #defines in that same order? SUGGESTION: +#define SH_PREFIX bh_nodeidx +#define SH_ELEMENT_TYPE bh_nodeidx_entry +#define SH_KEY_TYPE bh_node_type +#define SH_SCOPE extern +#ifdef FRONTEND +#define SH_RAW_ALLOCATOR pg_malloc0 +#endif +#define SH_DEFINE +#define SH_KEY key +#define SH_EQUAL(tb, a, b) (memcmp(&a, &b, sizeof(bh_node_type)) == 0) +#define SH_HASH_KEY(tb, key) \ + hash_bytes((const unsigned char *) &key, sizeof(bh_node_type)) +#include "lib/simplehash.h" ~~ 4b. The comment in simplehash.h says that "it's preferable, if possible, to store the element's hash in the element's data type", so should SH_STORE_HASH and SH_GET_HASH also be defined here? ~~~ 5. + * + * If 'indexed' is true, we create a hash table to track of each node's + * index in the heap, enabling to perform some operations such as removing + * the node from the heap. */ binaryheap * -binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg) +binaryheap_allocate(int capacity, binaryheap_comparator compare, + bool indexed, void *arg) BEFORE ... enabling to perform some operations such as removing the node from the heap. SUGGESTION ... to help make operations such as removing nodes more efficient. ~~~ 6. + heap->bh_indexed = indexed; + if (heap->bh_indexed) + { +#ifdef FRONTEND + heap->bh_nodeidx = bh_nodeidx_create(capacity, NULL); +#else + heap->bh_nodeidx = bh_nodeidx_create(CurrentMemoryContext, capacity, + NULL); +#endif + } + The heap allocation just uses palloc instead of palloc0 so it might be better to assign "heap->bh_nodeidx = NULL;" up-front, just so you will never get a situation where bh_indexed is false but bh_nodeidx has some (garbage) value. ~~~ 7. +/* + * Set the given node at the 'index', and updates its position accordingly. + * + * Return true if the node's index is already tracked. + */ +static bool +bh_set_node(binaryheap *heap, bh_node_type node, int index) 7a. I felt the 1st sentence should be more like: SUGGESTION Set the given node at the 'index' and track it if required. ~ 7b. IMO the parameters would be better the other way around (e.g. 'index' before the 'node') because that's what the assignments look like: heap->bh_nodes[heap->bh_size] = d; becomes: bh_set_node(heap, heap->bh_size, d); ~~~ 8. +static bool +bh_set_node(binaryheap *heap, bh_node_type node, int index) +{ + bh_nodeidx_entry *ent; + bool found = false; + + /* Set the node to the nodes array */ + heap->bh_nodes[index] = node; + + if (heap->bh_indexed) + { + /* Remember its index in the nodes array */ + ent = bh_nodeidx_insert(heap->bh_nodeidx, node, &found); + ent->idx = index; + } + + return found; +} 8a. That 'ent' declaration can be moved to the inner block scope, so it is closer to where it is needed. ~ 8b. + /* Remember its index in the nodes array */ The comment is worded a bit ambiguously. IMO a simpler comment would be: "/* Keep track of the node index. */" ~~~ 9. +static void +bh_delete_nodeidx(binaryheap *heap, bh_node_type node) +{ + if (!heap->bh_indexed) + return; + + (void) bh_nodeidx_delete(heap->bh_nodeidx, node); +} Since there is only 1 statement IMO it is simpler to write this function like below: if (heap->bh_indexed) (void) bh_nodeidx_delete(heap->bh_nodeidx, node); ~~~ 10. +/* + * Replace the node a
Re: Improve eviction algorithm in ReorderBuffer
On Wed, 28 Feb 2024 at 11:40, Amit Kapila wrote: > > On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada wrote: > > > > A few comments on 0003: > === > 1. > +/* > + * Threshold of the total number of top-level and sub transactions > that controls > + * whether we switch the memory track state. While the MAINTAIN_HEAP state is > + * effective when there are many transactions being decoded, in many systems > + * there is generally no need to use it as long as all transactions > being decoded > + * are top-level transactions. Therefore, we use MaxConnections as > the threshold > + * so we can prevent switch to the state unless we use subtransactions. > + */ > +#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections > > The comment seems to imply that MAINTAIN_HEAP is useful for large > number of transactions but ReorderBufferLargestTXN() switches to this > state even when there is one transaction. So, basically we use the > binary_heap technique to get the largest even when we have one > transaction but we don't maintain that heap unless we have > REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are > in-progress. This means there is some additional work when (build and > reset heap each time when we pick largest xact) we have fewer > transactions in the system but that may not be impacting us because of > other costs involved like serializing all the changes. I think once we > can try to stress test this by setting > debug_logical_replication_streaming to 'immediate' to see if the new > mechanism has any overhead. I ran the test with a transaction having many inserts: | 5000 | 1 | 2 | 10| 100 | 1000 --- |---|||--|| Head | 26.31 | 48.84 | 93.65 | 480.05 | 4808.29 | 47020.16 Patch | 26.35 | 50.8 | 97.99 | 484.8| 4856.95 | 48108.89 The same test with debug_logical_replication_streaming= 'immediate' | 5000 | 1 | 2 | 10| 100 | 1000 --- |---|||--|| Head | 59.29 | 115.84 | 227.21 | 1156.08 | 11367.42 | 113986.14 Patch | 62.45 | 120.48 | 240.56 | 1185.12 | 11855.37 | 119921.81 The execution time is in milliseconds. The column header indicates the number of inserts in the transaction. In this case I noticed that the test execution with patch was taking slightly more time. Regards, Vignesh
Re: Improve eviction algorithm in ReorderBuffer
Hi, Here are some review comments for v7-0001 1. /* * binaryheap_free * * Releases memory used by the given binaryheap. */ void binaryheap_free(binaryheap *heap) { pfree(heap); } Shouldn't the above function (not modified by the patch) also firstly free the memory allocated for the heap->bh_nodes? ~~~ 2. +/* + * Make sure there is enough space for nodes. + */ +static void +bh_enlarge_node_array(binaryheap *heap) +{ + heap->bh_space *= 2; + heap->bh_nodes = repalloc(heap->bh_nodes, + sizeof(bh_node_type) * heap->bh_space); +} Strictly speaking, this function doesn't really "Make sure" of anything because the caller does the check whether we need more space. All that happens here is allocating more space. Maybe this function comment should say something like "Double the space allocated for nodes." -- Kind Regards, Peter Smith. Fujitsu Australia
Re: Improve eviction algorithm in ReorderBuffer
On Wed, Feb 28, 2024 at 3:10 PM Amit Kapila wrote: > > On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada wrote: > > > Thank you for the comments! > A few comments on 0003: > === > 1. > +/* > + * Threshold of the total number of top-level and sub transactions > that controls > + * whether we switch the memory track state. While the MAINTAIN_HEAP state is > + * effective when there are many transactions being decoded, in many systems > + * there is generally no need to use it as long as all transactions > being decoded > + * are top-level transactions. Therefore, we use MaxConnections as > the threshold > + * so we can prevent switch to the state unless we use subtransactions. > + */ > +#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections > > The comment seems to imply that MAINTAIN_HEAP is useful for large > number of transactions but ReorderBufferLargestTXN() switches to this > state even when there is one transaction. So, basically we use the > binary_heap technique to get the largest even when we have one > transaction but we don't maintain that heap unless we have > REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are > in-progress. Right. > This means there is some additional work when (build and > reset heap each time when we pick largest xact) we have fewer > transactions in the system but that may not be impacting us because of > other costs involved like serializing all the changes. I think once we > can try to stress test this by setting > debug_logical_replication_streaming to 'immediate' to see if the new > mechanism has any overhead. Agreed. I've done performance tests that decodes 10k small transactions (pgbench transactions) with debug_logical_replication_streaming = 'immediate': master: 6263.022 ms patched: 6403.873 ms I don't see noticeable regressions. > > 2. Can we somehow measure the additional memory that will be consumed > by each backend/walsender to maintain transactions? Because I think > this also won't be accounted for logical_decoding_work_mem, so if this > is large, there could be a chance of more complaints on us for not > honoring logical_decoding_work_mem. Good point. We initialize the binaryheap with MaxConnections * 2 entries and the binaryheap entries are pointers. So we use additional (8 * 100 * 2) bytes with the default max_connections setting even when there is one transaction, and could use more memory when adding more transactions. I think there is still room for considering how to determine the threshold and the number of initial entries. Using MaxConnections seems to work but it always uses the current MaxConnections value instead of the value that was set at a time when WAL records were written. As for the initial number of entries in binaryheap, I think we can the threshold value as the initial number of entries instead of (threshold * 2). Or we might want to use the same value, 1000, as the one we use for buffer->by_txn hash table. > > 3. > @@ -3707,11 +3817,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, > ReorderBufferTXN *txn) > > ReorderBufferSerializeChange(rb, txn, fd, change); > dlist_delete(&change->node); > - ReorderBufferReturnChange(rb, change, true); > + ReorderBufferReturnChange(rb, change, false); > > spilled++; > } > > + /* Update the memory counter */ > + ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size); > > In ReorderBufferSerializeTXN(), we already use a size variable for > txn->size, we can probably use that for the sake of consistency. Agreed, will fix it. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On Mon, Feb 26, 2024 at 7:54 PM Masahiko Sawada wrote: > A few comments on 0003: === 1. +/* + * Threshold of the total number of top-level and sub transactions that controls + * whether we switch the memory track state. While the MAINTAIN_HEAP state is + * effective when there are many transactions being decoded, in many systems + * there is generally no need to use it as long as all transactions being decoded + * are top-level transactions. Therefore, we use MaxConnections as the threshold + * so we can prevent switch to the state unless we use subtransactions. + */ +#define REORDER_BUFFER_MEM_TRACK_THRESHOLD MaxConnections The comment seems to imply that MAINTAIN_HEAP is useful for large number of transactions but ReorderBufferLargestTXN() switches to this state even when there is one transaction. So, basically we use the binary_heap technique to get the largest even when we have one transaction but we don't maintain that heap unless we have REORDER_BUFFER_MEM_TRACK_THRESHOLD number of transactions are in-progress. This means there is some additional work when (build and reset heap each time when we pick largest xact) we have fewer transactions in the system but that may not be impacting us because of other costs involved like serializing all the changes. I think once we can try to stress test this by setting debug_logical_replication_streaming to 'immediate' to see if the new mechanism has any overhead. 2. Can we somehow measure the additional memory that will be consumed by each backend/walsender to maintain transactions? Because I think this also won't be accounted for logical_decoding_work_mem, so if this is large, there could be a chance of more complaints on us for not honoring logical_decoding_work_mem. 3. @@ -3707,11 +3817,14 @@ ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn) ReorderBufferSerializeChange(rb, txn, fd, change); dlist_delete(&change->node); - ReorderBufferReturnChange(rb, change, true); + ReorderBufferReturnChange(rb, change, false); spilled++; } + /* Update the memory counter */ + ReorderBufferChangeMemoryUpdate(rb, NULL, txn, false, txn->size); In ReorderBufferSerializeTXN(), we already use a size variable for txn->size, we can probably use that for the sake of consistency. -- With Regards, Amit Kapila.
Re: Improve eviction algorithm in ReorderBuffer
On Mon, 26 Feb 2024 at 12:33, Masahiko Sawada wrote: > > On Fri, Feb 23, 2024 at 6:24 PM vignesh C wrote: > > > > On Fri, 9 Feb 2024 at 20:51, Masahiko Sawada wrote: > > > > > > > > > I think this performance regression is not acceptable. In this > > > workload, one transaction has 10k subtransactions and the logical > > > decoding becomes quite slow if logical_decoding_work_mem is not big > > > enough. Therefore, it's a legitimate and common approach to increase > > > logical_decoding_work_mem to speedup the decoding. However, with thie > > > patch, the decoding becomes slower than today. It's a bad idea in > > > general to optimize an extreme case while sacrificing the normal (or > > > more common) cases. > > > > > > > Since this same function is used by pg_dump sorting TopoSort functions > > also, we can just verify once if there is no performance impact with > > large number of objects during dump sorting: > > Okay. I've run the pg_dump regression tests with --timer flag (note > that pg_dump doesn't use indexed binary heap): > > master: > [16:00:25] t/001_basic.pl ok 151 ms ( 0.00 usr > 0.00 sys + 0.09 cusr 0.06 csys = 0.15 CPU) > [16:00:25] t/002_pg_dump.pl .. ok10157 ms ( 0.23 usr > 0.01 sys + 1.48 cusr 0.37 csys = 2.09 CPU) > [16:00:36] t/003_pg_dump_with_server.pl .. ok 504 ms ( 0.00 usr > 0.01 sys + 0.10 cusr 0.07 csys = 0.18 CPU) > [16:00:36] t/004_pg_dump_parallel.pl . ok 1044 ms ( 0.00 usr > 0.00 sys + 0.12 cusr 0.08 csys = 0.20 CPU) > [16:00:37] t/005_pg_dump_filterfile.pl ... ok 2390 ms ( 0.00 usr > 0.00 sys + 0.34 cusr 0.19 csys = 0.53 CPU) > [16:00:40] t/010_dump_connstr.pl . ok 4813 ms ( 0.01 usr > 0.00 sys + 2.13 cusr 0.45 csys = 2.59 CPU) > > patched: > [15:59:47] t/001_basic.pl ok 150 ms ( 0.00 usr > 0.00 sys + 0.08 cusr 0.07 csys = 0.15 CPU) > [15:59:47] t/002_pg_dump.pl .. ok10057 ms ( 0.23 usr > 0.02 sys + 1.49 cusr 0.36 csys = 2.10 CPU) > [15:59:57] t/003_pg_dump_with_server.pl .. ok 509 ms ( 0.00 usr > 0.00 sys + 0.09 cusr 0.08 csys = 0.17 CPU) > [15:59:58] t/004_pg_dump_parallel.pl . ok 1048 ms ( 0.01 usr > 0.00 sys + 0.11 cusr 0.11 csys = 0.23 CPU) > [15:59:59] t/005_pg_dump_filterfile.pl ... ok 2398 ms ( 0.00 usr > 0.00 sys + 0.34 cusr 0.20 csys = 0.54 CPU) > [16:00:01] t/010_dump_connstr.pl . ok 4762 ms ( 0.01 usr > 0.00 sys + 2.15 cusr 0.42 csys = 2.58 CPU) > > There is no noticeable difference between the two results. Thanks for verifying it, I have also run in my environment and found no noticeable difference between them: Head: [07:29:41] t/001_basic.pl ok 332 ms [07:29:41] t/002_pg_dump.pl .. ok11029 ms [07:29:52] t/003_pg_dump_with_server.pl .. ok 705 ms [07:29:53] t/004_pg_dump_parallel.pl . ok 1198 ms [07:29:54] t/005_pg_dump_filterfile.pl ... ok 2822 ms [07:29:57] t/010_dump_connstr.pl . ok 5582 ms With Patch: [07:42:16] t/001_basic.pl ok 328 ms [07:42:17] t/002_pg_dump.pl .. ok11044 ms [07:42:28] t/003_pg_dump_with_server.pl .. ok 719 ms [07:42:29] t/004_pg_dump_parallel.pl . ok 1188 ms [07:42:30] t/005_pg_dump_filterfile.pl ... ok 2816 ms [07:42:33] t/010_dump_connstr.pl . ok 5609 ms Regards, Vignesh
Re: Improve eviction algorithm in ReorderBuffer
On Mon, Feb 26, 2024 at 6:43 PM Tomas Vondra wrote: > > > > On 2/26/24 07:46, Masahiko Sawada wrote: > > On Sat, Feb 24, 2024 at 1:29 AM Tomas Vondra > > wrote: > >>... > >> > >> overall design > >> -- > >> > >> As for the design, I agree with the approach of using a binaryheap to > >> track transactions by size. When going over the thread history, > >> describing the initial approach with only keeping "large" transactions > >> above some threshold (e.g. 10%), I was really concerned that'll either > >> lead to abrupt changes in behavior (when transactions move just around > >> the 10%), or won't help with many common cases (with most transactions > >> being below the limit). > >> > >> I was going to suggest some sort of "binning" - keeping lists for > >> transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and > >> evicting transactions from a list, i.e. based on approximate size. But > >> if the indexed binary heap seems to be cheap enough, I think it's a > >> better solution. > > > > I've also considered the binning idea. But it was not clear to me how > > it works well in a case where all transactions belong to the > > particular class. For example, if we need to free up 1MB memory, we > > could end up evicting 2000 transactions consuming 50 bytes instead of > > 100 transactions consuming 1000 bytes, resulting in that we end up > > serializing more transactions. Also, I'm concerned about the cost of > > maintaining the binning lists. > > > > I don't think the list maintenance would be very costly - in particular, > the lists would not need to be sorted by size. You're right in some > extreme cases we might evict the smallest transactions in the list. I > think on average we'd evict transactions with average size, which seems > OK for this use case. > > Anyway, I don't think we need to be distracted with this. I mentioned it > merely to show it was considered, but the heap seems to work well > enough, and in the end is even simpler because the complexity is hidden > outside reorderbuffer. > > >> > >> The one thing I'm a bit concerned about is the threshold used to start > >> using binary heap - these thresholds with binary decisions may easily > >> lead to a "cliff" and robustness issues, i.e. abrupt change in behavior > >> with significant runtime change (e.g. you add/remove one transaction and > >> the code takes a much more expensive path). The value (1024) seems > >> rather arbitrary, I wonder if there's something to justify that choice. > > > > True. 1024 seems small to me. In my environment, I started to see a > > big difference from around 4 transactions. But it varies depending > > on environments and workloads. > > > > I think that this performance problem we're addressing doesn't > > normally happen as long as all transactions being decoded are > > top-level transactions. Otherwise, we also need to improve > > ReorderBufferLargestStreamableTopTXN(). Given this fact, I think > > max_connections = 1024 is a possible value in some systems, and I've > > observed such systems sometimes. On the other hand, I've observed > > > 5000 in just a few cases, and having more than 5000 transactions in > > ReorderBuffer seems unlikely to happen without subtransactions. I > > think we can say it's an extreme case, the number is still an > > arbitrary number though. > > > > Or probably we can compute the threshold based on max_connections, > > e.g., max_connections * 10. That way, we can ensure that users won't > > incur the max-heap maintenance costs as long as they don't use > > subtransactions. > > > > Tying this to max_connections seems like an interesting option. It'd > make this adaptive to a system. I haven't thought about the exact value > (m_c * 10), but it seems better than arbitrary hard-coded values. I've updated the patch accordingly, using MaxConnections for now. I've also updated some comments and commit messages and added typedef.list changes. > > >> > >> In any case, I agree it'd be good to have some dampening factor, to > >> reduce the risk of trashing because of adding/removing a single > >> transaction to the decoding. > >> > >> > >> related stuff / GenerationContext > >> - > >> > >> It's not the fault of this patch, but this reminds me I have some doubts > >> about how the eviction interferes with using the GenerationContext for > >> some of the data. I suspect we can easily get into a situation where we > >> evict the largest transaction, but that doesn't actually reduce the > >> memory usage at all, because the memory context blocks are shared with > >> some other transactions and don't get 100% empty (so we can't release > >> them). But it's actually worse, because GenerationContext does not even > >> reuse this memory. So do we even gain anything by the eviction? > >> > >> When the earlier patch versions also considered age of the transaction, > >> to try evicting the older ones first, I think that was interesting. I > >>
Re: Improve eviction algorithm in ReorderBuffer
On 2/26/24 07:46, Masahiko Sawada wrote: > On Sat, Feb 24, 2024 at 1:29 AM Tomas Vondra > wrote: >>... >> >> overall design >> -- >> >> As for the design, I agree with the approach of using a binaryheap to >> track transactions by size. When going over the thread history, >> describing the initial approach with only keeping "large" transactions >> above some threshold (e.g. 10%), I was really concerned that'll either >> lead to abrupt changes in behavior (when transactions move just around >> the 10%), or won't help with many common cases (with most transactions >> being below the limit). >> >> I was going to suggest some sort of "binning" - keeping lists for >> transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and >> evicting transactions from a list, i.e. based on approximate size. But >> if the indexed binary heap seems to be cheap enough, I think it's a >> better solution. > > I've also considered the binning idea. But it was not clear to me how > it works well in a case where all transactions belong to the > particular class. For example, if we need to free up 1MB memory, we > could end up evicting 2000 transactions consuming 50 bytes instead of > 100 transactions consuming 1000 bytes, resulting in that we end up > serializing more transactions. Also, I'm concerned about the cost of > maintaining the binning lists. > I don't think the list maintenance would be very costly - in particular, the lists would not need to be sorted by size. You're right in some extreme cases we might evict the smallest transactions in the list. I think on average we'd evict transactions with average size, which seems OK for this use case. Anyway, I don't think we need to be distracted with this. I mentioned it merely to show it was considered, but the heap seems to work well enough, and in the end is even simpler because the complexity is hidden outside reorderbuffer. >> >> The one thing I'm a bit concerned about is the threshold used to start >> using binary heap - these thresholds with binary decisions may easily >> lead to a "cliff" and robustness issues, i.e. abrupt change in behavior >> with significant runtime change (e.g. you add/remove one transaction and >> the code takes a much more expensive path). The value (1024) seems >> rather arbitrary, I wonder if there's something to justify that choice. > > True. 1024 seems small to me. In my environment, I started to see a > big difference from around 4 transactions. But it varies depending > on environments and workloads. > > I think that this performance problem we're addressing doesn't > normally happen as long as all transactions being decoded are > top-level transactions. Otherwise, we also need to improve > ReorderBufferLargestStreamableTopTXN(). Given this fact, I think > max_connections = 1024 is a possible value in some systems, and I've > observed such systems sometimes. On the other hand, I've observed > > 5000 in just a few cases, and having more than 5000 transactions in > ReorderBuffer seems unlikely to happen without subtransactions. I > think we can say it's an extreme case, the number is still an > arbitrary number though. > > Or probably we can compute the threshold based on max_connections, > e.g., max_connections * 10. That way, we can ensure that users won't > incur the max-heap maintenance costs as long as they don't use > subtransactions. > Tying this to max_connections seems like an interesting option. It'd make this adaptive to a system. I haven't thought about the exact value (m_c * 10), but it seems better than arbitrary hard-coded values. >> >> In any case, I agree it'd be good to have some dampening factor, to >> reduce the risk of trashing because of adding/removing a single >> transaction to the decoding. >> >> >> related stuff / GenerationContext >> - >> >> It's not the fault of this patch, but this reminds me I have some doubts >> about how the eviction interferes with using the GenerationContext for >> some of the data. I suspect we can easily get into a situation where we >> evict the largest transaction, but that doesn't actually reduce the >> memory usage at all, because the memory context blocks are shared with >> some other transactions and don't get 100% empty (so we can't release >> them). But it's actually worse, because GenerationContext does not even >> reuse this memory. So do we even gain anything by the eviction? >> >> When the earlier patch versions also considered age of the transaction, >> to try evicting the older ones first, I think that was interesting. I >> think we may want to do something like this even with the binary heap. > > Thank you for raising this issue. This is one of the highest priority > items in my backlog. We've seen cases where the logical decoding uses > much more memory than logical_decoding_work_mem value[1][2] (e.g. it > used 4GB memory even though the logical_decoding_work_mem was 256kB). > I think that the problem w
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Feb 23, 2024 at 6:24 PM vignesh C wrote: > > On Fri, 9 Feb 2024 at 20:51, Masahiko Sawada wrote: > > > > > > I think this performance regression is not acceptable. In this > > workload, one transaction has 10k subtransactions and the logical > > decoding becomes quite slow if logical_decoding_work_mem is not big > > enough. Therefore, it's a legitimate and common approach to increase > > logical_decoding_work_mem to speedup the decoding. However, with thie > > patch, the decoding becomes slower than today. It's a bad idea in > > general to optimize an extreme case while sacrificing the normal (or > > more common) cases. > > > > Since this same function is used by pg_dump sorting TopoSort functions > also, we can just verify once if there is no performance impact with > large number of objects during dump sorting: Okay. I've run the pg_dump regression tests with --timer flag (note that pg_dump doesn't use indexed binary heap): master: [16:00:25] t/001_basic.pl ok 151 ms ( 0.00 usr 0.00 sys + 0.09 cusr 0.06 csys = 0.15 CPU) [16:00:25] t/002_pg_dump.pl .. ok10157 ms ( 0.23 usr 0.01 sys + 1.48 cusr 0.37 csys = 2.09 CPU) [16:00:36] t/003_pg_dump_with_server.pl .. ok 504 ms ( 0.00 usr 0.01 sys + 0.10 cusr 0.07 csys = 0.18 CPU) [16:00:36] t/004_pg_dump_parallel.pl . ok 1044 ms ( 0.00 usr 0.00 sys + 0.12 cusr 0.08 csys = 0.20 CPU) [16:00:37] t/005_pg_dump_filterfile.pl ... ok 2390 ms ( 0.00 usr 0.00 sys + 0.34 cusr 0.19 csys = 0.53 CPU) [16:00:40] t/010_dump_connstr.pl . ok 4813 ms ( 0.01 usr 0.00 sys + 2.13 cusr 0.45 csys = 2.59 CPU) patched: [15:59:47] t/001_basic.pl ok 150 ms ( 0.00 usr 0.00 sys + 0.08 cusr 0.07 csys = 0.15 CPU) [15:59:47] t/002_pg_dump.pl .. ok10057 ms ( 0.23 usr 0.02 sys + 1.49 cusr 0.36 csys = 2.10 CPU) [15:59:57] t/003_pg_dump_with_server.pl .. ok 509 ms ( 0.00 usr 0.00 sys + 0.09 cusr 0.08 csys = 0.17 CPU) [15:59:58] t/004_pg_dump_parallel.pl . ok 1048 ms ( 0.01 usr 0.00 sys + 0.11 cusr 0.11 csys = 0.23 CPU) [15:59:59] t/005_pg_dump_filterfile.pl ... ok 2398 ms ( 0.00 usr 0.00 sys + 0.34 cusr 0.20 csys = 0.54 CPU) [16:00:01] t/010_dump_connstr.pl . ok 4762 ms ( 0.01 usr 0.00 sys + 2.15 cusr 0.42 csys = 2.58 CPU) There is no noticeable difference between the two results. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On Sat, Feb 24, 2024 at 1:29 AM Tomas Vondra wrote: > > Hi, > > I did a basic review and testing of this patch today. Overall I think > the patch is in very good shape - I agree with the tradeoffs it makes, > and I like the approach in general. I do have a couple minor comments > about the code, and then maybe a couple thoughts about the approach. Thank you for the review comments and tests! > > > First, some comments - I'll put them here, but I also kept them in > "review" commits, because that makes it easier to show the exact place > in the code the comment is about. > > 1) binaryheap_allocate got a new "indexed" argument, but the comment is > not updated to document it Fixed. > > 2) I think it's preferable to use descriptive argument names for > bh_set_node. I don't think there's a good reason to keep it short. Agreed. > > 3) In a couple places we have code like this: > > if (heap->bh_indexed) > bh_nodeidx_delete(heap->bh_nodeidx, result); > > Maybe it'd be better to have the if condition in bh_nodeidx_delete, so > that it can be called without it. Fixed. > > 4) Could we check the "found" flag in bh_set_node, somehow? I mean, we > either expect to find the node (update of already tracked transaction) > or not (when inserting it). The life cycle may be non-trivial (node > added, updated and removed, ...), would be useful assert I think. Agreed. > > 5) Do we actually need the various mem_freed local variables in a couple > places, when we expect the value to be equal to txn->size (there's even > assert enforcing that)? You're right. > > 6) ReorderBufferCleanupTXN has a comment about maybe not using the same > threshold both to enable & disable usage of the binaryheap. I agree with > that, otherwise we could easily end up "trashing" if we add/remove > transactions right around the threshold. I think 90-95% for disabling > the heap would work fine. Agreeehd. > > 7) The code disabling binaryheap (based on the threshold) is copied in a > couple places, perhaps it should be a separate function called from > those places. Fixed. > > 8) Similarly to (3), maybe ReorderBufferTXNMemoryUpdate should do the > memory size check internally, to make the calls simpler. Agreed. > > 9) The ReorderBufferChangeMemoryUpdate / ReorderBufferTXNMemoryUpdate > split maybe not very clear. It's not clear to me why it's divided like > this, or why we can't simply call ReorderBufferTXNMemoryUpdate directly. I think that now we have two use cases: updating the memory counter after freeing individual change, and updating the memory counter after freeing all changes of the transaction (i.e., making the counter to 0). In the former case, we need to check if the change is REORDER_BUFFER_CHANGE_INTERNAL_TUPLECID but we don't need to pass the transaction as the change has its transaction. On the other hand, in the latter case, we don't need the change but need to pass the transaction. If we do both things in one function, the function would have two arguments: change and txn, and the callers set either one that they know. I've updated the patch accordingly. BTW it might be worth considering to create a separate patch for the updates around ReorderBufferChangeMemoryUpdate() that batches the memory counter updates as it seems an independent change from the max-heap stuff. > > performance > --- > > I did some benchmarks, to see the behavior in simple good/bad cases (see > the attached scripts.tgz). "large" is one large transaction inserting 1M > rows, small is 64k single-row inserts, and subxacts is the original case > with ~100k subxacts. Finally, subxacts-small is many transactions with > 128 subxacts each (the main transactions are concurrent). > > The results are pretty good, I think: > > testmaster patched > - > large 25872459 95% > small 956 856 89% > subxacts13891529112% >subxacts-small 13632 13187 97% Thank you for doing the performance test. I ran the same script you shared on my machine just in case and got similar results: masterpatched large:28312827 small:12261222 subxacts:134076 2744 subxacts-small:2338423127 In my case, the differences seem to be within a noise range. > > This is timing (ms) with logical_work_mem=4MB. I also tried with 64MB, > where the subxact timing goes way down, but the overall conclusions do > not change. > > I was a bit surprised I haven't seen any clear regression, but in the > end that's a good thing, right? There's a couple results in this thread > showing ~10% regression, but I've been unable to reproduce those. > Perhaps the newer patch versions fix that, I guess. Yes, the 10% regression is fixed in the v4 patch. We don't update the max-heap at all until the number of transactions reaches the threshold so
Re: Improve eviction algorithm in ReorderBuffer
Hi, I did a basic review and testing of this patch today. Overall I think the patch is in very good shape - I agree with the tradeoffs it makes, and I like the approach in general. I do have a couple minor comments about the code, and then maybe a couple thoughts about the approach. First, some comments - I'll put them here, but I also kept them in "review" commits, because that makes it easier to show the exact place in the code the comment is about. 1) binaryheap_allocate got a new "indexed" argument, but the comment is not updated to document it 2) I think it's preferable to use descriptive argument names for bh_set_node. I don't think there's a good reason to keep it short. 3) In a couple places we have code like this: if (heap->bh_indexed) bh_nodeidx_delete(heap->bh_nodeidx, result); Maybe it'd be better to have the if condition in bh_nodeidx_delete, so that it can be called without it. 4) Could we check the "found" flag in bh_set_node, somehow? I mean, we either expect to find the node (update of already tracked transaction) or not (when inserting it). The life cycle may be non-trivial (node added, updated and removed, ...), would be useful assert I think. 5) Do we actually need the various mem_freed local variables in a couple places, when we expect the value to be equal to txn->size (there's even assert enforcing that)? 6) ReorderBufferCleanupTXN has a comment about maybe not using the same threshold both to enable & disable usage of the binaryheap. I agree with that, otherwise we could easily end up "trashing" if we add/remove transactions right around the threshold. I think 90-95% for disabling the heap would work fine. 7) The code disabling binaryheap (based on the threshold) is copied in a couple places, perhaps it should be a separate function called from those places. 8) Similarly to (3), maybe ReorderBufferTXNMemoryUpdate should do the memory size check internally, to make the calls simpler. 9) The ReorderBufferChangeMemoryUpdate / ReorderBufferTXNMemoryUpdate split maybe not very clear. It's not clear to me why it's divided like this, or why we can't simply call ReorderBufferTXNMemoryUpdate directly. performance --- I did some benchmarks, to see the behavior in simple good/bad cases (see the attached scripts.tgz). "large" is one large transaction inserting 1M rows, small is 64k single-row inserts, and subxacts is the original case with ~100k subxacts. Finally, subxacts-small is many transactions with 128 subxacts each (the main transactions are concurrent). The results are pretty good, I think: testmaster patched - large 25872459 95% small 956 856 89% subxacts13891529112% subxacts-small 13632 13187 97% This is timing (ms) with logical_work_mem=4MB. I also tried with 64MB, where the subxact timing goes way down, but the overall conclusions do not change. I was a bit surprised I haven't seen any clear regression, but in the end that's a good thing, right? There's a couple results in this thread showing ~10% regression, but I've been unable to reproduce those. Perhaps the newer patch versions fix that, I guess. Anyway, I think that at some point we'd have to accept that some cases may have slight regression. I think that's inherent for almost any heuristics - there's always going to be some rare case that defeats it. What's important is that the case needs to be rare and/or the impact very limited. And I think that's true here. overall design -- As for the design, I agree with the approach of using a binaryheap to track transactions by size. When going over the thread history, describing the initial approach with only keeping "large" transactions above some threshold (e.g. 10%), I was really concerned that'll either lead to abrupt changes in behavior (when transactions move just around the 10%), or won't help with many common cases (with most transactions being below the limit). I was going to suggest some sort of "binning" - keeping lists for transactions of similar size (e.g. <1kB, 1-2kB, 2-4kB, 4-8kB, ...) and evicting transactions from a list, i.e. based on approximate size. But if the indexed binary heap seems to be cheap enough, I think it's a better solution. The one thing I'm a bit concerned about is the threshold used to start using binary heap - these thresholds with binary decisions may easily lead to a "cliff" and robustness issues, i.e. abrupt change in behavior with significant runtime change (e.g. you add/remove one transaction and the code takes a much more expensive path). The value (1024) seems rather arbitrary, I wonder if there's something to justify that choice. In any case, I agree it'd be good to have some dampening factor, to reduce the risk of trashing because of adding/removing a single transaction to the d
Re: Improve eviction algorithm in ReorderBuffer
On Fri, 9 Feb 2024 at 20:51, Masahiko Sawada wrote: > > On Thu, Feb 8, 2024 at 6:33 PM Hayato Kuroda (Fujitsu) > wrote: > > > > Dear Sawada-san, > > > > Thanks for making v3 patchset. I have also benchmarked the case [1]. > > Below results are the average of 5th, there are almost the same result > > even when median is used for the comparison. On my env, the regression > > cannot be seen. > > > > HEAD (1e285a5) HEAD + v3 patches difference > > 10910.722 ms10714.540 msaround 1.8% > > > > Thank you for doing the performance test! > > > Also, here are mino comments for v3 set. > > > > 01. > > bh_nodeidx_entry and ReorderBufferMemTrackState is missing in typedefs.list. > > Will add them. > > > > > 02. ReorderBufferTXNSizeCompare > > Should we assert {ta, tb} are not NULL? > > Not sure we really need it as other binaryheap users don't have such checks. > > On Tue, Feb 6, 2024 at 2:45 PM Hayato Kuroda (Fujitsu) > wrote: > > > > > I've run a benchmark test that I shared before[1]. Here are results of > > > decoding a transaction that has 1M subtransaction each of which has 1 > > > INSERT: > > > > > > HEAD: > > > 1810.192 ms > > > > > > HEAD w/ patch: > > > 2001.094 ms > > > > > > I set a large enough value to logical_decoding_work_mem not to evict > > > any transactions. I can see about about 10% performance regression in > > > this case. > > > > Thanks for running. I think this workload is the worst and an extreme case > > which > > would not be occurred on the real system (Such a system should be fixed), > > so we > > can say that the regression is up to -10%. I felt it could be negligible > > but how > > do other think? > > I think this performance regression is not acceptable. In this > workload, one transaction has 10k subtransactions and the logical > decoding becomes quite slow if logical_decoding_work_mem is not big > enough. Therefore, it's a legitimate and common approach to increase > logical_decoding_work_mem to speedup the decoding. However, with thie > patch, the decoding becomes slower than today. It's a bad idea in > general to optimize an extreme case while sacrificing the normal (or > more common) cases. > Since this same function is used by pg_dump sorting TopoSort functions also, we can just verify once if there is no performance impact with large number of objects during dump sorting: binaryheap * binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg) { - int sz; binaryheap *heap; - sz = offsetof(binaryheap, bh_nodes) + sizeof(bh_node_type) * capacity; - heap = (binaryheap *) palloc(sz); + heap = (binaryheap *) palloc(sizeof(binaryheap)); heap->bh_space = capacity; heap->bh_compare = compare; heap->bh_arg = arg; heap->bh_size = 0; heap->bh_has_heap_property = true; + heap->bh_nodes = (bh_node_type *) palloc(sizeof(bh_node_type) * capacity); Regards, Vignesh
Re: Improve eviction algorithm in ReorderBuffer
On Sat, Feb 10, 2024 at 2:23 AM Masahiko Sawada wrote: > On Fri, Feb 9, 2024 at 7:35 PM Ajin Cherian wrote: > > > > > > > > On Tue, Feb 6, 2024 at 5:06 PM Masahiko Sawada > wrote: > >> > >> > >> I've attached the new version patch set. > >> > >> Regards, > >> > >> > >> -- > >> Masahiko Sawada > >> Amazon Web Services: https://aws.amazon.com > > > > > > Thanks for the patch. I reviewed that patch and did minimal testing and > it seems to show the speed up as claimed. Some minor comments: > > Thank you for the comments! > > > patch 0001: > > > > +static void > > +bh_enlarge_node_array(binaryheap *heap) > > +{ > > + if (heap->bh_size < heap->bh_space) > > + return; > > > > why not check "if (heap->bh_size >= heap->bh_space)" outside this > function to avoid calling this function when not necessary? This check was > there in code before the patch. > > Agreed. > > > > > patch 0003: > > > > +/* > > + * The threshold of the number of transactions in the max-heap > (rb->txn_heap) > > + * to switch the state. > > + */ > > +#define REORDE_BUFFER_MEM_TRACK_THRESHOLD 1024 > > > > Typo: I think you meant REORDER_ and not REORDE_ > > Fixed. > > These comments are addressed in the v4 patch set I just shared[1]. > > > These changes look good to me. I've done some tests with a few varying levels of subtransaction and I could see that the patch was at least 5% better in all of them. regards, Ajin Cherian Fujitsu Australia
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Feb 9, 2024 at 7:35 PM Ajin Cherian wrote: > > > > On Tue, Feb 6, 2024 at 5:06 PM Masahiko Sawada wrote: >> >> >> I've attached the new version patch set. >> >> Regards, >> >> >> -- >> Masahiko Sawada >> Amazon Web Services: https://aws.amazon.com > > > Thanks for the patch. I reviewed that patch and did minimal testing and it > seems to show the speed up as claimed. Some minor comments: Thank you for the comments! > patch 0001: > > +static void > +bh_enlarge_node_array(binaryheap *heap) > +{ > + if (heap->bh_size < heap->bh_space) > + return; > > why not check "if (heap->bh_size >= heap->bh_space)" outside this function to > avoid calling this function when not necessary? This check was there in code > before the patch. Agreed. > > patch 0003: > > +/* > + * The threshold of the number of transactions in the max-heap (rb->txn_heap) > + * to switch the state. > + */ > +#define REORDE_BUFFER_MEM_TRACK_THRESHOLD 1024 > > Typo: I think you meant REORDER_ and not REORDE_ Fixed. These comments are addressed in the v4 patch set I just shared[1]. Regards, [1] https://www.postgresql.org/message-id/CAD21AoDhuybyryVkmVkgPY8uVrjGLYchL8EY8-rBm1hbZJpwLw%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On Thu, Feb 8, 2024 at 6:33 PM Hayato Kuroda (Fujitsu) wrote: > > Dear Sawada-san, > > Thanks for making v3 patchset. I have also benchmarked the case [1]. > Below results are the average of 5th, there are almost the same result > even when median is used for the comparison. On my env, the regression > cannot be seen. > > HEAD (1e285a5) HEAD + v3 patches difference > 10910.722 ms10714.540 msaround 1.8% > Thank you for doing the performance test! > Also, here are mino comments for v3 set. > > 01. > bh_nodeidx_entry and ReorderBufferMemTrackState is missing in typedefs.list. Will add them. > > 02. ReorderBufferTXNSizeCompare > Should we assert {ta, tb} are not NULL? Not sure we really need it as other binaryheap users don't have such checks. On Tue, Feb 6, 2024 at 2:45 PM Hayato Kuroda (Fujitsu) wrote: > > > I've run a benchmark test that I shared before[1]. Here are results of > > decoding a transaction that has 1M subtransaction each of which has 1 > > INSERT: > > > > HEAD: > > 1810.192 ms > > > > HEAD w/ patch: > > 2001.094 ms > > > > I set a large enough value to logical_decoding_work_mem not to evict > > any transactions. I can see about about 10% performance regression in > > this case. > > Thanks for running. I think this workload is the worst and an extreme case > which > would not be occurred on the real system (Such a system should be fixed), so > we > can say that the regression is up to -10%. I felt it could be negligible but > how > do other think? I think this performance regression is not acceptable. In this workload, one transaction has 10k subtransactions and the logical decoding becomes quite slow if logical_decoding_work_mem is not big enough. Therefore, it's a legitimate and common approach to increase logical_decoding_work_mem to speedup the decoding. However, with thie patch, the decoding becomes slower than today. It's a bad idea in general to optimize an extreme case while sacrificing the normal (or more common) cases. Therefore, I've improved the algorithm so that we don't touch the max-heap at all if the number of transactions is small enough. I've run benchmark test with two workloads: workload-1, decode single transaction with 800k tuples (normal.sql): * without spill HEAD: 13235.136 ms v3 patch: 14320.082 ms v4 patch: 13300.665 ms * with spill HEAD: 22970.204 ms v3 patch: 23625.649 ms v4 patch: 23304.366 workload-2, decode one transaction with 100k subtransaction (many-subtxn.sql): * without spill HEAD: 345.718 ms v3 patch: 409.686 ms v4 patch: 353.026 ms * with spill HEAD: 136718.313 ms v3 patch: 2675.539 ms v4 patch: 2734.981 ms Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com v4-0001-Make-binaryheap-enlareable.patch Description: Binary data v4-0002-Add-functions-to-binaryheap-to-efficiently-remove.patch Description: Binary data v4-0003-Use-max-heap-to-evict-largest-transactions-in-Reo.patch Description: Binary data normal.sql Description: Binary data many-subtxn.sql Description: Binary data
Re: Improve eviction algorithm in ReorderBuffer
On Tue, Feb 6, 2024 at 5:06 PM Masahiko Sawada wrote: > > I've attached the new version patch set. > > Regards, > > > -- > Masahiko Sawada > Amazon Web Services: https://aws.amazon.com Thanks for the patch. I reviewed that patch and did minimal testing and it seems to show the speed up as claimed. Some minor comments: patch 0001: +static void +bh_enlarge_node_array(binaryheap *heap) +{ + if (heap->bh_size < heap->bh_space) + return; why not check "if (heap->bh_size >= heap->bh_space)" outside this function to avoid calling this function when not necessary? This check was there in code before the patch. patch 0003: +/* + * The threshold of the number of transactions in the max-heap (rb->txn_heap) + * to switch the state. + */ +#define REORDE_BUFFER_MEM_TRACK_THRESHOLD 1024 Typo: I think you meant REORDER_ and not REORDE_ regards, Ajin Cherian Fujitsu Australia
RE: Improve eviction algorithm in ReorderBuffer
Dear Sawada-san, Thanks for making v3 patchset. I have also benchmarked the case [1]. Below results are the average of 5th, there are almost the same result even when median is used for the comparison. On my env, the regression cannot be seen. HEAD (1e285a5) HEAD + v3 patches difference 10910.722 ms10714.540 msaround 1.8% Also, here are mino comments for v3 set. 01. bh_nodeidx_entry and ReorderBufferMemTrackState is missing in typedefs.list. 02. ReorderBufferTXNSizeCompare Should we assert {ta, tb} are not NULL? [1]: https://www.postgresql.org/message-id/CAD21AoB-7mPpKnLmBNfzfavG8AiTwEgAdVMuv%3DjzmAp9ex7eyQ%40mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Feb 2, 2024 at 5:16 PM Masahiko Sawada wrote: > > On Fri, Feb 2, 2024 at 1:59 PM Shubham Khanna > wrote: > > > > On Fri, Jan 26, 2024 at 2:07 PM Masahiko Sawada > > wrote: > > > > > > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila > > > wrote: > > > > > > > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada > > > > wrote: > > > > > > > > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila > > > > > wrote: > > > > > > > > > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada > > > > > > wrote: > > > > > > > > > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > The individual transactions shouldn't cross > > > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your > > > > > > > > proposal to > > > > > > > > maintain the lists: "...splitting it into two lists: > > > > > > > > transactions > > > > > > > > consuming 5% < and 5% >= of the memory limit, and checking the > > > > > > > > 5% >= > > > > > > > > list preferably.". In the previous sentence, what did you mean > > > > > > > > by > > > > > > > > transactions consuming 5% >= of the memory limit? I got the > > > > > > > > impression > > > > > > > > that you are saying to maintain them in a separate transaction > > > > > > > > list > > > > > > > > which doesn't seems to be the case. > > > > > > > > > > > > > > I wanted to mean that there are three lists in total: the first > > > > > > > one > > > > > > > maintain the transactions consuming more than 10% of > > > > > > > logical_decoding_work_mem, > > > > > > > > > > > > > > > > > > > How can we have multiple transactions in the list consuming more > > > > > > than > > > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization > > > > > > before any xact reaches logical_decoding_work_mem? > > > > > > > > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions > > > > > consuming more than 6.4MB are added to the list. So for example, it's > > > > > possible that the list has three transactions each of which are > > > > > consuming 10MB while the total memory usage in the reorderbuffer is > > > > > still 30MB (less than logical_decoding_work_mem). > > > > > > > > > > > > > Thanks for the clarification. I misunderstood the list to have > > > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one > > > > thing to note is that maintaining these lists by default can also have > > > > some overhead unless the list of open transactions crosses a certain > > > > threshold. > > > > > > > > > > On further analysis, I realized that the approach discussed here might > > > not be the way to go. The idea of dividing transactions into several > > > subgroups is to divide a large number of entries into multiple > > > sub-groups so we can reduce the complexity to search for the > > > particular entry. Since we assume that there are no big differences in > > > entries' sizes within a sub-group, we can pick the entry to evict in > > > O(1). However, what we really need to avoid here is that we end up > > > increasing the number of times to evict entries because serializing an > > > entry to the disk is more costly than searching an entry on memory in > > > general. > > > > > > I think that it's no problem in a large-entries subgroup but when it > > > comes to the smallest-entries subgroup, like for entries consuming > > > less than 5% of the limit, it could end up evicting many entries. For > > > example, there would be a huge difference between serializing 1 entry > > > consuming 5% of the memory limit and serializing 5000 entries > > > consuming 0.001% of the memory limit. Even if we can select 5000 > > > entries quickly, I think the latter would be slower in total. The more > > > subgroups we create, the more the algorithm gets complex and the > > > overheads could cause. So I think we need to search for the largest > > > entry in order to minimize the number of evictions anyway. > > > > > > Looking for data structures and algorithms, I think binaryheap with > > > some improvements could be promising. I mentioned before why we cannot > > > use the current binaryheap[1]. The missing pieces are efficient ways > > > to remove the arbitrary entry and to update the arbitrary entry's key. > > > The current binaryheap provides binaryheap_remove_node(), which is > > > O(log n), but it requires the entry's position in the binaryheap. We > > > can know the entry's position just after binaryheap_add_unordered() > > > but it might be changed after heapify. Searching the node's position > > > is O(n). So the improvement idea is to add a hash table to the > > > binaryheap so that it can track the positions for each entry so that > > > we can remove the arbitrary entry in O(log n) and also update the > > > arbitrary entry's key in O(log n). This is known as the indexed > > > priority queue. I've attached the patch for that (0001 and 0002). > > > > > > That way, in terms of reord
RE: Improve eviction algorithm in ReorderBuffer
Dear Sawada-san, > Thank you for the review comments! > > > > > A comment for 0001: > > > > 01. > > ``` > > +static void > > +bh_enlarge_node_array(binaryheap *heap) > > +{ > > +if (heap->bh_size < heap->bh_space) > > +return; > > + > > +heap->bh_space *= 2; > > +heap->bh_nodes = repalloc(heap->bh_nodes, > > + sizeof(bh_node_type) * heap->bh_space); > > +} > > ``` > > > > I'm not sure it is OK to use repalloc() for enlarging bh_nodes. This data > > structure public one and arbitrary codes and extensions can directly refer > > bh_nodes. But if the array is repalloc()'d, the pointer would be updated so > > that > > their referring would be a dangling pointer. > > Hmm I'm not sure this is the case that we need to really worry about, > and cannot come up with a good use case where extensions refer to > bh_nodes directly rather than binaryheap. In PostgreSQL codes, many > Nodes already have pointers and are exposed. Actually, me neither. I could not come up with the use-case - I just said the possibility. If it is not a real issue, we can ignore. > > > > > 04. > > ``` > > extern binaryheap *binaryheap_allocate(int capacity, > > binaryheap_comparator compare, > > - void *arg); > > + bool indexed, void *arg); > > ``` > > > > I felt pre-existing API should not be changed. How about adding > > binaryheap_allocate_extended() or something which can specify the `bool > indexed`? > > binaryheap_allocate() sets heap->bh_indexed to false. > > I'm really not sure it's worth inventing a > binaryheap_allocate_extended() function just for preserving API > compatibility. I think it's generally a good idea to have > xxx_extended() function to increase readability and usability, for > example, for the case where the same (kind of default) arguments are > passed in most cases and the function is called from many places. > However, we have a handful binaryheap_allocate() callers, and I > believe that it would not hurt the existing callers. I kept (external) extensions which uses binaryheap APIs in my mind. I thought we could avoid to raise costs for updating their codes. But I could understand the change is small, so ... up to you. > > 05. > > ``` > > +extern void binaryheap_update_up(binaryheap *heap, bh_node_type d); > > +extern void binaryheap_update_down(binaryheap *heap, bh_node_type d); > > ``` > > > > IIUC, callers must consider whether the node should be shift up/down and use > > appropriate function, right? I felt it may not be user-friendly. > > Right, I couldn't come up with a better interface. > > Another idea I've considered was that the caller provides a callback > function where it can compare the old and new keys. For example, in > reorderbuffer case, we call like: > > binaryheap_update(rb->txn_heap, PointerGetDatum(txn), > ReorderBufferTXNUpdateCompare, (void *) &old_size); > > Then in ReorderBufferTXNUpdateCompare(), > ReorderBufferTXN *txn = (ReorderBufferTXN *) a;Size old_size = *(Size *) b; > (compare txn->size to "b" ...) > > However it seems complicated... > I considered similar approach: accept old node as an argument of a compare function. But it requires further memory allocation. Do someone have better idea? > > > > > 08. > > IIUC, if more than 1024 transactions are running but they have small amount > > of > > changes, the performance may be degraded, right? Do you have a result in > sucha > > a case? > > I've run a benchmark test that I shared before[1]. Here are results of > decoding a transaction that has 1M subtransaction each of which has 1 > INSERT: > > HEAD: > 1810.192 ms > > HEAD w/ patch: > 2001.094 ms > > I set a large enough value to logical_decoding_work_mem not to evict > any transactions. I can see about about 10% performance regression in > this case. Thanks for running. I think this workload is the worst and an extreme case which would not be occurred on the real system (Such a system should be fixed), so we can say that the regression is up to -10%. I felt it could be negligible but how do other think? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Feb 2, 2024 at 1:59 PM Shubham Khanna wrote: > > On Fri, Jan 26, 2024 at 2:07 PM Masahiko Sawada wrote: > > > > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila > > wrote: > > > > > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada > > > wrote: > > > > > > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila > > > > wrote: > > > > > > > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada > > > > > wrote: > > > > > > > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > The individual transactions shouldn't cross > > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your > > > > > > > proposal to > > > > > > > maintain the lists: "...splitting it into two lists: transactions > > > > > > > consuming 5% < and 5% >= of the memory limit, and checking the > > > > > > > 5% >= > > > > > > > list preferably.". In the previous sentence, what did you mean by > > > > > > > transactions consuming 5% >= of the memory limit? I got the > > > > > > > impression > > > > > > > that you are saying to maintain them in a separate transaction > > > > > > > list > > > > > > > which doesn't seems to be the case. > > > > > > > > > > > > I wanted to mean that there are three lists in total: the first one > > > > > > maintain the transactions consuming more than 10% of > > > > > > logical_decoding_work_mem, > > > > > > > > > > > > > > > > How can we have multiple transactions in the list consuming more than > > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization > > > > > before any xact reaches logical_decoding_work_mem? > > > > > > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions > > > > consuming more than 6.4MB are added to the list. So for example, it's > > > > possible that the list has three transactions each of which are > > > > consuming 10MB while the total memory usage in the reorderbuffer is > > > > still 30MB (less than logical_decoding_work_mem). > > > > > > > > > > Thanks for the clarification. I misunderstood the list to have > > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one > > > thing to note is that maintaining these lists by default can also have > > > some overhead unless the list of open transactions crosses a certain > > > threshold. > > > > > > > On further analysis, I realized that the approach discussed here might > > not be the way to go. The idea of dividing transactions into several > > subgroups is to divide a large number of entries into multiple > > sub-groups so we can reduce the complexity to search for the > > particular entry. Since we assume that there are no big differences in > > entries' sizes within a sub-group, we can pick the entry to evict in > > O(1). However, what we really need to avoid here is that we end up > > increasing the number of times to evict entries because serializing an > > entry to the disk is more costly than searching an entry on memory in > > general. > > > > I think that it's no problem in a large-entries subgroup but when it > > comes to the smallest-entries subgroup, like for entries consuming > > less than 5% of the limit, it could end up evicting many entries. For > > example, there would be a huge difference between serializing 1 entry > > consuming 5% of the memory limit and serializing 5000 entries > > consuming 0.001% of the memory limit. Even if we can select 5000 > > entries quickly, I think the latter would be slower in total. The more > > subgroups we create, the more the algorithm gets complex and the > > overheads could cause. So I think we need to search for the largest > > entry in order to minimize the number of evictions anyway. > > > > Looking for data structures and algorithms, I think binaryheap with > > some improvements could be promising. I mentioned before why we cannot > > use the current binaryheap[1]. The missing pieces are efficient ways > > to remove the arbitrary entry and to update the arbitrary entry's key. > > The current binaryheap provides binaryheap_remove_node(), which is > > O(log n), but it requires the entry's position in the binaryheap. We > > can know the entry's position just after binaryheap_add_unordered() > > but it might be changed after heapify. Searching the node's position > > is O(n). So the improvement idea is to add a hash table to the > > binaryheap so that it can track the positions for each entry so that > > we can remove the arbitrary entry in O(log n) and also update the > > arbitrary entry's key in O(log n). This is known as the indexed > > priority queue. I've attached the patch for that (0001 and 0002). > > > > That way, in terms of reorderbuffer, we can update and remove the > > transaction's memory usage in O(log n) (in worst case and O(1) in > > average) and then pick the largest transaction in O(1). Since we might > > need to call ReorderBufferSerializeTXN() even in non-streaming case, > > we need to maintain the binaryheap anyway. I've att
Re: Improve eviction algorithm in ReorderBuffer
Hi, On Wed, Jan 31, 2024 at 5:32 PM vignesh C wrote: > > On Tue, 30 Jan 2024 at 13:37, Masahiko Sawada wrote: > > > > On Fri, Jan 26, 2024 at 5:36 PM Masahiko Sawada > > wrote: > > > > > > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila > > > wrote: > > > > > > > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada > > > > wrote: > > > > > > > > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila > > > > > wrote: > > > > > > > > > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada > > > > > > wrote: > > > > > > > > > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > The individual transactions shouldn't cross > > > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your > > > > > > > > proposal to > > > > > > > > maintain the lists: "...splitting it into two lists: > > > > > > > > transactions > > > > > > > > consuming 5% < and 5% >= of the memory limit, and checking the > > > > > > > > 5% >= > > > > > > > > list preferably.". In the previous sentence, what did you mean > > > > > > > > by > > > > > > > > transactions consuming 5% >= of the memory limit? I got the > > > > > > > > impression > > > > > > > > that you are saying to maintain them in a separate transaction > > > > > > > > list > > > > > > > > which doesn't seems to be the case. > > > > > > > > > > > > > > I wanted to mean that there are three lists in total: the first > > > > > > > one > > > > > > > maintain the transactions consuming more than 10% of > > > > > > > logical_decoding_work_mem, > > > > > > > > > > > > > > > > > > > How can we have multiple transactions in the list consuming more > > > > > > than > > > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization > > > > > > before any xact reaches logical_decoding_work_mem? > > > > > > > > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions > > > > > consuming more than 6.4MB are added to the list. So for example, it's > > > > > possible that the list has three transactions each of which are > > > > > consuming 10MB while the total memory usage in the reorderbuffer is > > > > > still 30MB (less than logical_decoding_work_mem). > > > > > > > > > > > > > Thanks for the clarification. I misunderstood the list to have > > > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one > > > > thing to note is that maintaining these lists by default can also have > > > > some overhead unless the list of open transactions crosses a certain > > > > threshold. > > > > > > > > > > On further analysis, I realized that the approach discussed here might > > > not be the way to go. The idea of dividing transactions into several > > > subgroups is to divide a large number of entries into multiple > > > sub-groups so we can reduce the complexity to search for the > > > particular entry. Since we assume that there are no big differences in > > > entries' sizes within a sub-group, we can pick the entry to evict in > > > O(1). However, what we really need to avoid here is that we end up > > > increasing the number of times to evict entries because serializing an > > > entry to the disk is more costly than searching an entry on memory in > > > general. > > > > > > I think that it's no problem in a large-entries subgroup but when it > > > comes to the smallest-entries subgroup, like for entries consuming > > > less than 5% of the limit, it could end up evicting many entries. For > > > example, there would be a huge difference between serializing 1 entry > > > consuming 5% of the memory limit and serializing 5000 entries > > > consuming 0.001% of the memory limit. Even if we can select 5000 > > > entries quickly, I think the latter would be slower in total. The more > > > subgroups we create, the more the algorithm gets complex and the > > > overheads could cause. So I think we need to search for the largest > > > entry in order to minimize the number of evictions anyway. > > > > > > Looking for data structures and algorithms, I think binaryheap with > > > some improvements could be promising. I mentioned before why we cannot > > > use the current binaryheap[1]. The missing pieces are efficient ways > > > to remove the arbitrary entry and to update the arbitrary entry's key. > > > The current binaryheap provides binaryheap_remove_node(), which is > > > O(log n), but it requires the entry's position in the binaryheap. We > > > can know the entry's position just after binaryheap_add_unordered() > > > but it might be changed after heapify. Searching the node's position > > > is O(n). So the improvement idea is to add a hash table to the > > > binaryheap so that it can track the positions for each entry so that > > > we can remove the arbitrary entry in O(log n) and also update the > > > arbitrary entry's key in O(log n). This is known as the indexed > > > priority queue. I've attached the patch for that (0001 and 0002). > > > > > > That way, in terms of reorder
Re: Improve eviction algorithm in ReorderBuffer
On Wed, Jan 31, 2024 at 2:18 PM Hayato Kuroda (Fujitsu) wrote: > > Dear Sawada-san, > > I have started to read your patches. Here are my initial comments. > At least, all subscription tests were passed on my env. Thank you for the review comments! > > A comment for 0001: > > 01. > ``` > +static void > +bh_enlarge_node_array(binaryheap *heap) > +{ > +if (heap->bh_size < heap->bh_space) > +return; > + > +heap->bh_space *= 2; > +heap->bh_nodes = repalloc(heap->bh_nodes, > + sizeof(bh_node_type) * heap->bh_space); > +} > ``` > > I'm not sure it is OK to use repalloc() for enlarging bh_nodes. This data > structure public one and arbitrary codes and extensions can directly refer > bh_nodes. But if the array is repalloc()'d, the pointer would be updated so > that > their referring would be a dangling pointer. Hmm I'm not sure this is the case that we need to really worry about, and cannot come up with a good use case where extensions refer to bh_nodes directly rather than binaryheap. In PostgreSQL codes, many Nodes already have pointers and are exposed. > I think the internal of the structure should be a private one in this case. > > Comments for 0002: > > 02. > ``` > +#include "utils/palloc.h" > ``` > > Is it really needed? I'm not sure who referrs it. Seems not, will remove. > > 03. > ``` > typedef struct bh_nodeidx_entry > { > bh_node_typekey; > charstatus; > int idx; > } bh_nodeidx_entry; > ``` > > Sorry if it is a stupid question. Can you tell me how "status" is used? > None of binaryheap and reorderbuffer components refer it. It's required by simplehash.h > > 04. > ``` > extern binaryheap *binaryheap_allocate(int capacity, > binaryheap_comparator compare, > - void *arg); > + bool indexed, void *arg); > ``` > > I felt pre-existing API should not be changed. How about adding > binaryheap_allocate_extended() or something which can specify the `bool > indexed`? > binaryheap_allocate() sets heap->bh_indexed to false. I'm really not sure it's worth inventing a binaryheap_allocate_extended() function just for preserving API compatibility. I think it's generally a good idea to have xxx_extended() function to increase readability and usability, for example, for the case where the same (kind of default) arguments are passed in most cases and the function is called from many places. However, we have a handful binaryheap_allocate() callers, and I believe that it would not hurt the existing callers. > > 05. > ``` > +extern void binaryheap_update_up(binaryheap *heap, bh_node_type d); > +extern void binaryheap_update_down(binaryheap *heap, bh_node_type d); > ``` > > IIUC, callers must consider whether the node should be shift up/down and use > appropriate function, right? I felt it may not be user-friendly. Right, I couldn't come up with a better interface. Another idea I've considered was that the caller provides a callback function where it can compare the old and new keys. For example, in reorderbuffer case, we call like: binaryheap_update(rb->txn_heap, PointerGetDatum(txn), ReorderBufferTXNUpdateCompare, (void *) &old_size); Then in ReorderBufferTXNUpdateCompare(), ReorderBufferTXN *txn = (ReorderBufferTXN *) a;Size old_size = *(Size *) b; (compare txn->size to "b" ...) However it seems complicated... > > Comments for 0003: > > 06. > ``` > This commit changes the eviction algorithm in ReorderBuffer to use > max-heap with transaction size,a nd use two strategies depending on > the number of transactions being decoded. > ``` > > s/a nd/ and/ > > 07. > ``` > It could be too expensive to pudate max-heap while preserving the heap > property each time the transaction's memory counter is updated, as it > could happen very frquently. So when the number of transactions being > decoded is small, we add the transactions to max-heap but don't > preserve the heap property, which is O(1). We heapify the max-heap > just before picking the largest transaction, which is O(n). This > strategy minimizes the overheads of updating the transaction's memory > counter. > ``` > > s/pudate/update/ Will fix them. > > 08. > IIUC, if more than 1024 transactions are running but they have small amount of > changes, the performance may be degraded, right? Do you have a result in sucha > a case? I've run a benchmark test that I shared before[1]. Here are results of decoding a transaction that has 1M subtransaction each of which has 1 INSERT: HEAD: 1810.192 ms HEAD w/ patch: 2001.094 ms I set a large enough value to logical_decoding_work_mem not to evict any transactions. I can see about about 10% performance regression in this case. Regards, [1] https://www.postgresql.org/message-id/CAD21AoAfKTgrBrLq96GcTv9d6k97zaQcDM-rxfKEt4GSe0qnaQ%40mail.g
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Jan 26, 2024 at 2:07 PM Masahiko Sawada wrote: > > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila wrote: > > > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada > > wrote: > > > > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila > > > wrote: > > > > > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada > > > > wrote: > > > > > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila > > > > > wrote: > > > > > > > > > > > > > > > > > > The individual transactions shouldn't cross > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal > > > > > > to > > > > > > maintain the lists: "...splitting it into two lists: transactions > > > > > > consuming 5% < and 5% >= of the memory limit, and checking the 5% > > > > > > >= > > > > > > list preferably.". In the previous sentence, what did you mean by > > > > > > transactions consuming 5% >= of the memory limit? I got the > > > > > > impression > > > > > > that you are saying to maintain them in a separate transaction list > > > > > > which doesn't seems to be the case. > > > > > > > > > > I wanted to mean that there are three lists in total: the first one > > > > > maintain the transactions consuming more than 10% of > > > > > logical_decoding_work_mem, > > > > > > > > > > > > > How can we have multiple transactions in the list consuming more than > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization > > > > before any xact reaches logical_decoding_work_mem? > > > > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions > > > consuming more than 6.4MB are added to the list. So for example, it's > > > possible that the list has three transactions each of which are > > > consuming 10MB while the total memory usage in the reorderbuffer is > > > still 30MB (less than logical_decoding_work_mem). > > > > > > > Thanks for the clarification. I misunderstood the list to have > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one > > thing to note is that maintaining these lists by default can also have > > some overhead unless the list of open transactions crosses a certain > > threshold. > > > > On further analysis, I realized that the approach discussed here might > not be the way to go. The idea of dividing transactions into several > subgroups is to divide a large number of entries into multiple > sub-groups so we can reduce the complexity to search for the > particular entry. Since we assume that there are no big differences in > entries' sizes within a sub-group, we can pick the entry to evict in > O(1). However, what we really need to avoid here is that we end up > increasing the number of times to evict entries because serializing an > entry to the disk is more costly than searching an entry on memory in > general. > > I think that it's no problem in a large-entries subgroup but when it > comes to the smallest-entries subgroup, like for entries consuming > less than 5% of the limit, it could end up evicting many entries. For > example, there would be a huge difference between serializing 1 entry > consuming 5% of the memory limit and serializing 5000 entries > consuming 0.001% of the memory limit. Even if we can select 5000 > entries quickly, I think the latter would be slower in total. The more > subgroups we create, the more the algorithm gets complex and the > overheads could cause. So I think we need to search for the largest > entry in order to minimize the number of evictions anyway. > > Looking for data structures and algorithms, I think binaryheap with > some improvements could be promising. I mentioned before why we cannot > use the current binaryheap[1]. The missing pieces are efficient ways > to remove the arbitrary entry and to update the arbitrary entry's key. > The current binaryheap provides binaryheap_remove_node(), which is > O(log n), but it requires the entry's position in the binaryheap. We > can know the entry's position just after binaryheap_add_unordered() > but it might be changed after heapify. Searching the node's position > is O(n). So the improvement idea is to add a hash table to the > binaryheap so that it can track the positions for each entry so that > we can remove the arbitrary entry in O(log n) and also update the > arbitrary entry's key in O(log n). This is known as the indexed > priority queue. I've attached the patch for that (0001 and 0002). > > That way, in terms of reorderbuffer, we can update and remove the > transaction's memory usage in O(log n) (in worst case and O(1) in > average) and then pick the largest transaction in O(1). Since we might > need to call ReorderBufferSerializeTXN() even in non-streaming case, > we need to maintain the binaryheap anyway. I've attached the patch for > that (0003). > > Here are test script for many sub-transactions case: > > create table test (c int); > create or replace function testfn (cnt int) returns void as $$ > begin > for i in 1..cnt loop > begin > insert into test valu
Re: Improve eviction algorithm in ReorderBuffer
On Tue, 30 Jan 2024 at 13:37, Masahiko Sawada wrote: > > On Fri, Jan 26, 2024 at 5:36 PM Masahiko Sawada wrote: > > > > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila > > wrote: > > > > > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada > > > wrote: > > > > > > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila > > > > wrote: > > > > > > > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada > > > > > wrote: > > > > > > > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > The individual transactions shouldn't cross > > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your > > > > > > > proposal to > > > > > > > maintain the lists: "...splitting it into two lists: transactions > > > > > > > consuming 5% < and 5% >= of the memory limit, and checking the > > > > > > > 5% >= > > > > > > > list preferably.". In the previous sentence, what did you mean by > > > > > > > transactions consuming 5% >= of the memory limit? I got the > > > > > > > impression > > > > > > > that you are saying to maintain them in a separate transaction > > > > > > > list > > > > > > > which doesn't seems to be the case. > > > > > > > > > > > > I wanted to mean that there are three lists in total: the first one > > > > > > maintain the transactions consuming more than 10% of > > > > > > logical_decoding_work_mem, > > > > > > > > > > > > > > > > How can we have multiple transactions in the list consuming more than > > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization > > > > > before any xact reaches logical_decoding_work_mem? > > > > > > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions > > > > consuming more than 6.4MB are added to the list. So for example, it's > > > > possible that the list has three transactions each of which are > > > > consuming 10MB while the total memory usage in the reorderbuffer is > > > > still 30MB (less than logical_decoding_work_mem). > > > > > > > > > > Thanks for the clarification. I misunderstood the list to have > > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one > > > thing to note is that maintaining these lists by default can also have > > > some overhead unless the list of open transactions crosses a certain > > > threshold. > > > > > > > On further analysis, I realized that the approach discussed here might > > not be the way to go. The idea of dividing transactions into several > > subgroups is to divide a large number of entries into multiple > > sub-groups so we can reduce the complexity to search for the > > particular entry. Since we assume that there are no big differences in > > entries' sizes within a sub-group, we can pick the entry to evict in > > O(1). However, what we really need to avoid here is that we end up > > increasing the number of times to evict entries because serializing an > > entry to the disk is more costly than searching an entry on memory in > > general. > > > > I think that it's no problem in a large-entries subgroup but when it > > comes to the smallest-entries subgroup, like for entries consuming > > less than 5% of the limit, it could end up evicting many entries. For > > example, there would be a huge difference between serializing 1 entry > > consuming 5% of the memory limit and serializing 5000 entries > > consuming 0.001% of the memory limit. Even if we can select 5000 > > entries quickly, I think the latter would be slower in total. The more > > subgroups we create, the more the algorithm gets complex and the > > overheads could cause. So I think we need to search for the largest > > entry in order to minimize the number of evictions anyway. > > > > Looking for data structures and algorithms, I think binaryheap with > > some improvements could be promising. I mentioned before why we cannot > > use the current binaryheap[1]. The missing pieces are efficient ways > > to remove the arbitrary entry and to update the arbitrary entry's key. > > The current binaryheap provides binaryheap_remove_node(), which is > > O(log n), but it requires the entry's position in the binaryheap. We > > can know the entry's position just after binaryheap_add_unordered() > > but it might be changed after heapify. Searching the node's position > > is O(n). So the improvement idea is to add a hash table to the > > binaryheap so that it can track the positions for each entry so that > > we can remove the arbitrary entry in O(log n) and also update the > > arbitrary entry's key in O(log n). This is known as the indexed > > priority queue. I've attached the patch for that (0001 and 0002). > > > > That way, in terms of reorderbuffer, we can update and remove the > > transaction's memory usage in O(log n) (in worst case and O(1) in > > average) and then pick the largest transaction in O(1). Since we might > > need to call ReorderBufferSerializeTXN() even in non-streaming case, > > we need to maintain the binaryheap anyway. > > Sinc
RE: Improve eviction algorithm in ReorderBuffer
Dear Sawada-san, I have started to read your patches. Here are my initial comments. At least, all subscription tests were passed on my env. A comment for 0001: 01. ``` +static void +bh_enlarge_node_array(binaryheap *heap) +{ +if (heap->bh_size < heap->bh_space) +return; + +heap->bh_space *= 2; +heap->bh_nodes = repalloc(heap->bh_nodes, + sizeof(bh_node_type) * heap->bh_space); +} ``` I'm not sure it is OK to use repalloc() for enlarging bh_nodes. This data structure public one and arbitrary codes and extensions can directly refer bh_nodes. But if the array is repalloc()'d, the pointer would be updated so that their referring would be a dangling pointer. I think the internal of the structure should be a private one in this case. Comments for 0002: 02. ``` +#include "utils/palloc.h" ``` Is it really needed? I'm not sure who referrs it. 03. ``` typedef struct bh_nodeidx_entry { bh_node_typekey; charstatus; int idx; } bh_nodeidx_entry; ``` Sorry if it is a stupid question. Can you tell me how "status" is used? None of binaryheap and reorderbuffer components refer it. 04. ``` extern binaryheap *binaryheap_allocate(int capacity, binaryheap_comparator compare, - void *arg); + bool indexed, void *arg); ``` I felt pre-existing API should not be changed. How about adding binaryheap_allocate_extended() or something which can specify the `bool indexed`? binaryheap_allocate() sets heap->bh_indexed to false. 05. ``` +extern void binaryheap_update_up(binaryheap *heap, bh_node_type d); +extern void binaryheap_update_down(binaryheap *heap, bh_node_type d); ``` IIUC, callers must consider whether the node should be shift up/down and use appropriate function, right? I felt it may not be user-friendly. Comments for 0003: 06. ``` This commit changes the eviction algorithm in ReorderBuffer to use max-heap with transaction size,a nd use two strategies depending on the number of transactions being decoded. ``` s/a nd/ and/ 07. ``` It could be too expensive to pudate max-heap while preserving the heap property each time the transaction's memory counter is updated, as it could happen very frquently. So when the number of transactions being decoded is small, we add the transactions to max-heap but don't preserve the heap property, which is O(1). We heapify the max-heap just before picking the largest transaction, which is O(n). This strategy minimizes the overheads of updating the transaction's memory counter. ``` s/pudate/update/ 08. IIUC, if more than 1024 transactions are running but they have small amount of changes, the performance may be degraded, right? Do you have a result in sucha a case? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Jan 26, 2024 at 5:36 PM Masahiko Sawada wrote: > > On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila wrote: > > > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada > > wrote: > > > > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila > > > wrote: > > > > > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada > > > > wrote: > > > > > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila > > > > > wrote: > > > > > > > > > > > > > > > > > > The individual transactions shouldn't cross > > > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal > > > > > > to > > > > > > maintain the lists: "...splitting it into two lists: transactions > > > > > > consuming 5% < and 5% >= of the memory limit, and checking the 5% > > > > > > >= > > > > > > list preferably.". In the previous sentence, what did you mean by > > > > > > transactions consuming 5% >= of the memory limit? I got the > > > > > > impression > > > > > > that you are saying to maintain them in a separate transaction list > > > > > > which doesn't seems to be the case. > > > > > > > > > > I wanted to mean that there are three lists in total: the first one > > > > > maintain the transactions consuming more than 10% of > > > > > logical_decoding_work_mem, > > > > > > > > > > > > > How can we have multiple transactions in the list consuming more than > > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization > > > > before any xact reaches logical_decoding_work_mem? > > > > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions > > > consuming more than 6.4MB are added to the list. So for example, it's > > > possible that the list has three transactions each of which are > > > consuming 10MB while the total memory usage in the reorderbuffer is > > > still 30MB (less than logical_decoding_work_mem). > > > > > > > Thanks for the clarification. I misunderstood the list to have > > transactions greater than 70.4 MB (64 + 6.4) in your example. But one > > thing to note is that maintaining these lists by default can also have > > some overhead unless the list of open transactions crosses a certain > > threshold. > > > > On further analysis, I realized that the approach discussed here might > not be the way to go. The idea of dividing transactions into several > subgroups is to divide a large number of entries into multiple > sub-groups so we can reduce the complexity to search for the > particular entry. Since we assume that there are no big differences in > entries' sizes within a sub-group, we can pick the entry to evict in > O(1). However, what we really need to avoid here is that we end up > increasing the number of times to evict entries because serializing an > entry to the disk is more costly than searching an entry on memory in > general. > > I think that it's no problem in a large-entries subgroup but when it > comes to the smallest-entries subgroup, like for entries consuming > less than 5% of the limit, it could end up evicting many entries. For > example, there would be a huge difference between serializing 1 entry > consuming 5% of the memory limit and serializing 5000 entries > consuming 0.001% of the memory limit. Even if we can select 5000 > entries quickly, I think the latter would be slower in total. The more > subgroups we create, the more the algorithm gets complex and the > overheads could cause. So I think we need to search for the largest > entry in order to minimize the number of evictions anyway. > > Looking for data structures and algorithms, I think binaryheap with > some improvements could be promising. I mentioned before why we cannot > use the current binaryheap[1]. The missing pieces are efficient ways > to remove the arbitrary entry and to update the arbitrary entry's key. > The current binaryheap provides binaryheap_remove_node(), which is > O(log n), but it requires the entry's position in the binaryheap. We > can know the entry's position just after binaryheap_add_unordered() > but it might be changed after heapify. Searching the node's position > is O(n). So the improvement idea is to add a hash table to the > binaryheap so that it can track the positions for each entry so that > we can remove the arbitrary entry in O(log n) and also update the > arbitrary entry's key in O(log n). This is known as the indexed > priority queue. I've attached the patch for that (0001 and 0002). > > That way, in terms of reorderbuffer, we can update and remove the > transaction's memory usage in O(log n) (in worst case and O(1) in > average) and then pick the largest transaction in O(1). Since we might > need to call ReorderBufferSerializeTXN() even in non-streaming case, > we need to maintain the binaryheap anyway. Since if the number of transactions being decoded is small, updating max-heap for each memory counter update could lead to some regressions, I've measured it with the case where updating memory counter happens frequently: setup script: create table test (c int); selec
Re: Improve eviction algorithm in ReorderBuffer
On Wed, Dec 20, 2023 at 12:11 PM Amit Kapila wrote: > > On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada wrote: > > > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila wrote: > > > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada > > > wrote: > > > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila > > > > wrote: > > > > > > > > > > > > > > > The individual transactions shouldn't cross > > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal to > > > > > maintain the lists: "...splitting it into two lists: transactions > > > > > consuming 5% < and 5% >= of the memory limit, and checking the 5% >= > > > > > list preferably.". In the previous sentence, what did you mean by > > > > > transactions consuming 5% >= of the memory limit? I got the impression > > > > > that you are saying to maintain them in a separate transaction list > > > > > which doesn't seems to be the case. > > > > > > > > I wanted to mean that there are three lists in total: the first one > > > > maintain the transactions consuming more than 10% of > > > > logical_decoding_work_mem, > > > > > > > > > > How can we have multiple transactions in the list consuming more than > > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization > > > before any xact reaches logical_decoding_work_mem? > > > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions > > consuming more than 6.4MB are added to the list. So for example, it's > > possible that the list has three transactions each of which are > > consuming 10MB while the total memory usage in the reorderbuffer is > > still 30MB (less than logical_decoding_work_mem). > > > > Thanks for the clarification. I misunderstood the list to have > transactions greater than 70.4 MB (64 + 6.4) in your example. But one > thing to note is that maintaining these lists by default can also have > some overhead unless the list of open transactions crosses a certain > threshold. > On further analysis, I realized that the approach discussed here might not be the way to go. The idea of dividing transactions into several subgroups is to divide a large number of entries into multiple sub-groups so we can reduce the complexity to search for the particular entry. Since we assume that there are no big differences in entries' sizes within a sub-group, we can pick the entry to evict in O(1). However, what we really need to avoid here is that we end up increasing the number of times to evict entries because serializing an entry to the disk is more costly than searching an entry on memory in general. I think that it's no problem in a large-entries subgroup but when it comes to the smallest-entries subgroup, like for entries consuming less than 5% of the limit, it could end up evicting many entries. For example, there would be a huge difference between serializing 1 entry consuming 5% of the memory limit and serializing 5000 entries consuming 0.001% of the memory limit. Even if we can select 5000 entries quickly, I think the latter would be slower in total. The more subgroups we create, the more the algorithm gets complex and the overheads could cause. So I think we need to search for the largest entry in order to minimize the number of evictions anyway. Looking for data structures and algorithms, I think binaryheap with some improvements could be promising. I mentioned before why we cannot use the current binaryheap[1]. The missing pieces are efficient ways to remove the arbitrary entry and to update the arbitrary entry's key. The current binaryheap provides binaryheap_remove_node(), which is O(log n), but it requires the entry's position in the binaryheap. We can know the entry's position just after binaryheap_add_unordered() but it might be changed after heapify. Searching the node's position is O(n). So the improvement idea is to add a hash table to the binaryheap so that it can track the positions for each entry so that we can remove the arbitrary entry in O(log n) and also update the arbitrary entry's key in O(log n). This is known as the indexed priority queue. I've attached the patch for that (0001 and 0002). That way, in terms of reorderbuffer, we can update and remove the transaction's memory usage in O(log n) (in worst case and O(1) in average) and then pick the largest transaction in O(1). Since we might need to call ReorderBufferSerializeTXN() even in non-streaming case, we need to maintain the binaryheap anyway. I've attached the patch for that (0003). Here are test script for many sub-transactions case: create table test (c int); create or replace function testfn (cnt int) returns void as $$ begin for i in 1..cnt loop begin insert into test values (i); exception when division_by_zero then raise notice 'caught error'; return; end; end loop; end; $$ language plpgsql; select pg_create_logical_replication_slot('s', 'test_decoding'); select testfn(5); set logical_decoding_work_mem to '4MB'; select count(*) from pg
Re: Improve eviction algorithm in ReorderBuffer
2024-01 Commitfest. Hi, This patch has a CF status of "Needs Review" [1], but it seems like there was some CFbot test failures last time it was run [2]. Please have a look and post an updated version if necessary. == [1] https://commitfest.postgresql.org/46/4699/ [2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4699 Kind Regards, Peter Smith.
Re: Improve eviction algorithm in ReorderBuffer
On Wed, Dec 20, 2023 at 6:49 AM Masahiko Sawada wrote: > > On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila wrote: > > > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada > > wrote: > > > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila > > > wrote: > > > > > > > > > > > > The individual transactions shouldn't cross > > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal to > > > > maintain the lists: "...splitting it into two lists: transactions > > > > consuming 5% < and 5% >= of the memory limit, and checking the 5% >= > > > > list preferably.". In the previous sentence, what did you mean by > > > > transactions consuming 5% >= of the memory limit? I got the impression > > > > that you are saying to maintain them in a separate transaction list > > > > which doesn't seems to be the case. > > > > > > I wanted to mean that there are three lists in total: the first one > > > maintain the transactions consuming more than 10% of > > > logical_decoding_work_mem, > > > > > > > How can we have multiple transactions in the list consuming more than > > 10% of logical_decoding_work_mem? Shouldn't we perform serialization > > before any xact reaches logical_decoding_work_mem? > > Well, suppose logical_decoding_work_mem is set to 64MB, transactions > consuming more than 6.4MB are added to the list. So for example, it's > possible that the list has three transactions each of which are > consuming 10MB while the total memory usage in the reorderbuffer is > still 30MB (less than logical_decoding_work_mem). > Thanks for the clarification. I misunderstood the list to have transactions greater than 70.4 MB (64 + 6.4) in your example. But one thing to note is that maintaining these lists by default can also have some overhead unless the list of open transactions crosses a certain threshold. -- With Regards, Amit Kapila.
Re: Improve eviction algorithm in ReorderBuffer
On Tue, Dec 19, 2023 at 8:02 PM Amit Kapila wrote: > > On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada wrote: > > > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila > > wrote: > > > > > > > > > The individual transactions shouldn't cross > > > 'logical_decoding_work_mem'. I got a bit confused by your proposal to > > > maintain the lists: "...splitting it into two lists: transactions > > > consuming 5% < and 5% >= of the memory limit, and checking the 5% >= > > > list preferably.". In the previous sentence, what did you mean by > > > transactions consuming 5% >= of the memory limit? I got the impression > > > that you are saying to maintain them in a separate transaction list > > > which doesn't seems to be the case. > > > > I wanted to mean that there are three lists in total: the first one > > maintain the transactions consuming more than 10% of > > logical_decoding_work_mem, > > > > How can we have multiple transactions in the list consuming more than > 10% of logical_decoding_work_mem? Shouldn't we perform serialization > before any xact reaches logical_decoding_work_mem? Well, suppose logical_decoding_work_mem is set to 64MB, transactions consuming more than 6.4MB are added to the list. So for example, it's possible that the list has three transactions each of which are consuming 10MB while the total memory usage in the reorderbuffer is still 30MB (less than logical_decoding_work_mem). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On Tue, Dec 19, 2023 at 8:31 AM Masahiko Sawada wrote: > > On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila wrote: > > > > > > The individual transactions shouldn't cross > > 'logical_decoding_work_mem'. I got a bit confused by your proposal to > > maintain the lists: "...splitting it into two lists: transactions > > consuming 5% < and 5% >= of the memory limit, and checking the 5% >= > > list preferably.". In the previous sentence, what did you mean by > > transactions consuming 5% >= of the memory limit? I got the impression > > that you are saying to maintain them in a separate transaction list > > which doesn't seems to be the case. > > I wanted to mean that there are three lists in total: the first one > maintain the transactions consuming more than 10% of > logical_decoding_work_mem, > How can we have multiple transactions in the list consuming more than 10% of logical_decoding_work_mem? Shouldn't we perform serialization before any xact reaches logical_decoding_work_mem? -- With Regards, Amit Kapila.
Re: Improve eviction algorithm in ReorderBuffer
On Sun, Dec 17, 2023 at 11:40 AM Amit Kapila wrote: > > On Fri, Dec 15, 2023 at 11:29 AM Masahiko Sawada > wrote: > > > > On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila > > wrote: > > > > > > On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada > > > wrote: > > > > > > > > > > IIUC, you are giving preference to multiple list ideas as compared to > > > (a) because you don't need to adjust the list each time the > > > transaction size changes, is that right? > > > > Right. > > > > > If so, I think there is a > > > cost to keep that data structure up-to-date but it can help in > > > reducing the number of times we need to serialize. > > > > Yes, there is a trade-off. > > > > What I don't want to do is to keep all transactions ordered since it's > > too costly. The proposed idea uses multiple lists to keep all > > transactions roughly ordered. The maintenance cost would be cheap > > since each list is unordered. > > > > It might be a good idea to have a threshold to switch how to pick the > > largest transaction based on the number of transactions in the > > reorderbuffer. If there are many transactions, we can use the proposed > > algorithm to find a possibly-largest transaction, otherwise use the > > current way. > > > > Yeah, that makes sense. > > > > > > > > > > > > > 1) A scenario where suppose you have one very large transaction that > > > > > is consuming ~40% of the memory and 5-6 comparatively smaller > > > > > transactions that are just above 10% of the memory limit. And now for > > > > > coming under the memory limit instead of getting 1 large transaction > > > > > evicted out, we are evicting out multiple times. > > > > > > > > Given the large transaction list will have up to 10 transactions, I > > > > think it's cheap to pick the largest transaction among them. It's O(N) > > > > but N won't be large. > > > > > > > > > 2) Another scenario where all the transactions are under 10% of the > > > > > memory limit but let's say there are some transactions are consuming > > > > > around 8-9% of the memory limit each but those are not very old > > > > > transactions whereas there are certain old transactions which are > > > > > fairly small and consuming under 1% of memory limit and there are many > > > > > such transactions. So how it would affect if we frequently select > > > > > many of these transactions to come under memory limit instead of > > > > > selecting a couple of large transactions which are consuming 8-9%? > > > > > > > > Yeah, probably we can do something for small transactions (i.e. small > > > > and on-memory transactions). One idea is to pick the largest > > > > transaction among them by iterating over all of them. Given that the > > > > more transactions are evicted, the less transactions the on-memory > > > > transaction list has, unlike the current algorithm, we would still > > > > win. Or we could even split it into several sub-lists in order to > > > > reduce the number of transactions to check. For example, splitting it > > > > into two lists: transactions consuming 5% < and 5% >= of the memory > > > > limit, and checking the 5% >= list preferably. > > > > > > > > > > Which memory limit are you referring to here? Is it > > > logical_decoding_work_mem? > > > > logical_decoding_work_mem. > > > > > > > > > The cost for > > > > maintaining these lists could increase, though. > > > > > > > > > > Yeah, can't we maintain a single list of all xacts that are consuming > > > equal to or greater than the memory limit? Considering that the memory > > > limit is logical_decoding_work_mem, then I think just picking one > > > transaction to serialize would be sufficient. > > > > IIUC we serialize a transaction when the sum of all transactions' > > memory usage in the reorderbuffer exceeds logical_decoding_work_mem. > > In what cases are multiple transactions consuming equal to or greater > > than the logical_decoding_work_mem? > > > > The individual transactions shouldn't cross > 'logical_decoding_work_mem'. I got a bit confused by your proposal to > maintain the lists: "...splitting it into two lists: transactions > consuming 5% < and 5% >= of the memory limit, and checking the 5% >= > list preferably.". In the previous sentence, what did you mean by > transactions consuming 5% >= of the memory limit? I got the impression > that you are saying to maintain them in a separate transaction list > which doesn't seems to be the case. I wanted to mean that there are three lists in total: the first one maintain the transactions consuming more than 10% of logical_decoding_work_mem, the second one maintains other transactions consuming more than or equal to 5% of logical_decoding_work_mem, and the third one maintains other transactions consuming more than 0 and less than 5% of logical_decoding_work_mem. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Dec 15, 2023 at 11:29 AM Masahiko Sawada wrote: > > On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila wrote: > > > > On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada > > wrote: > > > > > > > IIUC, you are giving preference to multiple list ideas as compared to > > (a) because you don't need to adjust the list each time the > > transaction size changes, is that right? > > Right. > > > If so, I think there is a > > cost to keep that data structure up-to-date but it can help in > > reducing the number of times we need to serialize. > > Yes, there is a trade-off. > > What I don't want to do is to keep all transactions ordered since it's > too costly. The proposed idea uses multiple lists to keep all > transactions roughly ordered. The maintenance cost would be cheap > since each list is unordered. > > It might be a good idea to have a threshold to switch how to pick the > largest transaction based on the number of transactions in the > reorderbuffer. If there are many transactions, we can use the proposed > algorithm to find a possibly-largest transaction, otherwise use the > current way. > Yeah, that makes sense. > > > > > > > > > 1) A scenario where suppose you have one very large transaction that > > > > is consuming ~40% of the memory and 5-6 comparatively smaller > > > > transactions that are just above 10% of the memory limit. And now for > > > > coming under the memory limit instead of getting 1 large transaction > > > > evicted out, we are evicting out multiple times. > > > > > > Given the large transaction list will have up to 10 transactions, I > > > think it's cheap to pick the largest transaction among them. It's O(N) > > > but N won't be large. > > > > > > > 2) Another scenario where all the transactions are under 10% of the > > > > memory limit but let's say there are some transactions are consuming > > > > around 8-9% of the memory limit each but those are not very old > > > > transactions whereas there are certain old transactions which are > > > > fairly small and consuming under 1% of memory limit and there are many > > > > such transactions. So how it would affect if we frequently select > > > > many of these transactions to come under memory limit instead of > > > > selecting a couple of large transactions which are consuming 8-9%? > > > > > > Yeah, probably we can do something for small transactions (i.e. small > > > and on-memory transactions). One idea is to pick the largest > > > transaction among them by iterating over all of them. Given that the > > > more transactions are evicted, the less transactions the on-memory > > > transaction list has, unlike the current algorithm, we would still > > > win. Or we could even split it into several sub-lists in order to > > > reduce the number of transactions to check. For example, splitting it > > > into two lists: transactions consuming 5% < and 5% >= of the memory > > > limit, and checking the 5% >= list preferably. > > > > > > > Which memory limit are you referring to here? Is it > > logical_decoding_work_mem? > > logical_decoding_work_mem. > > > > > > The cost for > > > maintaining these lists could increase, though. > > > > > > > Yeah, can't we maintain a single list of all xacts that are consuming > > equal to or greater than the memory limit? Considering that the memory > > limit is logical_decoding_work_mem, then I think just picking one > > transaction to serialize would be sufficient. > > IIUC we serialize a transaction when the sum of all transactions' > memory usage in the reorderbuffer exceeds logical_decoding_work_mem. > In what cases are multiple transactions consuming equal to or greater > than the logical_decoding_work_mem? > The individual transactions shouldn't cross 'logical_decoding_work_mem'. I got a bit confused by your proposal to maintain the lists: "...splitting it into two lists: transactions consuming 5% < and 5% >= of the memory limit, and checking the 5% >= list preferably.". In the previous sentence, what did you mean by transactions consuming 5% >= of the memory limit? I got the impression that you are saying to maintain them in a separate transaction list which doesn't seems to be the case. -- With Regards, Amit Kapila.
Re: Improve eviction algorithm in ReorderBuffer
On Sat, Dec 16, 2023 at 1:36 AM Euler Taveira wrote: > > On Fri, Dec 15, 2023, at 9:11 AM, Masahiko Sawada wrote: > > > I assume you mean to add ReorderBufferTXN entries to the binaryheap > and then build it by comparing their sizes (i.e. txn->size). But > ReorderBufferTXN entries are removed and deallocated once the > transaction finished. How can we efficiently remove these entries from > binaryheap? I guess it would be O(n) to find the entry among the > unordered entries, where n is the number of transactions in the > reorderbuffer. > > > O(log n) for both functions: binaryheap_remove_first() and > binaryheap_remove_node(). Right. The binaryheap_binaryheap_remove_first() removes the topmost entry in O(log n), but the ReorderBufferTXN being removed is not necessarily the topmost entry, since we remove the entry when the transaction completes (committed or aborted). The binaryheap_remove_node() removes the entry at the given Nth in O(log n), but I'm not sure how we can know the indexes of each entry. I think we can remember the index of newly added entry after calling binaryheap_add_unordered() but once we call binaryheap_build() the index is out-of-date. So I think that in the worst case we would need to check all entries in order to remove an arbitrary entry in binaryheap. It's O(n). I might be missing something though. > I didn't read your patch but do you really need to > free entries one by one? If not, binaryheap_free(). The patch doesn't touch on how to free entries. ReorderBufferTXN entries are freed one by one after each of which completes (committed or aborted). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Dec 15, 2023, at 9:11 AM, Masahiko Sawada wrote: > > I assume you mean to add ReorderBufferTXN entries to the binaryheap > and then build it by comparing their sizes (i.e. txn->size). But > ReorderBufferTXN entries are removed and deallocated once the > transaction finished. How can we efficiently remove these entries from > binaryheap? I guess it would be O(n) to find the entry among the > unordered entries, where n is the number of transactions in the > reorderbuffer. O(log n) for both functions: binaryheap_remove_first() and binaryheap_remove_node(). I didn't read your patch but do you really need to free entries one by one? If not, binaryheap_free(). -- Euler Taveira EDB https://www.enterprisedb.com/
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Dec 15, 2023 at 2:59 PM Masahiko Sawada wrote: > > On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila wrote: > > > > On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada > > wrote: > > > > > > On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar wrote: > > > > > > > > On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada > > > > wrote: > > > > > > > > > > I've heard the report internally that replication lag became huge when > > > > > decoding transactions each consisting of 500k sub transactions. Note > > > > > that ReorderBufferLargetstTXN() is used only in non-streaming mode. > > > > > > > > > Can't you suggest them to use streaming mode to avoid this problem or > > do you see some problem with that? > > Yeah, that's one option. But I can suggest > Sorry, it was still in the middle of editing. Yeah, that's one option. But since there is a trade-off I cannot suggest using streaming mode for every user. Also, the logical replication client (e.g. third party tool receiving logical change set) might not support the streaming mode yet. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Dec 15, 2023 at 7:10 PM Alvaro Herrera wrote: > > On 2023-Dec-12, Masahiko Sawada wrote: > > > To deal with this problem, I initially thought of the idea (a) > > mentioned in the comment; use a binary heap to maintain the > > transactions sorted by the amount of changes or the size. But it seems > > not a good idea to try maintaining all transactions by its size since > > the size of each transaction could be changed frequently. > > Hmm, maybe you can just use binaryheap_add_unordered and just let the > sizes change, and do binaryheap_build() at the point where the eviction > is needed. I assume you mean to add ReorderBufferTXN entries to the binaryheap and then build it by comparing their sizes (i.e. txn->size). But ReorderBufferTXN entries are removed and deallocated once the transaction finished. How can we efficiently remove these entries from binaryheap? I guess it would be O(n) to find the entry among the unordered entries, where n is the number of transactions in the reorderbuffer. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Improve eviction algorithm in ReorderBuffer
On 2023-Dec-12, Masahiko Sawada wrote: > To deal with this problem, I initially thought of the idea (a) > mentioned in the comment; use a binary heap to maintain the > transactions sorted by the amount of changes or the size. But it seems > not a good idea to try maintaining all transactions by its size since > the size of each transaction could be changed frequently. Hmm, maybe you can just use binaryheap_add_unordered and just let the sizes change, and do binaryheap_build() at the point where the eviction is needed. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "No necesitamos banderas No reconocemos fronteras" (Jorge González)
Re: Improve eviction algorithm in ReorderBuffer
On Fri, Dec 15, 2023 at 12:37 PM Amit Kapila wrote: > > On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada wrote: > > > > On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar wrote: > > > > > > On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada > > > wrote: > > > > > > > > I've heard the report internally that replication lag became huge when > > > > decoding transactions each consisting of 500k sub transactions. Note > > > > that ReorderBufferLargetstTXN() is used only in non-streaming mode. > > > > > > Can't you suggest them to use streaming mode to avoid this problem or > do you see some problem with that? Yeah, that's one option. But I can suggest > > > > > Here is a test script for a many subtransactions scenario. In my > > > > environment, the logical decoding took over 2min to decode one top > > > > transaction having 100k subtransctions. > > > > > > > > - > > > > create table test (c int); > > > > create or replace function testfn (cnt int) returns void as $$ > > > > begin > > > > for i in 1..cnt loop > > > > begin > > > > insert into test values (i); > > > > exception when division_by_zero then > > > > raise notice 'caught error'; > > > > return; > > > > end; > > > > end loop; > > > > end; > > > > $$ > > > > language plpgsql; > > > > select testfn(10) > > > > set logical_decoding_work_mem to '4MB'; > > > > select count(*) from pg_logical_slot_peek_changes('s', null, null) > > > > > > > > > > > > To deal with this problem, I initially thought of the idea (a) > > > > mentioned in the comment; use a binary heap to maintain the > > > > transactions sorted by the amount of changes or the size. But it seems > > > > not a good idea to try maintaining all transactions by its size since > > > > the size of each transaction could be changed frequently. > > > > > > > > The attached patch uses a different approach that consists of three > > > > strategies; (1) maintain the list of transactions whose size is larger > > > > than 10% of logical_decoding_work_mem, and preferentially evict a > > > > transaction from this list. > > > > > > IIUC, you are giving preference to multiple list ideas as compared to > (a) because you don't need to adjust the list each time the > transaction size changes, is that right? Right. > If so, I think there is a > cost to keep that data structure up-to-date but it can help in > reducing the number of times we need to serialize. Yes, there is a trade-off. What I don't want to do is to keep all transactions ordered since it's too costly. The proposed idea uses multiple lists to keep all transactions roughly ordered. The maintenance cost would be cheap since each list is unordered. It might be a good idea to have a threshold to switch how to pick the largest transaction based on the number of transactions in the reorderbuffer. If there are many transactions, we can use the proposed algorithm to find a possibly-largest transaction, otherwise use the current way. > > If the list is empty, all transactions are > > > > small enough, (2) so we evict the oldest top-level transaction from > > > > rb->toplevel_by_lsn list. Evicting older transactions would help in > > > > freeing memory blocks in GenerationContext. Finally, if this is also > > > > empty, (3) we evict a transaction that size is > 0. Here, we need to > > > > note the fact that even if a transaction is evicted the > > > > ReorderBufferTXN entry is not removed from rb->by_txn but its size is > > > > 0. In the worst case where all (quite a few) transactions are smaller > > > > than 10% of the memory limit, we might end up checking many > > > > transactions to find non-zero size transaction entries to evict. So > > > > the patch adds a new list to maintain all transactions that have at > > > > least one change in memory. > > > > > > > > Summarizing the algorithm I've implemented in the patch, > > > > > > > > 1. pick a transaction from the list of large transactions (larger than > > > > 10% of memory limit). > > > > 2. pick a transaction from the top-level transaction list in LSN order. > > > > 3. pick a transaction from the list of transactions that have at least > > > > one change in memory. > > > > > > > > With the patch, the above test case completed within 3 seconds in my > > > > environment. > > > > > > Thanks for working on this, I think it would be good to test other > > > scenarios as well where this might have some negative impact and see > > > where we stand. > > > > Agreed. > > > > > 1) A scenario where suppose you have one very large transaction that > > > is consuming ~40% of the memory and 5-6 comparatively smaller > > > transactions that are just above 10% of the memory limit. And now for > > > coming under the memory limit instead of getting 1 large transaction > > > evicted out, we are evicting out multiple times. > > > > Given the large transaction list will have up to 10 transactions, I > > think it's cheap to pick the largest transaction among them. It's O(N) > >
Re: Improve eviction algorithm in ReorderBuffer
On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada wrote: > > On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar wrote: > > > > On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada > > wrote: > > > > > > I've heard the report internally that replication lag became huge when > > > decoding transactions each consisting of 500k sub transactions. Note > > > that ReorderBufferLargetstTXN() is used only in non-streaming mode. > > > Can't you suggest them to use streaming mode to avoid this problem or do you see some problem with that? > > > Here is a test script for a many subtransactions scenario. In my > > > environment, the logical decoding took over 2min to decode one top > > > transaction having 100k subtransctions. > > > > > > - > > > create table test (c int); > > > create or replace function testfn (cnt int) returns void as $$ > > > begin > > > for i in 1..cnt loop > > > begin > > > insert into test values (i); > > > exception when division_by_zero then > > > raise notice 'caught error'; > > > return; > > > end; > > > end loop; > > > end; > > > $$ > > > language plpgsql; > > > select testfn(10) > > > set logical_decoding_work_mem to '4MB'; > > > select count(*) from pg_logical_slot_peek_changes('s', null, null) > > > > > > > > > To deal with this problem, I initially thought of the idea (a) > > > mentioned in the comment; use a binary heap to maintain the > > > transactions sorted by the amount of changes or the size. But it seems > > > not a good idea to try maintaining all transactions by its size since > > > the size of each transaction could be changed frequently. > > > > > > The attached patch uses a different approach that consists of three > > > strategies; (1) maintain the list of transactions whose size is larger > > > than 10% of logical_decoding_work_mem, and preferentially evict a > > > transaction from this list. > > > IIUC, you are giving preference to multiple list ideas as compared to (a) because you don't need to adjust the list each time the transaction size changes, is that right? If so, I think there is a cost to keep that data structure up-to-date but it can help in reducing the number of times we need to serialize. If the list is empty, all transactions are > > > small enough, (2) so we evict the oldest top-level transaction from > > > rb->toplevel_by_lsn list. Evicting older transactions would help in > > > freeing memory blocks in GenerationContext. Finally, if this is also > > > empty, (3) we evict a transaction that size is > 0. Here, we need to > > > note the fact that even if a transaction is evicted the > > > ReorderBufferTXN entry is not removed from rb->by_txn but its size is > > > 0. In the worst case where all (quite a few) transactions are smaller > > > than 10% of the memory limit, we might end up checking many > > > transactions to find non-zero size transaction entries to evict. So > > > the patch adds a new list to maintain all transactions that have at > > > least one change in memory. > > > > > > Summarizing the algorithm I've implemented in the patch, > > > > > > 1. pick a transaction from the list of large transactions (larger than > > > 10% of memory limit). > > > 2. pick a transaction from the top-level transaction list in LSN order. > > > 3. pick a transaction from the list of transactions that have at least > > > one change in memory. > > > > > > With the patch, the above test case completed within 3 seconds in my > > > environment. > > > > Thanks for working on this, I think it would be good to test other > > scenarios as well where this might have some negative impact and see > > where we stand. > > Agreed. > > > 1) A scenario where suppose you have one very large transaction that > > is consuming ~40% of the memory and 5-6 comparatively smaller > > transactions that are just above 10% of the memory limit. And now for > > coming under the memory limit instead of getting 1 large transaction > > evicted out, we are evicting out multiple times. > > Given the large transaction list will have up to 10 transactions, I > think it's cheap to pick the largest transaction among them. It's O(N) > but N won't be large. > > > 2) Another scenario where all the transactions are under 10% of the > > memory limit but let's say there are some transactions are consuming > > around 8-9% of the memory limit each but those are not very old > > transactions whereas there are certain old transactions which are > > fairly small and consuming under 1% of memory limit and there are many > > such transactions. So how it would affect if we frequently select > > many of these transactions to come under memory limit instead of > > selecting a couple of large transactions which are consuming 8-9%? > > Yeah, probably we can do something for small transactions (i.e. small > and on-memory transactions). One idea is to pick the largest > transaction among them by iterating over all of them. Given that the > more transactions are evicted, the less tr
Re: Improve eviction algorithm in ReorderBuffer
On Wed, Dec 13, 2023 at 6:01 AM Masahiko Sawada wrote: > > > Thanks for working on this, I think it would be good to test other > > scenarios as well where this might have some negative impact and see > > where we stand. > > Agreed. > > > 1) A scenario where suppose you have one very large transaction that > > is consuming ~40% of the memory and 5-6 comparatively smaller > > transactions that are just above 10% of the memory limit. And now for > > coming under the memory limit instead of getting 1 large transaction > > evicted out, we are evicting out multiple times. > > Given the large transaction list will have up to 10 transactions, I > think it's cheap to pick the largest transaction among them. It's O(N) > but N won't be large. Yeah, that makes sense. > > 2) Another scenario where all the transactions are under 10% of the > > memory limit but let's say there are some transactions are consuming > > around 8-9% of the memory limit each but those are not very old > > transactions whereas there are certain old transactions which are > > fairly small and consuming under 1% of memory limit and there are many > > such transactions. So how it would affect if we frequently select > > many of these transactions to come under memory limit instead of > > selecting a couple of large transactions which are consuming 8-9%? > > Yeah, probably we can do something for small transactions (i.e. small > and on-memory transactions). One idea is to pick the largest > transaction among them by iterating over all of them. Given that the > more transactions are evicted, the less transactions the on-memory > transaction list has, unlike the current algorithm, we would still > win. Or we could even split it into several sub-lists in order to > reduce the number of transactions to check. For example, splitting it > into two lists: transactions consuming 5% < and 5% >= of the memory > limit, and checking the 5% >= list preferably. The cost for > maintaining these lists could increase, though. > > Do you have any ideas? Yeah something like what you mention might be good, we maintain 3 list that says large, medium, and small transactions. In a large transaction, list suppose we allow transactions that consume more than 10% so there could be at max 10 transactions so we can do a sequence search and spill the largest of all. Whereas in the medium list suppose we keep transactions ranging from e.g. 3-10% then it's just fine to pick from the head because the size differences between the largest and smallest transaction in this list are not very significant. And remaining in the small transaction list and from the small transaction list we can choose to spill multiple transactions at a time. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Re: Improve eviction algorithm in ReorderBuffer
On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar wrote: > > On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada wrote: > > > > Hi all, > > > > As the comment of ReorderBufferLargestTXN() says, it's very slow with > > many subtransactions: > > > > /* > > * Find the largest transaction (toplevel or subxact) to evict (spill to > > disk). > > * > > * XXX With many subtransactions this might be quite slow, because we'll > > have > > * to walk through all of them. There are some options how we could improve > > * that: (a) maintain some secondary structure with transactions sorted by > > * amount of changes, (b) not looking for the entirely largest transaction, > > * but e.g. for transaction using at least some fraction of the memory > > limit, > > * and (c) evicting multiple transactions at once, e.g. to free a given > > portion > > * of the memory limit (e.g. 50%). > > */ > > > > This is because the reorderbuffer has transaction entries for each > > top-level and sub transaction, and ReorderBufferLargestTXN() walks > > through all transaction entries to pick the transaction to evict. > > I've heard the report internally that replication lag became huge when > > decoding transactions each consisting of 500k sub transactions. Note > > that ReorderBufferLargetstTXN() is used only in non-streaming mode. > > > > Here is a test script for a many subtransactions scenario. In my > > environment, the logical decoding took over 2min to decode one top > > transaction having 100k subtransctions. > > > > - > > create table test (c int); > > create or replace function testfn (cnt int) returns void as $$ > > begin > > for i in 1..cnt loop > > begin > > insert into test values (i); > > exception when division_by_zero then > > raise notice 'caught error'; > > return; > > end; > > end loop; > > end; > > $$ > > language plpgsql; > > select testfn(10) > > set logical_decoding_work_mem to '4MB'; > > select count(*) from pg_logical_slot_peek_changes('s', null, null) > > > > > > To deal with this problem, I initially thought of the idea (a) > > mentioned in the comment; use a binary heap to maintain the > > transactions sorted by the amount of changes or the size. But it seems > > not a good idea to try maintaining all transactions by its size since > > the size of each transaction could be changed frequently. > > > > The attached patch uses a different approach that consists of three > > strategies; (1) maintain the list of transactions whose size is larger > > than 10% of logical_decoding_work_mem, and preferentially evict a > > transaction from this list. If the list is empty, all transactions are > > small enough, (2) so we evict the oldest top-level transaction from > > rb->toplevel_by_lsn list. Evicting older transactions would help in > > freeing memory blocks in GenerationContext. Finally, if this is also > > empty, (3) we evict a transaction that size is > 0. Here, we need to > > note the fact that even if a transaction is evicted the > > ReorderBufferTXN entry is not removed from rb->by_txn but its size is > > 0. In the worst case where all (quite a few) transactions are smaller > > than 10% of the memory limit, we might end up checking many > > transactions to find non-zero size transaction entries to evict. So > > the patch adds a new list to maintain all transactions that have at > > least one change in memory. > > > > Summarizing the algorithm I've implemented in the patch, > > > > 1. pick a transaction from the list of large transactions (larger than > > 10% of memory limit). > > 2. pick a transaction from the top-level transaction list in LSN order. > > 3. pick a transaction from the list of transactions that have at least > > one change in memory. > > > > With the patch, the above test case completed within 3 seconds in my > > environment. > > Thanks for working on this, I think it would be good to test other > scenarios as well where this might have some negative impact and see > where we stand. Agreed. > 1) A scenario where suppose you have one very large transaction that > is consuming ~40% of the memory and 5-6 comparatively smaller > transactions that are just above 10% of the memory limit. And now for > coming under the memory limit instead of getting 1 large transaction > evicted out, we are evicting out multiple times. Given the large transaction list will have up to 10 transactions, I think it's cheap to pick the largest transaction among them. It's O(N) but N won't be large. > 2) Another scenario where all the transactions are under 10% of the > memory limit but let's say there are some transactions are consuming > around 8-9% of the memory limit each but those are not very old > transactions whereas there are certain old transactions which are > fairly small and consuming under 1% of memory limit and there are many > such transactions. So how it would affect if we frequently select > many of these transactions to come under memory limit instead
Re: Improve eviction algorithm in ReorderBuffer
On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada wrote: > > Hi all, > > As the comment of ReorderBufferLargestTXN() says, it's very slow with > many subtransactions: > > /* > * Find the largest transaction (toplevel or subxact) to evict (spill to > disk). > * > * XXX With many subtransactions this might be quite slow, because we'll have > * to walk through all of them. There are some options how we could improve > * that: (a) maintain some secondary structure with transactions sorted by > * amount of changes, (b) not looking for the entirely largest transaction, > * but e.g. for transaction using at least some fraction of the memory limit, > * and (c) evicting multiple transactions at once, e.g. to free a given > portion > * of the memory limit (e.g. 50%). > */ > > This is because the reorderbuffer has transaction entries for each > top-level and sub transaction, and ReorderBufferLargestTXN() walks > through all transaction entries to pick the transaction to evict. > I've heard the report internally that replication lag became huge when > decoding transactions each consisting of 500k sub transactions. Note > that ReorderBufferLargetstTXN() is used only in non-streaming mode. > > Here is a test script for a many subtransactions scenario. In my > environment, the logical decoding took over 2min to decode one top > transaction having 100k subtransctions. > > - > create table test (c int); > create or replace function testfn (cnt int) returns void as $$ > begin > for i in 1..cnt loop > begin > insert into test values (i); > exception when division_by_zero then > raise notice 'caught error'; > return; > end; > end loop; > end; > $$ > language plpgsql; > select testfn(10) > set logical_decoding_work_mem to '4MB'; > select count(*) from pg_logical_slot_peek_changes('s', null, null) > > > To deal with this problem, I initially thought of the idea (a) > mentioned in the comment; use a binary heap to maintain the > transactions sorted by the amount of changes or the size. But it seems > not a good idea to try maintaining all transactions by its size since > the size of each transaction could be changed frequently. > > The attached patch uses a different approach that consists of three > strategies; (1) maintain the list of transactions whose size is larger > than 10% of logical_decoding_work_mem, and preferentially evict a > transaction from this list. If the list is empty, all transactions are > small enough, (2) so we evict the oldest top-level transaction from > rb->toplevel_by_lsn list. Evicting older transactions would help in > freeing memory blocks in GenerationContext. Finally, if this is also > empty, (3) we evict a transaction that size is > 0. Here, we need to > note the fact that even if a transaction is evicted the > ReorderBufferTXN entry is not removed from rb->by_txn but its size is > 0. In the worst case where all (quite a few) transactions are smaller > than 10% of the memory limit, we might end up checking many > transactions to find non-zero size transaction entries to evict. So > the patch adds a new list to maintain all transactions that have at > least one change in memory. > > Summarizing the algorithm I've implemented in the patch, > > 1. pick a transaction from the list of large transactions (larger than > 10% of memory limit). > 2. pick a transaction from the top-level transaction list in LSN order. > 3. pick a transaction from the list of transactions that have at least > one change in memory. > > With the patch, the above test case completed within 3 seconds in my > environment. Thanks for working on this, I think it would be good to test other scenarios as well where this might have some negative impact and see where we stand. I mean 1) A scenario where suppose you have one very large transaction that is consuming ~40% of the memory and 5-6 comparatively smaller transactions that are just above 10% of the memory limit. And now for coming under the memory limit instead of getting 1 large transaction evicted out, we are evicting out multiple times. 2) Another scenario where all the transactions are under 10% of the memory limit but let's say there are some transactions are consuming around 8-9% of the memory limit each but those are not very old transactions whereas there are certain old transactions which are fairly small and consuming under 1% of memory limit and there are many such transactions. So how it would affect if we frequently select many of these transactions to come under memory limit instead of selecting a couple of large transactions which are consuming 8-9%? > > As a side note, the idea (c) mentioned in the comment, evicting > multiple transactions at once to free a given portion of the memory, > would also help in avoiding back and forth the memory threshold. It's > also worth considering. Yes, I think it is worth considering. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com