On Tue, Dec 12, 2023 at 1:33 PM Dilip Kumar <dilipbal...@gmail.com> wrote: > > On Tue, Dec 12, 2023 at 9:01 AM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > > > Hi all, > > > > As the comment of ReorderBufferLargestTXN() says, it's very slow with > > many subtransactions: > > > > /* > > * Find the largest transaction (toplevel or subxact) to evict (spill to > > disk). > > * > > * XXX With many subtransactions this might be quite slow, because we'll > > have > > * to walk through all of them. There are some options how we could improve > > * that: (a) maintain some secondary structure with transactions sorted by > > * amount of changes, (b) not looking for the entirely largest transaction, > > * but e.g. for transaction using at least some fraction of the memory > > limit, > > * and (c) evicting multiple transactions at once, e.g. to free a given > > portion > > * of the memory limit (e.g. 50%). > > */ > > > > This is because the reorderbuffer has transaction entries for each > > top-level and sub transaction, and ReorderBufferLargestTXN() walks > > through all transaction entries to pick the transaction to evict. > > I've heard the report internally that replication lag became huge when > > decoding transactions each consisting of 500k sub transactions. Note > > that ReorderBufferLargetstTXN() is used only in non-streaming mode. > > > > Here is a test script for a many subtransactions scenario. In my > > environment, the logical decoding took over 2min to decode one top > > transaction having 100k subtransctions. > > > > ----- > > create table test (c int); > > create or replace function testfn (cnt int) returns void as $$ > > begin > > for i in 1..cnt loop > > begin > > insert into test values (i); > > exception when division_by_zero then > > raise notice 'caught error'; > > return; > > end; > > end loop; > > end; > > $$ > > language plpgsql; > > select testfn(100000) > > set logical_decoding_work_mem to '4MB'; > > select count(*) from pg_logical_slot_peek_changes('s', null, null) > > ---- > > > > To deal with this problem, I initially thought of the idea (a) > > mentioned in the comment; use a binary heap to maintain the > > transactions sorted by the amount of changes or the size. But it seems > > not a good idea to try maintaining all transactions by its size since > > the size of each transaction could be changed frequently. > > > > The attached patch uses a different approach that consists of three > > strategies; (1) maintain the list of transactions whose size is larger > > than 10% of logical_decoding_work_mem, and preferentially evict a > > transaction from this list. If the list is empty, all transactions are > > small enough, (2) so we evict the oldest top-level transaction from > > rb->toplevel_by_lsn list. Evicting older transactions would help in > > freeing memory blocks in GenerationContext. Finally, if this is also > > empty, (3) we evict a transaction that size is > 0. Here, we need to > > note the fact that even if a transaction is evicted the > > ReorderBufferTXN entry is not removed from rb->by_txn but its size is > > 0. In the worst case where all (quite a few) transactions are smaller > > than 10% of the memory limit, we might end up checking many > > transactions to find non-zero size transaction entries to evict. So > > the patch adds a new list to maintain all transactions that have at > > least one change in memory. > > > > Summarizing the algorithm I've implemented in the patch, > > > > 1. pick a transaction from the list of large transactions (larger than > > 10% of memory limit). > > 2. pick a transaction from the top-level transaction list in LSN order. > > 3. pick a transaction from the list of transactions that have at least > > one change in memory. > > > > With the patch, the above test case completed within 3 seconds in my > > environment. > > Thanks for working on this, I think it would be good to test other > scenarios as well where this might have some negative impact and see > where we stand.
Agreed. > 1) A scenario where suppose you have one very large transaction that > is consuming ~40% of the memory and 5-6 comparatively smaller > transactions that are just above 10% of the memory limit. And now for > coming under the memory limit instead of getting 1 large transaction > evicted out, we are evicting out multiple times. Given the large transaction list will have up to 10 transactions, I think it's cheap to pick the largest transaction among them. It's O(N) but N won't be large. > 2) Another scenario where all the transactions are under 10% of the > memory limit but let's say there are some transactions are consuming > around 8-9% of the memory limit each but those are not very old > transactions whereas there are certain old transactions which are > fairly small and consuming under 1% of memory limit and there are many > such transactions. So how it would affect if we frequently select > many of these transactions to come under memory limit instead of > selecting a couple of large transactions which are consuming 8-9%? Yeah, probably we can do something for small transactions (i.e. small and on-memory transactions). One idea is to pick the largest transaction among them by iterating over all of them. Given that the more transactions are evicted, the less transactions the on-memory transaction list has, unlike the current algorithm, we would still win. Or we could even split it into several sub-lists in order to reduce the number of transactions to check. For example, splitting it into two lists: transactions consuming 5% < and 5% >= of the memory limit, and checking the 5% >= list preferably. The cost for maintaining these lists could increase, though. Do you have any ideas? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com