> 8 мая 2020 г., в 21:36, Andrey M. Borodin <x4...@yandex-team.ru> написал(а):
> 
> *** The problem ***
> I'm investigating some cases of reduced database performance due to 
> MultiXactOffsetLock contention (80% MultiXactOffsetLock, 20% IO DataFileRead).
> The problem manifested itself during index repack and constraint validation. 
> Both being effectively full table scans.
> The database workload contains a lot of select for share\select for update 
> queries. I've tried to construct synthetic world generator and could not 
> achieve similar lock configuration: I see a lot of different locks in wait 
> events, particularly a lot more MultiXactMemberLocks. But from my experiments 
> with synthetic workload, contention of MultiXactOffsetLock can be reduced by 
> increasing NUM_MXACTOFFSET_BUFFERS=8 to bigger numbers.
> 
> *** Question 1 ***
> Is it safe to increase number of buffers of MultiXact\All SLRUs, recompile 
> and run database as usual?
> I cannot experiment much with production. But I'm mostly sure that bigger 
> buffers will solve the problem.
> 
> *** Question 2 ***
> Probably, we could do GUCs for SLRU sizes? Are there any reasons not to do 
> them configurable? I think multis, clog, subtransactions and others will 
> benefit from bigger buffer. But, probably, too much of knobs can be confusing.
> 
> *** Question 3 ***
> MultiXact offset lock is always taken as exclusive lock. It turns MultiXact 
> Offset subsystem to single threaded. If someone have good idea how to make it 
> more concurrency-friendly, I'm willing to put some efforts into this.
> Probably, I could just add LWlocks for each offset buffer page. Is it 
> something worth doing? Or are there any hidden cavers and difficulties?

I've created benchmark[0] imitating MultiXact pressure on my laptop: 7 clients 
are concurrently running select "select * from table where primary_key = ANY 
($1) for share" where $1 is array of identifiers so that each tuple in a table 
is locked by different set of XIDs. During this benchmark I observe contention 
of MultiXactControlLock in pg_stat_activity

                                    пятница,  8 мая 2020 г. 15:08:37 (every 1s)

  pid  |         wait_event         | wait_event_type | state  |                
       query                        
-------+----------------------------+-----------------+--------+----------------------------------------------------
 41344 | ClientRead                 | Client          | idle   | insert into t1 
select generate_series(1,1000000,1)
 41375 | MultiXactOffsetControlLock | LWLock          | active | select * from 
t1 where i = ANY ($1) for share
 41377 | MultiXactOffsetControlLock | LWLock          | active | select * from 
t1 where i = ANY ($1) for share
 41378 |                            |                 | active | select * from 
t1 where i = ANY ($1) for share
 41379 | MultiXactOffsetControlLock | LWLock          | active | select * from 
t1 where i = ANY ($1) for share
 41381 |                            |                 | active | select * from 
t1 where i = ANY ($1) for share
 41383 | MultiXactOffsetControlLock | LWLock          | active | select * from 
t1 where i = ANY ($1) for share
 41385 | MultiXactOffsetControlLock | LWLock          | active | select * from 
t1 where i = ANY ($1) for share
(8 rows)

Finally, the benchmark is measuring time to execute select for update 42 times.

I've went ahead and created 3 patches:
1. Configurable SLRU buffer sizes for MultiXacOffsets and MultiXactMembers
2. Reduce locking level to shared on read of MultiXactId members
3. Configurable cache size

I've found out that:
1. When MultiXact working set does not fit into buffers - benchmark results 
grow very high. Yet, very big buffers slow down benchmark too. For this 
benchmark optimal SLRU size id 32 pages for offsets and 64 pages for members 
(defaults are 8 and 16 respectively).
2. Lock optimisation increases performance by 5% on default SLRU sizes. 
Actually, benchmark does not explicitly read MultiXactId members, but when it 
replaces one with another - it have to read previous set. I understand that we 
can construct benchmark to demonstrate dominance of any algorithm and 5% of 
synthetic workload is not a very big number. But it just make sense to try to 
take shared lock for reading.
3. Manipulations with cache size do not affect benchmark anyhow. It's somewhat 
expected: benchmark is designed to defeat cache, either way OffsetControlLock 
would not be stressed.

For our workload, I think we will just increase numbers of SLRU sizes. But 
patchset may be useful for tuning and as a performance optimisation of 
MultiXact.

Also MultiXacts seems to be not very good fit into SLRU design. I think it 
would be better to use B-tree as a container. Or at least make MultiXact 
members extendable in-place (reserve some size when multixact is created).
When we want to extend number of locks for a tuple currently we will:
1. Iterate through all SLRU buffers for offsets to read current offset (with 
exclusive lock for offsets)
2. Iterate through all buffers for members to find current members (with 
exclusive lock for members)
3. Create new members array with +1 xid
4. Iterate through all cache members to find out maybe there are any such cache 
item as what we are going to create
5. iterate over 1 again for write
6. Iterate over 2 again for write

Obviously this does not scale well - we cannot increase SLRU sizes for too long.

Thanks! I'd be happy to hear any feedback.

Best regards, Andrey Borodin.

[0] https://github.com/x4m/multixact_stress

Attachment: v1-0001-Add-GUCs-to-tune-MultiXact-SLRUs.patch
Description: Binary data

Attachment: v1-0002-Use-shared-lock-in-GetMultiXactIdMembers-for-offs.patch
Description: Binary data

Attachment: v1-0003-Make-MultiXact-local-cache-size-configurable.patch
Description: Binary data

Reply via email to