Re: Optimize shared LWLock acquisition for high-core-count systems

Yura Sokolov Wed, 09 Jul 2025 00:56:58 -0700

30.05.2025 14:30, Zhou, Zhiguo пишет:
> Hi Hackers,
> 
> I am reaching out to solicit your insights and comments on this patch 
> addressing a significant performance bottleneck in LWLock acquisition 
> observed on high-core-count systems. During performance analysis of 
> HammerDB/TPCC (192 virtual users, 757 warehouses) on a 384-vCPU Intel 
> system, we found that LWLockAttemptLock consumed 7.12% of total CPU 
> cycles. This bottleneck becomes even more pronounced (up to 30% of 
> cycles) after applying lock-free WAL optimizations[1][2].
> 
> Problem Analysis:
> The current LWLock implementation uses separate atomic operations for 
> state checking and modification. For shared locks (84% of 
> LWLockAttemptLock calls), this requires:
> 1.Atomic read of lock->state
> 2.State modification
> 3.Atomic compare-exchange (with retries on contention)
> 
> This design causes excessive atomic operations on contended locks, which 
> are particularly expensive on high-core-count systems where cache-line 
> bouncing amplifies synchronization costs.
> 
> Optimization Approach:
> The patch optimizes shared lock acquisition by:
> 1.Merging state read and update into a single atomic add operation
> 2.Extending LW_SHARED_MASK by 1 bit and shifting LW_VAL_EXCLUSIVE
> 3.Adding a willwait parameter to control optimization usage
> 
> Key implementation details:
> - For LW_SHARED with willwait=true: Uses atomic fetch-add to increment 
> reference count
> - Maintains backward compatibility through state mask adjustments
> - Preserves existing behavior for:
>    1) Exclusive locks
>    2) Non-waiting cases (LWLockConditionalAcquire)
> - Bounds shared lock count to MAX_BACKENDS*2 (handled via mask extension)
> 
> Performance Impact:
> Testing on a 384-vCPU Intel system shows:
> - *8%* NOPM improvement in HammerDB/TPCC with this optimization alone
> - *46%* cumulative improvement when combined with lock-free WAL 
> optimizations[1][2]
> 
> Patch Contents:
> 1.Extends shared mask and shifts exclusive lock value
> 2.Adds willwait parameter to control optimization
> 3.Updates lock acquisition/release logic
> 4.Maintains all existing assertions and safety checks
> 
> The optimization is particularly effective for contended shared locks, 
> which are common in buffer mapping, lock manager, and shared buffer 
> access patterns.
> 
> Please review this patch for consideration in upcoming PostgreSQL releases.
> 
> [1] Lock-free XLog Reservation from WAL: 
> https://www.postgresql.org/message-id/flat/PH7PR11MB5796659F654F9BE983F3AD97EF142%40PH7PR11MB5796.namprd11.prod.outlook.com
> [2] Increase NUM_XLOGINSERT_LOCKS: 
> https://www.postgresql.org/message-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru


Good day, Zhou.

Could you explain, why your patch is correct?

As far as I understand, it is clearly not correct at this time:
- SHARED lock count may be incremented many times, because of for(;;) loop
in LWLockAcquire and because LWLockAttemptLock is called twice per each
loop iteration in case lock is held in Exclusive mode by someone else.

If your patch is correct, then where I'm wrong?

When I tried to do same thing, I did sub_fetch immediately in case of
acquisition failure. And did no add_fetch at all if lock is held in
Exclusive mode.

BTW, there's way to optimize EXCLUSIVE lock as well since there's no need
to do CAS if lock is held by someone else.

See my version in attach...

-- 
regards
Yura Sokolov aka funny-falcon

From a87fba3b9a26d2fb5d56bcde709b3315631dd8f6 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Wed, 9 Jul 2025 10:49:38 +0300
Subject: [PATCH v1] Optimise LWLockAttemptLock by separate SHARED and
 EXCLUSIVE acquiring code.

Don't try to write anything if it doesn't look as lock_free.
LWLockAcquire and other functions calls AttemptLock at least twice for
correctness. "Do always CAS" doesn't pay.

Use atomic_fetch_add for SHARED lock acquiring.
---
 src/backend/storage/lmgr/lwlock.c | 67 ++++++++++++++++---------------
 1 file changed, 34 insertions(+), 33 deletions(-)

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 46f44bc4511..b591831961d 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -796,6 +796,8 @@ static bool
 LWLockAttemptLock(LWLock *lock, LWLockMode mode)
 {
 	uint32		old_state;
+	uint32		desired_state;
+	bool		lock_free;
 
 	Assert(mode == LW_EXCLUSIVE || mode == LW_SHARED);
 
@@ -805,51 +807,50 @@ LWLockAttemptLock(LWLock *lock, LWLockMode mode)
 	 */
 	old_state = pg_atomic_read_u32(&lock->state);
 
-	/* loop until we've determined whether we could acquire the lock or not */
-	while (true)
+	if (mode == LW_SHARED)
 	{
-		uint32		desired_state;
-		bool		lock_free;
-
-		desired_state = old_state;
-
-		if (mode == LW_EXCLUSIVE)
-		{
-			lock_free = (old_state & LW_LOCK_MASK) == 0;
-			if (lock_free)
-				desired_state += LW_VAL_EXCLUSIVE;
-		}
-		else
+		while (true)
 		{
 			lock_free = (old_state & LW_VAL_EXCLUSIVE) == 0;
 			if (lock_free)
-				desired_state += LW_VAL_SHARED;
-		}
+			{
+				old_state = pg_atomic_fetch_add_u32(&lock->state,
+													LW_VAL_SHARED);
+				if ((old_state & LW_VAL_EXCLUSIVE) == 0)
+					return false;
 
-		/*
-		 * Attempt to swap in the state we are expecting. If we didn't see
-		 * lock to be free, that's just the old value. If we saw it as free,
-		 * we'll attempt to mark it acquired. The reason that we always swap
-		 * in the value is that this doubles as a memory barrier. We could try
-		 * to be smarter and only swap in values if we saw the lock as free,
-		 * but benchmark haven't shown it as beneficial so far.
-		 *
-		 * Retry if the value changed since we last looked at it.
-		 */
-		if (pg_atomic_compare_exchange_u32(&lock->state,
-										   &old_state, desired_state))
+				old_state = pg_atomic_sub_fetch_u32(&lock->state,
+													LW_VAL_SHARED);
+			}
+			else
+			{
+				pg_read_barrier();
+				return true;
+			}
+		}
+	}
+
+	while (true)
+	{
+		lock_free = (old_state & LW_LOCK_MASK) == 0;
+		desired_state = old_state | LW_VAL_EXCLUSIVE;
+
+		if (lock_free)
 		{
-			if (lock_free)
+			if (pg_atomic_compare_exchange_u32(&lock->state, &old_state,
+											   desired_state))
 			{
 				/* Great! Got the lock. */
 #ifdef LOCK_DEBUG
-				if (mode == LW_EXCLUSIVE)
-					lock->owner = MyProc;
+				lock->owner = MyProc;
 #endif
 				return false;
 			}
-			else
-				return true;	/* somebody else has the lock */
+		}
+		else
+		{
+			pg_read_barrier();
+			return true;
 		}
 	}
 	pg_unreachable();
-- 
2.43.0

Re: Optimize shared LWLock acquisition for high-core-count systems

Reply via email to