Re: LogwrtResult contended spinlock
Okay, so I've pushed the last version after confirming with Alexander that it works on the Windows x86 machine. I hope nothing breaks now in the buildfarm. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/
Re: LogwrtResult contended spinlock
On 2024-Jul-02, Andres Freund wrote: > On 2024-07-01 21:12:25 +0200, Alvaro Herrera wrote: > > On 2024-Jul-01, Andres Freund wrote: > > I'm pretty sure the Microsoft docs I linked to are saying it must be > > aligned. > > I don't think so: > https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange64 > > LONG64 InterlockedCompareExchange64( > [in, out] LONG64 volatile *Destination, > [in] LONG64 ExChange, > [in] LONG64 Comperand > ); > > Note that Destination is the only argument passed by reference (and thus the > caller controls alignment of the in-memory value). ExChange is passed by > value, so we don't control alignment in any way. Hmm, you're right, assuming LONG64 is passed by value. Here https://learn.microsoft.com/en-us/windows/win32/winprog/windows-data-types it says that the type is declared as typedef __int64 LONG64; and https://learn.microsoft.com/en-us/cpp/cpp/int8-int16-int32-int64?view=msvc-170 says that __int64 is a normal integer type. So yes, 'ExChange' is passed by value and therefore testing it for alignment is useless on this platform. > > There are in some of them, but not in pg_atomic_compare_exchange_u64_impl. > > But there's one in pg_atomic_read_u64_impl(). Sure, but pg_atomic_read_u64 is given 'ptr', not 'currval'. > But I actually think it's wrong for pg_atomic_monotonic_advance_u64() > to use _impl(), that's just for the wrapper functions around the > implementation. Wheras pg_atomic_monotonic_advance_u64() should just > use the generic interface. True. Well, I can remove the assertion from pg_atomic_monotonic_advance_u64 and use pg_atomic_compare_exchange_u64 instead. But that one does this: static inline bool pg_atomic_compare_exchange_u64(volatile pg_atomic_uint64 *ptr, uint64 *expected, uint64 newval) { #ifndef PG_HAVE_ATOMIC_U64_SIMULATION AssertPointerAlignment(ptr, 8); AssertPointerAlignment(expected, 8); #endif return pg_atomic_compare_exchange_u64_impl(ptr, expected, newval); } AFAICS this is still going to fail, because uint64 *expected comes from our &currval, which was not aligned before so it'll still be unaligned now. The only difference is that the assertion failure will be in pg_atomic_compare_exchange_u64 instead of in pg_atomic_monotonic_advance_u64. Other platforms do have the 'expected' argument as a pointer, so the assertion there is not completely stupid. I think we could move the alignment assertions to appear inside the platform-specific _impl routines that need it, and refrain from adding it to the MSVC one. > > > > - return Max(target_, currval); > > > > + return Max(target_, currval.u64); > > > > > > What does the Max() actually achieve here? Shouldn't it be impossible to > > > reach > > > this with currval < target_? > > > > When two processes are hitting the cmpxchg concurrently, we need to > > return the highest value that was written, even if it was the other > > process that did it. > > Sure. That explains needing to use currval. But not really needing to use > Max(). If cmpxchg succeeds, we need to return target_, if the loop terminates > otherwise we need to return currval. No? Oh, you're suggesting to change the break statement with a return. Seems reasonable. > > > And why does target_ end in an underscore? > > > > Heh, you tell me -- I just copied the style of the functions just above. > > IIRC using plain "and" "or" "add" caused conflicts with some headers or such. Ah, that makes sense. It should be no problem to remove the underscore. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "La primera ley de las demostraciones en vivo es: no trate de usar el sistema. Escriba un guión que no toque nada para no causar daños." (Jakob Nielsen) >From f9df825c8c61f3e4c8597af3cfbccc5d69967872 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 2 Jul 2024 13:42:46 +0200 Subject: [PATCH v5] Remove bogus assertion in pg_atomic_monotonic_advance_u64 This code wanted to ensure that the 'exchange' variable passed to pg_atomic_compare_exchange_u64 has correct alignment, but apparently platforms don't actually require anything that doesn't come naturally. While messing with pg_atomic_monotonic_advance_u64: instead of using Max() to determine the value to return, just use pg_atomic_compare_exchange_u64()'s return value to decide; also, use pg_atomic_compare_exchange_u64 instead of the _impl version; also remove the unnecessary underscore at the end of variable name "target". Backpatch to 17. Reported-by: Alexander Lakhin Discussion: https://postgr.es/m/36796438-a718-cf9b-2071-b2c1b947c...@gmail.com --- src/include/port/atomics.h| 17 ++--- src/include/port/atomics/arch-ppc.h | 2 ++ src/include/port/atomics/arch-x86.h | 2 ++ src/include/port/atomics/generic-gcc.h| 3 +++ src/include/port/atomics/generic-sunp
Re: LogwrtResult contended spinlock
Hi, On 2024-07-01 21:12:25 +0200, Alvaro Herrera wrote: > On 2024-Jul-01, Andres Freund wrote: > > > On 2024-07-01 11:10:24 +0200, Alvaro Herrera wrote: > > > In the meantime I noticed that pg_attribute_aligned() is not supported > > > in every platform/compiler, so for safety sake I think it's better to go > > > with what we do for PGAlignedBlock: use a union with a double member. > > > That should be 8-byte aligned on x86 as well, unless I misunderstand. > > > > If a platform wants to support 8 byte atomics, it better provides a way to > > make variables 8 bytes aligned. We already rely on that, actually. See use > > of > > pg_attribute_aligned in e.g. src/include/port/atomics/generic-msvc.h > > Well, pg_atomic_uint64 is a struct. Passing pointers to it is fine, > which is what non-platform-specific code does; but because the > declaration of the type is in each platform-specific file, it might not > work to use it directly in generic code. I didn't actually try, but it > seems a bit of a layering violation. (I didn't find any place where > the struct is used that way.) > > If that works, then I think we could simply declare currval as a > pg_atomic_uint64 and it'd be prettier. I don't think that'd make sense at all. The expected value isn't an atomic, so it shouldn't be the type of an atomic. > > > + /* > > > + * On 32-bit machines, declaring a bare uint64 variable doesn't promise > > > + * the alignment we need, so coerce the compiler this way. > > > + */ > > > + union > > > + { > > > + uint64 u64; > > > + double force_align_d; > > > + } currval; > > > > I wonder if we should just relax the alignment requirement for currval. It's > > crucial that the pointer is atomically aligned (atomic ops across pages are > > either forbidden or extremely slow), but it's far from obvious that it's > > crucial for comparator value to be aligned. > > I'm pretty sure the Microsoft docs I linked to are saying it must be > aligned. I don't think so: https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange64 LONG64 InterlockedCompareExchange64( [in, out] LONG64 volatile *Destination, [in] LONG64 ExChange, [in] LONG64 Comperand ); Note that Destination is the only argument passed by reference (and thus the caller controls alignment of the in-memory value). ExChange is passed by value, so we don't control alignment in any way. > > > #ifndef PG_HAVE_ATOMIC_U64_SIMULATION > > > AssertPointerAlignment(ptr, 8); > > > #endif > > > > What's the point of this assert, btw? This stuff is already asserted in > > lower > > level routines, so it just seems redundant to have it here? > > There are in some of them, but not in pg_atomic_compare_exchange_u64_impl. But there's one in pg_atomic_read_u64_impl(). But I actually think it's wrong for pg_atomic_monotonic_advance_u64() to use _impl(), that's just for the wrapper functions around the implementation. Wheras pg_atomic_monotonic_advance_u64() should just use the generic interface. > > > > - return Max(target_, currval); > > > + return Max(target_, currval.u64); > > > > What does the Max() actually achieve here? Shouldn't it be impossible to > > reach > > this with currval < target_? > > When two processes are hitting the cmpxchg concurrently, we need to > return the highest value that was written, even if it was the other > process that did it. Sure. That explains needing to use currval. But not really needing to use Max(). If cmpxchg succeeds, we need to return target_, if the loop terminates otherwise we need to return currval. No? > > And why does target_ end in an underscore? > > Heh, you tell me -- I just copied the style of the functions just above. IIRC using plain "and" "or" "add" caused conflicts with some headers or such. Greetings, Andres Freund
Re: LogwrtResult contended spinlock
On 2024-Jul-01, Alvaro Herrera wrote: > On 2024-Jul-01, Andres Freund wrote: > > > On 2024-07-01 11:10:24 +0200, Alvaro Herrera wrote: > > > In the meantime I noticed that pg_attribute_aligned() is not supported > > > in every platform/compiler, so for safety sake I think it's better to go > > > with what we do for PGAlignedBlock: use a union with a double member. > > > That should be 8-byte aligned on x86 as well, unless I misunderstand. > > > > If a platform wants to support 8 byte atomics, it better provides a way to > > make variables 8 bytes aligned. We already rely on that, actually. See use > > of > > pg_attribute_aligned in e.g. src/include/port/atomics/generic-msvc.h I noticed that all definitions have the value struct member named "value", so I gave this a quick try, and hit another issue: the "expected" argument of pg_atomic_compare_exchange_u64_impl() is not volatile, which pg_atomic_uint64.value is, so we get this warning: In file included from /pgsql/source/master/src/include/storage/lwlock.h:21, from /pgsql/source/master/src/include/storage/lock.h:23, from /pgsql/source/master/src/include/access/twophase.h:20, from /pgsql/source/master/src/backend/access/transam/xlog.c:57: /pgsql/source/master/src/include/port/atomics.h: In function ‘pg_atomic_monotonic_advance_u64’: /pgsql/source/master/src/include/port/atomics.h:600:62: warning: passing argument 2 of ‘pg_atomic_compare_exchange_u64_impl’ discards ‘volatile’ qualifier from pointer target type [-Wdiscarded-qualifiers] 600 | if (pg_atomic_compare_exchange_u64_impl(ptr, &currval.value, target_)) | ^~ In file included from /pgsql/source/master/src/include/port/atomics.h:69: /pgsql/source/master/src/include/port/atomics/arch-x86.h:206:81: note: expected ‘uint64 *’ {aka ‘long unsigned int *’} but argument is of type ‘volatile uint64 *’ {aka ‘volatile long unsigned int *’} 206 | uint64 *expected, uint64 newval) | ^~~~ -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Most hackers will be perfectly comfortable conceptualizing users as entropy sources, so let's move on." (Nathaniel Smith) https://mail.gnu.org/archive/html/monotone-devel/2007-01/msg00080.html
Re: LogwrtResult contended spinlock
01.07.2024 20:04, Alvaro Herrera wrote: OK, so it's #if ALIGNOF_LONG_LONG_INT >= 8 Alexander, can you please confirm whether this works for you? Yes, the v4 patch makes `meson test --suite setup` pass in x86 environment. And complete `meson test` fails on pg_amcheck/004_verify_heapam only (I think it's a separate issue, just want to announce what else is broken on that platform) Thank you! Best regards, Alexander
Re: LogwrtResult contended spinlock
Hello, Thanks for your attention here. On 2024-Jul-01, Andres Freund wrote: > On 2024-07-01 11:10:24 +0200, Alvaro Herrera wrote: > > In the meantime I noticed that pg_attribute_aligned() is not supported > > in every platform/compiler, so for safety sake I think it's better to go > > with what we do for PGAlignedBlock: use a union with a double member. > > That should be 8-byte aligned on x86 as well, unless I misunderstand. > > If a platform wants to support 8 byte atomics, it better provides a way to > make variables 8 bytes aligned. We already rely on that, actually. See use of > pg_attribute_aligned in e.g. src/include/port/atomics/generic-msvc.h Well, pg_atomic_uint64 is a struct. Passing pointers to it is fine, which is what non-platform-specific code does; but because the declaration of the type is in each platform-specific file, it might not work to use it directly in generic code. I didn't actually try, but it seems a bit of a layering violation. (I didn't find any place where the struct is used that way.) If that works, then I think we could simply declare currval as a pg_atomic_uint64 and it'd be prettier. > > + /* > > +* On 32-bit machines, declaring a bare uint64 variable doesn't promise > > +* the alignment we need, so coerce the compiler this way. > > +*/ > > + union > > + { > > + uint64 u64; > > + double force_align_d; > > + } currval; > > I wonder if we should just relax the alignment requirement for currval. It's > crucial that the pointer is atomically aligned (atomic ops across pages are > either forbidden or extremely slow), but it's far from obvious that it's > crucial for comparator value to be aligned. I'm pretty sure the Microsoft docs I linked to are saying it must be aligned. > > #ifndef PG_HAVE_ATOMIC_U64_SIMULATION > > AssertPointerAlignment(ptr, 8); > > #endif > > What's the point of this assert, btw? This stuff is already asserted in lower > level routines, so it just seems redundant to have it here? There are in some of them, but not in pg_atomic_compare_exchange_u64_impl. > > - return Max(target_, currval); > > + return Max(target_, currval.u64); > > What does the Max() actually achieve here? Shouldn't it be impossible to reach > this with currval < target_? When two processes are hitting the cmpxchg concurrently, we need to return the highest value that was written, even if it was the other process that did it. The assertions in the calling site are quickly broken if we don't do this. I admit this aspect took me by surprise. > And why does target_ end in an underscore? Heh, you tell me -- I just copied the style of the functions just above. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "No nos atrevemos a muchas cosas porque son difíciles, pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)
Re: LogwrtResult contended spinlock
Hi, On 2024-07-01 11:10:24 +0200, Alvaro Herrera wrote: > In the meantime I noticed that pg_attribute_aligned() is not supported > in every platform/compiler, so for safety sake I think it's better to go > with what we do for PGAlignedBlock: use a union with a double member. > That should be 8-byte aligned on x86 as well, unless I misunderstand. If a platform wants to support 8 byte atomics, it better provides a way to make variables 8 bytes aligned. We already rely on that, actually. See use of pg_attribute_aligned in e.g. src/include/port/atomics/generic-msvc.h > From 9d240e90f87bf8b53bd5d92b623e33419db64717 Mon Sep 17 00:00:00 2001 > From: Alvaro Herrera > Date: Mon, 1 Jul 2024 10:41:06 +0200 > Subject: [PATCH v2] Fix alignment of variable in > pg_atomic_monotonic_advance_u64 > > Reported-by: Alexander Lakhin > Discussion: https://postgr.es/m/36796438-a718-cf9b-2071-b2c1b947c...@gmail.com > --- > src/include/port/atomics.h | 24 > 1 file changed, 16 insertions(+), 8 deletions(-) > > diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h > index 78987f3154..964732e660 100644 > --- a/src/include/port/atomics.h > +++ b/src/include/port/atomics.h > @@ -580,30 +580,38 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, > int64 sub_) > static inline uint64 > pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 > target_) > { > - uint64 currval; > + /* > + * On 32-bit machines, declaring a bare uint64 variable doesn't promise > + * the alignment we need, so coerce the compiler this way. > + */ > + union > + { > + uint64 u64; > + double force_align_d; > + } currval; I wonder if we should just relax the alignment requirement for currval. It's crucial that the pointer is atomically aligned (atomic ops across pages are either forbidden or extremely slow), but it's far from obvious that it's crucial for comparator value to be aligned. > #ifndef PG_HAVE_ATOMIC_U64_SIMULATION > AssertPointerAlignment(ptr, 8); > #endif What's the point of this assert, btw? This stuff is already asserted in lower level routines, so it just seems redundant to have it here? > - currval = pg_atomic_read_u64_impl(ptr); > - if (currval >= target_) > + currval.u64 = pg_atomic_read_u64_impl(ptr); > + if (currval.u64 >= target_) > { > pg_memory_barrier(); > - return currval; > + return currval.u64; > } > > #ifndef PG_HAVE_ATOMIC_U64_SIMULATION > - AssertPointerAlignment(&currval, 8); > + AssertPointerAlignment(&currval.u64, 8); > #endif > > - while (currval < target_) > + while (currval.u64 < target_) > { > - if (pg_atomic_compare_exchange_u64_impl(ptr, &currval, target_)) > + if (pg_atomic_compare_exchange_u64_impl(ptr, &currval.u64, > target_)) > break; > } > > - return Max(target_, currval); > + return Max(target_, currval.u64); What does the Max() actually achieve here? Shouldn't it be impossible to reach this with currval < target_? And why does target_ end in an underscore? Greetings, Andres Freund
Re: LogwrtResult contended spinlock
On 2024-Jul-01, Tom Lane wrote: > Alvaro Herrera writes: > > Maybe we can do something like this, > > > +#if MAXIMUM_ALIGNOF >= 8 > > uint64 currval; > > This should probably be testing the alignment of int64 specifically, > rather than assuming that MAXIMUM_ALIGNOF applies to it. At least > historically, there have been platforms where MAXIMUM_ALIGNOF is > determined by float quantities and integer quantities are different. OK, so it's #if ALIGNOF_LONG_LONG_INT >= 8 Alexander, can you please confirm whether this works for you? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "La virtud es el justo medio entre dos defectos" (Aristóteles) >From 65e9e813859fec436805b1ada7be9339f8ecd63e Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 1 Jul 2024 16:44:22 +0200 Subject: [PATCH v4] Fix alignment in pg_atomic_monotonic_advance_u64 Unsurprisingly, it's possible for Windows x86 to have 8-byte atomics but 4-byte-aligned long long ints. This breaks an assertion in recently introduced pg_atomic_monotonic_advance_u64 with commit f3ff7bf83bce. Try to fix that blindly by adding pg_attribute_aligned(8). Unfortunately not all compilers support that, so it needs to be protected by #ifdef; fortunately not all compilers _need_ it. Hopefully the combinations we now support cover all interesting cases. Reported-by: Alexander Lakhin Reviewed-by: Tom Lane Discussion: https://postgr.es/m/36796438-a718-cf9b-2071-b2c1b947c...@gmail.com --- src/include/port/atomics.h | 14 ++ 1 file changed, 14 insertions(+) diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h index 78987f3154..eeb47dee30 100644 --- a/src/include/port/atomics.h +++ b/src/include/port/atomics.h @@ -580,7 +580,21 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_) static inline uint64 pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_) { + /* + * When using actual (not simulated) atomics, the target variable for + * pg_atomic_compare_exchange_u64 must have suitable alignment. This + * occurs naturally on platforms where long long int is 8-byte aligned. On + * others, we must explicitly coerce the compiler into applying suitable + * alignment, or abort the compile if we cannot. When using simulated + * atomics, the alignment doesn't matter. + */ +#if ALIGNOF_LONG_LONG_INT >= 8 || defined(PG_HAVE_ATOMIC_U64_SIMULATION) uint64 currval; +#elif defined(pg_attribute_aligned) + pg_attribute_aligned(8) uint64 currval; +#else +#error "Must have pg_attribute_aligned or simulated atomics on this platform" +#endif #ifndef PG_HAVE_ATOMIC_U64_SIMULATION AssertPointerAlignment(ptr, 8); -- 2.39.2
Re: LogwrtResult contended spinlock
Alvaro Herrera writes: > Maybe we can do something like this, > +#if MAXIMUM_ALIGNOF >= 8 > uint64 currval; This should probably be testing the alignment of int64 specifically, rather than assuming that MAXIMUM_ALIGNOF applies to it. At least historically, there have been platforms where MAXIMUM_ALIGNOF is determined by float quantities and integer quantities are different. regards, tom lane
Re: LogwrtResult contended spinlock
Maybe we can do something like this, diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h index 78987f3154..f6fa90bad8 100644 --- a/src/include/port/atomics.h +++ b/src/include/port/atomics.h @@ -580,7 +580,20 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_) static inline uint64 pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_) { + /* +* When using actual (not simulated) atomics, the target variable for +* pg_atomic_compare_exchange_u64 must have suitable alignment, which +* is acquired naturally on most platforms, but not on 32-bit ones; +* persuade the compiler in that case, but fail if we +* cannot. +*/ +#if MAXIMUM_ALIGNOF >= 8 uint64 currval; +#elif defined(pg_attribute_aligned) && !defined(PG_HAVE_ATOMIC_U64_SIMULATION) + pg_attribute_aligned(8) uint64 currval; +#else +#error "Must have pg_attribute aligned or simulated atomics" +#endif #ifndef PG_HAVE_ATOMIC_U64_SIMULATION AssertPointerAlignment(ptr, 8); -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Just treat us the way you want to be treated + some extra allowance for ignorance."(Michael Brusser)
Re: LogwrtResult contended spinlock
On 2024-Jul-01, Tom Lane wrote: > Alvaro Herrera writes: > >> because the failed assertion is: > >> #ifndef PG_HAVE_ATOMIC_U64_SIMULATION > >> AssertPointerAlignment(&currval, 8); > >> #endif > > Perhaps this assertion is what is wrong? If the platform has no > native 8-byte alignment requirement, why do we think that atomics > need it? Oh, that's a good question. TBH I just copied the assertion from the other routines for 64-bit variables in the same file. But I think that's correct. We're gating the assertion on _not_ having emulation, which must mean we have native atomics; on MSVC, if I read the #ifdef maze correctly, that's implemented using _InterlockedCompareExchange, whose docs state: : The variables for this function must be aligned on a 64-bit boundary; : otherwise, this function will behave unpredictably on multiprocessor x86 : systems and any non-x86 systems. See _aligned_malloc. https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange64 So I think the assertion is correct. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "[PostgreSQL] is a great group; in my opinion it is THE best open source development communities in existence anywhere."(Lamar Owen)
Re: LogwrtResult contended spinlock
Alvaro Herrera writes: >> because the failed assertion is: >> #ifndef PG_HAVE_ATOMIC_U64_SIMULATION >> AssertPointerAlignment(&currval, 8); >> #endif Perhaps this assertion is what is wrong? If the platform has no native 8-byte alignment requirement, why do we think that atomics need it? regards, tom lane
Re: LogwrtResult contended spinlock
Hello, On 2024-Jun-29, Alexander Lakhin wrote: > It doesn't, but the following works for me: > static inline uint64 > pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 > target_) > { > - uint64 currval; > + pg_attribute_aligned(8) uint64 currval; > > because the failed assertion is: > #ifndef PG_HAVE_ATOMIC_U64_SIMULATION > AssertPointerAlignment(&currval, 8); > #endif Hah, thank you. In the meantime I noticed that pg_attribute_aligned() is not supported in every platform/compiler, so for safety sake I think it's better to go with what we do for PGAlignedBlock: use a union with a double member. That should be 8-byte aligned on x86 as well, unless I misunderstand. BTW if things works after this fix, I suggest you get a buildfarm member running with this configuration. Otherwise it's quite likely that we'll break it again. Or we could just decide we don't care about this particular platform ... but AFAICS the buildfarm does have other 32-bit animals running. [ looks at buildfarm ] Oh! while looking at Adder's config, I noticed this line: Checking for alignment of "double" : 4 So, I do misunderstand doubles. So my patch is not going to work. There seems to be nothing that has alignment 8 there, so I guess we're back to using pg_attribute_aligned() and abort the compile if that doesn't exist. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "La gente vulgar sólo piensa en pasar el tiempo; el que tiene talento, en aprovecharlo" >From 9d240e90f87bf8b53bd5d92b623e33419db64717 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 1 Jul 2024 10:41:06 +0200 Subject: [PATCH v2] Fix alignment of variable in pg_atomic_monotonic_advance_u64 Reported-by: Alexander Lakhin Discussion: https://postgr.es/m/36796438-a718-cf9b-2071-b2c1b947c...@gmail.com --- src/include/port/atomics.h | 24 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h index 78987f3154..964732e660 100644 --- a/src/include/port/atomics.h +++ b/src/include/port/atomics.h @@ -580,30 +580,38 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_) static inline uint64 pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_) { - uint64 currval; + /* + * On 32-bit machines, declaring a bare uint64 variable doesn't promise + * the alignment we need, so coerce the compiler this way. + */ + union + { + uint64 u64; + double force_align_d; + } currval; #ifndef PG_HAVE_ATOMIC_U64_SIMULATION AssertPointerAlignment(ptr, 8); #endif - currval = pg_atomic_read_u64_impl(ptr); - if (currval >= target_) + currval.u64 = pg_atomic_read_u64_impl(ptr); + if (currval.u64 >= target_) { pg_memory_barrier(); - return currval; + return currval.u64; } #ifndef PG_HAVE_ATOMIC_U64_SIMULATION - AssertPointerAlignment(&currval, 8); + AssertPointerAlignment(&currval.u64, 8); #endif - while (currval < target_) + while (currval.u64 < target_) { - if (pg_atomic_compare_exchange_u64_impl(ptr, &currval, target_)) + if (pg_atomic_compare_exchange_u64_impl(ptr, &currval.u64, target_)) break; } - return Max(target_, currval); + return Max(target_, currval.u64); } #undef INSIDE_ATOMICS_H -- 2.39.2
Re: LogwrtResult contended spinlock
Hi Alvaro, Thank you for looking at this! 29.06.2024 13:23, Alvaro Herrera wrote: TRAP: failed Assert("TYPEALIGN(8, (uintptr_t)(&currval)) == (uintptr_t)(&currval)"), File: "...\src\include\port/atomics.h", Line: 597, PID: 7556 child process was terminated by exception 0xC409 Oh. This is the new assertion in pg_atomic_monotonic_advance_u64() and therefore the only possible culprit could be logInsertResult in XLogCtlData. Does it work if you do this? diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8dcdf5a764..e581488d57 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -468,7 +468,7 @@ typedef struct XLogCtlData XLogRecPtr lastSegSwitchLSN; /* These are accessed using atomics -- info_lck not needed */ - pg_atomic_uint64 logInsertResult; /* last byte + 1 inserted to buffers */ + pg_atomic_uint64 logInsertResult pg_attribute_aligned(8); /* last byte + 1 inserted to buffers */ pg_atomic_uint64 logWriteResult;/* last byte + 1 written out */ pg_atomic_uint64 logFlushResult;/* last byte + 1 flushed */ It doesn't, but the following works for me: static inline uint64 pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_) { - uint64 currval; + pg_attribute_aligned(8) uint64 currval; because the failed assertion is: #ifndef PG_HAVE_ATOMIC_U64_SIMULATION AssertPointerAlignment(&currval, 8); #endif Best regards, Alexander
Re: LogwrtResult contended spinlock
Hi On 2024-Jun-29, Alexander Lakhin wrote: > --- stderr --- > TRAP: failed Assert("TYPEALIGN(8, (uintptr_t)(&currval)) == > (uintptr_t)(&currval)"), File: "...\src\include\port/atomics.h", Line: 597, > PID: 7556 > child process was terminated by exception 0xC409 Oh. This is the new assertion in pg_atomic_monotonic_advance_u64() and therefore the only possible culprit could be logInsertResult in XLogCtlData. Does it work if you do this? diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8dcdf5a764..e581488d57 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -468,7 +468,7 @@ typedef struct XLogCtlData XLogRecPtr lastSegSwitchLSN; /* These are accessed using atomics -- info_lck not needed */ - pg_atomic_uint64 logInsertResult; /* last byte + 1 inserted to buffers */ + pg_atomic_uint64 logInsertResult pg_attribute_aligned(8); /* last byte + 1 inserted to buffers */ pg_atomic_uint64 logWriteResult;/* last byte + 1 written out */ pg_atomic_uint64 logFlushResult;/* last byte + 1 flushed */ Thanks -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Use it up, wear it out, make it do, or do without"
Re: LogwrtResult contended spinlock
Hello Alvaro and Bharath, 07.04.2024 15:19, Alvaro Herrera wrote: I pushed the "copy" pointer now, except that I renamed it to "insert", which is what we call the operation being tracked. I also added some comments. I've discovered that Windows x86 build fails tests after commit f3ff7bf83. Using "x86 Native Tools Command Prompt for VS 2022", I do: meson setup build -Dcassert=true cd build ninja meson test --suite setup and get: 1/3 postgresql:setup / tmp_install OK 1.56s 2/3 postgresql:setup / install_test_files OK 0.09s 3/3 postgresql:setup / initdb_cache FAIL 1.88s exit status 1 testlog.txt contains: running bootstrap script ... --- stderr --- TRAP: failed Assert("TYPEALIGN(8, (uintptr_t)(&currval)) == (uintptr_t)(&currval)"), File: "...\src\include\port/atomics.h", Line: 597, PID: 7556 child process was terminated by exception 0xC409 On f3ff7bf83~1, `meson test --suite setup` passes for me. (I could not find if support for 32-bit Windows ended, so decided to report this.) Best regards, Alexander
Re: LogwrtResult contended spinlock
On Mon, 2024-04-08 at 10:24 +0200, Alvaro Herrera wrote: > My trouble with the "copy" term is that we don't use that term > anywhere > in relation to WAL. I got the term from CopyXLogRecordToWAL(). > This "copy" is in > reality just the insertion, after it's finished. The "Result" suffix > is intended to convey that it's the point where the operation has > successfully finished. That's a convincing point. > Maybe we can add some commentary to make this clearer. For now, I'd just suggest a comment on GetXLogInsertRecPtr() explaining what it's returning and how that's different from logInsertResult. In the future, it would be nice to clarify where variable names like reservedUpto and CurrBytePos fit in. Also the external documentation for pg_current_wal_insert_lsn() is a bit ambiguous. > Lastly, I just noticed that I forgot credits and discussion link in > that > commit message. I would have attributed authorship to Bharath > though, > because I had not realized that this patch had actually been written > by > you initially[1]. My apologies. No worries. Thank you for reviewing and committing it. Regards, Jeff Davis
Re: LogwrtResult contended spinlock
On 2024-Apr-07, Jeff Davis wrote: > On Sun, 2024-04-07 at 14:19 +0200, Alvaro Herrera wrote: > > I pushed the "copy" pointer now, except that I renamed it to > > "insert", > > which is what we call the operation being tracked. I also added some > > comments. > > The "copy" pointer, as I called it, is not the same as the "insert" > pointer (as returned by GetXLogInsertRecPtr()). > > LSNs before the "insert" pointer are reserved for WAL inserters (and > may or may not have made it to WAL buffers yet). LSNs before the "copy" > pointer are written to WAL buffers with CopyXLogRecordToWAL() (and may > or may not have been evicted to the OS file yet). It seems a matter of interpretation. WAL insertion starts with reserving the space (in ReserveXLogInsertLocation) and moving the CurrBytePos point forward; the same WAL insertion completes when the actual data has been copied to the buffers. It is this process as a whole that we can an insertion. CurrBytePos (what GetXLogInsertRecPtr returns) is the point where the next insertions are going to occur; the new logInsertResult is the point where previous insertions are known to have completed. We could think about insertions as something that's happening to a range of bytes. CurrBytePos is the high end of that range, logInsertResult is its low end. (Although in reality we don't know the true low end, because the process writing the oldest WAL doesn't advertise as soon as it finishes its insertion, because it doesn't know that it is writing the oldest. We only know that X is this "oldest known" after we have waited for all those that were inserting earlier than X to have finished.) My trouble with the "copy" term is that we don't use that term anywhere in relation to WAL. It's a term we don't need. This "copy" is in reality just the insertion, after it's finished. The "Result" suffix is intended to convey that it's the point where the operation has successfully finished. Maybe we can add some commentary to make this clearer. Now, if you're set on renaming the variable back to Copy, I won't object. Lastly, I just noticed that I forgot credits and discussion link in that commit message. I would have attributed authorship to Bharath though, because I had not realized that this patch had actually been written by you initially[1]. My apologies. [1] https://postgr.es/m/727449f3151c6b9ab76ba706fa4d30bf7b03ad4f.ca...@j-davis.com -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: LogwrtResult contended spinlock
On Sun, 2024-04-07 at 14:19 +0200, Alvaro Herrera wrote: > I pushed the "copy" pointer now, except that I renamed it to > "insert", > which is what we call the operation being tracked. I also added some > comments. The "copy" pointer, as I called it, is not the same as the "insert" pointer (as returned by GetXLogInsertRecPtr()). LSNs before the "insert" pointer are reserved for WAL inserters (and may or may not have made it to WAL buffers yet). LSNs before the "copy" pointer are written to WAL buffers with CopyXLogRecordToWAL() (and may or may not have been evicted to the OS file yet). Other than that, looks good, thank you. Regards, Jeff Davis
Re: LogwrtResult contended spinlock
I pushed the "copy" pointer now, except that I renamed it to "insert", which is what we call the operation being tracked. I also added some comments. One perhaps significant change from what Bharath submitted is that we now use the return value from monotonic advance to return an updated value of finishedUpto from WaitXLogInsertionsToFinish. I think this is a useful optimization: the only caller of WaitXLogInsertionsToFinish which cares about its return value is XLogFlush, which AFAICS benefits from having a more up-to-date value. Also, I did adopt your (Jeff's) idea of using a barrier when reading logInsertResult in WaitXLogInsertionsToFinish. If I understand correctly, this is actually useful: if some other CPU core was also recently busy waiting for xlog insertions and wrote an updated version of logInsertResult that's past what we last saw, we benefit from an updated value because we can skip the spinlock dance and WAL-insert-var stuff altogether. With no barrier, we would potentially see an outdated value of logInsertResult and waste a lot of effort to get it updated. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Those who use electric razors are infidels destined to burn in hell while we drink from rivers of beer, download free vids and mingle with naked well shaved babes." (http://slashdot.org/comments.pl?sid=44793&cid=4647152)
Re: LogwrtResult contended spinlock
On Sat, 2024-04-06 at 18:13 +0200, Alvaro Herrera wrote: > my understanding of pg_atomic_compare_exchange_u64 is that > *expected is set to the value that's stored in *ptr prior to the > exchange. Sorry, my mistake. Your version looks good. Regards, Jeff Davis
Re: LogwrtResult contended spinlock
On 2024-Apr-05, Jeff Davis wrote: > Minor comments: > * Also, I assume that the Max() call in > pg_atomic_monotonic_advance_u64() is just for clarity? Surely the > currval cannot be less than _target when it returns. I'd probably just > do Assert(currval >= _target) and then return currval. Uhh ... my understanding of pg_atomic_compare_exchange_u64 is that *expected is set to the value that's stored in *ptr prior to the exchange. Its comment * Atomically compare the current value of ptr with *expected and store newval * iff ptr and *expected have the same value. The current value of *ptr will * always be stored in *expected. is actually not very clear, because what does "current" mean in this context? Is the original "current" value, or is it the "current" value after the exchange? Anyway, looking at the spinlock-emulated code for pg_atomic_compare_exchange_u32_impl in atomics.c, /* perform compare/exchange logic */ ret = ptr->value == *expected; *expected = ptr->value; if (ret) ptr->value = newval; it's clear that *expected receives the original value, not the new one. Because of this behavior, not doing the Max() would return the obsolete value rather than the one we just installed. (It would only work correctly without Max() when our cmpxchg operation "fails", that is, somebody else incremented the value past the one we want to install.) Another reason is that when I used pg_atomic_monotonic_advance_u64 with the Write and Flush pointers and did not have the Max(), the assertion failed pretty quickly :-) -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ Al principio era UNIX, y UNIX habló y dijo: "Hello world\n". No dijo "Hello New Jersey\n", ni "Hello USA\n".
Re: LogwrtResult contended spinlock
On Sat, Apr 6, 2024 at 9:21 AM Thomas Munro wrote: > > On Sat, Apr 6, 2024 at 6:55 AM Alvaro Herrera wrote: > > Pushed 0001. > > Could that be related to the 3 failures on parula that look like this? > > TRAP: failed Assert("node->next == 0 && node->prev == 0"), File: > "../../../../src/include/storage/proclist.h", Line: 63, PID: 29119 > 2024-04-05 16:16:26.812 UTC [29114:15] pg_regress/drop_operator LOG: > statement: DROP OPERATOR <|(bigint, bigint); > postgres: postgres regression [local] CREATE > ROLE(ExceptionalCondition+0x4c)[0x9b3fdc] > postgres: postgres regression [local] CREATE ROLE[0x8529e4] > postgres: postgres regression [local] CREATE > ROLE(LWLockWaitForVar+0xec)[0x8538fc] > postgres: postgres regression [local] CREATE ROLE[0x54c7d4] > postgres: postgres regression [local] CREATE ROLE(XLogFlush+0xf0)[0x552600] > postgres: postgres regression [local] CREATE ROLE[0x54a9b0] > postgres: postgres regression [local] CREATE ROLE[0x54bbdc] > > Hmm, the comments for LWLockWaitForVar say: > > * Be aware that LWLockConflictsWithVar() does not include a memory barrier, > * hence the caller of this function may want to rely on an explicit barrier > or > * an implied barrier via spinlock or LWLock to avoid memory ordering issues. > > But that seems to be more likely to make LWLockWaitForVar suffer data > races (ie hang), not break assertions about LWLock sanity, so I don't > know what's going on there. I happened to have a shell on a Graviton > box, but I couldn't reproduce it after a while... Thanks for reporting. I'll try to spin up a similar instance like parula and reproduce. Meanwhile, I'm wondering if it is somehow related to what's discussed in "Why is parula failing?" https://www.postgresql.org/message-id/4009739.1710878318%40sss.pgh.pa.us. It seems like parula is behaving unexpectedly because of the compiler and other stuff. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: LogwrtResult contended spinlock
On Sat, Apr 6, 2024 at 6:55 AM Alvaro Herrera wrote: > Pushed 0001. Could that be related to the 3 failures on parula that look like this? TRAP: failed Assert("node->next == 0 && node->prev == 0"), File: "../../../../src/include/storage/proclist.h", Line: 63, PID: 29119 2024-04-05 16:16:26.812 UTC [29114:15] pg_regress/drop_operator LOG: statement: DROP OPERATOR <|(bigint, bigint); postgres: postgres regression [local] CREATE ROLE(ExceptionalCondition+0x4c)[0x9b3fdc] postgres: postgres regression [local] CREATE ROLE[0x8529e4] postgres: postgres regression [local] CREATE ROLE(LWLockWaitForVar+0xec)[0x8538fc] postgres: postgres regression [local] CREATE ROLE[0x54c7d4] postgres: postgres regression [local] CREATE ROLE(XLogFlush+0xf0)[0x552600] postgres: postgres regression [local] CREATE ROLE[0x54a9b0] postgres: postgres regression [local] CREATE ROLE[0x54bbdc] Hmm, the comments for LWLockWaitForVar say: * Be aware that LWLockConflictsWithVar() does not include a memory barrier, * hence the caller of this function may want to rely on an explicit barrier or * an implied barrier via spinlock or LWLock to avoid memory ordering issues. But that seems to be more likely to make LWLockWaitForVar suffer data races (ie hang), not break assertions about LWLock sanity, so I don't know what's going on there. I happened to have a shell on a Graviton box, but I couldn't reproduce it after a while...
Re: LogwrtResult contended spinlock
On Fri, 2024-04-05 at 13:54 +0200, Alvaro Herrera wrote: > Couldn't push: I tested with --disable-atomics --disable-spinlocks > and > the tests fail because the semaphore for the atomic variables is not > always initialized. This is weird -- it's like a client process is > running at a time when StartupXLOG has not initialized the variable > ... > so the initialization in the other place was not completely wrong. It looks like you solved the problem and pushed the first patch. Looks good to me. Patch 0002 also looks good to me, after a mostly-trivial rebase (just remember to initialize logCopyResult). Minor comments: * You could consider using a read barrier before reading the Copy ptr, or using the pg_atomic_read_membarrier_u64() variant. I don't see a correctness problem, but it might be slightly more clear and I don't see a lot of downside. * Also, I assume that the Max() call in pg_atomic_monotonic_advance_u64() is just for clarity? Surely the currval cannot be less than _target when it returns. I'd probably just do Assert(currval >= _target) and then return currval. Regards, Jeff Davis
Re: LogwrtResult contended spinlock
Pushed 0001. Here's the patch that adds the Copy position one more time, with the monotonic_advance function returning the value. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ >From 3f5c860576245b92701e7bfc517947c418c68510 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 5 Apr 2024 13:52:44 +0200 Subject: [PATCH v17] Add logCopyResult Updated using monotonic increment --- src/backend/access/transam/xlog.c | 32 ++- src/include/port/atomics.h| 36 +++ 2 files changed, 67 insertions(+), 1 deletion(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index b4499cda7b..a40c1fb79e 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -469,6 +469,7 @@ typedef struct XLogCtlData XLogRecPtr lastSegSwitchLSN; /* These are accessed using atomics -- info_lck not needed */ + pg_atomic_uint64 logCopyResult; /* last byte + 1 copied to WAL buffers */ pg_atomic_uint64 logWriteResult; /* last byte + 1 written out */ pg_atomic_uint64 logFlushResult; /* last byte + 1 flushed */ @@ -1499,6 +1500,7 @@ static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto) { uint64 bytepos; + XLogRecPtr copyptr; XLogRecPtr reservedUpto; XLogRecPtr finishedUpto; XLogCtlInsert *Insert = &XLogCtl->Insert; @@ -1507,6 +1509,11 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto) if (MyProc == NULL) elog(PANIC, "cannot wait without a PGPROC structure"); + /* check if there's any work to do */ + copyptr = pg_atomic_read_u64(&XLogCtl->logCopyResult); + if (upto <= copyptr) + return copyptr; + /* Read the current insert position */ SpinLockAcquire(&Insert->insertpos_lck); bytepos = Insert->CurrBytePos; @@ -1586,6 +1593,10 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto) if (insertingat != InvalidXLogRecPtr && insertingat < finishedUpto) finishedUpto = insertingat; } + + finishedUpto = + pg_atomic_monotonic_advance_u64(&XLogCtl->logCopyResult, finishedUpto); + return finishedUpto; } @@ -1727,13 +1738,24 @@ WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count, { char *pdst = dstbuf; XLogRecPtr recptr = startptr; + XLogRecPtr copyptr; Size nbytes = count; if (RecoveryInProgress() || tli != GetWALInsertionTimeLine()) return 0; Assert(!XLogRecPtrIsInvalid(startptr)); - Assert(startptr + count <= LogwrtResult.Write); + + /* + * Caller should ensure that the requested data has been copied to WAL + * buffers before we try to read it. + */ + copyptr = pg_atomic_read_u64(&XLogCtl->logCopyResult); + if (startptr + count > copyptr) + ereport(ERROR, +errmsg("request to read past end of generated WAL; requested %X/%X, current position %X/%X", + LSN_FORMAT_ARGS(startptr + count), + LSN_FORMAT_ARGS(copyptr))); /* * Loop through the buffers without a lock. For each buffer, atomically @@ -2571,13 +2593,19 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) { XLogRecPtr Flush; XLogRecPtr Write; + XLogRecPtr Copy; Flush = pg_atomic_read_u64(&XLogCtl->logFlushResult); pg_read_barrier(); Write = pg_atomic_read_u64(&XLogCtl->logWriteResult); + pg_read_barrier(); + Copy = pg_atomic_read_u64(&XLogCtl->logCopyResult); /* WAL written to disk is always ahead of WAL flushed */ Assert(Write >= Flush); + + /* WAL copied to buffers is always ahead of WAL written */ + Assert(Copy >= Write); } #endif } @@ -4951,6 +4979,7 @@ XLOGShmemInit(void) SpinLockInit(&XLogCtl->Insert.insertpos_lck); SpinLockInit(&XLogCtl->info_lck); + pg_atomic_init_u64(&XLogCtl->logCopyResult, InvalidXLogRecPtr); pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr); pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr); pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr); @@ -5979,6 +6008,7 @@ StartupXLOG(void) * because no other process can be reading or writing WAL yet. */ LogwrtResult.Write = LogwrtResult.Flush = EndOfLog; + pg_atomic_write_u64(&XLogCtl->logCopyResult, EndOfLog); pg_atomic_write_u64(&XLogCtl->logWriteResult, EndOfLog); pg_atomic_write_u64(&XLogCtl->logFlushResult, EndOfLog); XLogCtl->LogwrtRqst.Write = EndOfLog; diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h index ff47782cdb..78987f3154 100644 --- a/src/include/port/atomics.h +++ b/src/include/port/atomics.h @@ -570,6 +570,42 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_) return pg_atomic_sub_fetch_u64_impl(ptr, sub_); } +/* + * Monotonically advance the given variable using only atomic operations until + * it's at least the target value. Returns the latest value observed, which + * may or may not be the target value. + * + * Full barrier semantics (even when value is unchanged). + */ +static inline uint64 +pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_) +{ + uint64 c
Re: LogwrtResult contended spinlock
Couldn't push: I tested with --disable-atomics --disable-spinlocks and the tests fail because the semaphore for the atomic variables is not always initialized. This is weird -- it's like a client process is running at a time when StartupXLOG has not initialized the variable ... so the initialization in the other place was not completely wrong. I'll investigate after lunch. Here's v16. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "XML!" Exclaimed C++. "What are you doing here? You're not a programming language." "Tell that to the people who use me," said XML. https://burningbird.net/the-parable-of-the-languages/ >From 2164778b74681910544a64ba6c2ae36e5204ed9e Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 5 Apr 2024 13:21:39 +0200 Subject: [PATCH v16 1/2] Make XLogCtl->log{Write,Flush}Result accessible with atomics This removes the need to hold both the info_lck spinlock and WALWriteLock to update them. We use stock atomic write instead, with WALWriteLock held. Readers can use atomic read, without any locking. This allows for some code to be reordered: some places were a bit contorted to avoid repeated spinlock acquisition, but that's no longer a concern, so we can turn them to more natural coding. Some further changes are likely possible with a positive performance impact, but in this commit I did rather minimal ones only, to avoid increasing the blast radius. Reviewed-by: Bharath Rupireddy Reviewed-by: Jeff Davis Reviewed-by: Andres Freund (earlier versions) Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql --- src/backend/access/transam/xlog.c | 105 -- 1 file changed, 57 insertions(+), 48 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 3bdd9a2ddd..748ed1f2ef 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -292,12 +292,7 @@ static bool doPageWrites; * LogwrtRqst indicates a byte position that we need to write and/or fsync * the log up to (all records before that point must be written or fsynced). * The positions already written/fsynced are maintained in logWriteResult - * and logFlushResult. - * - * To read XLogCtl->logWriteResult or ->logFlushResult, you must hold either - * info_lck or WALWriteLock. To update them, you need to hold both locks. - * The point of this arrangement is that the value can be examined by code - * that already holds WALWriteLock without needing to grab info_lck as well. + * and logFlushResult using atomic access. * In addition to the shared variable, each backend has a private copy of * both in LogwrtResult, which is updated when convenient. * @@ -473,12 +468,9 @@ typedef struct XLogCtlData pg_time_t lastSegSwitchTime; XLogRecPtr lastSegSwitchLSN; - /* - * Protected by info_lck and WALWriteLock (you must hold either lock to - * read it, but both to update) - */ - XLogRecPtr logWriteResult; /* last byte + 1 written out */ - XLogRecPtr logFlushResult; /* last byte + 1 flushed */ + /* These are accessed using atomics -- info_lck not needed */ + pg_atomic_uint64 logWriteResult; /* last byte + 1 written out */ + pg_atomic_uint64 logFlushResult; /* last byte + 1 flushed */ /* * Latest initialized page in the cache (last byte position + 1). @@ -616,11 +608,15 @@ static XLogwrtResult LogwrtResult = {0, 0}; /* * Update local copy of shared XLogCtl->log{Write,Flush}Result + * + * It's critical that Flush always trails Write, so the order of the reads is + * important, as is the barrier. See also XLogWrite. */ #define RefreshXLogWriteResult(_target) \ do { \ - _target.Write = XLogCtl->logWriteResult; \ - _target.Flush = XLogCtl->logFlushResult; \ + _target.Flush = pg_atomic_read_u64(&XLogCtl->logFlushResult); \ + pg_read_barrier(); \ + _target.Write = pg_atomic_read_u64(&XLogCtl->logWriteResult); \ } while (0) /* @@ -968,9 +964,8 @@ XLogInsertRecord(XLogRecData *rdata, /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRqst.Write < EndPos) XLogCtl->LogwrtRqst.Write = EndPos; - /* update local result copy while I have the chance */ - RefreshXLogWriteResult(LogwrtResult); SpinLockRelease(&XLogCtl->info_lck); + RefreshXLogWriteResult(LogwrtResult); } /* @@ -1989,17 +1984,17 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic) if (opportunistic) break; - /* Before waiting, get info_lck and update LogwrtResult */ + /* Advance shared memory write request position */ SpinLockAcquire(&XLogCtl->info_lck); if (XLogCtl->LogwrtRqst.Write < OldPageRqstPtr) XLogCtl->LogwrtRqst.Write = OldPageRqstPtr; - RefreshXLogWriteResult(LogwrtResult); SpinLockRelease(&XLogCtl->info_lck); /* - * Now that we have an up-to-date LogwrtResult value, see if we - * still need to write it or if someone else already did. + * Acquire an up-to-date LogwrtResu
Re: LogwrtResult contended spinlock
On 2024-Apr-05, Bharath Rupireddy wrote: > 1. > /* > * Update local copy of shared XLogCtl->log{Write,Flush}Result > + * > + * It's critical that Flush always trails Write, so the order of the reads is > + * important, as is the barrier. > */ > #define RefreshXLogWriteResult(_target) \ > do { \ > -_target.Write = XLogCtl->logWriteResult; \ > -_target.Flush = XLogCtl->logFlushResult; \ > +_target.Flush = pg_atomic_read_u64(&XLogCtl->logFlushResult); \ > +pg_read_barrier(); \ > +_target.Write = pg_atomic_read_u64(&XLogCtl->logWriteResult); \ > } while (0) > > Is it "Flush always trails Write" or "Flush always leades Write"? I > guess the latter, no? Well, Flush cannot lead Write. We cannot Flush what hasn't been Written! To me, "Flush trails Write" means that flush is behind. That seems supported by Merriam-Webster[1], which says in defining it as a transitive verb 3 a : to follow upon the scent or trace of : TRACK b : to follow in the footsteps of : PURSUE c : to follow along behind d : to lag behind (someone, such as a competitor) Maybe it's not super clear as a term. We could turn it around and say "Write always leads Flush". [1] https://www.merriam-webster.com/dictionary/trail The reason we have the barrier, and the reason we write and read them in this order, is that we must never read a Flush value that is newer than the Write value we read. That is: the values are written in pairs (w1,f1) first, (w2,f2) next, and so on. It's okay if we obtain (w2,f1) (new write, old flush), but it's not okay if we obtain (w1,f2) (old write, new flush). If we did, we might end up with a Flush value that's ahead of Write, violating the invariant. > 2. > + > +pg_atomic_write_u64(&XLogCtl->logWriteResult, LogwrtResult.Write); > +pg_write_barrier(); > +pg_atomic_write_u64(&XLogCtl->logFlushResult, LogwrtResult.Flush); > } > > Maybe add the reason as to why we had to write logWriteResult first > and then logFlushResult, similar to the comment atop > RefreshXLogWriteResult? Hmm, I had one there and removed it. I'll put something back. > > For 0002, did you consider having pg_atomic_monotonic_advance_u64() > > return the currval? > > +1 for returning currval here. Oh yeah, I had that when the monotonic stuff was used by 0001 but lost it when I moved to 0002. I'll push 0001 now and send an updated 0002 with the return value added. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)
Re: LogwrtResult contended spinlock
On Fri, Apr 5, 2024 at 4:21 AM Jeff Davis wrote: > > > 4. If we're not modifying any callers of WALReadFromBuffers(), then > > AFAICS the added check that the request is not past the Copy pointer > > is > > useless. In a quick look at that code, I think we only try to read > > data > > that's been flushed, not written, so the stricter check that we don't > > read data that hasn't been Copied does nothing. > > Bharath has indicated that he may call WALReadFromBuffers() in an > extension, so I believe some error checking is appropriate there. > > > (Honestly I'm not sure > > that we want XLogSendPhysical to be reading data that has not been > > written, or do we?) > > Not yet, but there has been some discussion[1][2] about future work to > allow replicating data before it's been flushed locally. Right. Although callers of WALReadFromBuffers() in core postgres doesn't need it now (in future, they will), but it allows one to write something up externally - for example, see 0004 patch in https://www.postgresql.org/message-id/CALj2ACWmW+1ZdYE0BD5-4KVLzYGn3=pudomwphuhmi4yquo...@mail.gmail.com where I implemented a xlogreader page_read callback that just keeps reading the WAL that's fully copied to WAL buffers. > Regarding the patches themselves, 0001 looks good to me. A few comments on 0001: 1. /* * Update local copy of shared XLogCtl->log{Write,Flush}Result + * + * It's critical that Flush always trails Write, so the order of the reads is + * important, as is the barrier. */ #define RefreshXLogWriteResult(_target) \ do { \ -_target.Write = XLogCtl->logWriteResult; \ -_target.Flush = XLogCtl->logFlushResult; \ +_target.Flush = pg_atomic_read_u64(&XLogCtl->logFlushResult); \ +pg_read_barrier(); \ +_target.Write = pg_atomic_read_u64(&XLogCtl->logWriteResult); \ } while (0) Is it "Flush always trails Write" or "Flush always leades Write"? I guess the latter, no? 2. + +pg_atomic_write_u64(&XLogCtl->logWriteResult, LogwrtResult.Write); +pg_write_barrier(); +pg_atomic_write_u64(&XLogCtl->logFlushResult, LogwrtResult.Flush); } Maybe add the reason as to why we had to write logWriteResult first and then logFlushResult, similar to the comment atop RefreshXLogWriteResult? > For 0002, did you consider having pg_atomic_monotonic_advance_u64() > return the currval? +1 for returning currval here. Except for the above comments, the patches look good to me. I've also run the test loop for any assertion failure - for i in {1..100}; do make check PROVE_TESTS="t/027_stream_regress.pl"; if [ $? -ne 0 ]; then echo "The command failed on iteration $i"; break; fi; done. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: LogwrtResult contended spinlock
On Thu, 2024-04-04 at 19:45 +0200, Alvaro Herrera wrote: > 1. Using pg_atomic_write_membarrier_u64 is useless and it imposes > mora > barriers than we actually need. So I switched back to > pg_atomic_write_u64 and add one barrier between the two writes. Same > for reads. +1. This looks correct to me. Just before the writes there's a spinlock, which acts as a full barrier; and just afterwards, the function returns and the WALWriteLock is released, again acting as a full barrier. The write barrier in between enforces the Write >= Flush invariant. > 2. Using monotonic_advance for Write and Flush is useless. +1. > 3. Testing the invariant that the Copy pointer cannot be 0 is > useless, > because we initialize that pointer to EndOfLog during StartupXLOG. > So, removed. +1. > 4. If we're not modifying any callers of WALReadFromBuffers(), then > AFAICS the added check that the request is not past the Copy pointer > is > useless. In a quick look at that code, I think we only try to read > data > that's been flushed, not written, so the stricter check that we don't > read data that hasn't been Copied does nothing. Bharath has indicated that he may call WALReadFromBuffers() in an extension, so I believe some error checking is appropriate there. > (Honestly I'm not sure > that we want XLogSendPhysical to be reading data that has not been > written, or do we?) Not yet, but there has been some discussion[1][2] about future work to allow replicating data before it's been flushed locally. > Please argue why we need this patch. I'm not sure what you mean by "this patch"? > 5. The existing weird useless-looking block at the end of XLogWrite > is > there because we used to have it to declare a volatile pointer to > XLogCtl (cf. commit 6ba4ecbf477e). That's no longer needed, so we > could remove it. Or we could leave it alone (just because it's > ancient > and it doesn't hurt anything), but there's no reason to have the new > invariant-testing block inside the same block. So I added another > weird > useless-looking block, except that this one does have two variable > declaration at its top. That didn't bother me, but it could be cleaned up a bit in a later patch. > 6. In a few places, we read both Write and Flush to only use one of > them. This is wasteful and we could dial this back to reading only > the > one we need. Andres suggested as much in [1]. I didn't touch this > in > the current patch, and I don't necessarily think we need to address > it > right now. Addressing this should probably done similar to what I > posted in [2]'s 0002. I agree that it should be a separate patch. I haven't thought about the consequences of making them fully independent -- I think that means we give up the invariant that Copy >= Write >= Flush? Regarding the patches themselves, 0001 looks good to me. For 0002, did you consider having pg_atomic_monotonic_advance_u64() return the currval? Regards, Jeff Davis [1] https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com [2] https://www.postgresql.org/message-id/20230125211540.zylu74dj2uuh3k7w%40awork3.anarazel.de [3] https://www.postgresql.org/message-id/CALj2ACW65mqn6Ukv57SqDTMzAJgd1N_AdQtDgy%2BgMDqu6v618Q%40mail.gmail.com
Re: LogwrtResult contended spinlock
I've noticed a few things here, v16 attached with some rather largish changes. 1. Using pg_atomic_write_membarrier_u64 is useless and it imposes mora barriers than we actually need. So I switched back to pg_atomic_write_u64 and add one barrier between the two writes. Same for reads. 2. Using monotonic_advance for Write and Flush is useless. We can use a simple atomic_write with a write barrier in between. The reason is that, as Andres said[1], we only write those with WALWriteLock held, so it's not possible for them to move forward while we aren't looking. All callers of XLogWrite do RefreshXLogWriteResult() with the WALWriteLock held. Therefore we can just use pg_atomic_write_u64. Consequently I moved the addition of the monotonic advance function to the patch that adds Copy. 3. Testing the invariant that the Copy pointer cannot be 0 is useless, because we initialize that pointer to EndOfLog during StartupXLOG. So, removed. 4. If we're not modifying any callers of WALReadFromBuffers(), then AFAICS the added check that the request is not past the Copy pointer is useless. In a quick look at that code, I think we only try to read data that's been flushed, not written, so the stricter check that we don't read data that hasn't been Copied does nothing. (Honestly I'm not sure that we want XLogSendPhysical to be reading data that has not been written, or do we?) Please argue why we need this patch. 5. The existing weird useless-looking block at the end of XLogWrite is there because we used to have it to declare a volatile pointer to XLogCtl (cf. commit 6ba4ecbf477e). That's no longer needed, so we could remove it. Or we could leave it alone (just because it's ancient and it doesn't hurt anything), but there's no reason to have the new invariant-testing block inside the same block. So I added another weird useless-looking block, except that this one does have two variable declaration at its top. 6. In a few places, we read both Write and Flush to only use one of them. This is wasteful and we could dial this back to reading only the one we need. Andres suggested as much in [1]. I didn't touch this in the current patch, and I don't necessarily think we need to address it right now. Addressing this should probably done similar to what I posted in [2]'s 0002. [1] https://postgr.es/m/20210130023011.n545o54j65t4k...@alap3.anarazel.de [2] https://postgr.es/m/202203221611.hqbjdinzsbu2@alvherre.pgsql -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ >From f791c7ee9815871a1843116febffc0077d29a9a3 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 3 Apr 2024 11:48:27 +0200 Subject: [PATCH v16 1/2] Make XLogCtl->log{Write,Flush}Result accessible with atomics This removes the need to hold both the info_lck spinlock and WALWriteLock to update them. --- src/backend/access/transam/xlog.c | 91 +-- 1 file changed, 49 insertions(+), 42 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f26533ec01..9b68afd400 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -292,12 +292,7 @@ static bool doPageWrites; * LogwrtRqst indicates a byte position that we need to write and/or fsync * the log up to (all records before that point must be written or fsynced). * The positions already written/fsynced are maintained in logWriteResult - * and logFlushResult. - * - * To read XLogCtl->logWriteResult or ->logFlushResult, you must hold either - * info_lck or WALWriteLock. To update them, you need to hold both locks. - * The point of this arrangement is that the value can be examined by code - * that already holds WALWriteLock without needing to grab info_lck as well. + * and logFlushResult using atomic access. * In addition to the shared variable, each backend has a private copy of * both in LogwrtResult, which is updated when convenient. * @@ -473,12 +468,9 @@ typedef struct XLogCtlData pg_time_t lastSegSwitchTime; XLogRecPtr lastSegSwitchLSN; - /* - * Protected by info_lck and WALWriteLock (you must hold either lock to - * read it, but both to update) - */ - XLogRecPtr logWriteResult; /* last byte + 1 written out */ - XLogRecPtr logFlushResult; /* last byte + 1 flushed */ + /* These are accessed using atomics -- info_lck not needed */ + pg_atomic_uint64 logWriteResult; /* last byte + 1 written out */ + pg_atomic_uint64 logFlushResult; /* last byte + 1 flushed */ /* * Latest initialized page in the cache (last byte position + 1). @@ -616,11 +608,15 @@ static XLogwrtResult LogwrtResult = {0, 0}; /* * Update local copy of shared XLogCtl->log{Write,Flush}Result + * + * It's critical that Flush always trails Write, so the order of the reads is + * important, as is the barrier. */ #define RefreshXLogWriteResult(_target) \ do { \ - _target.Write = XLogCtl->logWriteResult; \ - _target.Flush = XLogCtl->logFlushResult; \ + _target.F
Re: LogwrtResult contended spinlock
On Wed, 2024-04-03 at 13:19 +0200, Alvaro Herrera wrote: > So what I do in the attached 0001 is stop using the XLogwrtResult > struct > in XLogCtl and replace it with separate Write and Flush values, and > add > the macro XLogUpdateLocalLogwrtResult() that copies the values of > Write > and Flush from the shared XLogCtl to the local variable given as > macro > argument. +1 > 0002 then adds pg_atomic_monotonic_advance_u64. (I don't add the > _u32 > variant, because I don't think it's a great idea to add dead code. > If > later we see a need for it we can put it in.) It also changes the > two > new members to be atomics, changes the macro to use atomic read, and > XLogWrite now uses monotonic increment. A couple of other places can > move the macro calls to occur outside the spinlock. Also, XLogWrite > gains the invariant checking that involves Write and Flush. To maintain the invariant that Write >= Flush, I believe you need to always store to Write first, then Flush; and load from Flush first, then from Write. So RefreshXLogWriteResult should load Flush then load Write, and the same for the Assert code. And in 0003, loading from Copy should happen last. Also, should pg_atomic_monotonic_advance_u64() return currval? I don't think it's important for the current patches, but I could imagine a caller wanting the most up-to-date value possible, even if it's beyond what the caller requested. Returning void seems slightly wasteful. Other than that, it looks good to me. > Finally, 0003 adds the Copy pointer to XLogCtl alongside Write and > Flush, and updates WALReadFromBuffers to test that instead of the > Write > pointer, and adds in XLogWrite the invariant checks that involve the > Copy pointer. Looks good to me. Regards, Jeff Davis
Re: LogwrtResult contended spinlock
On 2024-Apr-03, Bharath Rupireddy wrote: > On Wed, Apr 3, 2024 at 4:49 PM Alvaro Herrera wrote: > > So what I do in the attached 0001 is stop using the XLogwrtResult struct > > in XLogCtl and replace it with separate Write and Flush values, and add > > the macro XLogUpdateLocalLogwrtResult() that copies the values of Write > > and Flush from the shared XLogCtl to the local variable given as macro > > argument. (I also added our idiomatic do {} while(0) to the macro > > definition, for safety). The new members are XLogCtl->logWriteResult > > and XLogCtl->logFlushResult and as of 0001 are just XLogRecPtr, so > > essentially identical semantics as the previous code. No atomic access > > yet! > > +1. > Also, I'm fine with renaming XLogUpdateLocalLogwrtResult() to > RefreshXLogWriteResult(). Okay, I have pushed 0001 with the name change, will see about getting the others in tomorrow. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: LogwrtResult contended spinlock
On Wed, Apr 3, 2024 at 4:49 PM Alvaro Herrera wrote: > > Thanks for keeping this moving forward. I gave your proposed patches a > look. One thing I didn't like much is that we're adding a new member > (Copy) to XLogwrtAtomic -- but this struct is supposed to be a mirror of > XLogwrtResult for use with atomic access. Since this new member is not > added to XLogwrtResult (because it's not needed there), the whole idea > of there being symmetry between those two structs crumbles down. > Because we later stop using struct-assign anyway, meaning we no longer > need the structs to match, we can instead spell out the members in > XLogCtl and call it a day. Hm, I have no objection to having separate variables in XLogCtl. > So what I do in the attached 0001 is stop using the XLogwrtResult struct > in XLogCtl and replace it with separate Write and Flush values, and add > the macro XLogUpdateLocalLogwrtResult() that copies the values of Write > and Flush from the shared XLogCtl to the local variable given as macro > argument. (I also added our idiomatic do {} while(0) to the macro > definition, for safety). The new members are XLogCtl->logWriteResult > and XLogCtl->logFlushResult and as of 0001 are just XLogRecPtr, so > essentially identical semantics as the previous code. No atomic access > yet! +1. > 0002 then adds pg_atomic_monotonic_advance_u64. (I don't add the _u32 > variant, because I don't think it's a great idea to add dead code. If > later we see a need for it we can put it in.) It also changes the two > new members to be atomics, changes the macro to use atomic read, and > XLogWrite now uses monotonic increment. A couple of other places can > move the macro calls to occur outside the spinlock. Also, XLogWrite > gains the invariant checking that involves Write and Flush. I'm fine with not having the 32 bit variant of pg_atomic_monotonic_advance. However, a recent commit bd5132db5 added both 32 and 64 bit versions of pg_atomic_read_membarrier even though 32 bit isn't being used. > Finally, 0003 adds the Copy pointer to XLogCtl alongside Write and > Flush, and updates WALReadFromBuffers to test that instead of the Write > pointer, and adds in XLogWrite the invariant checks that involve the > Copy pointer. +1. The attached patches look good to me. Also, I'm fine with renaming XLogUpdateLocalLogwrtResult() to RefreshXLogWriteResult(). > I haven't rerun Bharath test loop yet; will do so shortly. I ran it a 100 times [1] on top of all the 3 patches, it looks fine. [1] for i in {1..100}; do make check PROVE_TESTS="t/027_stream_regress.pl"; if [ $? -ne 0 ]; then echo "The command failed on iteration $i"; break; fi; done ./configure --prefix=$PWD/pg17/ --enable-debug --enable-tap-tests --enable-cassert CC=/usr/bin/clang-14 > install.log && make -j 8 install > install.log 2>&1 & -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: LogwrtResult contended spinlock
On 2024-Apr-03, Alvaro Herrera wrote: > So what I do in the attached 0001 is stop using the XLogwrtResult struct > in XLogCtl and replace it with separate Write and Flush values, and add > the macro XLogUpdateLocalLogwrtResult() that copies the values of Write > and Flush from the shared XLogCtl to the local variable given as macro > argument. (I also added our idiomatic do {} while(0) to the macro > definition, for safety). The new members are XLogCtl->logWriteResult > and XLogCtl->logFlushResult and as of 0001 are just XLogRecPtr, so > essentially identical semantics as the previous code. No atomic access > yet! BTW I forgot. I didn't like the name XLogUpdateLocalLogwrtResult() name much. What do you think about RefreshXLogWriteResult() instead? -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "How strange it is to find the words "Perl" and "saner" in such close proximity, with no apparent sense of irony. I doubt that Larry himself could have managed it." (ncm, http://lwn.net/Articles/174769/)
Re: LogwrtResult contended spinlock
Thanks for keeping this moving forward. I gave your proposed patches a look. One thing I didn't like much is that we're adding a new member (Copy) to XLogwrtAtomic -- but this struct is supposed to be a mirror of XLogwrtResult for use with atomic access. Since this new member is not added to XLogwrtResult (because it's not needed there), the whole idea of there being symmetry between those two structs crumbles down. Because we later stop using struct-assign anyway, meaning we no longer need the structs to match, we can instead spell out the members in XLogCtl and call it a day. So what I do in the attached 0001 is stop using the XLogwrtResult struct in XLogCtl and replace it with separate Write and Flush values, and add the macro XLogUpdateLocalLogwrtResult() that copies the values of Write and Flush from the shared XLogCtl to the local variable given as macro argument. (I also added our idiomatic do {} while(0) to the macro definition, for safety). The new members are XLogCtl->logWriteResult and XLogCtl->logFlushResult and as of 0001 are just XLogRecPtr, so essentially identical semantics as the previous code. No atomic access yet! 0002 then adds pg_atomic_monotonic_advance_u64. (I don't add the _u32 variant, because I don't think it's a great idea to add dead code. If later we see a need for it we can put it in.) It also changes the two new members to be atomics, changes the macro to use atomic read, and XLogWrite now uses monotonic increment. A couple of other places can move the macro calls to occur outside the spinlock. Also, XLogWrite gains the invariant checking that involves Write and Flush. Finally, 0003 adds the Copy pointer to XLogCtl alongside Write and Flush, and updates WALReadFromBuffers to test that instead of the Write pointer, and adds in XLogWrite the invariant checks that involve the Copy pointer. I haven't rerun Bharath test loop yet; will do so shortly. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "You're _really_ hosed if the person doing the hiring doesn't understand relational systems: you end up with a whole raft of programmers, none of whom has had a Date with the clue stick." (Andrew Sullivan) https://postgr.es/m/20050809113420.gd2...@phlogiston.dyndns.org >From e6ddbe87d598c6a1090e39845a4a308060f0af1e Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Wed, 3 Apr 2024 11:23:12 +0200 Subject: [PATCH v15 1/3] split XLogCtl->LogwrtResult into separate struct members --- src/backend/access/transam/xlog.c | 59 ++- 1 file changed, 35 insertions(+), 24 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 87ea03954b..6213d99561 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -291,16 +291,15 @@ static bool doPageWrites; * * LogwrtRqst indicates a byte position that we need to write and/or fsync * the log up to (all records before that point must be written or fsynced). - * LogwrtResult indicates the byte positions we have already written/fsynced. - * These structs are identical but are declared separately to indicate their - * slightly different functions. + * The positions already written/fsynced are maintained in logWriteResult + * and logFlushResult. * - * To read XLogCtl->LogwrtResult, you must hold either info_lck or - * WALWriteLock. To update it, you need to hold both locks. The point of - * this arrangement is that the value can be examined by code that already - * holds WALWriteLock without needing to grab info_lck as well. In addition - * to the shared variable, each backend has a private copy of LogwrtResult, - * which is updated when convenient. + * To read XLogCtl->logWriteResult or ->logFlushResult, you must hold either + * info_lck or WALWriteLock. To update them, you need to hold both locks. + * The point of this arrangement is that the value can be examined by code + * that already holds WALWriteLock without needing to grab info_lck as well. + * In addition to the shared variable, each backend has a private copy of + * both in LogwrtResult, which is updated when convenient. * * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst * (protected by info_lck), but we don't need to cache any copies of it. @@ -478,7 +477,8 @@ typedef struct XLogCtlData * Protected by info_lck and WALWriteLock (you must hold either lock to * read it, but both to update) */ - XLogwrtResult LogwrtResult; + XLogRecPtr logWriteResult; /* last byte + 1 written out */ + XLogRecPtr logFlushResult; /* last byte + 1 flushed */ /* * Latest initialized page in the cache (last byte position + 1). @@ -614,6 +614,15 @@ static int UsableBytesInSegment; */ static XLogwrtResult LogwrtResult = {0, 0}; +/* + * Update local copy of shared XLogCtl->log{Write,Flush}Result + */ +#define XLogUpdateLocalLogwrtResult(_target) \ + do { \ + _target.Write = XLogCtl->logWrit
Re: LogwrtResult contended spinlock
On Mon, Mar 4, 2024 at 9:15 PM Bharath Rupireddy wrote: > > > 0003: > > > > * We need to maintain the invariant that Copy >= Write >= Flush. I > > believe that's always satisfied, because the > > XLogWaitInsertionsToFinish() is always called before XLogWrite(). But > > we should add an assert or runtime check of this invariant somewhere. > > Yes, that invariant is already maintained by the server. Although, I'm > not fully agree, I added an assertion to WaitXLogInsertionsToFinish > after updating XLogCtl->LogwrtResult.Copy. CF bot is happy with it - > https://github.com/BRupireddy2/postgres/tree/atomic_LogwrtResult_v13. I've now separated these invariants out into the 0004 patch. With the assertions placed in WaitXLogInsertionsToFinish after updating Copy ptr, I observed the assertion failing in one of the CF bot machines - https://cirrus-ci.com/build/6202112288227328. I could reproduce it locally with [1]. I guess the reason is that the Write and Flush ptrs are now updated independently and atomically without lock, they might drift and become out-of-order for a while if concurrently they are accessed in WaitXLogInsertionsToFinish. So, I guess the right place to verify the invariant Copy >= Write >= Flush is in XLogWrite once Write and Flush ptrs in shared memory are updated (note that only one process at a time can do this). Accordingly, I've moved the assertions to XLogWrite in the attached v14-0004 patch. > Please see the attached v13 patch set for further review. Earlier versions of the patches removed a piece of code ensuring shared WAL 'request' values did not fall beading the 'result' values. There's a good reason for us to have it. So, I restored it. - /* -* Update shared-memory status -* -* We make sure that the shared 'request' values do not fall behind the -* 'result' values. This is not absolutely essential, but it saves some -* code in a couple of places. -*/ Please see the attached v14 patch set. [1] for i in {1..100}; do make check PROVE_TESTS="t/027_stream_regress.pl"; if [ $? -ne 0 ]; then echo "The command failed on iteration $i"; break; fi; done -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com v14-0001-Add-monotonic-advancement-functions-for-atomics.patch Description: Binary data v14-0002-Make-XLogCtl-LogwrtResult-accessible-with-atomic.patch Description: Binary data v14-0003-Add-Copy-pointer-to-track-data-copied-to-WAL-buf.patch Description: Binary data v14-0004-Add-invariants-for-shared-LogwrtResult-members.patch Description: Binary data
Re: LogwrtResult contended spinlock
Thanks for looking into this. On Thu, Feb 22, 2024 at 1:54 AM Jeff Davis wrote: > > > 3. > > @@ -6371,7 +6373,9 @@ GetFlushRecPtr(TimeLineID *insertTLI) > > If at all, a read > > barrier is warranted here, we can use atomic read with full barrier > > I don't think we need a full barrier but I'm fine with using > pg_atomic_read_membarrier_u64() if it's better for whatever reason. For the sake of clarity and correctness, I've used pg_atomic_read_membarrier_u64 everywhere for reading XLogCtl->LogwrtResult.Write and XLogCtl->LogwrtResult.Flush. > > 5. I guess we'd better use pg_atomic_read_u64_impl and > > pg_atomic_compare_exchange_u64_impl in > > pg_atomic_monotonic_advance_u64 > > to reduce one level of function indirections. > > Aren't they inlined? Yes, all of them are inlined. But, it seems like XXX_impl functions are being used in implementing exposed functions as a convention. Therefore, having pg_atomic_read_u64_impl and pg_atomic_compare_exchange_u64_impl doesn't sound bad IMV. > > 6. > > + * Full barrier semantics (even when value is unchanged). > > > > +currval = pg_atomic_read_u64(ptr); > > +if (currval >= target_) > > +{ > > +pg_memory_barrier(); > > I don't think they are exactly equivalent: in the current patch, the > first pg_atomic_read_u64() could be reordered with earlier reads; > whereas that wouldn't work if using pg_atomic_read_membarrier_u64() it > could not be. I'm not sure whether that could create a performance > problem or not. I left it as-is. > > 9. > > +copyptr = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Copy); > > +if (startptr + count > copyptr) > > +ereport(WARNING, > > +(errmsg("request to read past end of generated WAL; > > request %X/%X, current position %X/%X", > > +LSN_FORMAT_ARGS(startptr + count), > > LSN_FORMAT_ARGS(copyptr; > > > > Any specific reason for this to be a WARNING rather than an ERROR? > > Good question. WaitXLogInsertionsToFinish() uses a LOG level message > for the same situation. They should probably be the same log level, and > I would think it would be either PANIC or WARNING. I have no idea why > LOG was chosen. WaitXLogInsertionsToFinish adjusts upto after LOG so that the wait is never past the current insert position even if a caller asks for reading WAL that doesn't yet exist. And the comment there says "Here we just assume that to mean that all WAL that has been reserved needs to be finished." In contrast, WALReadFromBuffers kind of enforces callers to do WaitXLogInsertionsToFinish (IOW asks callers to send in the WAL that exists in the server). Therefore, an ERROR seems a reasonable choice to me, if PANIC sounds rather strong affecting all the postgres processes. FWIW, a PANIC when requested to flush past the end of WAL in WaitXLogInsertionsToFinish instead of LOG seems to be good. CF bot animals don't complain - https://github.com/BRupireddy2/postgres/tree/be_harsh_when_request_to_flush_past_end_of_WAL_WIP. > 0001: > > * The comments on the two versions of the functions are redundant, and > the style in that header seems to be to omit the comment from the u64 > version. Removed comments atop 64-bit version. > * I'm not sure the AssertPointerAlignment is needed in the u32 version? Borrowed them from pg_atomic_read_u32 and pg_atomic_compare_exchange_u32, just like how they assert before calling XXX_impl versions. I don't see any problem with them. > 0002: > > * All updates to the non-shared LogwrtResult should update both values. > It's confusing to update those local values independently, because it > violates the invariant that LogwrtResult.Flush <= LogwrtResult.Write. > > > 2. I guess we need to update both the Write and Flush local copies in > > AdvanceXLInsertBuffer. > > I agree. Whenever we update the non-shared LogwrtResult, let's update > the whole thing. Otherwise it's confusing. Yes, it's done that way now with a macro XLogUpdateLocalLogWrtResult using pg_atomic_read_membarrier_u64 to read both Write and Flush ptrs. > * pg_memory_barrier() is not needed right before a spinlock Got rid of it as we read both Flush and Write local copies with pg_atomic_read_membarrier_u64. > * As mentioned above, I think GetFlushRecPtr() needs two read barriers. > Also, I think the same for GetXLogWriteRecPtr(). Yes, it's done that way now with a macro XLogUpdateLocalLogWrtResult using pg_atomic_read_membarrier_u64 to read both Write and Flush ptrs. > * In general, for any place using both Write and Flush, I think Flush > should be loaded first, followed by a read barrier, followed by a load > of the Write pointer. Why read Flush first rather than Write? I think it's enough to do {read Write, read barrier, read Flush}. This works because Write is monotonically advanced first before Flush using full barriers and we don't get reordering issues between the readers and writers no? Am I missing anything here? > And I think in most of those places there shou
Re: LogwrtResult contended spinlock
On Fri, Feb 23, 2024 at 1:18 AM Jeff Davis wrote: > I don't see the global non-shared variable as a huge problem, so if it > serves a purpose then I'm fine keeping it. Perhaps we could make it a > bit safer by using some wrapper functions. I actually really hate these kinds of variables. I think they likely mask various bugs (where the value might not be up to date where we think it is) and anti-bugs (where we actually rely on the value being out of date but we don't know that we do because the code is so opaque). My vintage 2021 adventures in getting rid of the global variable ThisTimeLineID and a few other global variables found some actual but minor bugs, also found a bunch of confused code that didn't really do what it thought it did, and took up a huge amount of my time trying to analyze what was happening. I'm not prepared to recommend what we should do in this particular case. I would like to believe that the local copies can be eliminated somehow, but I haven't studied the code or done benchmarking so I don't really know enough to guess what sort of code changes would or would not be good enough. -- Robert Haas EDB: http://www.enterprisedb.com
Re: LogwrtResult contended spinlock
On Thu, 2024-02-22 at 10:17 +0530, Robert Haas wrote: > I think the problems tend to be worst when you have some bit of data > that's being frequently modified by multiple backends. Every backend > that wants to modify the value needs to steal the cache line, and > eventually you spend most of your CPU time stealing cache lines from > other sockets and not much of it doing any actual work. If you have a > value that's just being read by a lot of backends without > modification, I think the cache line can be shared in read only mode > by all the CPUs and it's not too bad. That makes sense. I guess they'd be on the same cache line as well, which means a write to either will invalidate both. Some places (XLogWrite, XLogInsertRecord, XLogSetAsyncXactLSN, GetFlushRecPtr, GetXLogWriteRecPtr) already update it from shared memory anyway, so those are non-issues. The potential problem areas (unless I missed something) are: * AdvanceXLInsertBuffer reads it as an early check to see if a buffer is already written out. * XLogFlush / XLogNeedsFlush use it for an early return * WALReadFromBuffers reads it for an error check (currently an Assert, but we might want to make that an elog). * We are discussing adding a Copy pointer, which would be advanced by WaitXLogInsertionsToFinish(), and if we do replication-before-flush, we may want to be more eager about advancing it. That could cause an issue, as well. I don't see the global non-shared variable as a huge problem, so if it serves a purpose then I'm fine keeping it. Perhaps we could make it a bit safer by using some wrapper functions. Something like: bool IsWriteRecPtrAtLeast(XLogRecPtr recptr) { XLogRecPtr writeptr; if (LogwrtResult.Write >= recptr) return true; writeptr = GetXLogWriteRecPtr(); return (writeptr >= recptr); } That would reduce the number of direct references to LogwrtResult, callers would never see a stale value, and would avoid the cache miss problem that you're concerned about. Not sure if they'd need to be inline or not. Regards, Jeff Davis
Re: LogwrtResult contended spinlock
On Thu, Feb 22, 2024 at 1:54 AM Jeff Davis wrote: > On Wed, 2024-02-21 at 20:00 +0530, Bharath Rupireddy wrote: > > The local copy of LogwrtResult is so frequently used in the backends, > > if we were to replace it with atomic accesses, won't the atomic reads > > be costly and start showing up in perf profiles? > > I don't see exactly where the extra cost would come from. What is your > concern? > > In functions that access it several times, it may make sense to copy it > to a function-local variable, of course. But having a global non-shared > LogwrtResult doesn't seem particularly useful to me. I would love to get rid of the global variable; all of our code relies too much on global variables, but the xlog code is some of the worst. The logic is complex and difficult to reason about. But I do think that there's room for concern about performance, too. I don't know if there is an actual problem, but there could be a problem. My impression is that accessing shared memory isn't intrinsically more costly than accessing backend-private memory, but accessing heavily contended memory can definitely be more expensive than accessing uncontended memory -- and not just slightly more expensive, but WAY more expensive. I think the problems tend to be worst when you have some bit of data that's being frequently modified by multiple backends. Every backend that wants to modify the value needs to steal the cache line, and eventually you spend most of your CPU time stealing cache lines from other sockets and not much of it doing any actual work. If you have a value that's just being read by a lot of backends without modification, I think the cache line can be shared in read only mode by all the CPUs and it's not too bad. -- Robert Haas EDB: http://www.enterprisedb.com
Re: LogwrtResult contended spinlock
On Wed, 2024-02-21 at 20:00 +0530, Bharath Rupireddy wrote: > The local copy of LogwrtResult is so frequently used in the backends, > if we were to replace it with atomic accesses, won't the atomic reads > be costly and start showing up in perf profiles? I don't see exactly where the extra cost would come from. What is your concern? In functions that access it several times, it may make sense to copy it to a function-local variable, of course. But having a global non-shared LogwrtResult doesn't seem particularly useful to me. > In any case, I prefer > discussing this separately once we get to a conclusion on converting > shared memory Write and Flush ptrs to atomics. I agree that it shouldn't block the earlier patches, but it seems on- topic for this thread. > > 1. I guess we need to initialize the new atomics with > pg_atomic_init_u64 initially in XLOGShmemInit: Thank you. > 2. I guess we need to update both the Write and Flush local copies in > AdvanceXLInsertBuffer. I agree. Whenever we update the non-shared LogwrtResult, let's update the whole thing. Otherwise it's confusing. > 3. > @@ -6371,7 +6373,9 @@ GetFlushRecPtr(TimeLineID *insertTLI) > { > Assert(XLogCtl->SharedRecoveryState == RECOVERY_STATE_DONE); > > + pg_read_barrier(); > LogwrtResult.Flush = pg_atomic_read_u64(&XLogCtl- > >LogwrtResult.Flush); > + LogwrtResult.Write = pg_atomic_read_u64(&XLogCtl- > >LogwrtResult.Write); > > Do we need a read barrier here to not reorder things when more than > one process is accessing the flush and write ptrs? The above seems wrong: we don't want to reorder a load of Write before a load of Flush, otherwise it could look like Write < Flush. (Similarly, we never want to reorder a store of Flush before a store of Write.) So I think we should have another read barrier between those two atomic reads. I also think we need the read barrier at the beginning because the caller doesn't expect a stale value of the Flush pointer. It might not be strictly required for correctness, because I don't think that stale value can be inconsistent with anything else, but getting an up-to-date value is better. > If at all, a read > barrier is warranted here, we can use atomic read with full barrier I don't think we need a full barrier but I'm fine with using pg_atomic_read_membarrier_u64() if it's better for whatever reason. > + pg_atomic_write_u64(&XLogCtl->LogwrtResult.Write, EndOfLog); > + pg_atomic_write_u64(&XLogCtl->LogwrtResult.Flush, EndOfLog); > XLogCtl->LogwrtRqst.Write = EndOfLog; > > pg_atomic_write_u64 here seems fine to me as no other process is > active writing WAL yet, otherwise, we need write with full barrier I don't see any memory ordering hazard. In any case, there's an LWLockAcquire a few lines later, which acts as a full barrier. > something like pg_write_barrier + pg_atomic_write_u64 or > pg_atomic_exchange_u64 like in LWLockUpdateVar or use atomic write > with full barrier sematices as proposed here (when that's in) - > https://www.postgresql.org/message-id/20231127210030.GA140335%40nathanxps13 > . > > 5. I guess we'd better use pg_atomic_read_u64_impl and > pg_atomic_compare_exchange_u64_impl in > pg_atomic_monotonic_advance_u64 > to reduce one level of function indirections. Aren't they inlined? > 6. > + * Full barrier semantics (even when value is unchanged). > > + currval = pg_atomic_read_u64(ptr); > + if (currval >= target_) > + { > + pg_memory_barrier(); > > Perhaps, we can use atomic read with full barrier sematices proposed > here - > https://www.postgresql.org/message-id/20231127210030.GA140335%40nathanxps13 I don't think they are exactly equivalent: in the current patch, the first pg_atomic_read_u64() could be reordered with earlier reads; whereas that wouldn't work if using pg_atomic_read_membarrier_u64() it could not be. I'm not sure whether that could create a performance problem or not. > Just for the sake of completeness, do we need > pg_atomic_monotonic_advance_u32 as well? +1. > 8. I'd split the addition of these new monotonic functions into 0001, > then 0002 adding XLogwrtAtomic and 0003 adding Copy to XLogwrtAtomic. +1 > 9. > + copyptr = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Copy); > + if (startptr + count > copyptr) > + ereport(WARNING, > + (errmsg("request to read past end of generated WAL; > request %X/%X, current position %X/%X", > + LSN_FORMAT_ARGS(startptr + count), > LSN_FORMAT_ARGS(copyptr; > > Any specific reason for this to be a WARNING rather than an ERROR? Good question. WaitXLogInsertionsToFinish() uses a LOG level message for the same situation. They should probably be the same log level, and I would think it would be either PANIC or WARNING. I have no idea why LOG was chosen. > I've addressed some of my review comments (#1, #5, #7 and #8) above, > merged the v11j-0002-fixup.patch into 0002, ran pgindent. Please see > the attached v12 p
Re: LogwrtResult contended spinlock
On Sat, Feb 17, 2024 at 2:24 AM Jeff Davis wrote: > > Though it looks like we can remove the non-shared LogwrtResult > entirely. Andres expressed some concern here: > > https://www.postgresql.org/message-id/20210130020211.rtu5ir3dpjrbi...@alap3.anarazel.de > > But then seemed to favor removing it here: > > https://www.postgresql.org/message-id/20240213001150.4uqzh7tinuhvo...@awork3.anarazel.de > > I'm inclined to think we can get rid of the non-shared copy. The local copy of LogwrtResult is so frequently used in the backends, if we were to replace it with atomic accesses, won't the atomic reads be costly and start showing up in perf profiles? In any case, I prefer discussing this separately once we get to a conclusion on converting shared memory Write and Flush ptrs to atomics. > A few other comments: > > * Should GetFlushRecPtr()/GetXLogWriteRecPtr() use a read memory > barrier? > * Why did you add pg_memory_barrier() right before a spinlock > acquisition? > * Is it an invariant that Write >= Flush at all times? Are there > guaranteed to be write barriers in the right place to ensure that? I'll continue to think about these points. > I would also like it if we could add a new "Copy" pointer indicating > how much WAL data has been copied to the WAL buffers. That would be set > by WaitXLogInsertionsToFinish() so that subsequent calls are cheap. > Attached a patch (0003) for illustration purposes. It adds to the size > of XLogCtlData, but it's fairly large already, so I'm not sure if > that's a problem. If we do add this, there would be an invariant that > Copy >= Write at all times. Thanks. I have a few comments on v11j patches. 1. I guess we need to initialize the new atomics with pg_atomic_init_u64 initially in XLOGShmemInit: 2. I guess we need to update both the Write and Flush local copies in AdvanceXLInsertBuffer. Overall, I guess we need to update both the Write and Flush local copies whenever we previously read LogwrtResult, no? Missing to do that caused regression tests to fail and since we were able to catch it, we ended up with v11j-0002-fixup.patch. There might be cases where we aren't able to catch if we didn't update both Write and Flush local copies. I'd suggest we just wrap both of these under a macro: LogwrtResult.Flush = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Flush); LogwrtResult.Write = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Write); and just use it wherever we did LogwrtResult = XLogCtl->LogwrtResult; previously but not under spinlock. 3. @@ -6371,7 +6373,9 @@ GetFlushRecPtr(TimeLineID *insertTLI) { Assert(XLogCtl->SharedRecoveryState == RECOVERY_STATE_DONE); +pg_read_barrier(); LogwrtResult.Flush = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Flush); +LogwrtResult.Write = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Write); Do we need a read barrier here to not reorder things when more than one process is accessing the flush and write ptrs? If at all, a read barrier is warranted here, we can use atomic read with full barrier sematices as proposed here (when that's in) - https://www.postgresql.org/message-id/20231127210030.GA140335%40nathanxps13. 4. +/* + * Update local and shared status. This is OK to do without any locks + * because no other process can be reading or writing WAL yet. + */ LogwrtResult.Write = LogwrtResult.Flush = EndOfLog; +pg_atomic_write_u64(&XLogCtl->LogwrtResult.Write, EndOfLog); +pg_atomic_write_u64(&XLogCtl->LogwrtResult.Flush, EndOfLog); XLogCtl->LogwrtRqst.Write = EndOfLog; pg_atomic_write_u64 here seems fine to me as no other process is active writing WAL yet, otherwise, we need write with full barrier something like pg_write_barrier + pg_atomic_write_u64 or pg_atomic_exchange_u64 like in LWLockUpdateVar or use atomic write with full barrier sematices as proposed here (when that's in) - https://www.postgresql.org/message-id/20231127210030.GA140335%40nathanxps13. 5. I guess we'd better use pg_atomic_read_u64_impl and pg_atomic_compare_exchange_u64_impl in pg_atomic_monotonic_advance_u64 to reduce one level of function indirections. Of course, we need the pointer alignments that pg_atomic_compare_exchange_u64 in the new monotonic function. 6. + * Full barrier semantics (even when value is unchanged). +currval = pg_atomic_read_u64(ptr); +if (currval >= target_) +{ +pg_memory_barrier(); Perhaps, we can use atomic read with full barrier sematices proposed here - https://www.postgresql.org/message-id/20231127210030.GA140335%40nathanxps13 7. +static inline void +pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_) Just for the sake of completeness, do we need pg_atomic_monotonic_advance_u32 as well? 8. I'd split the addition of these new monotonic functions into 0001, then 0002 adding XLogwrtAtomic and 0003 adding Copy to XLogwrtAtomic. 9. +copyptr = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Copy); +if (startptr + count > co
Re: LogwrtResult contended spinlock
On Mon, 2024-02-12 at 17:44 -0800, Jeff Davis wrote: > It looks like there's some renewed interest in this patch: After rebasing (attached as 0001), I'm seeing some test failures. It looks like the local LogwrtResult is not being updated in as many places, and that's hitting the Assert that I recently added. The fix is easy (attached as 0002). Though it looks like we can remove the non-shared LogwrtResult entirely. Andres expressed some concern here: https://www.postgresql.org/message-id/20210130020211.rtu5ir3dpjrbi...@alap3.anarazel.de But then seemed to favor removing it here: https://www.postgresql.org/message-id/20240213001150.4uqzh7tinuhvo...@awork3.anarazel.de I'm inclined to think we can get rid of the non-shared copy. A few other comments: * Should GetFlushRecPtr()/GetXLogWriteRecPtr() use a read memory barrier? * Why did you add pg_memory_barrier() right before a spinlock acquisition? * Is it an invariant that Write >= Flush at all times? Are there guaranteed to be write barriers in the right place to ensure that? I would also like it if we could add a new "Copy" pointer indicating how much WAL data has been copied to the WAL buffers. That would be set by WaitXLogInsertionsToFinish() so that subsequent calls are cheap. Attached a patch (0003) for illustration purposes. It adds to the size of XLogCtlData, but it's fairly large already, so I'm not sure if that's a problem. If we do add this, there would be an invariant that Copy >= Write at all times. Regards, Jeff Davis From cd48125afc61faa46f15f444390ed01f64988320 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 2 Feb 2021 14:03:43 -0300 Subject: [PATCH v11j 1/3] Make XLogCtl->LogwrtResult accessible with atomics MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, access to LogwrtResult is protected by a spinlock. This becomes severely contended in some scenarios, such as with a largish replication flock: walsenders all calling GetFlushRecPtr repeatedly cause the processor heat up to the point where eggs can be fried on top. This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult with a struct containing a pair of atomically accessed variables. Do so. In a few places, we can adjust the exact location where the locals are updated to account for the fact that we no longer need the spinlock. Author: Álvaro Herrera Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql --- src/backend/access/transam/xlog.c | 106 ++ src/include/port/atomics.h| 29 src/tools/pgindent/typedefs.list | 1 + 3 files changed, 78 insertions(+), 58 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 50c347a679..46394c5d39 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -294,16 +294,13 @@ static bool doPageWrites; * * LogwrtRqst indicates a byte position that we need to write and/or fsync * the log up to (all records before that point must be written or fsynced). - * LogwrtResult indicates the byte positions we have already written/fsynced. - * These structs are identical but are declared separately to indicate their - * slightly different functions. + * LogWrtResult indicates the byte positions we have already written/fsynced. + * These structs are similar but are declared separately to indicate their + * slightly different functions; in addition, the latter is read and written + * using atomic operations. * - * To read XLogCtl->LogwrtResult, you must hold either info_lck or - * WALWriteLock. To update it, you need to hold both locks. The point of - * this arrangement is that the value can be examined by code that already - * holds WALWriteLock without needing to grab info_lck as well. In addition - * to the shared variable, each backend has a private copy of LogwrtResult, - * which is updated when convenient. + * In addition to the shared variable, each backend has a private copy of + * LogwrtResult, each member of which is separately updated when convenient. * * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst * (protected by info_lck), but we don't need to cache any copies of it. @@ -326,6 +323,12 @@ static bool doPageWrites; *-- */ +typedef struct XLogwrtAtomic +{ + pg_atomic_uint64 Write; /* last byte + 1 written out */ + pg_atomic_uint64 Flush; /* last byte + 1 flushed */ +} XLogwrtAtomic; + typedef struct XLogwrtRqst { XLogRecPtr Write; /* last byte + 1 to write out */ @@ -461,6 +464,7 @@ typedef struct XLogCtlData { XLogCtlInsert Insert; + XLogwrtAtomic LogwrtResult; /* uses atomics */ /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; XLogRecPtr RedoRecPtr; /* a recent copy of Insert->RedoRecPtr */ @@ -478,12 +482,6 @@ typedef struct XLogCtlData pg_time_t lastSegSwitchTime; XLogRecPtr lastSegSwitchLSN; - /* - * Protected by info_l
Re: LogwrtResult contended spinlock
On Fri, 2022-09-23 at 10:49 +0200, Alvaro Herrera wrote: > On 2022-Jul-28, Alvaro Herrera wrote: > > > v10 is just a trivial rebase. No changes. Moved to next > > commitfest. > > I realized that because of commit e369f3708636 this change is no > longer > as critical as it used to be, so I'm withdrawing this patch from the > commitfest. It looks like there's some renewed interest in this patch: https://www.postgresql.org/message-id/20240213001150.4uqzh7tinuhvo...@awork3.anarazel.de even if there's not a pressing performance conern now, it could clean up the complexity of having a non-shared copy of LogwrtResult. Regards, Jeff Davis
Re: LogwrtResult contended spinlock
On 2022-Jul-28, Alvaro Herrera wrote: > v10 is just a trivial rebase. No changes. Moved to next commitfest. I realized that because of commit e369f3708636 this change is no longer as critical as it used to be, so I'm withdrawing this patch from the commitfest. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "No necesitamos banderas No reconocemos fronteras" (Jorge González)
Re: LogwrtResult contended spinlock
v10 is just a trivial rebase. No changes. Moved to next commitfest. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ >From 86655a0231e277c3f1bc907a0f0eb669943d4c71 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 2 Feb 2021 14:03:43 -0300 Subject: [PATCH v10] Make XLogCtl->LogwrtResult accessible with atomics MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, access to LogwrtResult is protected by a spinlock. This becomes severely contended in some scenarios, such as with a largish replication flock: walsenders all calling GetFlushRecPtr repeatedly cause the processor heat up to the point where eggs can be fried on top. This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult with a struct containing a pair of atomically accessed variables. Do so. In a few places, we can adjust the exact location where the locals are updated to account for the fact that we no longer need the spinlock. Author: Álvaro Herrera Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql --- src/backend/access/transam/xlog.c | 106 ++ src/include/port/atomics.h| 29 src/tools/pgindent/typedefs.list | 1 + 3 files changed, 78 insertions(+), 58 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 15ab8d90d4..29ebd38103 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -287,16 +287,13 @@ static bool doPageWrites; * * LogwrtRqst indicates a byte position that we need to write and/or fsync * the log up to (all records before that point must be written or fsynced). - * LogwrtResult indicates the byte positions we have already written/fsynced. - * These structs are identical but are declared separately to indicate their - * slightly different functions. + * LogWrtResult indicates the byte positions we have already written/fsynced. + * These structs are similar but are declared separately to indicate their + * slightly different functions; in addition, the latter is read and written + * using atomic operations. * - * To read XLogCtl->LogwrtResult, you must hold either info_lck or - * WALWriteLock. To update it, you need to hold both locks. The point of - * this arrangement is that the value can be examined by code that already - * holds WALWriteLock without needing to grab info_lck as well. In addition - * to the shared variable, each backend has a private copy of LogwrtResult, - * which is updated when convenient. + * In addition to the shared variable, each backend has a private copy of + * LogwrtResult, each member of which is separately updated when convenient. * * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst * (protected by info_lck), but we don't need to cache any copies of it. @@ -319,6 +316,12 @@ static bool doPageWrites; *-- */ +typedef struct XLogwrtAtomic +{ + pg_atomic_uint64 Write; /* last byte + 1 written out */ + pg_atomic_uint64 Flush; /* last byte + 1 flushed */ +} XLogwrtAtomic; + typedef struct XLogwrtRqst { XLogRecPtr Write; /* last byte + 1 to write out */ @@ -456,6 +459,7 @@ typedef struct XLogCtlData { XLogCtlInsert Insert; + XLogwrtAtomic LogwrtResult; /* uses atomics */ /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; XLogRecPtr RedoRecPtr; /* a recent copy of Insert->RedoRecPtr */ @@ -473,12 +477,6 @@ typedef struct XLogCtlData pg_time_t lastSegSwitchTime; XLogRecPtr lastSegSwitchLSN; - /* - * Protected by info_lck and WALWriteLock (you must hold either lock to - * read it, but both to update) - */ - XLogwrtResult LogwrtResult; - /* * Latest initialized page in the cache (last byte position + 1). * @@ -598,7 +596,7 @@ static ControlFileData *ControlFile = NULL; static int UsableBytesInSegment; /* - * Private, possibly out-of-date copy of shared LogwrtResult. + * Private, possibly out-of-date copy of shared XLogCtl->LogwrtResult. * See discussion above. */ static XLogwrtResult LogwrtResult = {0, 0}; @@ -907,8 +905,6 @@ XLogInsertRecord(XLogRecData *rdata, /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRqst.Write < EndPos) XLogCtl->LogwrtRqst.Write = EndPos; - /* update local result copy while I have the chance */ - LogwrtResult = XLogCtl->LogwrtResult; SpinLockRelease(&XLogCtl->info_lck); } @@ -1786,6 +1782,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic) * Now that we have the lock, check if someone initialized the page * already. */ + LogwrtResult.Write = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Write); while (upto >= XLogCtl->InitializedUpTo || opportunistic) { nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo); @@ -1805,17 +1802,18 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic) if (opportunistic) break; -
Re: LogwrtResult contended spinlock
On 2022-Apr-05, Alvaro Herrera wrote: > Apologies -- I selected the wrong commit to extract the commit message > from. Here it is again. I also removed an obsolete /* XXX */ comment. I spent a lot of time staring at this to understand the needs for memory barriers in the interactions. In the end I decided not to get this out for this cycle because I don't want to create subtle bugs in WAL. I'll come back with this for pg16. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "El que vive para el futuro es un iluso, y el que vive para el pasado, un imbécil" (Luis Adler, "Los tripulantes de la noche")
Re: LogwrtResult contended spinlock
Apologies -- I selected the wrong commit to extract the commit message from. Here it is again. I also removed an obsolete /* XXX */ comment. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ >From 0b26f90f1b95f8e9b932eb34bbf9c2a50729cf60 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 2 Feb 2021 14:03:43 -0300 Subject: [PATCH v9] Make XLogCtl->LogwrtResult accessible with atomics MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, access to LogwrtResult is protected by a spinlock. This becomes severely contended in some scenarios, such as with a largish replication flock: walsenders all calling GetFlushRecPtr repeatedly cause the processor heat up to the point where eggs can be fried on top. This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult with a struct containing a pair of atomically accessed variables. Do so. In a few places, we can adjust the exact location where the locals are updated to account for the fact that we no longer need the spinlock. Author: Álvaro Herrera Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql --- src/backend/access/transam/xlog.c | 106 ++ src/include/port/atomics.h| 29 src/tools/pgindent/typedefs.list | 1 + 3 files changed, 78 insertions(+), 58 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 17a56152f1..50fdfcbc90 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -285,16 +285,13 @@ static bool doPageWrites; * * LogwrtRqst indicates a byte position that we need to write and/or fsync * the log up to (all records before that point must be written or fsynced). - * LogwrtResult indicates the byte positions we have already written/fsynced. - * These structs are identical but are declared separately to indicate their - * slightly different functions. + * LogWrtResult indicates the byte positions we have already written/fsynced. + * These structs are similar but are declared separately to indicate their + * slightly different functions; in addition, the latter is read and written + * using atomic operations. * - * To read XLogCtl->LogwrtResult, you must hold either info_lck or - * WALWriteLock. To update it, you need to hold both locks. The point of - * this arrangement is that the value can be examined by code that already - * holds WALWriteLock without needing to grab info_lck as well. In addition - * to the shared variable, each backend has a private copy of LogwrtResult, - * which is updated when convenient. + * In addition to the shared variable, each backend has a private copy of + * LogwrtResult, each member of which is separately updated when convenient. * * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst * (protected by info_lck), but we don't need to cache any copies of it. @@ -317,6 +314,12 @@ static bool doPageWrites; *-- */ +typedef struct XLogwrtAtomic +{ + pg_atomic_uint64 Write; /* last byte + 1 written out */ + pg_atomic_uint64 Flush; /* last byte + 1 flushed */ +} XLogwrtAtomic; + typedef struct XLogwrtRqst { XLogRecPtr Write; /* last byte + 1 to write out */ @@ -480,6 +483,7 @@ typedef struct XLogCtlData { XLogCtlInsert Insert; + XLogwrtAtomic LogwrtResult; /* uses atomics */ /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; XLogRecPtr RedoRecPtr; /* a recent copy of Insert->RedoRecPtr */ @@ -497,12 +501,6 @@ typedef struct XLogCtlData pg_time_t lastSegSwitchTime; XLogRecPtr lastSegSwitchLSN; - /* - * Protected by info_lck and WALWriteLock (you must hold either lock to - * read it, but both to update) - */ - XLogwrtResult LogwrtResult; - /* * Latest initialized page in the cache (last byte position + 1). * @@ -622,7 +620,7 @@ static ControlFileData *ControlFile = NULL; static int UsableBytesInSegment; /* - * Private, possibly out-of-date copy of shared LogwrtResult. + * Private, possibly out-of-date copy of shared XLogCtl->LogwrtResult. * See discussion above. */ static XLogwrtResult LogwrtResult = {0, 0}; @@ -932,8 +930,6 @@ XLogInsertRecord(XLogRecData *rdata, /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRqst.Write < EndPos) XLogCtl->LogwrtRqst.Write = EndPos; - /* update local result copy while I have the chance */ - LogwrtResult = XLogCtl->LogwrtResult; SpinLockRelease(&XLogCtl->info_lck); } @@ -1811,6 +1807,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic) * Now that we have the lock, check if someone initialized the page * already. */ + LogwrtResult.Write = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Write); while (upto >= XLogCtl->InitializedUpTo || opportunistic) { nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo); @@ -1830,17 +1827,18 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, Ti
Re: LogwrtResult contended spinlock
Here's a v8, where per my previous comment I removed some code that I believe is no longer necessary. I've omitted the patch that renames LogwrtResult subvariables into LogWriteResult/LogWriteFlush; I still think the end result is better after that one, but it's a pretty trivial change that can be dealt with separately afterwards. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "Saca el libro que tu religión considere como el indicado para encontrar la oración que traiga paz a tu alma. Luego rebootea el computador y ve si funciona" (Carlos Duclós) >From fdca5f32573342d950feefbdd4d64da6dd1cfb34 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 22 Mar 2022 16:49:40 +0100 Subject: [PATCH v8] Split LogwrtResult into separate variables Since the schedule of updating each portion is independent, it's not very useful to have them both as a single struct. Before we used atomics for the corresponding XLogCtl struct this made sense because we could use struct assignment, but that's no longer the case. Suggested by Andres Freund. --- src/backend/access/transam/xlog.c | 99 ++- src/include/port/atomics.h| 29 + src/tools/pgindent/typedefs.list | 1 + 3 files changed, 74 insertions(+), 55 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 17a56152f1..f1cb725bed 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -285,16 +285,13 @@ static bool doPageWrites; * * LogwrtRqst indicates a byte position that we need to write and/or fsync * the log up to (all records before that point must be written or fsynced). - * LogwrtResult indicates the byte positions we have already written/fsynced. - * These structs are identical but are declared separately to indicate their - * slightly different functions. + * LogWrtResult indicates the byte positions we have already written/fsynced. + * These structs are similar but are declared separately to indicate their + * slightly different functions; in addition, the latter is read and written + * using atomic operations. * - * To read XLogCtl->LogwrtResult, you must hold either info_lck or - * WALWriteLock. To update it, you need to hold both locks. The point of - * this arrangement is that the value can be examined by code that already - * holds WALWriteLock without needing to grab info_lck as well. In addition - * to the shared variable, each backend has a private copy of LogwrtResult, - * which is updated when convenient. + * In addition to the shared variable, each backend has a private copy of + * LogwrtResult, each member of which is separately updated when convenient. * * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst * (protected by info_lck), but we don't need to cache any copies of it. @@ -317,6 +314,12 @@ static bool doPageWrites; *-- */ +typedef struct XLogwrtAtomic +{ + pg_atomic_uint64 Write; /* last byte + 1 written out */ + pg_atomic_uint64 Flush; /* last byte + 1 flushed */ +} XLogwrtAtomic; + typedef struct XLogwrtRqst { XLogRecPtr Write; /* last byte + 1 to write out */ @@ -480,6 +483,7 @@ typedef struct XLogCtlData { XLogCtlInsert Insert; + XLogwrtAtomic LogwrtResult; /* uses atomics */ /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; XLogRecPtr RedoRecPtr; /* a recent copy of Insert->RedoRecPtr */ @@ -497,12 +501,6 @@ typedef struct XLogCtlData pg_time_t lastSegSwitchTime; XLogRecPtr lastSegSwitchLSN; - /* - * Protected by info_lck and WALWriteLock (you must hold either lock to - * read it, but both to update) - */ - XLogwrtResult LogwrtResult; - /* * Latest initialized page in the cache (last byte position + 1). * @@ -622,7 +620,7 @@ static ControlFileData *ControlFile = NULL; static int UsableBytesInSegment; /* - * Private, possibly out-of-date copy of shared LogwrtResult. + * Private, possibly out-of-date copy of shared XLogCtl->LogwrtResult. * See discussion above. */ static XLogwrtResult LogwrtResult = {0, 0}; @@ -932,8 +930,6 @@ XLogInsertRecord(XLogRecData *rdata, /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRqst.Write < EndPos) XLogCtl->LogwrtRqst.Write = EndPos; - /* update local result copy while I have the chance */ - LogwrtResult = XLogCtl->LogwrtResult; SpinLockRelease(&XLogCtl->info_lck); } @@ -1811,6 +1807,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic) * Now that we have the lock, check if someone initialized the page * already. */ + LogwrtResult.Write = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Write); while (upto >= XLogCtl->InitializedUpTo || opportunistic) { nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo); @@ -1830,17 +1827,18 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic) if (opportunistic) break; - /* Before waiting, ge
Re: LogwrtResult contended spinlock
So I've been wondering about this block at the bottom of XLogWrite: /* * Make sure that the shared 'request' values do not fall behind the * 'result' values. This is not absolutely essential, but it saves some * code in a couple of places. */ { SpinLockAcquire(&XLogCtl->info_lck); if (XLogCtl->LogwrtRqst.Write < LogwrtResult.Write) XLogCtl->LogwrtRqst.Write = LogwrtResult.Write; if (XLogCtl->LogwrtRqst.Flush < LogwrtResult.Flush) XLogCtl->LogwrtRqst.Flush = LogwrtResult.Flush; SpinLockRelease(&XLogCtl->info_lck); } I just noticed that my 0001 makes the comment a lie: it is now quite possible that 'result' is advanced beyond 'request'. Before the patch that never happened because they were both advanced in the region locked by the spinlock. I think we could still maintain this promise if we just moved this entire block before the first pg_atomic_monotonic_advance_u64 setting XLogCtl->LogwrtResult.Write. Or we could halve the whole block, and put one acquire/test/set/release stanza before each monotonic increase of the corresponding variable. However, I wonder if this is still necessary. This code was added in 4d14fe0048c (March 2001) and while everything else was quite different back then, this hasn't changed at all. I can't quite figure out what are those "couple of places" that would need additional code if this block is just removed. I tried running the tests (including wal_consistency_checking), and nothing breaks. Reading the code surrounding the other accesses of XLogCtl->LogwrtRqst, there's nothing that looks to me like it depends on these values not lagging behind LogwrtResult. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "If you have nothing to say, maybe you need just the right tool to help you not say it." (New York Times, about Microsoft PowerPoint)
Re: LogwrtResult contended spinlock
On 2022-Mar-22, Tom Lane wrote: > I looked briefly at 0001, and I've got to say that I disagree with > your decision to rearrange the representation of the local LogwrtResult > copy. It clutters the patch tremendously and makes it hard to > understand what the actual functional change is. Moreover, I'm > not entirely convinced that it's a notational improvement in the > first place. > > Perhaps it'd help if you split 0001 into two steps, one to do the > mechanical change of the representation and then a second patch that > converts the shared variable to atomics. Since you've moved around > the places that read the shared variable, that part is subtler than > one could wish and really needs to be studied on its own. Hmm, I did it the other way around: first change to use atomics, then the mechanical change. I think that makes the usefulness of the change more visible, because before the atomics use the use of the combined struct as a unit remains sensible. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ Officer Krupke, what are we to do? Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke") >From dd9b53878faeedba921ea7027e98ddbb433e8647 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 2 Feb 2021 14:03:43 -0300 Subject: [PATCH v7 1/3] Change XLogCtl->LogwrtResult to use atomic ops MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, access to LogwrtResult is protected by a spinlock. This becomes severely contended in some scenarios, such as with a largish replication flock: walsenders all calling GetFlushRecPtr repeatedly cause the processor heat up to the point where eggs can be fried on top. This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult with a struct containing a pair of atomically accessed variables. In addition, this commit splits the process-local copy of these values (kept in the freestanding LogwrtResult struct) into two separate LogWriteResult and LogFlushResult. This is not strictly necessary, but it makes it clearer that these are updated separately, each on their own schedule. Author: Álvaro Herrera Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql --- src/backend/access/transam/xlog.c | 85 +++ src/include/port/atomics.h| 29 +++ src/tools/pgindent/typedefs.list | 1 + 3 files changed, 72 insertions(+), 43 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 4ac3871c74..6f2eb494fe 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -285,16 +285,13 @@ static bool doPageWrites; * * LogwrtRqst indicates a byte position that we need to write and/or fsync * the log up to (all records before that point must be written or fsynced). - * LogwrtResult indicates the byte positions we have already written/fsynced. - * These structs are identical but are declared separately to indicate their - * slightly different functions. + * LogWrtResult indicates the byte positions we have already written/fsynced. + * These structs are similar but are declared separately to indicate their + * slightly different functions; in addition, the latter is read and written + * using atomic operations. * - * To read XLogCtl->LogwrtResult, you must hold either info_lck or - * WALWriteLock. To update it, you need to hold both locks. The point of - * this arrangement is that the value can be examined by code that already - * holds WALWriteLock without needing to grab info_lck as well. In addition - * to the shared variable, each backend has a private copy of LogwrtResult, - * which is updated when convenient. + * In addition to the shared variable, each backend has a private copy of + * LogwrtResult, each member of which is separately updated when convenient. * * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst * (protected by info_lck), but we don't need to cache any copies of it. @@ -317,6 +314,12 @@ static bool doPageWrites; *-- */ +typedef struct XLogwrtAtomic +{ + pg_atomic_uint64 Write; /* last byte + 1 written out */ + pg_atomic_uint64 Flush; /* last byte + 1 flushed */ +} XLogwrtAtomic; + typedef struct XLogwrtRqst { XLogRecPtr Write; /* last byte + 1 to write out */ @@ -480,6 +483,7 @@ typedef struct XLogCtlData { XLogCtlInsert Insert; + XLogwrtAtomic LogwrtResult; /* uses atomics */ /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; XLogRecPtr RedoRecPtr; /* a recent copy of Insert->RedoRecPtr */ @@ -497,12 +501,6 @@ typedef struct XLogCtlData pg_time_t lastSegSwitchTime; XLogRecPtr lastSegSwitchLSN; - /* - * Protected by info_lck and WALWriteLock (you must hold either lock to - * read it, but both to update) - */ - XLogwrtResult LogwrtResult; - /* * Latest initialized page in the cache (last byte position + 1). * @@ -622,7 +620,7
Re: LogwrtResult contended spinlock
Alvaro Herrera writes: > On 2022-Mar-21, Andres Freund wrote: >> Are you aiming this for v15? Otherwise I'd like to move the entry to the next >> CF. Marked as waiting-on-author. > I'd like to get 0001 pushed to pg15, yes. I'll let 0002 sit here for > discussion, but I haven't seen any evidence that we need it. If others > vouch for it, I can push that one too, but I'd rather have it be a > separate thing. I looked briefly at 0001, and I've got to say that I disagree with your decision to rearrange the representation of the local LogwrtResult copy. It clutters the patch tremendously and makes it hard to understand what the actual functional change is. Moreover, I'm not entirely convinced that it's a notational improvement in the first place. Perhaps it'd help if you split 0001 into two steps, one to do the mechanical change of the representation and then a second patch that converts the shared variable to atomics. Since you've moved around the places that read the shared variable, that part is subtler than one could wish and really needs to be studied on its own. regards, tom lane
Re: LogwrtResult contended spinlock
On 2022-Mar-21, Andres Freund wrote: > This patch currently doesn't apply: http://cfbot.cputube.org/patch_37_2716.log Updated. > Are you aiming this for v15? Otherwise I'd like to move the entry to the next > CF. Marked as waiting-on-author. I'd like to get 0001 pushed to pg15, yes. I'll let 0002 sit here for discussion, but I haven't seen any evidence that we need it. If others vouch for it, I can push that one too, but I'd rather have it be a separate thing. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/ "El miedo atento y previsor es la madre de la seguridad" (E. Burke) >From 4d5a8d915b497b79bbe62e18352f7b9d8c1b9bca Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 2 Feb 2021 14:03:43 -0300 Subject: [PATCH v6 1/2] Change XLogCtl->LogwrtResult to use atomic ops MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Currently, access to LogwrtResult is protected by a spinlock. This becomes severely contended in some scenarios, such as with a largish replication flock: walsenders all calling GetFlushRecPtr repeatedly cause the processor heat up to the point where eggs can be fried on top. This can be reduced to a non-problem by replacing XLogCtl->LogwrtResult with a struct containing a pair of atomically accessed variables. In addition, this commit splits the process-local copy of these values (kept in the freestanding LogwrtResult struct) into two separate LogWriteResult and LogFlushResult. This is not strictly necessary, but it makes it clearer that these are updated separately, each on their own schedule. Author: Álvaro Herrera Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql --- src/backend/access/transam/xlog.c | 195 +++--- src/include/port/atomics.h| 29 + src/tools/pgindent/typedefs.list | 1 + 3 files changed, 125 insertions(+), 100 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 4ac3871c74..311a8a1192 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -285,16 +285,14 @@ static bool doPageWrites; * * LogwrtRqst indicates a byte position that we need to write and/or fsync * the log up to (all records before that point must be written or fsynced). - * LogwrtResult indicates the byte positions we have already written/fsynced. - * These structs are identical but are declared separately to indicate their - * slightly different functions. + * LogWrtResult indicates the byte positions we have already written/fsynced. + * These structs are similar but are declared separately to indicate their + * slightly different functions; in addition, the latter is read and written + * using atomic operations. * - * To read XLogCtl->LogwrtResult, you must hold either info_lck or - * WALWriteLock. To update it, you need to hold both locks. The point of - * this arrangement is that the value can be examined by code that already - * holds WALWriteLock without needing to grab info_lck as well. In addition - * to the shared variable, each backend has a private copy of LogwrtResult, - * which is updated when convenient. + * In addition to the shared variable, each backend has a private copy of + * each member of LogwrtResult (LogWriteResult and LogFlushResult), each of + * which is separately updated when convenient. * * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst * (protected by info_lck), but we don't need to cache any copies of it. @@ -317,18 +315,18 @@ static bool doPageWrites; *-- */ +typedef struct XLogwrtAtomic +{ + pg_atomic_uint64 Write; /* last byte + 1 of write position */ + pg_atomic_uint64 Flush; /* last byte + 1 of flush position */ +} XLogwrtAtomic; + typedef struct XLogwrtRqst { XLogRecPtr Write; /* last byte + 1 to write out */ XLogRecPtr Flush; /* last byte + 1 to flush */ } XLogwrtRqst; -typedef struct XLogwrtResult -{ - XLogRecPtr Write; /* last byte + 1 written out */ - XLogRecPtr Flush; /* last byte + 1 flushed */ -} XLogwrtResult; - /* * Inserting to WAL is protected by a small fixed number of WAL insertion * locks. To insert to the WAL, you must hold one of the locks - it doesn't @@ -480,6 +478,7 @@ typedef struct XLogCtlData { XLogCtlInsert Insert; + XLogwrtAtomic LogwrtResult; /* uses atomics */ /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; XLogRecPtr RedoRecPtr; /* a recent copy of Insert->RedoRecPtr */ @@ -497,12 +496,6 @@ typedef struct XLogCtlData pg_time_t lastSegSwitchTime; XLogRecPtr lastSegSwitchLSN; - /* - * Protected by info_lck and WALWriteLock (you must hold either lock to - * read it, but both to update) - */ - XLogwrtResult LogwrtResult; - /* * Latest initialized page in the cache (last byte position + 1). * @@ -622,10 +615,11 @@ static ControlFileData *ControlFile = NULL; static int UsableBytesInSegment; /* - * Private
Re: LogwrtResult contended spinlock
Hi, On 2021-11-22 18:56:43 -0300, Alvaro Herrera wrote: > There was an earlier comment by Andres that asyncXactLSN should also be > atomic, to avoid an ugly spinlock interaction with the new atomic-based > logwrtResult. The 0002 here is an attempt at doing that; I found that > it also needed to change WalWriterSleeping to use atomics, to avoid > XLogSetAsyncXactLSN having to grab the spinlock for that. This patch currently doesn't apply: http://cfbot.cputube.org/patch_37_2716.log Are you aiming this for v15? Otherwise I'd like to move the entry to the next CF. Marked as waiting-on-author. Greetings, Andres Freund
Re: LogwrtResult contended spinlock
There was an earlier comment by Andres that asyncXactLSN should also be atomic, to avoid an ugly spinlock interaction with the new atomic-based logwrtResult. The 0002 here is an attempt at doing that; I found that it also needed to change WalWriterSleeping to use atomics, to avoid XLogSetAsyncXactLSN having to grab the spinlock for that. -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/ "Learn about compilers. Then everything looks like either a compiler or a database, and now you have two problems but one of them is fun." https://twitter.com/thingskatedid/status/1456027786158776329 >From 0d48422df1a0e738721bfba5cd2d3f1ede5db423 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 2 Feb 2021 14:03:43 -0300 Subject: [PATCH v5 1/2] make LogwrtResult atomic --- src/backend/access/transam/xlog.c | 192 +++--- src/include/port/atomics.h| 29 + src/tools/pgindent/typedefs.list | 1 + 3 files changed, 123 insertions(+), 99 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 221e4cb34f..225ac8bc2d 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -382,16 +382,14 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr; * * LogwrtRqst indicates a byte position that we need to write and/or fsync * the log up to (all records before that point must be written or fsynced). - * LogwrtResult indicates the byte positions we have already written/fsynced. + * LogWriteResult, LogFlushResult indicate the byte positions we have already + * written/fsynced. * These structs are identical but are declared separately to indicate their * slightly different functions. * - * To read XLogCtl->LogwrtResult, you must hold either info_lck or - * WALWriteLock. To update it, you need to hold both locks. The point of - * this arrangement is that the value can be examined by code that already - * holds WALWriteLock without needing to grab info_lck as well. In addition - * to the shared variable, each backend has a private copy of LogwrtResult, - * which is updated when convenient. + * XLogCtl->LogwrtResult is read and written using atomic operations. + * In addition to the shared variable, each backend has a private copy of + * LogwrtResult, each member of which is separately updated when convenient. * * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst * (protected by info_lck), but we don't need to cache any copies of it. @@ -414,18 +412,18 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr; *-- */ +typedef struct XLogwrtAtomic +{ + pg_atomic_uint64 Write; /* last byte + 1 of write position */ + pg_atomic_uint64 Flush; /* last byte + 1 of flush position */ +} XLogwrtAtomic; + typedef struct XLogwrtRqst { XLogRecPtr Write; /* last byte + 1 to write out */ XLogRecPtr Flush; /* last byte + 1 to flush */ } XLogwrtRqst; -typedef struct XLogwrtResult -{ - XLogRecPtr Write; /* last byte + 1 written out */ - XLogRecPtr Flush; /* last byte + 1 flushed */ -} XLogwrtResult; - /* * Inserting to WAL is protected by a small fixed number of WAL insertion * locks. To insert to the WAL, you must hold one of the locks - it doesn't @@ -577,6 +575,7 @@ typedef struct XLogCtlData { XLogCtlInsert Insert; + XLogwrtAtomic LogwrtResult; /* uses atomics */ /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; XLogRecPtr RedoRecPtr; /* a recent copy of Insert->RedoRecPtr */ @@ -594,12 +593,6 @@ typedef struct XLogCtlData pg_time_t lastSegSwitchTime; XLogRecPtr lastSegSwitchLSN; - /* - * Protected by info_lck and WALWriteLock (you must hold either lock to - * read it, but both to update) - */ - XLogwrtResult LogwrtResult; - /* * Latest initialized page in the cache (last byte position + 1). * @@ -770,10 +763,11 @@ static ControlFileData *ControlFile = NULL; static int UsableBytesInSegment; /* - * Private, possibly out-of-date copy of shared LogwrtResult. + * Private, possibly out-of-date copy of shared XLogCtl->LogwrtResult. * See discussion above. */ -static XLogwrtResult LogwrtResult = {0, 0}; +static XLogRecPtr LogWriteResult = 0; +static XLogRecPtr LogFlushResult = 0; /* * Codes indicating where we got a WAL file from during recovery, or where @@ -1201,8 +1195,6 @@ XLogInsertRecord(XLogRecData *rdata, /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRqst.Write < EndPos) XLogCtl->LogwrtRqst.Write = EndPos; - /* update local result copy while I have the chance */ - LogwrtResult = XLogCtl->LogwrtResult; SpinLockRelease(&XLogCtl->info_lck); } @@ -2174,6 +2166,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic) * Now that we have the lock, check if someone initialized the page * already. */ + LogWriteResult = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Write); while (upto >= XLogCtl->Ini
Re: LogwrtResult contended spinlock
Here's a further attempt at this. Sorry it took so long. In this version, I replaced the coupled-in-a-struct representation of Write&Flush with two separate global variables. The reason to do this is to cater to Andres' idea to keep them up-to-date separately. Of course, I could kept them together, but it seems more sensible this way. Andres also suggested to not use monotonic-advance in XLogWrite, because these update are always done with WALWriteLock held in exclusive mode. However, that doesn't work (and tests fail pretty quickly) because XLogWrite seems to be called possibly out of order, so you still have to check if the current value is newer than what you have to write anyway; using stock atomic write doesn't work. So this is v4. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 1616448368..ad74862cb6 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -382,16 +382,14 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr; * * LogwrtRqst indicates a byte position that we need to write and/or fsync * the log up to (all records before that point must be written or fsynced). - * LogwrtResult indicates the byte positions we have already written/fsynced. + * LogWriteResult, LogFlushResult indicate the byte positions we have already + * written/fsynced. * These structs are identical but are declared separately to indicate their * slightly different functions. * - * To read XLogCtl->LogwrtResult, you must hold either info_lck or - * WALWriteLock. To update it, you need to hold both locks. The point of - * this arrangement is that the value can be examined by code that already - * holds WALWriteLock without needing to grab info_lck as well. In addition - * to the shared variable, each backend has a private copy of LogwrtResult, - * which is updated when convenient. + * XLogCtl->LogwrtResult is read and written using atomic operations. + * In addition to the shared variable, each backend has a private copy of + * LogwrtResult, each member of which is separately updated when convenient. * * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst * (protected by info_lck), but we don't need to cache any copies of it. @@ -414,18 +412,18 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr; *-- */ +typedef struct XLogwrtAtomic +{ + pg_atomic_uint64 Write; /* last byte + 1 of write position */ + pg_atomic_uint64 Flush; /* last byte + 1 of flush position */ +} XLogwrtAtomic; + typedef struct XLogwrtRqst { XLogRecPtr Write; /* last byte + 1 to write out */ XLogRecPtr Flush; /* last byte + 1 to flush */ } XLogwrtRqst; -typedef struct XLogwrtResult -{ - XLogRecPtr Write; /* last byte + 1 written out */ - XLogRecPtr Flush; /* last byte + 1 flushed */ -} XLogwrtResult; - /* * Inserting to WAL is protected by a small fixed number of WAL insertion * locks. To insert to the WAL, you must hold one of the locks - it doesn't @@ -577,6 +575,7 @@ typedef struct XLogCtlData { XLogCtlInsert Insert; + XLogwrtAtomic LogwrtResult; /* uses atomics */ /* Protected by info_lck: */ XLogwrtRqst LogwrtRqst; XLogRecPtr RedoRecPtr; /* a recent copy of Insert->RedoRecPtr */ @@ -594,12 +593,6 @@ typedef struct XLogCtlData pg_time_t lastSegSwitchTime; XLogRecPtr lastSegSwitchLSN; - /* - * Protected by info_lck and WALWriteLock (you must hold either lock to - * read it, but both to update) - */ - XLogwrtResult LogwrtResult; - /* * Latest initialized page in the cache (last byte position + 1). * @@ -770,10 +763,11 @@ static ControlFileData *ControlFile = NULL; static int UsableBytesInSegment; /* - * Private, possibly out-of-date copy of shared LogwrtResult. + * Private, possibly out-of-date copy of shared XLogCtl->LogwrtResult. * See discussion above. */ -static XLogwrtResult LogwrtResult = {0, 0}; +static XLogRecPtr LogWriteResult = 0; +static XLogRecPtr LogFlushResult = 0; /* * Codes indicating where we got a WAL file from during recovery, or where @@ -1201,8 +1195,6 @@ XLogInsertRecord(XLogRecData *rdata, /* advance global request to include new block(s) */ if (XLogCtl->LogwrtRqst.Write < EndPos) XLogCtl->LogwrtRqst.Write = EndPos; - /* update local result copy while I have the chance */ - LogwrtResult = XLogCtl->LogwrtResult; SpinLockRelease(&XLogCtl->info_lck); } @@ -2174,6 +2166,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic) * Now that we have the lock, check if someone initialized the page * already. */ + LogWriteResult = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Write); while (upto >= XLogCtl->InitializedUpTo || opportunistic) { nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo); @@ -2184,7 +2177,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic) * alread
Re: LogwrtResult contended spinlock
> On 8 Sep 2021, at 17:14, Jaime Casanova wrote: > > On Tue, Feb 02, 2021 at 08:19:19PM -0300, Alvaro Herrera wrote: >> Hello, >> >> So I addressed about half of your comments in this version merely by >> fixing silly bugs. The problem I had which I described as >> "synchronization fails" was one of those silly bugs. >> > > Hi Álvaro, > > Are we waiting for another version of the patch based on Andres' > comments? Or this version is good enough for testing? Any update on this? Should the patch be reset back to "Needs review" or will there be a new version? -- Daniel Gustafsson https://vmware.com/
Re: LogwrtResult contended spinlock
On Tue, Feb 02, 2021 at 08:19:19PM -0300, Alvaro Herrera wrote: > Hello, > > So I addressed about half of your comments in this version merely by > fixing silly bugs. The problem I had which I described as > "synchronization fails" was one of those silly bugs. > Hi Álvaro, Are we waiting for another version of the patch based on Andres' comments? Or this version is good enough for testing? BTW, current patch still applies cleanly. regards, -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL
Re: LogwrtResult contended spinlock
Hi, On 2021-02-02 20:19:19 -0300, Alvaro Herrera wrote: > > > @@ -1169,6 +1169,8 @@ XLogInsertRecord(XLogRecData *rdata, > > > { > > > /* advance global request to include new block(s) > > > */ > > > pg_atomic_monotonic_advance_u64(&XLogCtl->LogwrtRqst.Write, > > > EndPos); > > > + pg_memory_barrier(); > > > > That's not really useful - the path that actually updates already > > implies a barrier. It'd probably be better to add a barrier to a "never > > executed cmpxchg" fastpath. > > Got it. Do you mean in pg_atomic_monotonic_advance_u64() itself? Yes. > I'm not sure which is the nicer semantics. (If it's got to be at the > caller, then we'll need to return a boolean from there, which sounds > worse.) Nearly all other modifying atomic operations have full barrier semantics, so I think it'd be better to have it inside the pg_atomic_monotonic_advance_u64(). > +/* > + * Monotonically advance the given variable using only atomic operations > until > + * it's at least the target value. > + */ > +static inline void > +pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 > target_) > +{ > + uint64 currval; > + > +#ifndef PG_HAVE_ATOMIC_U64_SIMULATION > + AssertPointerAlignment(ptr, 8); > +#endif > + > + currval = pg_atomic_read_u64(ptr); > + while (currval < target_) > + { > + if (pg_atomic_compare_exchange_u64(ptr, &currval, target_)) > + break; > + } > +} So I think it'd be static inline void pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_) { uint64 currval; #ifndef PG_HAVE_ATOMIC_U64_SIMULATION AssertPointerAlignment(ptr, 8); #endif currval = pg_atomic_read_u64(ptr); if (currval >= target_) { pg_memory_barrier(); return; } while (currval < target_) { if (pg_atomic_compare_exchange_u64(ptr, &currval, target_)) break; } } > @@ -1172,9 +1170,10 @@ XLogInsertRecord(XLogRecData *rdata, > /* advance global request to include new block(s) */ > if (XLogCtl->LogwrtRqst.Write < EndPos) > XLogCtl->LogwrtRqst.Write = EndPos; > - /* update local result copy while I have the chance */ > - LogwrtResult = XLogCtl->LogwrtResult; > SpinLockRelease(&XLogCtl->info_lck); > + /* update local result copy */ > + LogwrtResult.Write = > pg_atomic_read_u64(&XLogCtl->LogwrtResult.Write); > + LogwrtResult.Flush = > pg_atomic_read_u64(&XLogCtl->LogwrtResult.Flush); > } As mentioned before - it's not clear to me why this is a valid thing to do without verifying all LogwrtResult.* usages. You can get updates completely out of order / independently. > @@ -2675,8 +2673,10 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) >* code in a couple of places. >*/ > { > + pg_atomic_monotonic_advance_u64(&XLogCtl->LogwrtResult.Write, > LogwrtResult.Write); > + pg_atomic_monotonic_advance_u64(&XLogCtl->LogwrtResult.Flush, > LogwrtResult.Flush); > + pg_memory_barrier(); > SpinLockAcquire(&XLogCtl->info_lck); > - XLogCtl->LogwrtResult = LogwrtResult; I still don't see why we need "locked" atomic operations here, rather than just a pg_atomic_write_u64(). They can only be modified with WALWriteLock held. There's two reasons for using a spinlock in this place: 1) it avoids torn reads of 64bit values - pg_atomic_write_u64()/pg_atomic_read_u64() avoid that already. 2) it ensures that Write/Flush are updated in unison - but that's not useful anymore, given that other places now read the variables separately. > @@ -3064,8 +3063,10 @@ XLogBackgroundFlush(void) > WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ; > > /* if we have already flushed that far, consider async commit records */ > + LogwrtResult.Flush = pg_atomic_read_u64(&XLogCtl->LogwrtResult.Flush); > if (WriteRqst.Write <= LogwrtResult.Flush) > { > + pg_memory_barrier(); > SpinLockAcquire(&XLogCtl->info_lck); > WriteRqst.Write = XLogCtl->asyncXactLSN; > SpinLockRelease(&XLogCtl->info_lck); A SpinLockAcquire() is a full memory barrier on its own I think. This'd probably better solved by just making asyncXactLSN atomic... Greetings, Andres Freund
Re: LogwrtResult contended spinlock
Hello, So I addressed about half of your comments in this version merely by fixing silly bugs. The problem I had which I described as "synchronization fails" was one of those silly bugs. So in further thinking, it seems simpler to make only LogwrtResult atomic, and leave LogwrtRqst as currently, using the spinlock. This should solve the contention problem we saw at the customer (but I've asked Jaime very nicely to do a test run, if possible, to confirm). For things like logical replication, which call GetFlushRecPtr() very frequently (causing the spinlock issue we saw) it is good, because we're no longer hitting the spinlock at all in that case. I have another (pretty mechanical) patch that renames LogwrtResult.Write to LogWriteResult and LogwrtResult.Flush to LogFlushResult. That more clearly shows that we're no longer updating them on unison. Didn't want to attach here because I didn't rebase on current one. But it seems logical: there's no longer any point in doing struct assignment, which is the only thing that stuff was good for. On 2021-Jan-29, Andres Freund wrote: > > @@ -1169,6 +1169,8 @@ XLogInsertRecord(XLogRecData *rdata, > > { > > /* advance global request to include new block(s) > > */ > > pg_atomic_monotonic_advance_u64(&XLogCtl->LogwrtRqst.Write, > > EndPos); > > + pg_memory_barrier(); > > That's not really useful - the path that actually updates already > implies a barrier. It'd probably be better to add a barrier to a "never > executed cmpxchg" fastpath. Got it. Do you mean in pg_atomic_monotonic_advance_u64() itself? I'm not sure which is the nicer semantics. (If it's got to be at the caller, then we'll need to return a boolean from there, which sounds worse.) > > @@ -2905,6 +2909,7 @@ XLogFlush(XLogRecPtr record) > > WriteRqstPtr = pg_atomic_read_u64(&XLogCtl->LogwrtRqst.Write); > > LogwrtResult.Write = > > pg_atomic_read_u64(&XLogCtl->LogwrtResult.Write); > > LogwrtResult.Flush = > > pg_atomic_read_u64(&XLogCtl->LogwrtResult.Flush); > > + pg_read_barrier(); > > I'm not sure you really can get away with just a read barrier in these > places. We can't e.g. have later updates to other shared memory > variables be "moved" to before the barrier (which a read barrier > allows). Ah, that makes sense. I have not really studied the barrier locations terribly closely in this version of the patch. It probably misses some (eg. in GetFlushRecPtr and GetXLogWriteRecPtr). It is passing the tests for me, but alone that's probably not enough. I'm gonna try and study the generated assembly and see if I can make sense of things ... -- Álvaro Herrera39°49'30"S 73°17'W >From ca05447f41270d6b3ac3b6fcf816f86aba86d08f Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 29 Jan 2021 22:34:23 -0300 Subject: [PATCH 1/2] add pg_atomic_monotonic_advance_u64 --- src/include/port/atomics.h | 21 + 1 file changed, 21 insertions(+) diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h index 856338f161..a1c4764d18 100644 --- a/src/include/port/atomics.h +++ b/src/include/port/atomics.h @@ -519,6 +519,27 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_) return pg_atomic_sub_fetch_u64_impl(ptr, sub_); } +/* + * Monotonically advance the given variable using only atomic operations until + * it's at least the target value. + */ +static inline void +pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_) +{ + uint64 currval; + +#ifndef PG_HAVE_ATOMIC_U64_SIMULATION + AssertPointerAlignment(ptr, 8); +#endif + + currval = pg_atomic_read_u64(ptr); + while (currval < target_) + { + if (pg_atomic_compare_exchange_u64(ptr, &currval, target_)) + break; + } +} + #undef INSIDE_ATOMICS_H #endif /* ATOMICS_H */ -- 2.20.1 >From d185cf50a47a0f9e346e49ccda038bb016ce246b Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Tue, 2 Feb 2021 14:03:43 -0300 Subject: [PATCH 2/2] make LogwrtResult atomic --- src/backend/access/transam/xlog.c | 66 +++ 1 file changed, 32 insertions(+), 34 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f03bd473e2..10802bd56f 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -405,12 +405,9 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr; * These structs are identical but are declared separately to indicate their * slightly different functions. * - * To read XLogCtl->LogwrtResult, you must hold either info_lck or - * WALWriteLock. To update it, you need to hold both locks. The point of - * this arrangement is that the value can be examined by code that already - * holds WALWriteLock without needing to grab info_lck as well. In addition - * to the shared variable, each backend has a private copy of LogwrtResult, - * which
Re: LogwrtResult contended spinlock
Hi, On 2021-01-29 22:47:56 -0300, Alvaro Herrera wrote: > So I tried this, but -- perhaps not suprisingly -- I can't get it to > work properly; the synchronization fails. What do you mean by "synchronization fails"? > Strangely, all tests work for me, but the pg_upgrade one in particular > fails. It's one of the few tests using fsync=on. > +static inline void > +pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 > target_) > +{ > +#ifndef PG_HAVE_ATOMIC_U64_SIMULATION > + AssertPointerAlignment(ptr, 8); > +#endif > + /* FIXME is this algorithm correct if we have u64 simulation? */ I don't see a problem. > + while (true) > + { > + uint64 currval; > + > + currval = pg_atomic_read_u64(ptr); > + if (currval > target_) > + break; /* already done by somebody else */ > + if (pg_atomic_compare_exchange_u64(ptr, &currval, target_)) > + break; /* successfully done */ > + } > +} I suggest writing this as currval = pg_atomic_read_u64(ptr); while (currval < target_) { if (pg_atomic_compare_exchange_u64(ptr, &currval, target_)) break; } > /* > * Inserting to WAL is protected by a small fixed number of WAL insertion > @@ -596,8 +599,10 @@ typedef struct XLogCtlData > { > XLogCtlInsert Insert; > > + XLogwrtAtomic LogwrtRqst; > + XLogwrtAtomic LogwrtResult; > + > /* Protected by info_lck: */ > - XLogwrtRqst LogwrtRqst; Not sure putting these into the same cacheline is a good idea. > @@ -2166,12 +2163,11 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool > opportunistic) > if (opportunistic) > break; > > - /* Before waiting, get info_lck and update LogwrtResult > */ > - SpinLockAcquire(&XLogCtl->info_lck); > - if (XLogCtl->LogwrtRqst.Write < OldPageRqstPtr) > - XLogCtl->LogwrtRqst.Write = OldPageRqstPtr; > - LogwrtResult = XLogCtl->LogwrtResult; > - SpinLockRelease(&XLogCtl->info_lck); > + /* Before waiting, update LogwrtResult */ > + > pg_atomic_monotonic_advance_u64(&XLogCtl->LogwrtRqst.Write, OldPageRqstPtr); > + > + LogwrtResult.Write = > pg_atomic_read_u64(&XLogCtl->LogwrtResult.Write); > + LogwrtResult.Flush = > pg_atomic_read_u64(&XLogCtl->LogwrtResult.Flush); I don't think it's quite as easy as this. Write/Flush now aren't guaranteed to be coherent with each other - previously they were. And because it's in a global variable used everywhere, we can't just be more careful about update protocols in one place... We also shouldn't re-read a variable that we just did via the pg_atomic_monotonic_advance_u64(). I think we should stop updating both the Write/Flush position at the same time. That way we don't have an expectation of them being coherent with each other. Most places really don't need both anyway. > { > - SpinLockAcquire(&XLogCtl->info_lck); > - XLogCtl->LogwrtResult = LogwrtResult; > - if (XLogCtl->LogwrtRqst.Write < LogwrtResult.Write) > - XLogCtl->LogwrtRqst.Write = LogwrtResult.Write; > - if (XLogCtl->LogwrtRqst.Flush < LogwrtResult.Flush) > - XLogCtl->LogwrtRqst.Flush = LogwrtResult.Flush; > - SpinLockRelease(&XLogCtl->info_lck); > + pg_atomic_monotonic_advance_u64(&XLogCtl->LogwrtResult.Write, > LogwrtResult.Write); > + pg_atomic_monotonic_advance_u64(&XLogCtl->LogwrtResult.Flush, > LogwrtResult.Flush); > + > + pg_atomic_monotonic_advance_u64(&XLogCtl->LogwrtRqst.Write, > LogwrtResult.Write); > + pg_atomic_monotonic_advance_u64(&XLogCtl->LogwrtRqst.Flush, > LogwrtResult.Flush); Hm. We went from one cheap atomic operation (SpinLockAcquire) to four expensive ones in the happy path. That's not free... I don't think we need to manipulate XLogCtl->LogwrtResult.* using atomic ops - they can only be updated with WALWriteLock held, right? XLogCtl->LogwrtResult was updated with plain assignment before, why did you change it to pg_atomic_monotonic_advance_u64()? > @@ -1169,6 +1169,8 @@ XLogInsertRecord(XLogRecData *rdata, > { > /* advance global request to include new block(s) */ > pg_atomic_monotonic_advance_u64(&XLogCtl->LogwrtRqst.Write, > EndPos); > + pg_memory_barrier(); > + That's not really useful - the path that actually updates already implies a barrier. It'd probably be better to add a barrier to a "never executed cmpxchg" fastpath. > @@ -2905,6 +2909,7 @@ XLogFlush(XLogRecPtr record) > WriteRqstPtr = pg_atomic_read_u64(&XLogCtl->LogwrtRqst.Write); > LogwrtResult.Write = > pg_atomic_read
Re: LogwrtResult contended spinlock
Hi, On 2021-01-29 12:40:18 -0300, Alvaro Herrera wrote: > On 2020-Aug-31, Andres Freund wrote: > > > Wouldn't the better fix here be to allow reading of individual members > > without a lock? E.g. by wrapping each in a 64bit atomic. > > So I've been playing with this and I'm annoyed about having two > datatypes to represent Write/Flush positions: > > typedef struct XLogwrtRqst > { > XLogRecPtr Write; /* last byte + 1 to write out */ > XLogRecPtr Flush; /* last byte + 1 to flush */ > } XLogwrtRqst; > > typedef struct XLogwrtResult > { > XLogRecPtr Write; /* last byte + 1 written out */ > XLogRecPtr Flush; /* last byte + 1 flushed */ > } XLogwrtResult; > > Don't they look, um, quite similar? I am strongly tempted to remove > that distinction, since it seems quite pointless, and introduce a > different one: Their spelling drives me nuts. Like, one is *Rqst, the other *Result? Comeon. > Now, I do wonder if there's a point in keeping LogwrtResult as a local > variable at all; maybe since the ones in shared memory are going to use > unlocked access, we don't need it anymore? I'd prefer to defer that > decision to after this patch is done, since ISTM that it'd merit more > careful benchmarking. I think doing that might be a bit harmful - we update LogwrtResult fairly granularly in XLogWrite(). Doing that in those small steps in shared memory will increase the likelihood of cache misses in other backends. Greetings, Andres Freund
Re: LogwrtResult contended spinlock
So I tried this, but -- perhaps not suprisingly -- I can't get it to work properly; the synchronization fails. I suspect I need some barriers, but I tried adding a few (probably some that are not really necessary) and that didn't have the expected effect. Strangely, all tests work for me, but the pg_upgrade one in particular fails. (The attached is of course POC quality at best.) I'll have another look next week. -- Álvaro Herrera39°49'30"S 73°17'W >From 90507185357391a661cd856fc231b140079c8d78 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 29 Jan 2021 22:34:23 -0300 Subject: [PATCH v2 1/3] add pg_atomic_monotonic_advance_u64 --- src/include/port/atomics.h | 19 +++ 1 file changed, 19 insertions(+) diff --git a/src/include/port/atomics.h b/src/include/port/atomics.h index 856338f161..752f39d66e 100644 --- a/src/include/port/atomics.h +++ b/src/include/port/atomics.h @@ -519,6 +519,25 @@ pg_atomic_sub_fetch_u64(volatile pg_atomic_uint64 *ptr, int64 sub_) return pg_atomic_sub_fetch_u64_impl(ptr, sub_); } +static inline void +pg_atomic_monotonic_advance_u64(volatile pg_atomic_uint64 *ptr, uint64 target_) +{ +#ifndef PG_HAVE_ATOMIC_U64_SIMULATION + AssertPointerAlignment(ptr, 8); +#endif + /* FIXME is this algorithm correct if we have u64 simulation? */ + while (true) + { + uint64 currval; + + currval = pg_atomic_read_u64(ptr); + if (currval > target_) + break; /* already done by somebody else */ + if (pg_atomic_compare_exchange_u64(ptr, &currval, target_)) + break; /* successfully done */ + } +} + #undef INSIDE_ATOMICS_H #endif /* ATOMICS_H */ -- 2.20.1 >From 84a62ac0594f8f0b13c06958cf6e94e2a2723082 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 29 Jan 2021 22:34:04 -0300 Subject: [PATCH v2 2/3] atomics in xlog.c --- src/backend/access/transam/xlog.c | 129 ++ 1 file changed, 61 insertions(+), 68 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index f03bd473e2..8073a92ceb 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -413,7 +413,8 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr; * which is updated when convenient. * * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst - * (protected by info_lck), but we don't need to cache any copies of it. + * (which use atomic access, so no locks are needed), but we don't need to + * cache any copies of it. * * info_lck is only held long enough to read/update the protected variables, * so it's a plain spinlock. The other locks are held longer (potentially @@ -433,17 +434,19 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr; *-- */ -typedef struct XLogwrtRqst +/* Write/Flush position to be used in shared memory */ +typedef struct XLogwrtAtomic { - XLogRecPtr Write; /* last byte + 1 to write out */ - XLogRecPtr Flush; /* last byte + 1 to flush */ -} XLogwrtRqst; + pg_atomic_uint64 Write; /* last byte + 1 of write position */ + pg_atomic_uint64 Flush; /* last byte + 1 of flush position */ +} XLogwrtAtomic; -typedef struct XLogwrtResult +/* Write/Flush position to be used in process-local memory */ +typedef struct XLogwrt { XLogRecPtr Write; /* last byte + 1 written out */ XLogRecPtr Flush; /* last byte + 1 flushed */ -} XLogwrtResult; +} XLogwrt; /* * Inserting to WAL is protected by a small fixed number of WAL insertion @@ -596,8 +599,10 @@ typedef struct XLogCtlData { XLogCtlInsert Insert; + XLogwrtAtomic LogwrtRqst; + XLogwrtAtomic LogwrtResult; + /* Protected by info_lck: */ - XLogwrtRqst LogwrtRqst; XLogRecPtr RedoRecPtr; /* a recent copy of Insert->RedoRecPtr */ FullTransactionId ckptFullXid; /* nextXid of latest checkpoint */ XLogRecPtr asyncXactLSN; /* LSN of newest async commit/abort */ @@ -613,12 +618,6 @@ typedef struct XLogCtlData pg_time_t lastSegSwitchTime; XLogRecPtr lastSegSwitchLSN; - /* - * Protected by info_lck and WALWriteLock (you must hold either lock to - * read it, but both to update) - */ - XLogwrtResult LogwrtResult; - /* * Latest initialized page in the cache (last byte position + 1). * @@ -779,7 +778,7 @@ static int UsableBytesInSegment; * Private, possibly out-of-date copy of shared LogwrtResult. * See discussion above. */ -static XLogwrtResult LogwrtResult = {0, 0}; +static XLogwrt LogwrtResult = {0, 0}; /* * Codes indicating where we got a WAL file from during recovery, or where @@ -910,7 +909,7 @@ static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void); static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic); static bool XLogCheckpointNeeded(XLogSegNo new_segno); -static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible); +static void XLogWrite(XLogwrt WriteRqst, bool flexible); static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath, bool find_free, XLogS
Re: LogwrtResult contended spinlock
On 2020-Aug-31, Andres Freund wrote: > Wouldn't the better fix here be to allow reading of individual members > without a lock? E.g. by wrapping each in a 64bit atomic. So I've been playing with this and I'm annoyed about having two datatypes to represent Write/Flush positions: typedef struct XLogwrtRqst { XLogRecPtr Write; /* last byte + 1 to write out */ XLogRecPtr Flush; /* last byte + 1 to flush */ } XLogwrtRqst; typedef struct XLogwrtResult { XLogRecPtr Write; /* last byte + 1 written out */ XLogRecPtr Flush; /* last byte + 1 flushed */ } XLogwrtResult; Don't they look, um, quite similar? I am strongly tempted to remove that distinction, since it seems quite pointless, and introduce a different one: typedef struct XLogwrtAtomic { pg_atomic_uint64Write; /* last byte + 1 of write position */ pg_atomic_uint64Flush; /* last byte + 1 of flush position */ } XLogwrtAtomic; this one, with atomics, would be used for the XLogCtl struct members LogwrtRqst and LogwrtResult, and are always accessed using atomic ops. On the other hand we would have typedef struct XLogwrt { XLogRecPtr Write; /* last byte + 1 of write position */ XLogRecPtr Flush; /* last byte + 1 of flush position */ } XLogwrt; to be used for the local-memory-only LogwrtResult, using normal assignment. Now, I do wonder if there's a point in keeping LogwrtResult as a local variable at all; maybe since the ones in shared memory are going to use unlocked access, we don't need it anymore? I'd prefer to defer that decision to after this patch is done, since ISTM that it'd merit more careful benchmarking. Thoughts? -- Álvaro Herrera Valdivia, Chile
Re: LogwrtResult contended spinlock
On Fri, Jan 22, 2021 at 10:39 PM Alvaro Herrera wrote: > > On 2021-Jan-22, Masahiko Sawada wrote: > > > On Wed, Nov 25, 2020 at 12:02 AM Alvaro Herrera > > wrote: > > > > Yes, please move it forward. I'll post an update sometime before the > > > next CF. > > > > Anything update on this? > > I'll update this one early next week. Great, thanks! I'll look at that. Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: LogwrtResult contended spinlock
On 2021-Jan-22, Masahiko Sawada wrote: > On Wed, Nov 25, 2020 at 12:02 AM Alvaro Herrera > wrote: > > Yes, please move it forward. I'll post an update sometime before the > > next CF. > > Anything update on this? I'll update this one early next week. Thanks! -- Álvaro Herrera Valdivia, Chile "I would rather have GNU than GNOT." (ccchips, lwn.net/Articles/37595/)
Re: LogwrtResult contended spinlock
Hi Alvaro, On Wed, Nov 25, 2020 at 12:02 AM Alvaro Herrera wrote: > > On 2020-Nov-24, Anastasia Lubennikova wrote: > > > On 04.09.2020 20:13, Andres Freund wrote: > > > > Re general routine: On second thought, it might actually be worth having > > > it. Even just for LSNs - there's plenty places where it's useful to > > > ensure a variable is at least a certain size. I think I would be in > > > favor of a general helper function. > > Do you mean by general helper function something like this? > > > > void > > swap_lsn(XLogRecPtr old_value, XLogRecPtr new_value, bool to_largest) > > Something like that, yeah, though maybe name it "pg_atomic_increase_lsn" > or some similar name that makes it clear that > > 1. it is supposed to use atomics > 2. it can only be used to *advance* a value rather than a generic swap. > > (I'm not 100% clear that that's the exact API we need.) > > > This CF entry was inactive for a while. Alvaro, are you going to continue > > working on it? > > Yes, please move it forward. I'll post an update sometime before the > next CF. Anything update on this? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/
Re: LogwrtResult contended spinlock
On 2020-Nov-24, Anastasia Lubennikova wrote: > On 04.09.2020 20:13, Andres Freund wrote: > > Re general routine: On second thought, it might actually be worth having > > it. Even just for LSNs - there's plenty places where it's useful to > > ensure a variable is at least a certain size. I think I would be in > > favor of a general helper function. > Do you mean by general helper function something like this? > > void > swap_lsn(XLogRecPtr old_value, XLogRecPtr new_value, bool to_largest) Something like that, yeah, though maybe name it "pg_atomic_increase_lsn" or some similar name that makes it clear that 1. it is supposed to use atomics 2. it can only be used to *advance* a value rather than a generic swap. (I'm not 100% clear that that's the exact API we need.) > This CF entry was inactive for a while. Alvaro, are you going to continue > working on it? Yes, please move it forward. I'll post an update sometime before the next CF.
Re: LogwrtResult contended spinlock
On 04.09.2020 20:13, Andres Freund wrote: Hi, On 2020-09-04 10:05:45 -0700, Andres Freund wrote: On 2020-09-03 14:34:52 -0400, Alvaro Herrera wrote: Looking at patterns like this if (XLogCtl->LogwrtRqst.Write < EndPos) XLogCtl->LogwrtRqst.Write = EndPos; It seems possible to implement with do { XLogRecPtr currwrite; currwrite = pg_atomic_read_u64(LogwrtRqst.Write); if (currwrite > EndPos) break; // already done by somebody else if (pg_atomic_compare_exchange_u64(LogwrtRqst.Write, currwrite, EndPos)) break; // successfully updated } while (true); This assumes that LogwrtRqst.Write never goes backwards, so it doesn't seem good material for a general routine. This *seems* correct to me, though this is muddy territory to me. Also, are there better ways to go about this? Hm, I was thinking that we'd first go for reading it without a spinlock, but continuing to write it as we currently do. But yea, I can't see an issue with what you propose here. I personally find do {} while () weird and avoid it when not explicitly useful, but that's extremely minor, obviously. Re general routine: On second thought, it might actually be worth having it. Even just for LSNs - there's plenty places where it's useful to ensure a variable is at least a certain size. I think I would be in favor of a general helper function. Do you mean by general helper function something like this? void swap_lsn(XLogRecPtr old_value, XLogRecPtr new_value, bool to_largest) { while (true) { XLogRecPtr currwrite; currwrite = pg_atomic_read_u64(old_value); if (to_largest) if (currwrite > new_value) break; /* already done by somebody else */ else if (currwrite < new_value) break; /* already done by somebody else */ if (pg_atomic_compare_exchange_u64(old_value, currwrite, new_value)) break; /* already done by somebody else */ } } which will be called like swap_lsn(XLogCtl->LogwrtRqst.Write, EndPos, true); Greetings, Andres Freund This CF entry was inactive for a while. Alvaro, are you going to continue working on it? -- Anastasia Lubennikova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: LogwrtResult contended spinlock
Hi, On 2020-09-04 10:05:45 -0700, Andres Freund wrote: > On 2020-09-03 14:34:52 -0400, Alvaro Herrera wrote: > > Looking at patterns like this > > > > if (XLogCtl->LogwrtRqst.Write < EndPos) > > XLogCtl->LogwrtRqst.Write = EndPos; > > > > It seems possible to implement with > > > > do { > > XLogRecPtr currwrite; > > > > currwrite = pg_atomic_read_u64(LogwrtRqst.Write); > > if (currwrite > EndPos) > > break; // already done by somebody else > > if (pg_atomic_compare_exchange_u64(LogwrtRqst.Write, > >currwrite, EndPos)) > > break; // successfully updated > > } while (true); > > > > This assumes that LogwrtRqst.Write never goes backwards, so it doesn't > > seem good material for a general routine. > > > > This *seems* correct to me, though this is muddy territory to me. Also, > > are there better ways to go about this? > > Hm, I was thinking that we'd first go for reading it without a spinlock, > but continuing to write it as we currently do. > > But yea, I can't see an issue with what you propose here. I personally > find do {} while () weird and avoid it when not explicitly useful, but > that's extremely minor, obviously. Re general routine: On second thought, it might actually be worth having it. Even just for LSNs - there's plenty places where it's useful to ensure a variable is at least a certain size. I think I would be in favor of a general helper function. Greetings, Andres Freund
Re: LogwrtResult contended spinlock
Hi, On 2020-09-03 14:34:52 -0400, Alvaro Herrera wrote: > Looking at patterns like this > > if (XLogCtl->LogwrtRqst.Write < EndPos) > XLogCtl->LogwrtRqst.Write = EndPos; > > It seems possible to implement with > > do { > XLogRecPtr currwrite; > > currwrite = pg_atomic_read_u64(LogwrtRqst.Write); > if (currwrite > EndPos) > break; // already done by somebody else > if (pg_atomic_compare_exchange_u64(LogwrtRqst.Write, > currwrite, EndPos)) > break; // successfully updated > } while (true); > > This assumes that LogwrtRqst.Write never goes backwards, so it doesn't > seem good material for a general routine. > > This *seems* correct to me, though this is muddy territory to me. Also, > are there better ways to go about this? Hm, I was thinking that we'd first go for reading it without a spinlock, but continuing to write it as we currently do. But yea, I can't see an issue with what you propose here. I personally find do {} while () weird and avoid it when not explicitly useful, but that's extremely minor, obviously. Greetings, Andres Freund
Re: LogwrtResult contended spinlock
Looking at patterns like this if (XLogCtl->LogwrtRqst.Write < EndPos) XLogCtl->LogwrtRqst.Write = EndPos; It seems possible to implement with do { XLogRecPtr currwrite; currwrite = pg_atomic_read_u64(LogwrtRqst.Write); if (currwrite > EndPos) break; // already done by somebody else if (pg_atomic_compare_exchange_u64(LogwrtRqst.Write, currwrite, EndPos)) break; // successfully updated } while (true); This assumes that LogwrtRqst.Write never goes backwards, so it doesn't seem good material for a general routine. This *seems* correct to me, though this is muddy territory to me. Also, are there better ways to go about this? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: LogwrtResult contended spinlock
Hi, On August 31, 2020 11:34:45 AM PDT, Alvaro Herrera wrote: >On 2020-Aug-31, Andres Freund wrote: > >> Hi, >> >> On August 31, 2020 11:21:56 AM PDT, Alvaro Herrera > wrote: > >> >At first I wanted to make the new LWLock cover only LogwrtResult >> >proper, >> >and leave LogwrtRqst alone. However on doing it, it seemed that >that >> >might change the locking protocol in a nontrivial way. So I decided >to >> >make it cover both and call it a day. We did verify that the patch >> >solves the reported problem, at any rate. >> >> Wouldn't the better fix here be to allow reading of individual >members >> without a lock? E.g. by wrapping each in a 64bit atomic. > >Heh, Simon said the same. It's not clear to me due to the lack of >general availability of 64-bit atomics. If they are spinlock-protected >when emulated, I think that would make the problem worse. > >IIRC Thomas wanted to start relying on atomic 64-bit vars in some >patch, >but I don't remember what it was. All relevant platforms have 64bit atomics. So I don't think there's much point in worrying about the emulated performance. Correctness, sure. Performance, not so much. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Re: LogwrtResult contended spinlock
On 2020-Aug-31, Andres Freund wrote: > Hi, > > On August 31, 2020 11:21:56 AM PDT, Alvaro Herrera > wrote: > >At first I wanted to make the new LWLock cover only LogwrtResult > >proper, > >and leave LogwrtRqst alone. However on doing it, it seemed that that > >might change the locking protocol in a nontrivial way. So I decided to > >make it cover both and call it a day. We did verify that the patch > >solves the reported problem, at any rate. > > Wouldn't the better fix here be to allow reading of individual members > without a lock? E.g. by wrapping each in a 64bit atomic. Heh, Simon said the same. It's not clear to me due to the lack of general availability of 64-bit atomics. If they are spinlock-protected when emulated, I think that would make the problem worse. IIRC Thomas wanted to start relying on atomic 64-bit vars in some patch, but I don't remember what it was. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: LogwrtResult contended spinlock
Hi, On August 31, 2020 11:21:56 AM PDT, Alvaro Herrera wrote: >Jaime Casanova recently reported a situation where pglogical >replicating >from 64 POS sites to a single central (64-core) node, each with two >replication sets, causes XLog's info_lck to become highly contended >because of frequently reading LogwrtResult. We tested the simple fix >of >adding a new LWLock that protects LogwrtResult and LogwrtRqst; that >seems to solve the problem easily enough. > >At first I wanted to make the new LWLock cover only LogwrtResult >proper, >and leave LogwrtRqst alone. However on doing it, it seemed that that >might change the locking protocol in a nontrivial way. So I decided to >make it cover both and call it a day. We did verify that the patch >solves the reported problem, at any rate. Wouldn't the better fix here be to allow reading of individual members without a lock? E.g. by wrapping each in a 64bit atomic. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
LogwrtResult contended spinlock
Jaime Casanova recently reported a situation where pglogical replicating from 64 POS sites to a single central (64-core) node, each with two replication sets, causes XLog's info_lck to become highly contended because of frequently reading LogwrtResult. We tested the simple fix of adding a new LWLock that protects LogwrtResult and LogwrtRqst; that seems to solve the problem easily enough. At first I wanted to make the new LWLock cover only LogwrtResult proper, and leave LogwrtRqst alone. However on doing it, it seemed that that might change the locking protocol in a nontrivial way. So I decided to make it cover both and call it a day. We did verify that the patch solves the reported problem, at any rate. -- Álvaro HerreraPostgreSQL Expert, https://www.2ndQuadrant.com/ >From 3eb871f235c1b6005ff5dc88561173e4e92c1314 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Mon, 17 Aug 2020 19:05:22 -0400 Subject: [PATCH] use an LWLock rather than spinlock for Logwrt{Result,Rqst} --- src/backend/access/transam/xlog.c| 76 +--- src/backend/storage/lmgr/lwlocknames.txt | 2 +- 2 files changed, 43 insertions(+), 35 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 09c01ed4ae..d838739bc4 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -403,15 +403,13 @@ static XLogRecPtr RedoStartLSN = InvalidXLogRecPtr; * These structs are identical but are declared separately to indicate their * slightly different functions. * - * To read XLogCtl->LogwrtResult, you must hold either info_lck or - * WALWriteLock. To update it, you need to hold both locks. The point of - * this arrangement is that the value can be examined by code that already - * holds WALWriteLock without needing to grab info_lck as well. In addition - * to the shared variable, each backend has a private copy of LogwrtResult, - * which is updated when convenient. + * To read/write XLogCtl->LogwrtResult, you must hold WALWriteResultLock. + * In addition to the shared variable, each backend has a private copy of + * LogwrtResult, which is updated when convenient. * * The request bookkeeping is simpler: there is a shared XLogCtl->LogwrtRqst - * (protected by info_lck), but we don't need to cache any copies of it. + * (also protected by WALWriteResultLock), but we don't need to cache any + * copies of it. * * info_lck is only held long enough to read/update the protected variables, * so it's a plain spinlock. The other locks are held longer (potentially @@ -615,10 +613,7 @@ typedef struct XLogCtlData pg_time_t lastSegSwitchTime; XLogRecPtr lastSegSwitchLSN; - /* - * Protected by info_lck and WALWriteLock (you must hold either lock to - * read it, but both to update) - */ + /* Protected by WALWriteResultLock. */ XLogwrtResult LogwrtResult; /* @@ -1158,13 +1153,12 @@ XLogInsertRecord(XLogRecData *rdata, */ if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ) { - SpinLockAcquire(&XLogCtl->info_lck); /* advance global request to include new block(s) */ + LWLockAcquire(WALWriteResultLock, LW_EXCLUSIVE); if (XLogCtl->LogwrtRqst.Write < EndPos) XLogCtl->LogwrtRqst.Write = EndPos; - /* update local result copy while I have the chance */ LogwrtResult = XLogCtl->LogwrtResult; - SpinLockRelease(&XLogCtl->info_lck); + LWLockRelease(WALWriteResultLock); } /* @@ -2155,12 +2149,12 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic) if (opportunistic) break; - /* Before waiting, get info_lck and update LogwrtResult */ - SpinLockAcquire(&XLogCtl->info_lck); + /* Before waiting, update LogwrtResult */ + LWLockAcquire(WALWriteResultLock, LW_EXCLUSIVE); if (XLogCtl->LogwrtRqst.Write < OldPageRqstPtr) XLogCtl->LogwrtRqst.Write = OldPageRqstPtr; LogwrtResult = XLogCtl->LogwrtResult; - SpinLockRelease(&XLogCtl->info_lck); + LWLockRelease(WALWriteResultLock); /* * Now that we have an up-to-date LogwrtResult value, see if we @@ -2178,9 +2172,11 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic) WaitXLogInsertionsToFinish(OldPageRqstPtr); -LWLockAcquire(WALWriteLock, LW_EXCLUSIVE); - +LWLockAcquire(WALWriteResultLock, LW_SHARED); LogwrtResult = XLogCtl->LogwrtResult; +LWLockRelease(WALWriteResultLock); + +LWLockAcquire(WALWriteLock, LW_EXCLUSIVE); if (LogwrtResult.Write >= OldPageRqstPtr) { /* OK, someone wrote it already */ @@ -2426,7 +2422,9 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) /* * Update local LogwrtResult (caller probably did this already, but...) */ + LWLockAcquire(WALWriteResultLock, LW_SHARED); LogwrtResult = XLogCtl->LogwrtResult; + LWLockRelease(WALWriteResultLock); /* * Since successive pages in the xlog cache are consecutively allocated, @@ -2663,13 +2661,14 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible) * code in a couple of