Re: [HACKERS] LWLock optimization for multicore Power machines

2017-08-30 Thread Sokolov Yura

On 2017-08-30 16:24, Tom Lane wrote:

Alexander Korotkov  writes:
It doesn't seems to make sense to consider this patch unless we get 
access

to suitable Power machine to reproduce benefits.
This is why I'm going to mark this patch "Returned with feedback".
Once we would get access to the appropriate machine, I will resubmit 
this

patch.


What about hydra (that 60-core POWER7 machine we have community
access to)?

regards, tom lane


Anyway, I encourage to consider first another LWLock patch:
https://commitfest.postgresql.org/14/1166/
It positively affects performance on any platform (I've tested it
on Power as well).

And I have prototype of further adaptation of that patch for POWER
platform (ie using inline assembly) (not published yet).

--
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-08-30 Thread Tom Lane
Alexander Korotkov  writes:
> It doesn't seems to make sense to consider this patch unless we get access
> to suitable Power machine to reproduce benefits.
> This is why I'm going to mark this patch "Returned with feedback".
> Once we would get access to the appropriate machine, I will resubmit this
> patch.

What about hydra (that 60-core POWER7 machine we have community
access to)?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-08-30 Thread Alexander Korotkov
On Thu, Apr 6, 2017 at 5:38 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Thu, Apr 6, 2017 at 5:37 PM, Alexander Korotkov <
> a.korot...@postgrespro.ru> wrote:
>
>> On Thu, Apr 6, 2017 at 2:16 AM, Andres Freund  wrote:
>>
>>> On 2017-04-03 11:56:13 -0700, Andres Freund wrote:
>>> > Have you done x86 benchmarking?
>>>
>>> I think unless such benchmarking is done in the next 24h we need to move
>>> this patch to the next CF...
>>>
>>
>> Thank you for your comments.
>> I didn't x86 benchmarking.  I even didn't manage to reproduce previous
>> results on Power8.
>> Presumably, it's because previous benchmarks were done on bare metal,
>> while now I have to some kind of virtual machine on IBM E880 where I can't
>> reproduce any win of Power8 LWLock optimization.  But probably there is
>> another reason.
>> Thus, I'm moving this patch to the next CF.
>>
>
> I see it's already moved. OK!
>

It doesn't seems to make sense to consider this patch unless we get access
to suitable Power machine to reproduce benefits.
This is why I'm going to mark this patch "Returned with feedback".
Once we would get access to the appropriate machine, I will resubmit this
patch.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-04-06 Thread Alexander Korotkov
On Thu, Apr 6, 2017 at 5:37 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Thu, Apr 6, 2017 at 2:16 AM, Andres Freund  wrote:
>
>> On 2017-04-03 11:56:13 -0700, Andres Freund wrote:
>> > Have you done x86 benchmarking?
>>
>> I think unless such benchmarking is done in the next 24h we need to move
>> this patch to the next CF...
>>
>
> Thank you for your comments.
> I didn't x86 benchmarking.  I even didn't manage to reproduce previous
> results on Power8.
> Presumably, it's because previous benchmarks were done on bare metal,
> while now I have to some kind of virtual machine on IBM E880 where I can't
> reproduce any win of Power8 LWLock optimization.  But probably there is
> another reason.
> Thus, I'm moving this patch to the next CF.
>

I see it's already moved. OK!

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-04-06 Thread Alexander Korotkov
On Thu, Apr 6, 2017 at 2:16 AM, Andres Freund  wrote:

> On 2017-04-03 11:56:13 -0700, Andres Freund wrote:
> >
> > > +/*
> > > + * Generic implementation of pg_atomic_fetch_mask_add_u32() via loop
> > > + * of compare & exchange.
> > > + */
> > > +static inline uint32
> > > +pg_atomic_fetch_mask_add_u32_impl(volatile pg_atomic_uint32 *ptr,
> > > + uint32
> mask_, uint32 add_)
> > > +{
> > > +   uint32  old_value;
> > > +
> > > +   /*
> > > +* Read once outside the loop, later iterations will get the newer
> value
> > > +* via compare & exchange.
> > > +*/
> > > +   old_value = pg_atomic_read_u32_impl(ptr);
> > > +
> > > +   /* loop until we've determined whether we could make an increment
> or not */
> > > +   while (true)
> > > +   {
> > > +   uint32  desired_value;
> > > +   boolfree;
> > > +
> > > +   desired_value = old_value;
> > > +   free = (old_value & mask_) == 0;
> > > +   if (free)
> > > +   desired_value += add_;
> > > +
> > > +   /*
> > > +* Attempt to swap in the value we are expecting. If we
> didn't see
> > > +* masked bits to be clear, that's just the old value. If
> we saw them
> > > +* as clear, we'll attempt to make an increment. The
> reason that we
> > > +* always swap in the value is that this doubles as a
> memory barrier.
> > > +* We could try to be smarter and only swap in values if
> we saw the
> > > +* maked bits as clear, but benchmark haven't shown it as
> beneficial
> > > +* so far.
> > > +*
> > > +* Retry if the value changed since we last looked at it.
> > > +*/
> > > +   if (pg_atomic_compare_exchange_u32_impl(ptr, &old_value,
> desired_value))
> > > +   return old_value;
> > > +   }
> > > +   pg_unreachable();
> > > +}
> > > +#endif
> >
> > Have you done x86 benchmarking?
>
> I think unless such benchmarking is done in the next 24h we need to move
> this patch to the next CF...
>

Thank you for your comments.
I didn't x86 benchmarking.  I even didn't manage to reproduce previous
results on Power8.
Presumably, it's because previous benchmarks were done on bare metal, while
now I have to some kind of virtual machine on IBM E880 where I can't
reproduce any win of Power8 LWLock optimization.  But probably there is
another reason.
Thus, I'm moving this patch to the next CF.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-04-05 Thread Tom Lane
Andres Freund  writes:
> On 2017-04-03 11:56:13 -0700, Andres Freund wrote:
>> Have you done x86 benchmarking?

> I think unless such benchmarking is done in the next 24h we need to move
> this patch to the next CF...

In theory, inlining the _impl function should lead to exactly the same
assembly code as before, on all non-PPC platforms.  Confirming that would
be a good substitute for benchmarking.

Having said that, I'm not sure this patch is solid enough to push in
right now, and would be happy with seeing it pushed to v11.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-04-05 Thread Andres Freund
Hi,

On 2017-04-03 11:56:13 -0700, Andres Freund wrote:
> 
> > +/*
> > + * Generic implementation of pg_atomic_fetch_mask_add_u32() via loop
> > + * of compare & exchange.
> > + */
> > +static inline uint32
> > +pg_atomic_fetch_mask_add_u32_impl(volatile pg_atomic_uint32 *ptr,
> > + uint32 mask_, 
> > uint32 add_)
> > +{
> > +   uint32  old_value;
> > +
> > +   /*
> > +* Read once outside the loop, later iterations will get the newer value
> > +* via compare & exchange.
> > +*/
> > +   old_value = pg_atomic_read_u32_impl(ptr);
> > +
> > +   /* loop until we've determined whether we could make an increment or 
> > not */
> > +   while (true)
> > +   {
> > +   uint32  desired_value;
> > +   boolfree;
> > +
> > +   desired_value = old_value;
> > +   free = (old_value & mask_) == 0;
> > +   if (free)
> > +   desired_value += add_;
> > +
> > +   /*
> > +* Attempt to swap in the value we are expecting. If we didn't 
> > see
> > +* masked bits to be clear, that's just the old value. If we 
> > saw them
> > +* as clear, we'll attempt to make an increment. The reason 
> > that we
> > +* always swap in the value is that this doubles as a memory 
> > barrier.
> > +* We could try to be smarter and only swap in values if we saw 
> > the
> > +* maked bits as clear, but benchmark haven't shown it as 
> > beneficial
> > +* so far.
> > +*
> > +* Retry if the value changed since we last looked at it.
> > +*/
> > +   if (pg_atomic_compare_exchange_u32_impl(ptr, &old_value, 
> > desired_value))
> > +   return old_value;
> > +   }
> > +   pg_unreachable();
> > +}
> > +#endif
> 
> Have you done x86 benchmarking?

I think unless such benchmarking is done in the next 24h we need to move
this patch to the next CF...

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-04-03 Thread Andres Freund
Hi,

On 2017-03-31 13:38:31 +0300, Alexander Korotkov wrote:
> > It seems that on this platform definition of atomics should be provided by
> > fallback.h.  But it doesn't because I already defined 
> > PG_HAVE_ATOMIC_U32_SUPPORT
> > in arch-ppc.h.  I think in this case we shouldn't provide ppc-specific
> > implementation of pg_atomic_fetch_mask_add_u32().  However, I don't know
> > how to do this assuming arch-ppc.h is included before compiler-specific
> > headers.  Thus, in arch-ppc.h we don't know yet if we would find
> > implementation of atomics for this platform.  One possible solution is to
> > provide assembly implementation for all atomics in arch-ppc.h.

That'd probably not be good use of effort.


> BTW, implementation for all atomics in arch-ppc.h would be too invasive and
> shouldn't be considered for v10.
> However, I made following workaround: declare pg_atomic_uint32 and
> pg_atomic_fetch_mask_add_u32_impl() only when we know that generic-gcc.h
> would declare gcc-based atomics.
> Could you, please, check it on Apple PPC?

That doesn't sound crazy to me.


> +/*
> + * pg_atomic_fetch_mask_add_u32 - atomically check that masked bits in 
> variable
> + * and if they are clear then add to variable.

Hm, expanding on this wouldn't hurt.


> + * Returns the value of ptr before the atomic operation.

Sounds a bit like it'd return the pointer itself, not what it points to...


> + * Full barrier semantics.

Does it actually have full barrier semantics?  Based on a quick skim I
don't think the ppc implementation doesn't?



> +#if defined(HAVE_ATOMICS) \
> + && (defined(HAVE_GCC__ATOMIC_INT32_CAS) || 
> defined(HAVE_GCC__SYNC_INT32_CAS))
> +
> +/*
> + * Declare pg_atomic_uint32 structure before generic-gcc.h does it in order 
> to
> + * use it in function arguments.
> + */

If we do this, then the other side needs a pointer to keep this up2date.

> +#define PG_HAVE_ATOMIC_U32_SUPPORT
> +typedef struct pg_atomic_uint32
> +{
> + volatile uint32 value;
> +} pg_atomic_uint32;
> +
> +/*
> + * Optimized implementation of pg_atomic_fetch_mask_add_u32() for Power
> + * processors.  Atomic operations on Power processors are implemented using
> + * optimistic locking.  'lwarx' instruction 'reserves index', but that
> + * reservation could be broken on 'stwcx.' and then we have to retry.  Thus,
> + * each CAS operation is a loop.  But loop of CAS operation is two level 
> nested
> + * loop.  Experiments on multicore Power machines shows that we can have huge
> + * benefit from having this operation done using single loop in assembly.
> + */
> +#define PG_HAVE_ATOMIC_FETCH_MASK_ADD_U32
> +static inline uint32
> +pg_atomic_fetch_mask_add_u32_impl(volatile pg_atomic_uint32 *ptr,
> +   uint32 mask, 
> uint32 increment)
> +{
> + uint32  result,
> + tmp;
> +
> + __asm__ __volatile__(
> + /* read *ptr and reserve index */
> +#ifdef USE_PPC_LWARX_MUTEX_HINT
> + "   lwarx   %0,0,%5,1   \n"
> +#else
> + "   lwarx   %0,0,%5 \n"
> +#endif

Hm, we should remove that hint bit stuff at some point...


> + "   and %1,%0,%3\n" /* calculate '*ptr & mask" 
> */
> + "   cmpwi   %1,0\n" /* compare '*ptr & mark' vs 0 */
> + "   bne-$+16\n" /* exit on '*ptr & mark != 0' */
> + "   add %1,%0,%4\n" /* calculate '*ptr + 
> increment' */
> + "   stwcx.  %1,0,%5 \n" /* try to store '*ptr + increment' 
> into *ptr */
> + "   bne-$-24\n" /* retry if index reservation is 
> broken */
> +#ifdef USE_PPC_LWSYNC
> + "   lwsync  \n"
> +#else
> + "   isync   \n"
> +#endif
> + : "=&r"(result), "=&r"(tmp), "+m"(*ptr)
> + : "r"(mask), "r"(increment), "r"(ptr)
> + : "memory", "cc");
> + return result;
> +}

I'm not sure that the barrier semantics here are sufficient.


> +/*
> + * Generic implementation of pg_atomic_fetch_mask_add_u32() via loop
> + * of compare & exchange.
> + */
> +static inline uint32
> +pg_atomic_fetch_mask_add_u32_impl(volatile pg_atomic_uint32 *ptr,
> +   uint32 mask_, 
> uint32 add_)
> +{
> + uint32  old_value;
> +
> + /*
> +  * Read once outside the loop, later iterations will get the newer value
> +  * via compare & exchange.
> +  */
> + old_value = pg_atomic_read_u32_impl(ptr);
> +
> + /* loop until we've determined whether we could make an increment or 
> not */
> + while (true)
> + {
> + uint32  desired_value;
> + boolfree;
> +
> + desired_value = old_value;
> + free = (old_value & mask_) == 0;
> + if (free)
> + desired_value += add_;
> +
> + /*
> +   

Re: [HACKERS] LWLock optimization for multicore Power machines

2017-03-31 Thread Tom Lane
Alexander Korotkov  writes:
>> It seems that on this platform definition of atomics should be provided by
>> fallback.h.  But it doesn't because I already defined 
>> PG_HAVE_ATOMIC_U32_SUPPORT
>> in arch-ppc.h.  I think in this case we shouldn't provide ppc-specific
>> implementation of pg_atomic_fetch_mask_add_u32().  However, I don't know
>> how to do this assuming arch-ppc.h is included before compiler-specific
>> headers.  Thus, in arch-ppc.h we don't know yet if we would find
>> implementation of atomics for this platform.  One possible solution is to
>> provide assembly implementation for all atomics in arch-ppc.h.

> BTW, implementation for all atomics in arch-ppc.h would be too invasive and
> shouldn't be considered for v10.
> However, I made following workaround: declare pg_atomic_uint32 and
> pg_atomic_fetch_mask_add_u32_impl() only when we know that generic-gcc.h
> would declare gcc-based atomics.

I don't have a well-informed opinion on whether this is a reasonable thing
to do, but I imagine Andres does.

> Could you, please, check it on Apple PPC?

It does compile and pass "make check" on prairiedog.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-03-31 Thread Alexander Korotkov
On Sun, Mar 26, 2017 at 12:29 AM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Sat, Mar 25, 2017 at 11:32 PM, Tom Lane  wrote:
>
>> Alexander Korotkov  writes:
>> > I moved PPC implementation of pg_atomic_fetch_mask_add_u32() into
>> > port/atomics/arch-ppc.h.  I also had to declare pg_atomic_uint32 there
>> to
>> > satisfy usage of this type as argument
>> > of pg_atomic_fetch_mask_add_u32_impl().
>>
>> Hm, you did something wrong there, because now I get a bunch of failures:
>>
>> ccache gcc -Wall -Wmissing-prototypes -Wpointer-arith
>> -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
>> -Wformat-security -fno-strict-aliasing -fwrapv -g -O2
>> -I../../../../src/include-c -o brin.o brin.c
>> In file included from ../../../../src/include/port/atomics.h:123,
>>  from ../../../../src/include/utils/dsa.h:17,
>>  from ../../../../src/include/nodes/tidbitmap.h:26,
>>  from ../../../../src/include/access/genam.h:19,
>>  from ../../../../src/include/nodes/execnodes.h:17,
>>  from ../../../../src/include/access/brin.h:14,
>>  from brin.c:18:
>> ../../../../src/include/port/atomics/generic.h:154:3: error: #error "No
>> pg_atomic_test_and_set provided"
>> ../../../../src/include/port/atomics.h: In function
>> 'pg_atomic_init_flag':
>> ../../../../src/include/port/atomics.h:178: warning: implicit
>> declaration of function 'pg_atomic_init_flag_impl'
>> ../../../../src/include/port/atomics.h: In function
>> 'pg_atomic_test_set_flag':
>> ../../../../src/include/port/atomics.h:193: warning: implicit
>> declaration of function 'pg_atomic_test_set_flag_impl'
>> ../../../../src/include/port/atomics.h: In function
>> 'pg_atomic_unlocked_test_flag':
>> ../../../../src/include/port/atomics.h:208: warning: implicit
>> declaration of function 'pg_atomic_unlocked_test_flag_impl'
>> ... and so on.
>>
>> I'm not entirely sure what the intended structure of these header files
>> is.  Maybe Andres can comment.
>>
>
> It seems that on this platform definition of atomics should be provided by
> fallback.h.  But it doesn't because I already defined 
> PG_HAVE_ATOMIC_U32_SUPPORT
> in arch-ppc.h.  I think in this case we shouldn't provide ppc-specific
> implementation of pg_atomic_fetch_mask_add_u32().  However, I don't know
> how to do this assuming arch-ppc.h is included before compiler-specific
> headers.  Thus, in arch-ppc.h we don't know yet if we would find
> implementation of atomics for this platform.  One possible solution is to
> provide assembly implementation for all atomics in arch-ppc.h.
>

BTW, implementation for all atomics in arch-ppc.h would be too invasive and
shouldn't be considered for v10.
However, I made following workaround: declare pg_atomic_uint32 and
pg_atomic_fetch_mask_add_u32_impl() only when we know that generic-gcc.h
would declare gcc-based atomics.
Could you, please, check it on Apple PPC?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


lwlock-power-5.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-03-25 Thread Alexander Korotkov
On Sat, Mar 25, 2017 at 11:32 PM, Tom Lane  wrote:

> Alexander Korotkov  writes:
> > I moved PPC implementation of pg_atomic_fetch_mask_add_u32() into
> > port/atomics/arch-ppc.h.  I also had to declare pg_atomic_uint32 there to
> > satisfy usage of this type as argument
> > of pg_atomic_fetch_mask_add_u32_impl().
>
> Hm, you did something wrong there, because now I get a bunch of failures:
>
> ccache gcc -Wall -Wmissing-prototypes -Wpointer-arith
> -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute
> -Wformat-security -fno-strict-aliasing -fwrapv -g -O2
> -I../../../../src/include-c -o brin.o brin.c
> In file included from ../../../../src/include/port/atomics.h:123,
>  from ../../../../src/include/utils/dsa.h:17,
>  from ../../../../src/include/nodes/tidbitmap.h:26,
>  from ../../../../src/include/access/genam.h:19,
>  from ../../../../src/include/nodes/execnodes.h:17,
>  from ../../../../src/include/access/brin.h:14,
>  from brin.c:18:
> ../../../../src/include/port/atomics/generic.h:154:3: error: #error "No
> pg_atomic_test_and_set provided"
> ../../../../src/include/port/atomics.h: In function 'pg_atomic_init_flag':
> ../../../../src/include/port/atomics.h:178: warning: implicit declaration
> of function 'pg_atomic_init_flag_impl'
> ../../../../src/include/port/atomics.h: In function
> 'pg_atomic_test_set_flag':
> ../../../../src/include/port/atomics.h:193: warning: implicit declaration
> of function 'pg_atomic_test_set_flag_impl'
> ../../../../src/include/port/atomics.h: In function
> 'pg_atomic_unlocked_test_flag':
> ../../../../src/include/port/atomics.h:208: warning: implicit declaration
> of function 'pg_atomic_unlocked_test_flag_impl'
> ... and so on.
>
> I'm not entirely sure what the intended structure of these header files
> is.  Maybe Andres can comment.
>

It seems that on this platform definition of atomics should be provided by
fallback.h.  But it doesn't because I already
defined PG_HAVE_ATOMIC_U32_SUPPORT in arch-ppc.h.  I think in this case we
shouldn't provide ppc-specific implementation
of pg_atomic_fetch_mask_add_u32().  However, I don't know how to do this
assuming arch-ppc.h is included before compiler-specific headers.  Thus, in
arch-ppc.h we don't know yet if we would find implementation of atomics for
this platform.  One possible solution is to provide assembly implementation
for all atomics in arch-ppc.h.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-03-25 Thread Tom Lane
Alexander Korotkov  writes:
> I moved PPC implementation of pg_atomic_fetch_mask_add_u32() into
> port/atomics/arch-ppc.h.  I also had to declare pg_atomic_uint32 there to
> satisfy usage of this type as argument
> of pg_atomic_fetch_mask_add_u32_impl().

Hm, you did something wrong there, because now I get a bunch of failures:

ccache gcc -Wall -Wmissing-prototypes -Wpointer-arith 
-Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute 
-Wformat-security -fno-strict-aliasing -fwrapv -g -O2 -I../../../../src/include 
   -c -o brin.o brin.c
In file included from ../../../../src/include/port/atomics.h:123,
 from ../../../../src/include/utils/dsa.h:17,
 from ../../../../src/include/nodes/tidbitmap.h:26,
 from ../../../../src/include/access/genam.h:19,
 from ../../../../src/include/nodes/execnodes.h:17,
 from ../../../../src/include/access/brin.h:14,
 from brin.c:18:
../../../../src/include/port/atomics/generic.h:154:3: error: #error "No 
pg_atomic_test_and_set provided"
../../../../src/include/port/atomics.h: In function 'pg_atomic_init_flag':
../../../../src/include/port/atomics.h:178: warning: implicit declaration of 
function 'pg_atomic_init_flag_impl'
../../../../src/include/port/atomics.h: In function 'pg_atomic_test_set_flag':
../../../../src/include/port/atomics.h:193: warning: implicit declaration of 
function 'pg_atomic_test_set_flag_impl'
../../../../src/include/port/atomics.h: In function 
'pg_atomic_unlocked_test_flag':
../../../../src/include/port/atomics.h:208: warning: implicit declaration of 
function 'pg_atomic_unlocked_test_flag_impl'
... and so on.

I'm not entirely sure what the intended structure of these header files
is.  Maybe Andres can comment.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-03-25 Thread Alexander Korotkov
On Sat, Mar 25, 2017 at 8:44 PM, Tom Lane  wrote:

> Alexander Korotkov  writes:
> > [ lwlock-power-3.patch ]
>
> I experimented with this patch a bit.  I can't offer help on testing it
> on large PPC machines, but I can report that it breaks the build on
> Apple PPC machines, apparently because of nonstandard assembly syntax.
> I get "Parameter syntax error" on these four lines of assembly:
>
> and 3,r0,r4
> cmpwi   3,0
> add 3,r0,r5
> stwcx.  3,0,r2
>
> I am not sure what the "3"s are meant to be --- if that's a hard-coded
> register number, then it's unacceptable code to begin with, IMO.
> You should be able to get gcc to give you a scratch register of its
> choosing via appropriate use of the register assignment part of the
> asm syntax.  I think there are examples in s_lock.h.
>

Right. Now I use variable bind to register instead of hard-coded register
for that purpose.

I'm also unhappy that this patch changes the generic implementation of
> LWLockAttemptLock.  That means that we'd need to do benchmarking on *every*
> machine type to see what we're paying elsewhere for the benefit of PPC.
> It seems unlikely that putting an extra level of function call into that
> hot-spot is zero cost.
>
> Lastly, putting machine-specific code into atomics.c is a really bad idea.
> We have a convention for where to put such code, and that is not it.
>
> You could address both the second and third concerns, I think, by putting
> the PPC asm function into port/atomics/arch-ppc.h (which is where it
> belongs anyway) and making the generic implementation be a "static inline"
> function in port/atomics/fallback.h.  In that way, at least with compilers
> that do inlines sanely, we could expect that this patch doesn't change the
> generated code for LWLockAttemptLock at all on machines other than PPC.
>

I moved PPC implementation of pg_atomic_fetch_mask_add_u32() into
port/atomics/arch-ppc.h.  I also had to declare pg_atomic_uint32 there to
satisfy usage of this type as argument
of pg_atomic_fetch_mask_add_u32_impl().

I moved generic implementation of pg_atomic_fetch_mask_add_u32() into
port/atomics/generic.h.  That seems to be more appropriate place for that
than fallback.h.  Because fallback.h has definitions for cases when we
don't have atomic for this platform at all.  generic.h has definitions of
lacking atomics using defined atomics.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


lwlock-power-4.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-03-25 Thread Tom Lane
Alexander Korotkov  writes:
> [ lwlock-power-3.patch ]

I experimented with this patch a bit.  I can't offer help on testing it
on large PPC machines, but I can report that it breaks the build on
Apple PPC machines, apparently because of nonstandard assembly syntax.
I get "Parameter syntax error" on these four lines of assembly:

and 3,r0,r4
cmpwi   3,0
add 3,r0,r5
stwcx.  3,0,r2

I am not sure what the "3"s are meant to be --- if that's a hard-coded
register number, then it's unacceptable code to begin with, IMO.
You should be able to get gcc to give you a scratch register of its
choosing via appropriate use of the register assignment part of the
asm syntax.  I think there are examples in s_lock.h.

I'm also unhappy that this patch changes the generic implementation of
LWLockAttemptLock.  That means that we'd need to do benchmarking on *every*
machine type to see what we're paying elsewhere for the benefit of PPC.
It seems unlikely that putting an extra level of function call into that
hot-spot is zero cost.

Lastly, putting machine-specific code into atomics.c is a really bad idea.
We have a convention for where to put such code, and that is not it.

You could address both the second and third concerns, I think, by putting
the PPC asm function into port/atomics/arch-ppc.h (which is where it
belongs anyway) and making the generic implementation be a "static inline"
function in port/atomics/fallback.h.  In that way, at least with compilers
that do inlines sanely, we could expect that this patch doesn't change the
generated code for LWLockAttemptLock at all on machines other than PPC.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-03-25 Thread Alexander Korotkov
On Thu, Mar 16, 2017 at 8:35 PM, David Steele  wrote:

> On 2/21/17 9:54 AM, Bernd Helmle wrote:
> > Am Dienstag, den 14.02.2017, 15:53 +0300 schrieb Alexander Korotkov:
> >> +1
> >> And you could try to use pg_wait_sampling
> >>  to sampling of wait
> >> events.
> >
> > I've tried this with your example from your blog post[1] and got this:
> >
> > (pgbench scale 1000)
> >
> > pgbench -Mprepared -S -n -c 100 -j 100 -T 300 -P2 pgbench2
> >
> > SELECT-only:
> >
> > SELECT * FROM profile_log ;
> >  ts |  event_type   | event | count
> > +---+---+---
> >  2017-02-21 15:21:52.45719  | LWLockNamed   | ProcArrayLock | 8
> >  2017-02-21 15:22:11.19594  | LWLockTranche | lock_manager  | 1
> >  2017-02-21 15:22:11.19594  | LWLockNamed   | ProcArrayLock |24
> >  2017-02-21 15:22:31.220803 | LWLockNamed   | ProcArrayLock | 1
> >  2017-02-21 15:23:01.255969 | LWLockNamed   | ProcArrayLock | 1
> >  2017-02-21 15:23:11.272254 | LWLockNamed   | ProcArrayLock | 2
> >  2017-02-21 15:23:41.313069 | LWLockNamed   | ProcArrayLock | 1
> >  2017-02-21 15:24:31.37512  | LWLockNamed   | ProcArrayLock |19
> >  2017-02-21 15:24:41.386974 | LWLockNamed   | ProcArrayLock | 1
> >  2017-02-21 15:26:41.530399 | LWLockNamed   | ProcArrayLock | 1
> > (10 rows)
> >
> > writes pgbench runs have far more events logged, see the attached text
> > file. Maybe this is of interest...
> >
> >
> > [1] http://akorotkov.github.io/blog/2016/03/25/wait_monitoring_9_6/
>
> This patch applies cleanly at cccbdde.  It doesn't break compilation on
> amd64 but I can't test on a Power-based machine
>
> Alexander, have you had a chance to look at Bernd's results?
>

Bernd's results look good.  But it seems that his machine is not big
enough, so effect of patch isn't huge.
I will get access to big enough Power8 machine in Monday 27 March.  I will
run benchmarks there and post results immediately.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-03-24 Thread David Steele

Hi Alexander,

On 3/16/17 1:35 PM, David Steele wrote:

On 2/21/17 9:54 AM, Bernd Helmle wrote:

Am Dienstag, den 14.02.2017, 15:53 +0300 schrieb Alexander Korotkov:

+1
And you could try to use pg_wait_sampling
 to sampling of wait
events.


I've tried this with your example from your blog post[1] and got this:

(pgbench scale 1000)

pgbench -Mprepared -S -n -c 100 -j 100 -T 300 -P2 pgbench2

SELECT-only:

SELECT * FROM profile_log ;
 ts |  event_type   | event | count
+---+---+---
 2017-02-21 15:21:52.45719  | LWLockNamed   | ProcArrayLock | 8
 2017-02-21 15:22:11.19594  | LWLockTranche | lock_manager  | 1
 2017-02-21 15:22:11.19594  | LWLockNamed   | ProcArrayLock |24
 2017-02-21 15:22:31.220803 | LWLockNamed   | ProcArrayLock | 1
 2017-02-21 15:23:01.255969 | LWLockNamed   | ProcArrayLock | 1
 2017-02-21 15:23:11.272254 | LWLockNamed   | ProcArrayLock | 2
 2017-02-21 15:23:41.313069 | LWLockNamed   | ProcArrayLock | 1
 2017-02-21 15:24:31.37512  | LWLockNamed   | ProcArrayLock |19
 2017-02-21 15:24:41.386974 | LWLockNamed   | ProcArrayLock | 1
 2017-02-21 15:26:41.530399 | LWLockNamed   | ProcArrayLock | 1
(10 rows)

writes pgbench runs have far more events logged, see the attached text
file. Maybe this is of interest...


[1] http://akorotkov.github.io/blog/2016/03/25/wait_monitoring_9_6/


This patch applies cleanly at cccbdde.  It doesn't break compilation on
amd64 but I can't test on a Power-based machine

Alexander, have you had a chance to look at Bernd's results?


I'm marking this submission "Waiting for Author" as your input seems to 
be required.


This thread has been idle for a week.  Please respond and/or post a new 
patch by 2017-03-28 00:00 AoE (UTC-12) or this submission will be marked 
"Returned with Feedback".


--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-03-16 Thread David Steele
On 2/21/17 9:54 AM, Bernd Helmle wrote:
> Am Dienstag, den 14.02.2017, 15:53 +0300 schrieb Alexander Korotkov:
>> +1
>> And you could try to use pg_wait_sampling
>>  to sampling of wait
>> events.
> 
> I've tried this with your example from your blog post[1] and got this:
> 
> (pgbench scale 1000)
> 
> pgbench -Mprepared -S -n -c 100 -j 100 -T 300 -P2 pgbench2
> 
> SELECT-only:
> 
> SELECT * FROM profile_log ;
>  ts |  event_type   | event | count 
> +---+---+---
>  2017-02-21 15:21:52.45719  | LWLockNamed   | ProcArrayLock | 8
>  2017-02-21 15:22:11.19594  | LWLockTranche | lock_manager  | 1
>  2017-02-21 15:22:11.19594  | LWLockNamed   | ProcArrayLock |24
>  2017-02-21 15:22:31.220803 | LWLockNamed   | ProcArrayLock | 1
>  2017-02-21 15:23:01.255969 | LWLockNamed   | ProcArrayLock | 1
>  2017-02-21 15:23:11.272254 | LWLockNamed   | ProcArrayLock | 2
>  2017-02-21 15:23:41.313069 | LWLockNamed   | ProcArrayLock | 1
>  2017-02-21 15:24:31.37512  | LWLockNamed   | ProcArrayLock |19
>  2017-02-21 15:24:41.386974 | LWLockNamed   | ProcArrayLock | 1
>  2017-02-21 15:26:41.530399 | LWLockNamed   | ProcArrayLock | 1
> (10 rows)
> 
> writes pgbench runs have far more events logged, see the attached text
> file. Maybe this is of interest...
> 
> 
> [1] http://akorotkov.github.io/blog/2016/03/25/wait_monitoring_9_6/

This patch applies cleanly at cccbdde.  It doesn't break compilation on
amd64 but I can't test on a Power-based machine

Alexander, have you had a chance to look at Bernd's results?

-- 
-David
da...@pgmasters.net


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-21 Thread Bernd Helmle
Am Dienstag, den 14.02.2017, 15:53 +0300 schrieb Alexander Korotkov:
> +1
> And you could try to use pg_wait_sampling
>  to sampling of wait
> events.

I've tried this with your example from your blog post[1] and got this:

(pgbench scale 1000)

pgbench -Mprepared -S -n -c 100 -j 100 -T 300 -P2 pgbench2

SELECT-only:

SELECT * FROM profile_log ;
 ts |  event_type   | event | count 
+---+---+---
 2017-02-21 15:21:52.45719  | LWLockNamed   | ProcArrayLock | 8
 2017-02-21 15:22:11.19594  | LWLockTranche | lock_manager  | 1
 2017-02-21 15:22:11.19594  | LWLockNamed   | ProcArrayLock |24
 2017-02-21 15:22:31.220803 | LWLockNamed   | ProcArrayLock | 1
 2017-02-21 15:23:01.255969 | LWLockNamed   | ProcArrayLock | 1
 2017-02-21 15:23:11.272254 | LWLockNamed   | ProcArrayLock | 2
 2017-02-21 15:23:41.313069 | LWLockNamed   | ProcArrayLock | 1
 2017-02-21 15:24:31.37512  | LWLockNamed   | ProcArrayLock |19
 2017-02-21 15:24:41.386974 | LWLockNamed   | ProcArrayLock | 1
 2017-02-21 15:26:41.530399 | LWLockNamed   | ProcArrayLock | 1
(10 rows)

writes pgbench runs have far more events logged, see the attached text
file. Maybe this is of interest...


[1] http://akorotkov.github.io/blog/2016/03/25/wait_monitoring_9_6/ ts |  event_type   |event | count 
+---+--+---
 2017-02-21 15:45:18.875842 | Lock  | tuple|18
 2017-02-21 15:45:18.875842 | LWLockTranche | lock_manager |35
 2017-02-21 15:45:18.875842 | LWLockTranche | buffer_mapping   | 1
 2017-02-21 15:45:18.875842 | LWLockTranche | buffer_content   |   402
 2017-02-21 15:45:18.875842 | LWLockTranche | wal_insert   |  1880
 2017-02-21 15:45:18.875842 | Lock  | transactionid|   716
 2017-02-21 15:45:18.875842 | LWLockNamed   | ProcArrayLock|   644
 2017-02-21 15:45:18.875842 | LWLockNamed   | CLogControlLock  |   255
 2017-02-21 15:45:18.875842 | LWLockNamed   | XidGenLock   |   104
 2017-02-21 15:45:18.875842 | Lock  | extend   |   647
 2017-02-21 15:45:28.889785 | Lock  | tuple|65
 2017-02-21 15:45:28.889785 | LWLockTranche | lock_manager |96
 2017-02-21 15:45:28.889785 | LWLockTranche | buffer_content   |   811
 2017-02-21 15:45:28.889785 | LWLockTranche | wal_insert   |  2542
 2017-02-21 15:45:28.889785 | Lock  | transactionid|  1686
 2017-02-21 15:45:28.889785 | LWLockNamed   | ProcArrayLock|  1194
 2017-02-21 15:45:28.889785 | LWLockNamed   | CLogControlLock  |  1244
 2017-02-21 15:45:28.889785 | LWLockNamed   | XidGenLock   |   401
 2017-02-21 15:45:28.889785 | Lock  | extend   |   818
 2017-02-21 15:45:38.904761 | Lock  | tuple|54
 2017-02-21 15:45:38.904761 | LWLockTranche | lock_manager |87
 2017-02-21 15:45:38.904761 | LWLockTranche | buffer_content   |   756
 2017-02-21 15:45:38.904761 | LWLockTranche | wal_insert   |  2161
 2017-02-21 15:45:38.904761 | Lock  | transactionid|  1624
 2017-02-21 15:45:38.904761 | LWLockNamed   | ProcArrayLock|  1154
 2017-02-21 15:45:38.904761 | LWLockNamed   | CLogControlLock  |  1418
 2017-02-21 15:45:38.904761 | LWLockNamed   | XidGenLock   |   450
 2017-02-21 15:45:38.904761 | Lock  | extend   |   650
 2017-02-21 15:45:48.917774 | Lock  | tuple|51
 2017-02-21 15:45:48.917774 | LWLockTranche | lock_manager |85
 2017-02-21 15:45:48.917774 | LWLockTranche | buffer_content   |   947
 2017-02-21 15:45:48.917774 | LWLockTranche | wal_insert   |  2050
 2017-02-21 15:45:48.917774 | Lock  | transactionid|  1625
 2017-02-21 15:45:48.917774 | LWLockNamed   | ProcArrayLock|  1094
 2017-02-21 15:45:48.917774 | LWLockNamed   | CLogControlLock  |  2575
 2017-02-21 15:45:48.917774 | LWLockNamed   | XidGenLock   |   372
 2017-02-21 15:45:48.917774 | Lock  | extend   |   528
 2017-02-21 15:45:58.930732 | Lock  | tuple|63
 2017-02-21 15:45:58.930732 | LWLockTranche | lock_manager |86
 2017-02-21 15:45:58.930732 | LWLockTranche | buffer_content   |  1026
 2017-02-21 15:45:58.930732 | LWLockTranche | wal_insert   |  1543
 2017-02-21 15:45:58.930732 | Lock  | transactionid|  1794
 2017-02-21 15:45:58.930732 | LWLockNamed   | ProcArrayLock|  1154
 2017-02-21 15:45:58.930732 | LWLockNamed   | CLogControlLock  |  3850
 2017-02-21 15:45:58.930732 | LWLockNamed   | XidGenLock   |   315
 2017-02-21 15:45:58.930732 | Lock  | extend   |   

Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-21 Thread Alexander Korotkov
On Tue, Feb 21, 2017 at 1:47 PM, Bernd Helmle  wrote:

> Am Samstag, den 11.02.2017, 15:42 +0300 schrieb Alexander Korotkov:
> > I think it would make sense to run more kinds of tests.  Could you
> > try set
> > of tests provided by Tomas Vondra?
> > If even we wouldn't see win some of the tests, it would be still
> > valuable
> > to see that there is no regression there.
>
> Sorry for the delay.
>
> But here are the results after having run Tomas' benchmarks scripts.
> The pgbench-ro benchmark shows a clear win of the lwlock-power-3 patch,
> also you see a slight advantage of this patch in the pgbench-rw graph.
> At least, i don't see any signifikant regression somewhere.
>
> The graphs are showing the average values over all five runs for each
> client count.
>

Thank you very much for testing, results look good.
I'll keep trying to access bigger Power machine where advantage of patch is
expected to be larger.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-16 Thread Bernd Helmle
Am Dienstag, den 14.02.2017, 15:53 +0300 schrieb Alexander Korotkov:
> +1
> And you could try to use pg_wait_sampling
>  to sampling of wait
> events.

Okay, i'm going to try this. Currently Tomas' scripts are still
running, i'll provide updates as soon as they are finished.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-14 Thread Alexander Korotkov
On Mon, Feb 13, 2017 at 10:17 PM, Tomas Vondra  wrote:

> On 02/13/2017 03:16 PM, Bernd Helmle wrote:
>
>> Am Samstag, den 11.02.2017, 15:42 +0300 schrieb Alexander Korotkov:
>>
>>> Thus, I see reasons why in your tests absolute results are lower than
>>> in my
>>> previous tests.
>>> 1.  You use 28 physical cores while I was using 32 physical cores.
>>> 2.  You run tests in PowerVM while I was running test on bare metal.
>>> PowerVM could have some overhead.
>>> 3.  I guess you run pgbench on the same machine.  While in my tests
>>> pgbench
>>> was running on another node of IBM E880.
>>>
>>>
>> Yeah, pgbench was running locally. Maybe we can get some resources to
>> run them remotely. Interesting side note: If you run a second postgres
>> instance with the same pgbench in parallel, you get nearly the same
>> transaction throughput as a single instance.
>>
>> Short side note:
>>
>> If you run two Postgres instances concurrently with the same pgbench
>> parameters, you get nearly the same transaction throughput for both
>> instances each as when running against a single instance, e.g.
>>
>>
> That strongly suggests you're hitting some kind of lock. It'd be good to
> know which one. I see you're doing "pgbench -S" which also updates branches
> and other tiny tables - it's possible the sessions are trying to update the
> same row in those tiny tables. You're running with scale 1000, but with 100
> it's still possible thanks to the birthday paradox.
>
> Otherwise it might be interesting to look at sampling wait events, which
> might tell us more about the locks.
>

+1
And you could try to use pg_wait_sampling
 to sampling of wait
events.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-13 Thread Tomas Vondra

On 02/13/2017 03:16 PM, Bernd Helmle wrote:

Am Samstag, den 11.02.2017, 15:42 +0300 schrieb Alexander Korotkov:

Thus, I see reasons why in your tests absolute results are lower than
in my
previous tests.
1.  You use 28 physical cores while I was using 32 physical cores.
2.  You run tests in PowerVM while I was running test on bare metal.
PowerVM could have some overhead.
3.  I guess you run pgbench on the same machine.  While in my tests
pgbench
was running on another node of IBM E880.



Yeah, pgbench was running locally. Maybe we can get some resources to
run them remotely. Interesting side note: If you run a second postgres
instance with the same pgbench in parallel, you get nearly the same
transaction throughput as a single instance.

Short side note:

If you run two Postgres instances concurrently with the same pgbench
parameters, you get nearly the same transaction throughput for both
instances each as when running against a single instance, e.g.



That strongly suggests you're hitting some kind of lock. It'd be good to 
know which one. I see you're doing "pgbench -S" which also updates 
branches and other tiny tables - it's possible the sessions are trying 
to update the same row in those tiny tables. You're running with scale 
1000, but with 100 it's still possible thanks to the birthday paradox.


Otherwise it might be interesting to look at sampling wait events, which 
might tell us more about the locks.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-13 Thread Bernd Helmle
Am Samstag, den 11.02.2017, 15:42 +0300 schrieb Alexander Korotkov:
> Thus, I see reasons why in your tests absolute results are lower than
> in my
> previous tests.
> 1.  You use 28 physical cores while I was using 32 physical cores.
> 2.  You run tests in PowerVM while I was running test on bare metal.
> PowerVM could have some overhead.
> 3.  I guess you run pgbench on the same machine.  While in my tests
> pgbench
> was running on another node of IBM E880.
> 

Yeah, pgbench was running locally. Maybe we can get some resources to
run them remotely. Interesting side note: If you run a second postgres
instance with the same pgbench in parallel, you get nearly the same
transaction throughput as a single instance.

Short side note:

If you run two Postgres instances concurrently with the same pgbench
parameters, you get nearly the same transaction throughput for both
instances each as when running against a single instance, e.g.


- single

transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 112
number of threads: 112
duration: 300 s
number of transactions actually processed: 121523797
latency average = 0.276 ms
latency stddev = 0.096 ms
tps = 405075.282309 (including connections establishing)
tps = 405114.299174 (excluding connections establishing)

instance-1/instance-2 concurrently run:

transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 112
number of threads: 56
duration: 300 s
number of transactions actually processed: 120645351
latency average = 0.278 ms
latency stddev = 0.158 ms
tps = 402148.536087 (including connections establishing)
tps = 402199.952824 (excluding connections establishing)

transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 112
number of threads: 56
duration: 300 s
number of transactions actually processed: 121959772
latency average = 0.275 ms
latency stddev = 0.110 ms
tps = 406530.139080 (including connections establishing)
tps = 406556.658638 (excluding connections establishing)

So it looks like the machine has plenty of power, but PostgreSQL is
limiting somewhere.

> Therefore, having lower absolute numbers in your tests, win of LWLock
> optimization is also lower.  That is understandable.  But win of
> LWLock
> optimization is clearly visible definitely exceeds variation.
> 
> I think it would make sense to run more kinds of tests.  Could you
> try set
> of tests provided by Tomas Vondra?
> If even we wouldn't see win some of the tests, it would be still
> valuable
> to see that there is no regression there.

Unfortunately there are some test for AIX scheduled, which will assign
resources to that LPAR...i've just talked to the people responsible for
the machine and we can get more time for the Linux tests ;)



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-11 Thread Tomas Vondra

On 02/11/2017 01:42 PM, Alexander Korotkov wrote:


I think it would make sense to run more kinds of tests.  Could you try
set of tests provided by Tomas Vondra?
If even we wouldn't see win some of the tests, it would be still
valuable to see that there is no regression there.



FWIW it shouldn't be difficult to tweak my scripts and run them on 
another machine. You'd have to customize the parameters (scales, client 
counts, ...) and there are a few hard-coded paths, but that's about it.


regards
Tomas

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-11 Thread Alexander Korotkov
On Wed, Feb 8, 2017 at 5:00 PM, Bernd Helmle  wrote:

> Am Dienstag, den 07.02.2017, 16:48 +0300 schrieb Alexander Korotkov:
> > But win isn't
> > as high as I observed earlier.  And I wonder why absolute numbers are
> > lower
> > than in our earlier experiments.  We used IBM E880 which is actually
> > two
>
> Did you run your tests on bare metal or were they also virtualized?
>

I run tests on bare metal.

> nodes with interconnect.  However interconnect is not fast enough to
> > make
> > one PostgreSQL instance work on both nodes.  Thus, used half of IBM
> > E880
> > which has 4 sockets and 32 physical cores.  While you use IBM E850
> > which is
> > really single node with 4 sockets and 48 physical cores.  Thus, it
> > seems
> > that you have lower absolute numbers on more powerful hardware.  That
> > makes
> > me uneasy and I think we probably don't get the best from this
> > hardware.
> > Just in case, do you use SMT=8?
>
> Yes, SMT=8 was used.
>
> The machine has 4 sockets, 8 Core each, 3.7 GHz clock frequency. The
> Ubuntu LPAR running on PowerVM isn't using all physical cores,
> currently 28 cores are assigned (=224 SMT Threads). The other cores are
> dedicated to the PowerVM hypervisor and a (very) small AIX LPAR.
>

Thank you very much for the explanation.

Thus, I see reasons why in your tests absolute results are lower than in my
previous tests.
1.  You use 28 physical cores while I was using 32 physical cores.
2.  You run tests in PowerVM while I was running test on bare metal.
PowerVM could have some overhead.
3.  I guess you run pgbench on the same machine.  While in my tests pgbench
was running on another node of IBM E880.

Therefore, having lower absolute numbers in your tests, win of LWLock
optimization is also lower.  That is understandable.  But win of LWLock
optimization is clearly visible definitely exceeds variation.

I think it would make sense to run more kinds of tests.  Could you try set
of tests provided by Tomas Vondra?
If even we wouldn't see win some of the tests, it would be still valuable
to see that there is no regression there.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-08 Thread Bernd Helmle
Am Dienstag, den 07.02.2017, 16:48 +0300 schrieb Alexander Korotkov:
> But win isn't
> as high as I observed earlier.  And I wonder why absolute numbers are
> lower
> than in our earlier experiments.  We used IBM E880 which is actually
> two

Did you run your tests on bare metal or were they also virtualized?

> nodes with interconnect.  However interconnect is not fast enough to
> make
> one PostgreSQL instance work on both nodes.  Thus, used half of IBM
> E880
> which has 4 sockets and 32 physical cores.  While you use IBM E850
> which is
> really single node with 4 sockets and 48 physical cores.  Thus, it
> seems
> that you have lower absolute numbers on more powerful hardware.  That
> makes
> me uneasy and I think we probably don't get the best from this
> hardware.
> Just in case, do you use SMT=8?

Yes, SMT=8 was used.

The machine has 4 sockets, 8 Core each, 3.7 GHz clock frequency. The
Ubuntu LPAR running on PowerVM isn't using all physical cores,
currently 28 cores are assigned (=224 SMT Threads). The other cores are
dedicated to the PowerVM hypervisor and a (very) small AIX LPAR.

I've done more pgbenches this morning with SMT-4, too, fastest result
with master was 

SMT-4

transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 300 s
number of transactions actually processed: 167306423
latency average = 0.179 ms
latency stddev = 0.072 ms
tps = 557685.144912 (including connections establishing)
tps = 557835.683204 (excluding connections establishing)

compared with SMT-8:

transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 300 s
number of transactions actually processed: 173476449
latency average = 0.173 ms
latency stddev = 0.059 ms
tps = 578250.676019 (including connections establishing)
tps = 578412.159601 (excluding connections establishing)

and retried with lwlocks-power-3, SMT-4:

transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 300 s
number of transactions actually processed: 185991995
latency average = 0.161 ms
latency stddev = 0.059 ms
tps = 619970.030069 (including connections establishing)
tps = 620112.263770 (excluding connections establishing)
credativ@iicl183:~/git/postgres$ 

...and SMT-8

transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 300 s
number of transactions actually processed: 185878717
latency average = 0.161 ms
latency stddev = 0.047 ms
tps = 619591.476154 (including connections establishing)
tps = 619655.867280 (excluding connections establishing)

Interestingly the lwlocks patch seems to decrease the SMT influence
factor.

Side note: the system makes around 2 Mio Context Switches during the
benchmarks, e.g.

awk '{print $12;}' /tmp/vmstat.out 

cs
10
2153533
2134864
2141623
2126845
2128330
2127454
2145325
2126769
2134492
2130246
2130071
2142660
2136077
2126783
2126107
2125823
2136511
2137752
2146307
2141127

I've also tried a more recent kernel this morning (4.4 vs. 4.8), but
this didn't change the picture. Is there anything more i can do?




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-07 Thread Alexander Korotkov
On Tue, Feb 7, 2017 at 3:16 PM, Bernd Helmle  wrote:

> Am Montag, den 06.02.2017, 22:44 +0300 schrieb Alexander Korotkov:
>
>2. Also could you run each test longer: 3-5 mins, and run them with
>variety of clients count?
>
>
> So here are some other results. I've changed max_connections to 300. The
> bench was prewarmed and run 300s each.
> I could run more benches, if necessary.
>

Thank you very much for benchmarks!

There is clear win of both lwlock-power-1.patch and lwlock-power-3.patch in
comparison to master.  Difference between lwlock-power-1.patch and
lwlock-power-3.patch seems to be within the margin of error.  But win isn't
as high as I observed earlier.  And I wonder why absolute numbers are lower
than in our earlier experiments.  We used IBM E880 which is actually two
nodes with interconnect.  However interconnect is not fast enough to make
one PostgreSQL instance work on both nodes.  Thus, used half of IBM E880
which has 4 sockets and 32 physical cores.  While you use IBM E850 which is
really single node with 4 sockets and 48 physical cores.  Thus, it seems
that you have lower absolute numbers on more powerful hardware.  That makes
me uneasy and I think we probably don't get the best from this hardware.
Just in case, do you use SMT=8?

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-07 Thread Bernd Helmle
Am Montag, den 06.02.2017, 22:44 +0300 schrieb Alexander Korotkov:
>    2. Also could you run each test longer: 3-5 mins, and run them
> with
> 
>    variety of clients count?

So here are some other results. I've changed max_connections to 300.
The bench was prewarmed and run 300s each.
I could run more benches, if necessary.

>> clients 10 <
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 10
number of threads: 10
duration: 300 s
number of transactions actually processed: 47435397
latency average = 0.063 ms
tps = 158117.858124 (including connections establishing)
tps = 158121.264103 (excluding connections establishing)
>> clients 20 <
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 20
number of threads: 20
duration: 300 s
number of transactions actually processed: 85894461
latency average = 0.070 ms
tps = 286314.516993 (including connections establishing)
tps = 286325.924487 (excluding connections establishing)
>> clients 80 <
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 80
number of threads: 80
duration: 300 s
number of transactions actually processed: 173185067
latency average = 0.139 ms
tps = 577280.609099 (including connections establishing)
tps = 577482.251431 (excluding connections establishing)
>> clients 100 <
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 300 s
number of transactions actually processed: 182337951
latency average = 0.165 ms
tps = 607788.998215 (including connections establishing)
tps = 607879.190538 (excluding connections establishing)
>> clients 140 <
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 140
number of threads: 140
duration: 300 s
number of transactions actually processed: 167425515
latency average = 0.251 ms
tps = 558080.131402 (including connections establishing)
tps = 558230.505974 (excluding connections establishing)
>> clients 200 <
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 200
number of threads: 200
duration: 300 s
number of transactions actually processed: 150990552
latency average = 0.397 ms
tps = 503290.086636 (including connections establishing)
tps = 503486.345115 (excluding connections establishing)
>> clients 10 <
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 10
number of threads: 10
duration: 300 s
number of transactions actually processed: 49190440
latency average = 0.061 ms
tps = 163967.958649 (including connections establishing)
tps = 163970.305002 (excluding connections establishing)
>> clients 20 <
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 20
number of threads: 20
duration: 300 s
number of transactions actually processed: 81753716
latency average = 0.073 ms
tps = 272511.936932 (including connections establishing)
tps = 272518.387577 (excluding connections establishing)
>> clients 80 <
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 80
number of threads: 80
duration: 300 s
number of transactions actually processed: 173734287
latency average = 0.138 ms
tps = 579110.743767 (including connections establishing)
tps = 579171.270968 (excluding connections establishing)
>> clients 100 <
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 300 s
number of transactions actually processed: 183013962
latency average = 0.164 ms
tps = 610042.164629 (including connections establishing)
tps = 610124.661236 (excluding connections establishing)
>> clients 140 <
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 140
number of threads: 140
duration: 300 s
number of transactions actually processed: 169105547
latency average = 0.248 ms
tps = 563679.957119 (including connections establishing)
tps = 563925.230369 (excluding connections establishing)
>> clients 200 <
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 200
number of threads: 200
duration: 300 s
number of transactions actually processed: 152377373
latency average = 0.394 ms
tps = 507918.997085 (including connections establishing)
tps = 508259.090735 (excluding connections establishing)
>> clients 10 <
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 10
number of threads: 10
duration: 300 s
number of transactions actually processed: 49556864
latency average = 0.061 ms
tps = 165189.402451 (including connections establishing)
tps = 165191.717347 (excluding connections establishing)
>> clients 20 <
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 20
number of threads: 20
duration: 300 s
number of transactions actually processed: 82017648
latency average = 0.073 ms
tps = 273391.829687 (incl

Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-06 Thread Bernd Helmle
On Mon, 2017-02-06 at 22:44 +0300, Alexander Korotkov wrote:
> Results looks strange for me.  I wonder why there is difference
> between
> lwlock-power-1.patch and lwlock-power-3.patch?  From my intuition, it
> shouldn't be there because it's not much difference between them. 
> Thus, I
> have following questions.
> 
> 

Yeah, i've realized that as well.

>    1. Have you warm up database?  I.e. could you do "SELECT sum(x.x)
> FROM
>    (SELECT pg_prewarm(oid) AS x FROM pg_class WHERE relkind IN ('i',
> 'r')
>    ORDER BY oid) x;" before each run?
>    2. Also could you run each test longer: 3-5 mins, and run them
> with
>    variety of clients count?

The results i've posted were the last 3 run of 9 in summary. I hoped
that should be enough to prewarm the system. I'm going to repeat the
tests with the changes you've requested, though.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-06 Thread Andres Freund
Hi,

On 2017-02-03 20:01:03 +0300, Alexander Korotkov wrote:
> Using assembly in lwlock.c looks rough.  This is why I refactored it by
> introducing new atomic operation pg_atomic_fetch_mask_add_u32 (see
> lwlock-power-2.patch).  It checks that all masked bits are clear and then
> adds to variable.  This atomic have special assembly implementation for
> Power, and generic implementation for other platforms with loop of CAS.
> Probably we would have other implementations for other architectures in
> future.  This level of abstraction is the best I managed to invent.

I think that's a reasonable approach.  And I think it might be worth
experimenting with a more efficient implementation on x86 too, using
hardware lock elision / HLE and/or tsx.

Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-06 Thread Alexander Korotkov
On Mon, Feb 6, 2017 at 8:28 PM, Bernd Helmle  wrote:

> Am Montag, den 06.02.2017, 16:45 +0300 schrieb Alexander Korotkov:
> > I tried lwlock-power-2.patch on multicore Power machine we have in
> > PostgresPro.
> > I realized that using labels in assembly isn't safe.  Thus, I removed
> > labels and use relative jumps instead (lwlock-power-2.patch).
> > Unfortunately, I didn't manage to make any reasonable benchmarks.
> > This
> > machine runs AIX, and there are a lot of problems which prevents
> > PostgreSQL
> > to show high TPS.  Installing Linux there is not an option too,
> > because
> > that machine is used for tries to make Postgres work properly on AIX.
> > So, benchmarking help is very relevant.  I would very appreciate
> > that.
>
> Okay, so here are some results. The bench runs against
> current PostgreSQL master, 24 GByte shared_buffers configured (128
> GByte physical RAM), max_wal_size=8GB and effective_cache_size=100GB.
>

Thank you very much for testing!

Results looks strange for me.  I wonder why there is difference between
lwlock-power-1.patch and lwlock-power-3.patch?  From my intuition, it
shouldn't be there because it's not much difference between them.  Thus, I
have following questions.


   1. Have you warm up database?  I.e. could you do "SELECT sum(x.x) FROM
   (SELECT pg_prewarm(oid) AS x FROM pg_class WHERE relkind IN ('i', 'r')
   ORDER BY oid) x;" before each run?
   2. Also could you run each test longer: 3-5 mins, and run them with
   variety of clients count?


--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-06 Thread Bernd Helmle
Am Montag, den 06.02.2017, 16:45 +0300 schrieb Alexander Korotkov:
> I tried lwlock-power-2.patch on multicore Power machine we have in
> PostgresPro.
> I realized that using labels in assembly isn't safe.  Thus, I removed
> labels and use relative jumps instead (lwlock-power-2.patch).
> Unfortunately, I didn't manage to make any reasonable benchmarks. 
> This
> machine runs AIX, and there are a lot of problems which prevents
> PostgreSQL
> to show high TPS.  Installing Linux there is not an option too,
> because
> that machine is used for tries to make Postgres work properly on AIX.
> So, benchmarking help is very relevant.  I would very appreciate
> that.

Okay, so here are some results. The bench runs against
current PostgreSQL master, 24 GByte shared_buffers configured (128
GByte physical RAM), max_wal_size=8GB and effective_cache_size=100GB.

I've just discovered that max_connections was accidently set to 601,
normally i'd have set something near 110 or so...



transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 16910687
latency average = 0.177 ms
tps = 563654.968585 (including connections establishing)
tps = 563991.459659 (excluding connections establishing)
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 16523247
latency average = 0.182 ms
tps = 550744.748084 (including connections establishing)
tps = 552069.267389 (excluding connections establishing)
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 16796056
latency average = 0.179 ms
tps = 559830.986738 (including connections establishing)
tps = 560333.682010 (excluding connections establishing)



transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 14563500
latency average = 0.206 ms
tps = 485420.764515 (including connections establishing)
tps = 485720.606371 (excluding connections establishing)
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 14618457
latency average = 0.205 ms
tps = 487246.817758 (including connections establishing)
tps = 488117.718816 (excluding connections establishing)
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 14522462
latency average = 0.207 ms
tps = 484052.194063 (including connections establishing)
tps = 485434.771590 (excluding connections establishing)



transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 17946058
latency average = 0.167 ms
tps = 598164.841490 (including connections establishing)
tps = 598582.503248 (excluding connections establishing)
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 17719648
latency average = 0.169 ms
tps = 590621.671588 (including connections establishing)
tps = 591093.333153 (excluding connections establishing)
transaction type: 
scaling factor: 1000
query mode: prepared
number of clients: 100
number of threads: 100
duration: 30 s
number of transactions actually processed: 17722941
latency average = 0.169 ms
tps = 590728.715465 (including connections establishing)
tps = 591619.817043 (excluding connections establishing)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-06 Thread Alexander Korotkov
On Fri, Feb 3, 2017 at 11:31 PM, Bernd Helmle  wrote:

> > UPD: It appears that Postgres Pro have access to big Power machine
> > now.
> > So, I can do testing/benchmarking myself.
>
> We currently also have access to a LPAR on an E850 machine with 4
> sockets POWER8 running a Ubuntu 16.04 LTS Server ppc64el OS. I can do
> some tests next week, if you need to verify your findings.
>

Very good, thank you!

I tried lwlock-power-2.patch on multicore Power machine we have in
PostgresPro.
I realized that using labels in assembly isn't safe.  Thus, I removed
labels and use relative jumps instead (lwlock-power-2.patch).
Unfortunately, I didn't manage to make any reasonable benchmarks.  This
machine runs AIX, and there are a lot of problems which prevents PostgreSQL
to show high TPS.  Installing Linux there is not an option too, because
that machine is used for tries to make Postgres work properly on AIX.
So, benchmarking help is very relevant.  I would very appreciate that.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


lwlock-power-3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-03 Thread Robert Haas
On Fri, Feb 3, 2017 at 12:01 PM, Alexander Korotkov
 wrote:
> Hi everybody!
>
> During FOSDEM/PGDay 2017 developer meeting I said that I have some special
> assembly optimization for multicore Power machines.  From the answers of
> other hackers I realized following.
>
> There are some big Power machines with PostgreSQL in production use.  Not as
> many as Intel, but some of them.
> Community could be interested in special assembly optimization for Power
> machines despite cost of maintaining it.
>
> Power processors use specific implementation of atomic operations.  This
> implementation is some kind of optimistic locking. 'lwarx' instruction
> 'reserves index', but that reservation could be broken on 'stwcx', and then
> we have to retry.  So, for instance CAS operation on Power processor is a
> loop.  So, loop of CAS operations is two level nested loop.  Benchmarks
> showed that it becomes real problem for LWLockAttemptLock().  However, one
> actually can put arbitrary logic between 'lwarx' and 'stwcx' and make it a
> single loop.  The downside is that this logic has to be implemented in
> assembly.  See [1] for experiment details.
>
> Results in [1] have a lot of junk which isn't relevant anymore.  This is why
> I draw a separate graph.
>
> power8-lwlock-asm-ro.png – results of read-only pgbench test on IBM E880
> which have 32 physical cores and 256 virtual thread via SMT.  The curves
> have following meaning.
>  * 9.5: unpatched PostgreSQL 9.5
>  * pinunpin-cas: PostgreSQL 9.5 + earlier version of 48354581
>  * pinunpin-lwlock-asm: PostgreSQL 9.5 + earlier version of 48354581 +
> LWLock implementation in assembly.

Cool work.  Obviously there's some work to do before we can merge this
-- vetting the abstraction, performance testing -- but it seems pretty
neat.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-03 Thread Bernd Helmle
On Fri, 2017-02-03 at 20:11 +0300, Alexander Korotkov wrote:
> On Fri, Feb 3, 2017 at 8:01 PM, Alexander Korotkov <
> a.korot...@postgrespro.ru> wrote:
> 
> > Unfortunately, I have no big enough Power machine at hand to
> > reproduce
> > that results.  Actually, I have no Power machine at hand at
> > all.  So,
> > lwlock-power-2.patch was written "blindly".  I would very
> > appreciate if
> > someone would help me with testing and benchmarking.
> > 
> 
> UPD: It appears that Postgres Pro have access to big Power machine
> now.
> So, I can do testing/benchmarking myself.
> 

Hi Alexander,

We currently also have access to a LPAR on an E850 machine with 4
sockets POWER8 running a Ubuntu 16.04 LTS Server ppc64el OS. I can do
some tests next week, if you need to verify your findings.

Thanks,

Bernd


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] LWLock optimization for multicore Power machines

2017-02-03 Thread Alexander Korotkov
On Fri, Feb 3, 2017 at 8:01 PM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> Unfortunately, I have no big enough Power machine at hand to reproduce
> that results.  Actually, I have no Power machine at hand at all.  So,
> lwlock-power-2.patch was written "blindly".  I would very appreciate if
> someone would help me with testing and benchmarking.
>

UPD: It appears that Postgres Pro have access to big Power machine now.
So, I can do testing/benchmarking myself.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


[HACKERS] LWLock optimization for multicore Power machines

2017-02-03 Thread Alexander Korotkov
Hi everybody!

During FOSDEM/PGDay 2017 developer meeting I said that I have some special
assembly optimization for multicore Power machines.  From the answers of
other hackers I realized following.

   1. There are some big Power machines with PostgreSQL in production use.
   Not as many as Intel, but some of them.
   2. Community could be interested in special assembly optimization for
   Power machines despite cost of maintaining it.

Power processors use specific implementation of atomic operations.  This
implementation is some kind of optimistic locking. 'lwarx' instruction
'reserves index', but that reservation could be broken on 'stwcx', and then
we have to retry.  So, for instance CAS operation on Power processor is a
loop.  So, loop of CAS operations is two level nested loop.  Benchmarks
showed that it becomes real problem for LWLockAttemptLock().  However, one
actually can put arbitrary logic between 'lwarx' and 'stwcx' and make it a
single loop.  The downside is that this logic has to be implemented in
assembly.  See [1] for experiment details.

Results in [1] have a lot of junk which isn't relevant anymore.  This is
why I draw a separate graph.

power8-lwlock-asm-ro.png – results of read-only pgbench test on IBM E880
which have 32 physical cores and 256 virtual thread via SMT.  The curves
have following meaning.
 * 9.5: unpatched PostgreSQL 9.5
 * pinunpin-cas: PostgreSQL 9.5 + earlier version of 48354581
 * pinunpin-lwlock-asm: PostgreSQL 9.5 + earlier version of 48354581 +
LWLock implementation in assembly.

lwlock-power-1.patch – is the patch for assembly implementation of LWLock
which I used that time rebased to current master.

Using assembly in lwlock.c looks rough.  This is why I refactored it by
introducing new atomic operation pg_atomic_fetch_mask_add_u32 (see
lwlock-power-2.patch).  It checks that all masked bits are clear and then
adds to variable.  This atomic have special assembly implementation for
Power, and generic implementation for other platforms with loop of CAS.
Probably we would have other implementations for other architectures in
future.  This level of abstraction is the best I managed to invent.

Unfortunately, I have no big enough Power machine at hand to reproduce that
results.  Actually, I have no Power machine at hand at all.  So,
lwlock-power-2.patch was written "blindly".  I would very appreciate if
someone would help me with testing and benchmarking.

1. https://www.postgresql.org/message-id/CAPpHfdsogj38HTDhNMLE56uJy9N8-
=gya2nnuwbpujgp2n1...@mail.gmail.com

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


lwlock-power-1.patch
Description: Binary data


lwlock-power-2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers