Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Nils Goroll

On 10/06/15 16:05, Andres Freund wrote:
 it'll nearly always be beneficial to spin

Trouble is that postgres cannot know if the process holding the lock actually
does run, so if it doesn't, all we're doing is burn cycles and make the problem
worse.

Contrary to that, the kernel does know, so for a (f|m)utex which fails to
acquire immediately and thus needs to syscall, the kernel has the option to spin
only if the lock holder is running (the adaptive mutex).

Nils


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Nils Goroll
On larger Linux machines, we have been running with spin locks replaced by
generic posix mutexes for years now. I personally haven't look at the code for
ages, but we maintain a patch which pretty much does the same thing still:

Ref: http://www.postgresql.org/message-id/4fede0bf.7080...@schokola.de

I understand that there are systems out there which have less efficient posix
mutex implementations than Linux (which uses futexes), but I think it would
still be worth considering to do away with the roll-your-own spinlocks on
systems whose posix mutexes are known to behave.

Nils



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] s_lock() seems too aggressive for machines with many sockets

2015-06-10 Thread Nils Goroll
 As in 200%+ slower.
 Have you tried PTHREAD_MUTEX_ADAPTIVE_NP ?
 Yes.

Ok, if this can be validated, we might have a new case now for which my
suggestion would not be helpful. Reviewed, optimized code with short critical
sections and no hotspots by design could indeed be an exception where to keep
slock as they are.

 Hm, ok. Any chance you have profiles from back then?

IIUC I had shared all relevant data on the list. Does this help?
http://www.postgresql.org/message-id/4fe9eb27.9020...@schokola.de

Thanks, NIls


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] xid wrap / optimize frozen tables?

2015-06-04 Thread Nils Goroll
Just FYI: We have worked around these issues by running regular (scripted and
thus controlled) vaccuums on all tables but the active ones and adding L2 ZFS
caching (l2arc). I hope to get back to this again soon.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] xid wrap / optimize frozen tables?

2015-05-24 Thread Nils Goroll
Hi Jeff and all,

On 23/05/15 22:13, Jeff Janes wrote:
 Are you sure it is the read IO that causes the problem?

Yes. Trouble is here that we are talking about a 361 GB table

   List of relations
 Schema |Name |   Type   |  Owner   |Size|
Description
+-+--+--++-
 public | *redacted*_y2015m04 | table| postgres | 361 GB |

and while we have

shared_buffers = 325GB
huge_pages = on

this is not the only table of this size (total db size ist 1.8tb) and more
current data got written to *redacted*_y2015m05 (the manually-partitioned table
for may), so most of the m04 data would have got evicted from the cache when
this issue surfaced initially.

There is one application pushing data (bulk inserts) and we have transaction
rates for this app in a log. The moment the vacuum started, these rates dropped.
Unfortunately I cannot present helpful log excerpts here as the autovacuum never
finished so far (because the admin killed the db), so we have zero logging about
past autovac events.

At the moment, the application is shut down and the machine is only running the
vacs:

query_start  | 2015-05-22 19:33:52.44334+02
waiting  | f
query| autovacuum: VACUUM public.*redacted*_y2015m04 (to prevent
wraparound)
query_start  | 2015-05-22 19:34:02.46004+02
waiting  | f
query| autovacuum: VACUUM ANALYZE public.*redacted*_y2015m05 (to
prevent wraparound)

so we know that any io must be caused by the vacs:

shell# uptime
 13:33:33 up 1 day, 18:01,  2 users,  load average: 5.75, 12.71, 8.43
shell# zpool iostat
capacity operationsbandwidth
pool alloc   free   read  write   read  write
---  -  -  -  -  -  -
tank1 358G  6.90T872 55  15.1M  3.08M


Again, we know IO capacity is insufficient, the pool is on 2 magnetic disks only
atm, so an avg read rate of 872 IOPS averaged over 42 hours is not even bad.

 I don't know happened to that, but there is another patch waiting for review 
 and
 testing:
 
 https://commitfest.postgresql.org/5/221/

This is really interesting, thank you very much for the pointer.

Cheers, Nils


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] xid wrap / optimize frozen tables?

2015-05-23 Thread Nils Goroll
Hi,

as many before, I ran into the issue of a postgresql database (8.4.1)
- committing many transactions
- to huge volume tables (3-figure GB in size)
- running the xid wrap vacuum (to freeze tuples)

where the additional read IO load has negative impact to the extent of the
system becoming unusable.

Besides considering the fact that this can be worked around by exchanging
printed sheets of paper or plastic (hello to .au) for hardware, I'd very much
appreciate answers to these questions:

* have I missed any more recent improvements regarding this problem? My
understanding is that the full scan for unfrozen tuples can be made less likely
(by reducing the number of transactions and tuning the autovac), but that it is
still required. Is this correct?

* A pretty obvious idea seems to be to add special casing for fully frozen
tables: If we could register the fact that a table is fully frozen (has no new
tuples after the complete-freeze xid), a vacuum would get reduced to just
increasing that last frozen xid.

It seems like Alvaro Herrera had implemented a patch along the lines of this
idea but I fail to find any other references to it:
http://grokbase.com/t/postgresql/pgsql-hackers/0666gann3t/how-to-avoid-transaction-id-wrap#200606113hlzxtcuzrcsfwc4pxjimyvwgu

Does anyone have pointers what happened to the patch?

Thanks, Nils


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] xid wrap / optimize frozen tables?

2015-05-23 Thread Nils Goroll
On 23/05/15 16:50, Tom Lane wrote:
  as many before, I ran into the issue of a postgresql database (8.4.1)
 *Please* tell us that was a typo.

Yes it was, my sincere apologies. It's 9.4.1


Nils


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] reviewing the Reduce sinval synchronization overhead patch / b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4

2012-09-10 Thread Nils Goroll

This is really late, but ...

On 08/21/12 11:20 PM, Robert Haas wrote:

Our sinval synchronization mechanism has a somewhat weird design that
makes this OK.


... I don't want to miss the change to thank you, Robert, for the detailed 
explanation. I have backported b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4 to 9.1.3 
and it is working well in production for ~2 weeks, but I must admit that I had 
put in the unnecessary read barrier into SIGetDataEntries just to be on the safe 
side. I will take it out for the next builds.


Thanks, Nils



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] reviewing the Reduce sinval synchronization overhead patch / b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4

2012-08-21 Thread Nils Goroll

Hi,

I am reviewing this one year old change again before backporting it to 9.1.3 for 
production use.


ATM, I believe the code is correct, but I don't want to miss the change to spot 
possible errors, so please let me dump my brain on some points:


- IIUC, SIGetDataEntries() can return 0 when in fact there _are_ messages
  because  stateP-hasMessages could come from a stale cache (iow there is no
  read-membar used and because we return before acquiring SInvalReadLock (which
  the patch is all about in the first place), we don't get an implicit
  read-membar from a lock op any more).

  What I can't judge on: Would this cause any harm? What are the consequences
  of SIGetDataEntries returning 0 after another process has posted a message
  (looking at global temporal ordering)?

  I don't quite understand the significance of the respective comment in the
  code that the incoherence should be acceptable because the cached read can't
  migrate to before a previous lock acquisition (which itself is clear).

  AcceptInvalidationMessages has a comment that it should be the first thing
  to do in a transaction, and I am not sure if all the consumers have a
  read-membar equivalent operation in place.

  How bad would a missed cache invalidation be? Should we have a read-membar
  in SIGetDataEntries just to be safe?

Other notes on points which appear correct to me (really more a note to myself):

- stateP-hasMessages = false in SIGetDataEntries is membar'ed by
  SpinLockAcquire(vsegP-msgnumLock), so it shouldn't happen that
  clearing hasMessages moves behind reading msgnumLock

  (in which case we could loose the hasMessages flag)

- but it can happen that hasMessages gets set when in fact there is
  nothing to read (which is fine because we then check maxMsgNum)

Nils


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SIGFPE handler is naive

2012-08-14 Thread Nils Goroll

Should we do something to plug this, and if so, what?  If not, should
we document the danger?


I am not sure if I really understood the intention of the question correctly, 
but if the question was if pg should try to work around misuse of signals, then 
my answer would be a definite no.


IMHO, the signal handler should check if the signal was received for good 
reasons (as proposed by Noah) and handle it appropriately, but otherwise ignore it.


Nils


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock-pthread_mutex : real world results

2012-08-06 Thread Nils Goroll

Robert,


1. How much we're paying for this in the uncontended case?


Using glibc, we have the overhead of an additional library function call, which 
we could eliminate by pulling in the code from glibc/nptl or a source of other 
proven reference code.


The pgbench results I had posted before 
http://archives.postgresql.org/pgsql-hackers/2012-07/msg00061.php could give an 
indication on the higher base cost for the simple approach.



I have mentioned this before: While I agree that minimizing the base overhead is 
good, IMHO, optimizing the worst case is the important part here.


Nils

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Update on the spinlock-pthread_mutex patch experimental: replace s_lock spinlock code with pthread_mutex on linux

2012-07-02 Thread Nils Goroll
Jeff,

without further ado: Thank you, I will go away, run pgbench according to your
advice and report back.

Nils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Update on the spinlock-pthread_mutex patch experimental: replace s_lock spinlock code with pthread_mutex on linux

2012-07-02 Thread Nils Goroll
just a quick note: I got really interesting results, but the writeup is not done
yet. Will get back to this ASAP.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] spinlock-pthread_mutex : first results with Jeff's pgbench+plsql

2012-07-02 Thread Nils Goroll
 3.1.7?

Sorry, that was a typo. 9.1.3.

Yes, I had mentioned the Version in my initial posting. This version is the one
I need to work on as long as 9.2 is beta.

 A major scalability bottleneck caused by spinlock contention was fixed
 in 9.2 - see commit b4fbe392f8ff6ff1a66b488eb7197eef9e1770a4.  I'm not
 sure that it's very meaningful to do performance testing on versions
 that are known to be out of date.

Apparently I have not pointed this out clearly enough. Sorry.

Nils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] away soon - spinlock-pthread_mutex : first results with Jeff's pgbench+plsql

2012-07-02 Thread Nils Goroll
btw, I really need to let go of this topic to catch up before going away at the
end of the week.

Thanks, Nils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Update on the spinlock-pthread_mutex patch experimental: replace s_lock spinlock code with pthread_mutex on linux

2012-07-01 Thread Nils Goroll
Thank you, Robert.

as this patch was not targeted towards increasing tps, I am at happy to hear
that your benchmarks also suggest that performance is comparable.

But my main question is: how about resource consumption? For the issue I am
working on, my current working hypothesis is that spinning on locks saturates
resources and brings down overall performance in a high-contention situation.

Do you have any getrusage figures or anything equivalent?

Thanks, Nils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Update on the spinlock-pthread_mutex patch experimental: replace s_lock spinlock code with pthread_mutex on linux

2012-07-01 Thread Nils Goroll
 test runs on an IBM POWER7 system with 16 cores, 64 hardware threads.

Could you add the CPU Type / clock speed please?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Update on the spinlock-pthread_mutex patch experimental: replace s_lock spinlock code with pthread_mutex on linux

2012-07-01 Thread Nils Goroll
Hi Robert,

 Spinlock contentions cause tps to go down.  The fact that tps didn't
 change much in this case suggests that either these workloads don't
 generate enough spinlock contention to benefit from your patch, or
 your patch doesn't meaningfully reduce it, or both.  We might need a
 test case that is more spinlock-bound to observe an effect.

Agree. My understanding is that

- for no contention, aquiring a futex should almost be as fast as aquiring a
  spinlock, so we should observe

  - comparable tps
  - comparable resource consumption

  I believe this is what your test has shown for the low concurrency tests.


- for light contention, spinning will be faster than syscalling, so
  we should observe with the patch

  - slightly worse tps
  - more syscalls, otherwise comparable resource consumption

  I believe your test supports the first point for high concurrency tests.


- for high contention, spinning should be be
  - unfair (because the time to aquire a lock is not deterministic -
individual threads could starve)
  - much less efficient

  and we should see with the patch

  - slightly better tps if the system is not saturated because
the next process to aquire a contended futex gets scheduled immediately,
rather than when a process returns from sleeping

- much better tps if the system is saturated / oversubscribed due to
  increased scheduling latency for spinning processes

  - significantly lower resource consumption
- so we should have much more headroom before running into saturation
  as described above


So would it be possible for you to record resource consumption and rerun the 
test?

Thank you, Nils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Update on the spinlock-pthread_mutex patch experimental: replace s_lock spinlock code with pthread_mutex on linux

2012-07-01 Thread Nils Goroll
Hi Jeff,

 It looks like the hacked code is slower than the original.  That
 doesn't seem so good to me.  Am I misreading this?

 No, you are right - in a way. This is not about maximizing tps, this is about
 maximizing efficiency under load situations
 
 But why wouldn't this maximized efficiency present itself as increased TPS?

Because the latency of lock aquision influences TPS, but this is only marginally
related to the cost in terms of cpu cyclues to aquire the locks.

See my posting as of Sun, 01 Jul 2012 21:02:05 +0200 for an overview of my
understanding.

 Also, 20 transactions per connection is not enough of a run to make
 any evaluation on.

 As you can see I've repeated the tests 10 times. I've tested slight 
 variations
 as mentioned above, so I was looking for quick results with acceptable 
 variation.
 
 Testing it 10 times doesn't necessarily improve things.

My intention was to average over the imperfections of rusage accounting because
I was maily interested in lowering rusage, not maximizing tps.

Yes, in order to get reliable results, I'd have to run longer tests, but
interestingly the results from my quick tests already approximated those from
the huge tests Robert has run with respect to the differences between unpatched
and patched.

 You should use at least -T30, rather than -t20.

Thanks for the advice - it is really appreciated and I will take it when I run
more test tests.

But I don't understand yet how to best provoke high spinlock concurrency with
pgbench. Or are there are any other test tools out there for this case?

 Anyway, your current benchmark speed of around 600 TPS over such a
 short time periods suggests you are limited by fsyncs.

Definitely. I described the setup in my initial posting (why roll-your-own
s_lock? / improving scalability - Tue, 26 Jun 2012 19:02:31 +0200)

 pgbench does as long as that is the case.  You could turn --fsync=off,
 or just change your benchmark to a read-only one like -S, or better
 the -P option I've been trying get into pgbench.

I don't like to make assumptions which I haven't validated. The system showing
the behavior is designed to write to persistent SSD storage in order to reduce
the risk of data loss by a (BBU) cache failure. Running a test with fsync=off
would divert even further from reality.

 Does your production server have fast fsyncs (BBU) while your test
 server does not?

No, we're writing directly to SSDs (ref: initial posting).

 The users probably don't care about the load average.  Presumably they
 are unhappy because of lowered throughput (TPS) or higher peak latency
 (-l switch in pgbench).  So I think the only use of load average is to
 verify that your benchmark is nothing like your production workload.
 (But it doesn't go the other way around, just because the load
 averages are similar doesn't mean the actual workloads are.)

Fully agree.


 RankTotal duration  Times executed  Av. duration s  Query
 1   3m39s   83,667   0.00   COMMIT;
 
 So fsync's probably are not totally free on production, but I still
 think they must be much cheaper than on your test box.

Oh, the two are the same. I ran the tests on the prod machine during quiet 
periods.

 2   54.4s2  27.18   SELECT ...
 
 That is interesting.  Maybe those two queries are hammering everything
 else to death.

With 64 cores?

I should have mentioned that these were simply the result of a missing index
when the data was collected.

 But how does the 9th rank through the final rank, cumulatively, stack up?
 
 In other words, how many query-seconds worth of time transpired during
 the 137 wall seconds?  That would give an estimate of how many
 simultaneously active connections the production server has.

Sorry, I should have given you the stats from pgFouine:

Number of unique normalized queries: 507
Number of queries: 295,949
Total query duration: 8m38s
First query: 2012-06-23 14:51:01
Last query: 2012-06-23 14:53:17
Query peak: 6,532 queries/s at 2012-06-23 14:51:33

 Sorry for having omitted that detail. I had initialized pgbench with -i -s 
 100
 
 Are you sure?  In an earlier email you reported the entire output of
 pgbench, and is said it was using 10.  Maybe you've changed it since
 then...

good catch, I was wrong in the email you quoted. Sorry.

-bash-4.1$ rsync -av --delete /tmp/test_template_data/ /tmp/data/
...
-bash-4.1$ ./postgres -D /tmp/data -p 55502 
[1] 38303
-bash-4.1$ LOG:  database system was shut down at 2012-06-26 23:18:42 CEST
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
-bash-4.1$ ./psql -p 55502
psql (9.1.3)
Type help for help.
postgres=# select count(*) from pgbench_branches;
 count
---
10
(1 row)


Thank you very much, Jeff! The one question remains: Do we really have all we
need to provoke very high lock contention?

Nils

-- 
Sent via pgsql-hackers mailing list 

[HACKERS] Update on the spinlock-pthread_mutex patch experimental: replace s_lock spinlock code with pthread_mutex on linux

2012-06-29 Thread Nils Goroll
 connections.  So you need to jack up
 the pgbench scale, or switch to using -N mode.

Sorry for having omitted that detail. I had initialized pgbench with -i -s 100

 Also, you should use -M prepared, otherwise you spend more time
 parsing and planning the statements than executing them.

Ah, good point, thank you. As you will have noticed, I don't have years worth of
background with pgbench yet.

On 06/28/12 05:29 PM, Robert Haas wrote:

 FWIW, I kicked off a looong benchmarking run on this a couple of days
 ago on the IBM POWER7 box, testing pgbench -S, regular pgbench, and
 pgbench --unlogged-tables at various client counts with and without
 the patch; three half-hour test runs for each test configuration.  It
 should be done tonight and I will post the results once they're in.

Sounds great! I am really curious.

Nils
From 7d615831acfdece846a1b40c4d9e6a3f1af364ef Mon Sep 17 00:00:00 2001
From: Nils Goroll nils.gor...@uplex.de
Date: Wed, 27 Jun 2012 13:30:48 +0200
Subject: [PATCH] experimental: use pthread mutexes instead of spinlocks.

Should be advantagous on Linux systems with an efficient futex()- based posix
mutex implementation.

compile with CPPFLAGS=-DUSE_PTHREAD_SLOCK
use -DUSE_PTHREAD_MUTEX_ADAPTIVE_NP to use linux non-portable adaptive mutexes
(which are probably less efficient than normal mutexes)
---
 src/backend/storage/lmgr/s_lock.c |   92 +
 src/include/storage/s_lock.h  |   31 -
 2 files changed, 121 insertions(+), 2 deletions(-)

diff --git a/src/backend/storage/lmgr/s_lock.c 
b/src/backend/storage/lmgr/s_lock.c
index 13f1dae..1e13631 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -20,6 +20,8 @@
 
 #include storage/s_lock.h
 
+#ifndef USE_PTHREAD_SLOCK
+
 slock_tdummy_spinlock;
 
 static int spins_per_delay = DEFAULT_SPINS_PER_DELAY;
@@ -201,9 +203,98 @@ update_spins_per_delay(int shared_spins_per_delay)
  * because the definitions for these are split between this file and s_lock.h.
  */
 
+#endif /* !USE_PTHREAD_SLOCK */
 
 #ifdef HAVE_SPINLOCKS  /* skip spinlocks if requested */
 
+#ifdef USE_PTHREAD_SLOCK
+#include pthread.h
+void
+posix_lock(volatile slock_t *lock, const char *file, int line)
+{
+   struct slock *s_lock = (slock_t *)lock;
+   int ret;
+
+   /* XXX error recovery ! */
+
+   do {
+   ret = pthread_mutex_lock(s_lock-mutex);
+   } while (ret != 0);
+   s_lock-held = 1;
+}
+
+void
+posix_unlock(volatile slock_t *lock, const char *file, int line)
+{
+   struct slock *s_lock = (slock_t *)lock;
+   int ret;
+
+   /* XXX error recovery ! */
+
+   s_lock-held = 0;
+   do {
+   ret = pthread_mutex_unlock(s_lock-mutex);
+   } while (ret != 0);
+}
+
+void
+posix_init(volatile slock_t *lock)
+{
+   struct slock *s_lock = (slock_t *)lock;
+   pthread_mutexattr_t attr;
+   int ret;
+
+   if (pthread_mutexattr_init(attr))
+   ereport(FATAL,
+   (errmsg(pthread_mutexattr_init failed)));
+
+#ifdef USE_PTHREAD_MUTEX_ADAPTIVE_NP
+   (void)pthread_mutexattr_settype(attr, PTHREAD_MUTEX_ADAPTIVE_NP);
+#else
+   (void)pthread_mutexattr_settype(attr, PTHREAD_MUTEX_NORMAL);
+#endif
+
+#ifdef USE_PRIO_INHERIT
+   /*
+* looks like this incurs significant syscall overhead on linux
+*/
+   (void)pthread_mutexattr_setprotocol(attr, PTHREAD_PRIO_INHERIT);
+#endif
+
+   if (pthread_mutexattr_setpshared(attr, PTHREAD_PROCESS_SHARED))
+   ereport(FATAL,
+   (errmsg(pthread_mutexattr_setpshared failed)));
+
+   s_lock-held = 0;
+
+   if (pthread_mutex_init(s_lock-mutex, attr))
+   ereport(FATAL,
+   (errmsg(pthread_mutex_init failed)));
+
+   (void)pthread_mutexattr_destroy(attr);
+}
+
+int
+posix_lock_free(volatile slock_t *lock)
+{
+   struct slock *s_lock = (slock_t *)lock;
+
+   return (s_lock-held == 0);
+}
+
+#if 0
+void
+posix_lock_destroy(slock_t *lock)
+{
+   struct slock *s_lock = lock;
+   int ret;
+
+   ret = pthread_mutex_destroy(s_lock-mutex);
+   Assert(ret == 0);
+}
+#endif
+
+#else /* !USE_PTHREAD_SLOCK */
 
 #if defined(__GNUC__)
 
@@ -283,6 +374,7 @@ tas_dummy() /* 
really means: extern int tas(slock_t
 }
 #endif   /* sun3 */
 #endif   /* not __GNUC__ */
+#endif   /* USE_PTHREAD_SLOCK */
 #endif   /* HAVE_SPINLOCKS */
 
 
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 48dc4de..ef4b3fa 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -98,6 +98,30 @@
 
 #ifdef HAVE_SPINLOCKS  /* skip spinlocks if requested */
 
+#ifdef USE_PTHREAD_SLOCK
+#include pthread.h
+struct slock {
+   pthread_mutex_t mutex;
+   int held;
+};
+
+typedef struct slock slock_t;
+
+extern void posix_lock(volatile

Re: [HACKERS] Update on the spinlock-pthread_mutex patch experimental: replace s_lock spinlock code with pthread_mutex on linux

2012-06-29 Thread Nils Goroll
 You need at the very, very least 10s.
ok, thanks.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] experimental: replace s_lock spinlock code with pthread_mutex on linux

2012-06-27 Thread Nils Goroll
 Using futexes directly could be even cheaper.
 Note that below this you only have the futex(2) system call.
I was only referring to the fact that we could save one function and one library
call, which could make a difference for the uncontended case.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] why roll-your-own s_lock? / improving scalability

2012-06-26 Thread Nils Goroll
Hi,

I am currently trying to understand what looks like really bad scalability of
9.1.3 on a 64core 512GB RAM system: the system runs OK when at 30% usr, but only
marginal amounts of additional load seem to push it to 70% and the application
becomes highly unresponsive.

My current understanding basically matches the issues being addressed by various
9.2 improvements, well summarized in
http://wiki.postgresql.org/images/e/e8/FOSDEM2012-Multi-CPU-performance-in-9.2.pdf

An additional aspect is that, in order to address the latent risk of data loss 
corruption with WBCs and async replication, we have deliberately moved the db
from a similar system with WB cached storage to ssd based storage without a WBC,
which, by design, has (in the best WBC case) approx. 100x higher latencies, but
much higher sustained throughput.


On the new system, even with 30% user acceptable load, oprofile makes apparent
significant lock contention:

opreport --symbols --merge tgid -l /mnt/db1/hdd/pgsql-9.1/bin/postgres


Profiling through timer interrupt
samples  %image name   symbol name
3024027.9720  postgres s_lock
5069  4.6888  postgres GetSnapshotData
3743  3.4623  postgres AllocSetAlloc
3167  2.9295  libc-2.12.so strcoll_l
2662  2.4624  postgres SearchCatCache
2495  2.3079  postgres hash_search_with_hash_value
2143  1.9823  postgres nocachegetattr
1860  1.7205  postgres LWLockAcquire
1642  1.5189  postgres base_yyparse
1604  1.4837  libc-2.12.so __strcmp_sse42
1543  1.4273  libc-2.12.so __strlen_sse42
1156  1.0693  libc-2.12.so memcpy

Unfortunately I don't have profiling data for the high-load / contention
condition yet, but I fear the picture will be worse and pointing in the same
direction.

pure speculation
In particular, the _impression_ is that lock contention could also be related to
I/O latencies making me fear that cases could exist where spin locks are being
helt while blocking on IO.
/pure speculation


Looking at the code, it appears to me that the roll-your-own s_lock code cannot
handle a couple of cases, for instance it will also spin when the lock holder is
not running at all or blocking on IO (which could even be implicit, e.g. for a
page flush). These issues have long been addressed by adaptive mutexes and 
futexes.

Also, the s_lock code tries to be somehow adaptive using spins_per_delay (when
having spun for long (not not blocked), spin even longer in future), which
appears to me to have the potential of becoming highly counter-productive.


Now that the scene is set, here's the simple question: Why all this? Why not
simply use posix mutexes which, on modern platforms, will map to efficient
implementations like adaptive mutexes or futexes?

Thanks, Nils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] why roll-your-own s_lock? / improving scalability

2012-06-26 Thread Nils Goroll
Hi Merlin,

 _POSIX_THREAD_PROCESS_SHARED

sure.

 Also, it's forbidden to do things like invoke i/o in the backend while
 holding only a spinlock. As to your larger point, it's an interesting
 assertion -- some data to back it up would help.

Let's see if I can get any. ATM I've only got indications, but no proof.

Nils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] why roll-your-own s_lock? / improving scalability

2012-06-26 Thread Nils Goroll

 But if you start with let's not support any platforms that don't have this 
 feature

This will never be my intention.

Nils

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] experimental: replace s_lock spinlock code with pthread_mutex on linux

2012-06-26 Thread Nils Goroll
 It's
 still unproven whether it'd be an improvement, but you could expect to
 prove it one way or the other with a well-defined amount of testing.

I've hacked the code to use adaptive pthread mutexes instead of spinlocks. see
attached patch. The patch is for the git head, but it can easily be applied for
9.1.3, which is what I did for my tests.

This had disastrous effects on Solaris because it does not use anything similar
to futexes for PTHREAD_PROCESS_SHARED mutexes (only the _PRIVATE mutexes do
without syscalls for the simple case).

But I was surprised to see that it works relatively well on linux. Here's a
glimpse of my results:

hacked code 9.1.3:

-bash-4.1$ rsync -av --delete /tmp/test_template_data/ ../data/ ; /usr/bin/time
./postgres -D ../data -p 55502  ppid=$! ; pid=$(pgrep -P $ppid ) ; sleep 15 ;
./pgbench -c 768 -t 20 -j 128 -p 55502 postgres ; kill $pid
sending incremental file list
...
ransaction type: TPC-B (sort of)
scaling factor: 10
query mode: simple
number of clients: 768
number of threads: 128
number of transactions per client: 20
number of transactions actually processed: 15360/15360
tps = 476.873261 (including connections establishing)
tps = 485.964355 (excluding connections establishing)
LOG:  received smart shutdown request
LOG:  autovacuum launcher shutting down
-bash-4.1$ LOG:  shutting down
LOG:  database system is shut down
210.58user 78.88system 0:50.64elapsed 571%CPU (0avgtext+0avgdata
1995968maxresident)k
0inputs+1153872outputs (0major+2464649minor)pagefaults 0swaps

original code (vanilla build on amd64) 9.1.3:

-bash-4.1$ rsync -av --delete /tmp/test_template_data/ ../data/ ; /usr/bin/time
./postgres -D ../data -p 55502  ppid=$! ; pid=$(pgrep -P $ppid ) ; sleep 15 ;
./pgbench -c 768 -t 20 -j 128 -p 55502 postgres ; kill $pid
sending incremental file list
...
transaction type: TPC-B (sort of)
scaling factor: 10
query mode: simple
number of clients: 768
number of threads: 128
number of transactions per client: 20
number of transactions actually processed: 15360/15360
tps = 499.993685 (including connections establishing)
tps = 510.410883 (excluding connections establishing)
LOG:  received smart shutdown request
-bash-4.1$ LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down
196.21user 71.38system 0:47.99elapsed 557%CPU (0avgtext+0avgdata
1360800maxresident)k
0inputs+1147904outputs (0major+2375965minor)pagefaults 0swaps


config:

-bash-4.1$ egrep '^[a-z]' /tmp/test_template_data/postgresql.conf
max_connections = 1800  # (change requires restart)
shared_buffers = 10GB   # min 128kB
temp_buffers = 64MB # min 800kB
work_mem = 256MB# min 64kB,d efault 1MB
maintenance_work_mem = 2GB  # min 1MB, default 16MB
bgwriter_delay = 10ms   # 10-1ms between rounds
bgwriter_lru_maxpages = 1000# 0-1000 max buffers written/round
bgwriter_lru_multiplier = 10.0  # 0-10.0 multipler on buffers 
scanned/round
wal_level = hot_standby # minimal, archive, or hot_standby
wal_buffers = 64MB  # min 32kB, -1 sets based on 
shared_buffers
commit_delay = 1# range 0-10, in microseconds
datestyle = 'iso, mdy'
lc_messages = 'en_US.UTF-8' # locale for system error 
message
lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
lc_numeric = 'en_US.UTF-8'  # locale for number formatting
lc_time = 'en_US.UTF-8' # locale for time formatting
default_text_search_config = 'pg_catalog.english'
seq_page_cost = 1.0 # measured on an arbitrary scale
random_page_cost = 1.5  # same scale as above (default: 4.0)
cpu_tuple_cost = 0.005
cpu_index_tuple_cost = 0.0025
cpu_operator_cost = 0.0001
effective_cache_size = 192GB



So it looks like using pthread_mutexes could at least be an option on Linux.

Using futexes directly could be even cheaper.


As a side note, it looks like I have not expressed myself clearly:

I did not intend to suggest to replace proven, working code (which probably is
the best you can get for some platforms) with posix calls. I apologize for the
provocative question.


Regarding the actual production issue, I did not manage to synthetically provoke
the saturation we are seeing in production using pgbench - I could not even get
anywhere near the production load. So I cannot currently test if reducing the
amount of spinning and waking up exactly one waiter (which is what linux/nptl
pthread_mutex_unlock does) would solve/mitigate the production issue I am
working on, and I'd highly appreciate any pointers in this direction.

Cheers, Nils
diff --git a/src/backend/storage/lmgr/s_lock.c 
b/src/backend/storage/lmgr/s_lock.c
index bc8d89f..a45fdf6 100644
--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -20,6 +20,8 @@