Re: pgbench cpu overhead (was Re: [HACKERS] lazy vxid locks, v1)
On 07/24/2011 03:50 AM, Jeff Janes wrote: On Mon, Jun 13, 2011 at 7:03 AM, Stefan Kaltenbrunner ste...@kaltenbrunner.cc wrote: On 06/13/2011 01:55 PM, Stefan Kaltenbrunner wrote: [...] all those tests are done with pgbench running on the same box - which has a noticable impact on the results because pgbench is using ~1 core per 8 cores of the backend tested in cpu resoures - though I don't think it causes any changes in the results that would show the performance behaviour in a different light. actuall testing against sysbench with the very same workload shows the following performance behaviour: with 40 threads(aka the peak performance point): pgbench:223308 tps sysbench: 311584 tps with 160 threads (backend contention dominated): pgbench:57075 sysbench: 43437 so it seems that sysbench is actually significantly less overhead than pgbench and the lower throughput at the higher conncurency seems to be cause by sysbench being able to stress the backend even more than pgbench can. for those curious - the profile for pgbench looks like: samples %symbol name 2937841.9087 doCustom 1750224.9672 threadRun 7629 10.8830 pg_strcasecmp 5871 8.3752 compareVariables 2568 3.6633 getVariable 2167 3.0913 putVariable 2065 2.9458 replaceVariable 1971 2.8117 parseVariable 534 0.7618 xstrdup 278 0.3966 xrealloc 137 0.1954 xmalloc Hi Stefan, How was this profile generated? I get a similar profile using --enable-profiling and gprof, but I find it not believable. The complete absence of any calls to libpq is not credible. I don't know about your profiler, but with gprof they should be listed in the call graph even if they take a negligible amount of time. So I think pgbench is linking to libpq libraries that do not themselves support profiling (I have no idea how that could happen though). If the calls graphs are not getting recorded correctly, surely the timing can't be reliable either. hmm - the profile was generated using oprofile, but now that you are mentioning this aspect, I suppose that this was a profile run without opcontrol --seperate=lib... I'm not currently in a position to retest that - but maybe you could do a run? (I also tried profiling pgbench with perf, but in that case I get nothing other than kernel and libc calls showing up. I don't know what that means) To support this, I've dummied up doCustom so that does all the work of deciding what needs to happen, executing the metacommands, interpolating the variables into the SQL string, but then simply refrains from calling the PQ functions to send and receive the query and response. (I had to make a few changes around the select loop in threadRun to support this). The result is that the dummy pgbench is charged with only 57% more CPU time than the stock one, but it gets over 10 times as many transactions done. I think this supports the notion that the CPU bottleneck is not in pgbench.c, but somewhere in the libpq or the kernel. interesting - iirc we actually had some reports about current libpq behaviour causing scaling issues on some OSes - see http://archives.postgresql.org/pgsql-hackers/2009-06/msg00748.php and some related threads. Iirc the final patch for that was never applied though and the original author lost interest, I think that I was able to measure some noticable performance gains back in the days but I don't think I still have the numbers somewhere. Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pgbench cpu overhead (was Re: [HACKERS] lazy vxid locks, v1)
Jeff Janes jeff.ja...@gmail.com writes: How was this profile generated? I get a similar profile using --enable-profiling and gprof, but I find it not believable. The complete absence of any calls to libpq is not credible. I don't know about your profiler, but with gprof they should be listed in the call graph even if they take a negligible amount of time. So I think pgbench is linking to libpq libraries that do not themselves support profiling (I have no idea how that could happen though). If the calls graphs are not getting recorded correctly, surely the timing can't be reliable either. Last I checked, gprof simply does not work for shared libraries on Linux --- is that what you're testing on? If so, try oprofile or some other Linux-specific solution. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pgbench cpu overhead (was Re: [HACKERS] lazy vxid locks, v1)
Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes: interesting - iirc we actually had some reports about current libpq behaviour causing scaling issues on some OSes - see http://archives.postgresql.org/pgsql-hackers/2009-06/msg00748.php and some related threads. Iirc the final patch for that was never applied though and the original author lost interest, I think that I was able to measure some noticable performance gains back in the days but I don't think I still have the numbers somewhere. Huh? That patch did get applied in some form or other -- at least, libpq does contain references to both SO_NOSIGPIPE and MSG_NOSIGNAL these days. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pgbench cpu overhead (was Re: [HACKERS] lazy vxid locks, v1)
On 07/24/2011 05:55 PM, Tom Lane wrote: Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes: interesting - iirc we actually had some reports about current libpq behaviour causing scaling issues on some OSes - see http://archives.postgresql.org/pgsql-hackers/2009-06/msg00748.php and some related threads. Iirc the final patch for that was never applied though and the original author lost interest, I think that I was able to measure some noticable performance gains back in the days but I don't think I still have the numbers somewhere. Huh? That patch did get applied in some form or other -- at least, libpq does contain references to both SO_NOSIGPIPE and MSG_NOSIGNAL these days. hmm yeah - your are right, when I looked that up a few hours ago I failed to find the right commit but it was indeed commited: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=cea80e726edd42a39bb0220290738f7825de8e57 I think I mentally mixed that up with compare word-at-a-time in bcTruelen patch that was also discussed for affecting query rates for trivial queries. I actually wonder if -HEAD would show that issue even more clearly now that we have parts of roberts performance work in the tree... Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pgbench cpu overhead (was Re: [HACKERS] lazy vxid locks, v1)
On Mon, Jun 13, 2011 at 7:03 AM, Stefan Kaltenbrunner ste...@kaltenbrunner.cc wrote: On 06/13/2011 01:55 PM, Stefan Kaltenbrunner wrote: [...] all those tests are done with pgbench running on the same box - which has a noticable impact on the results because pgbench is using ~1 core per 8 cores of the backend tested in cpu resoures - though I don't think it causes any changes in the results that would show the performance behaviour in a different light. actuall testing against sysbench with the very same workload shows the following performance behaviour: with 40 threads(aka the peak performance point): pgbench: 223308 tps sysbench: 311584 tps with 160 threads (backend contention dominated): pgbench: 57075 sysbench: 43437 so it seems that sysbench is actually significantly less overhead than pgbench and the lower throughput at the higher conncurency seems to be cause by sysbench being able to stress the backend even more than pgbench can. for those curious - the profile for pgbench looks like: samples % symbol name 29378 41.9087 doCustom 17502 24.9672 threadRun 7629 10.8830 pg_strcasecmp 5871 8.3752 compareVariables 2568 3.6633 getVariable 2167 3.0913 putVariable 2065 2.9458 replaceVariable 1971 2.8117 parseVariable 534 0.7618 xstrdup 278 0.3966 xrealloc 137 0.1954 xmalloc Hi Stefan, How was this profile generated? I get a similar profile using --enable-profiling and gprof, but I find it not believable. The complete absence of any calls to libpq is not credible. I don't know about your profiler, but with gprof they should be listed in the call graph even if they take a negligible amount of time. So I think pgbench is linking to libpq libraries that do not themselves support profiling (I have no idea how that could happen though). If the calls graphs are not getting recorded correctly, surely the timing can't be reliable either. (I also tried profiling pgbench with perf, but in that case I get nothing other than kernel and libc calls showing up. I don't know what that means) To support this, I've dummied up doCustom so that does all the work of deciding what needs to happen, executing the metacommands, interpolating the variables into the SQL string, but then simply refrains from calling the PQ functions to send and receive the query and response. (I had to make a few changes around the select loop in threadRun to support this). The result is that the dummy pgbench is charged with only 57% more CPU time than the stock one, but it gets over 10 times as many transactions done. I think this supports the notion that the CPU bottleneck is not in pgbench.c, but somewhere in the libpq or the kernel. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lazy vxid locks, v1
On Jun12, 2011, at 23:39 , Robert Haas wrote: So, the majority (60%) of the excess spinning appears to be due to SInvalReadLock. A good chunk are due to ProcArrayLock (25%). Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32. However, cache lines are 64 bytes large on recent Intel CPUs AFAIK, so I guess that two adjacent LWLocks currently share one cache line. Currently, the ProcArrayLock has index 4 while SInvalReadLock has index 5, so if I'm not mistaken exactly the two locks where you saw the largest contention on are on the same cache line... Might make sense to try and see if these numbers change if you either make LWLockPadded 64bytes or arrange the locks differently... best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lazy vxid locks, v1
On Mon, Jun 13, 2011 at 8:10 PM, Jeff Janes jeff.ja...@gmail.com wrote: On Sun, Jun 12, 2011 at 2:39 PM, Robert Haas robertmh...@gmail.com wrote: ... Profiling reveals that the system spends enormous amounts of CPU time in s_lock. LWLOCK_STATS reveals that the only lwlock with significant amounts of blocking is the BufFreelistLock; This is curious. Clearly the entire working set fits in RAM, or you wouldn't be getting number like this. But does the entire working set fit in shared_buffers? If so, you shouldn't see any traffic on BufFreelistLock once all the data is read in. I've only seen contention here when all data fits in OS cache memory but not in shared_buffers. Yeah, that does seem odd: rhaas=# select pg_size_pretty(pg_database_size(current_database())); pg_size_pretty 1501 MB (1 row) rhaas=# select pg_size_pretty(pg_table_size('pgbench_accounts')); pg_size_pretty 1281 MB (1 row) rhaas=# select pg_size_pretty(pg_table_size('pgbench_accounts_pkey')); pg_size_pretty 214 MB (1 row) rhaas=# show shared_buffers; shared_buffers 8GB (1 row) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pgbench cpu overhead (was Re: [HACKERS] lazy vxid locks, v1)
On Mon, Jun 13, 2011 at 9:09 PM, Alvaro Herrera alvhe...@commandprompt.com wrote: I noticed that pgbench's doCustom (the function highest in the profile posted) returns doing nothing if the connection is supposed to be sleeping; seems an open door for busy waiting. I didn't check the rest of the code to see if there's something avoiding that condition. Yes, there is a select in threadRun that avoids that. Also, I don't think anyone would but in a sleep in this particular type of pgbench run. I also noticed that it seems to be very liberal about calling INSTR_TIME_SET_CURRENT in the same function which perhaps could be optimizing by calling it a single time at entry and reusing the value, but I guess that would show up in the profile as a kernel call so it's maybe not a problem. I think that only gets called when you specifically asked for latencies or for logging, or when making new connection (which should be rare) Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pgbench cpu overhead (was Re: [HACKERS] lazy vxid locks, v1)
On 06/14/2011 02:27 AM, Jeff Janes wrote: On Mon, Jun 13, 2011 at 7:03 AM, Stefan Kaltenbrunner ste...@kaltenbrunner.cc wrote: ... so it seems that sysbench is actually significantly less overhead than pgbench and the lower throughput at the higher conncurency seems to be cause by sysbench being able to stress the backend even more than pgbench can. Hi Stefan, pgbench sends each query (per connection) and waits for the reply before sending another. Do we know whether sysbench does that, or if it just stuffs the kernel's IPC buffer full of queries without synchronously waiting for individual replies? I can't get sysbench to make for me, or I'd strace in single client mode and see what kind of messages are going back and forth. yeah sysbench compiled from a release tarball needs some autoconf/makefile hackery to get running on a modern system - but I can provide you with the data you are interested in if you tell me exactly what you are looking for... Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lazy vxid locks, v1
On 06/12/2011 11:39 PM, Robert Haas wrote: Here is a patch that applies over the reducing the overhead of frequent table locks (fastlock-v3) patch and allows heavyweight VXID locks to spring into existence only when someone wants to wait on them. I believe there is a large benefit to be had from this optimization, because the combination of these two patches virtually eliminates lock manager traffic on pgbench -S workloads. However, there are several flies in the ointment. 1. It's a bit of a kludge. I leave it to readers of the patch to determine exactly what about this patch they think is kludgey, but it's likely not the empty set. I suspect that MyProc-fpLWLock needs to be renamed to something a bit more generic if we're going to use it like this, but I don't immediately know what to call it. Also, the mechanism whereby we take SInvalWriteLock to work out the mapping from BackendId to PGPROC * is not exactly awesome. I don't think it matters from a performance point of view, because operations that need VXID locks are sufficiently rare that the additional lwlock traffic won't matter a bit. However, we could avoid this altogether if we rejiggered the mechanism for allocating PGPROCs and backend IDs. Right now, we allocate PGPROCs off of linked lists, except for auxiliary procs which allocate them by scanning a three-element array for an empty slot. Then, when the PGPROC subscribes to sinval, the sinval mechanism allocates a backend ID by scanning for the lowest unused backend ID in the ProcState array. If we changed the logic for allocating PGPROCs to mimic what the sinval queue currently does, then the backend ID could be defined as the offset into the PGPROC array. Translating between a backend ID and a PGPROC * now becomes a matter of pointer arithmetic. Not sure if this is worth doing. 2. Bad thing happen with large numbers of connections. This patch increases peak performance, but as you increase the number of concurrent connections beyond the number of CPU cores, performance drops off faster with the patch than without it. For example, on the 32-core loaner from Nate Boley, using 80 pgbench -S clients, unpatched HEAD runs at ~36K TPS; with fastlock, it jumps up to about ~99K TPS; with this patch also applied, it drops down to about ~64K TPS, despite the fact that nearly all the contention on the lock manager locks has been eliminated.On Stefan Kaltenbrunner's 40-core box, he was actually able to see performance drop down below unpatched HEAD with this applied! This is immensely counterintuitive. What is going on? just to add actual new numbers to the discussion(pgbench -n -S -T 120 -c X -j X) on that particular 40cores/80 threads box: unpatched: c1: tps = 7808.098053 (including connections establishing) c4: tps = 29941.444359 (including connections establishing) c8: tps = 58930.293850 (including connections establishing) c16:tps = 106911.385826 (including connections establishing) c24:tps = 117401.654430 (including connections establishing) c32:tps = 110659.627803 (including connections establishing) c40:tps = 107689.945323 (including connections establishing) c64:tps = 104835.182183 (including connections establishing) c80:tps = 101885.549081 (including connections establishing) c160: tps = 92373.395791 (including connections establishing) c200: tps = 90614.141246 (including connections establishing) fast locks: c1: tps = 7710.824723 (including connections establishing) c4: tps = 29653.578364 (including connections establishing) c8: tps = 58827.195578 (including connections establishing) c16:tps = 112814.382204 (including connections establishing) c24:tps = 154559.012960 (including connections establishing) c32:tps = 189281.391250 (including connections establishing) c40:tps = 215807.263233 (including connections establishing) c64:tps = 180644.527322 (including connections establishing) c80:tps = 118266.615543 (including connections establishing) c160: tps = 68957.22 (including connections establishing) c200: tps = 68803.801091 (including connections establishing) fast locks + lazy vxid: c1: tps = 7828.644389 (including connections establishing) c4: tps = 30520.558169 (including connections establishing) c8: tps = 60207.396385 (including connections establishing) c16:tps = 117923.775435 (including connections establishing) c24:tps = 158775.317590 (including connections establishing) c32:tps = 195768.530589 (including connections establishing) c40:tps = 223308.779212 (including connections establishing) c64:tps = 152848.742883 (including connections establishing) c80:tps = 65738.046558 (including connections establishing) c160: tps = 57075.304457 (including connections establishing) c200: tps = 59107.675182 (including connections establishing) so my reading of that is that we currently only scale well to ~12
Re: [HACKERS] lazy vxid locks, v1
Stefan Kaltenbrunner wrote: on that particular 40cores/80 threads box: unpatched: c40:tps = 107689.945323 (including connections establishing) c80:tps = 101885.549081 (including connections establishing) fast locks: c40:tps = 215807.263233 (including connections establishing) c80:tps = 118266.615543 (including connections establishing) fast locks + lazy vxid: c40:tps = 223308.779212 (including connections establishing) c80:tps = 65738.046558 (including connections establishing) Is there any way to disable the HT (or whatever technology attempts to make each core look like 2)? In my benchmarking that has kept performance from tanking as badly when a large number of processes are contending for CPU. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lazy vxid locks, v1
On 06/13/2011 02:29 PM, Kevin Grittner wrote: Stefan Kaltenbrunner wrote: on that particular 40cores/80 threads box: unpatched: c40:tps = 107689.945323 (including connections establishing) c80:tps = 101885.549081 (including connections establishing) fast locks: c40:tps = 215807.263233 (including connections establishing) c80:tps = 118266.615543 (including connections establishing) fast locks + lazy vxid: c40:tps = 223308.779212 (including connections establishing) c80:tps = 65738.046558 (including connections establishing) Is there any way to disable the HT (or whatever technology attempts to make each core look like 2)? In my benchmarking that has kept performance from tanking as badly when a large number of processes are contending for CPU. I can do that tomorrow, but I have now done a fair amount of benchmarking on that box using various tests and for CPU intense workloads(various math stuff, parallel compiles of the linux kernel, some inhouse stuff, and some other database) I usually get a 60-70x speedup over just using a single core and most recent CPUs (this one is actually a brand new Westmere-EX) showed pretty good scaling with HT/threading. I'm actually pretty sure that at leas in some benchmarks it was not HT that was the real problem but rather our general inability to scale much beyond 10-12 cores for reads and even worse for writes (due to WAL contention). Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lazy vxid locks, v1
On 06/12/2011 11:39 PM, Robert Haas wrote: Here is a patch that applies over the reducing the overhead of frequent table locks (fastlock-v3) patch and allows heavyweight VXID locks to spring into existence only when someone wants to wait on them. I believe there is a large benefit to be had from this optimization, because the combination of these two patches virtually eliminates lock manager traffic on pgbench -S workloads. However, there are several flies in the ointment. 1. It's a bit of a kludge. I leave it to readers of the patch to determine exactly what about this patch they think is kludgey, but it's likely not the empty set. I suspect that MyProc-fpLWLock needs to be renamed to something a bit more generic if we're going to use it like this, but I don't immediately know what to call it. Also, the mechanism whereby we take SInvalWriteLock to work out the mapping from BackendId to PGPROC * is not exactly awesome. I don't think it matters from a performance point of view, because operations that need VXID locks are sufficiently rare that the additional lwlock traffic won't matter a bit. However, we could avoid this altogether if we rejiggered the mechanism for allocating PGPROCs and backend IDs. Right now, we allocate PGPROCs off of linked lists, except for auxiliary procs which allocate them by scanning a three-element array for an empty slot. Then, when the PGPROC subscribes to sinval, the sinval mechanism allocates a backend ID by scanning for the lowest unused backend ID in the ProcState array. If we changed the logic for allocating PGPROCs to mimic what the sinval queue currently does, then the backend ID could be defined as the offset into the PGPROC array. Translating between a backend ID and a PGPROC * now becomes a matter of pointer arithmetic. Not sure if this is worth doing. 2. Bad thing happen with large numbers of connections. This patch increases peak performance, but as you increase the number of concurrent connections beyond the number of CPU cores, performance drops off faster with the patch than without it. For example, on the 32-core loaner from Nate Boley, using 80 pgbench -S clients, unpatched HEAD runs at ~36K TPS; with fastlock, it jumps up to about ~99K TPS; with this patch also applied, it drops down to about ~64K TPS, despite the fact that nearly all the contention on the lock manager locks has been eliminated.On Stefan Kaltenbrunner's 40-core box, he was actually able to see performance drop down below unpatched HEAD with this applied! This is immensely counterintuitive. What is going on? Profiling reveals that the system spends enormous amounts of CPU time in s_lock. just to reiterate that with numbers - at 160 threads with both patches applied the profile looks like: samples %image name symbol name 828794 75.8662 postgres s_lock 51672 4.7300 postgres LWLockAcquire 51145 4.6817 postgres LWLockRelease 17636 1.6144 postgres GetSnapshotData 7521 0.6885 postgres hash_search_with_hash_value 6193 0.5669 postgres AllocSetAlloc 4527 0.4144 postgres SearchCatCache 4521 0.4138 postgres PinBuffer 3385 0.3099 postgres SIGetDataEntries 3160 0.2893 postgres PostgresMain 2706 0.2477 postgres _bt_compare 2687 0.2460 postgres fmgr_info_cxt_security 1963 0.1797 postgres UnpinBuffer 1846 0.1690 postgres LockAcquireExtended 1770 0.1620 postgres exec_bind_message 1730 0.1584 postgres hash_any 1644 0.1505 postgres ExecInitExpr even at the peak performance spot of the combined patch-set (-c40) the contention is noticable in the profile: samples %image name symbol name 1497826 22.0231 postgres s_lock 5921048.7059 postgres LWLockAcquire 5122137.5313 postgres LWLockRelease 2300503.3825 postgres GetSnapshotData 1762522.5915 postgres AllocSetAlloc 1551222.2808 postgres hash_search_with_hash_value 1162351.7091 postgres SearchCatCache 1101971.6203 postgres _bt_compare 94101 1.3836 postgres PinBuffer 80119 1.1780 postgres PostgresMain 65584 0.9643 postgres fmgr_info_cxt_security 55198 0.8116 postgres hash_any 52872 0.7774 postgres exec_bind_message 48438 0.7122 postgres LockReleaseAll 46631 0.6856 postgres MemoryContextAlloc 45909 0.6750 postgres ExecInitExpr 42293 0.6219 postgres
pgbench cpu overhead (was Re: [HACKERS] lazy vxid locks, v1)
On 06/13/2011 01:55 PM, Stefan Kaltenbrunner wrote: [...] all those tests are done with pgbench running on the same box - which has a noticable impact on the results because pgbench is using ~1 core per 8 cores of the backend tested in cpu resoures - though I don't think it causes any changes in the results that would show the performance behaviour in a different light. actuall testing against sysbench with the very same workload shows the following performance behaviour: with 40 threads(aka the peak performance point): pgbench:223308 tps sysbench: 311584 tps with 160 threads (backend contention dominated): pgbench:57075 sysbench: 43437 so it seems that sysbench is actually significantly less overhead than pgbench and the lower throughput at the higher conncurency seems to be cause by sysbench being able to stress the backend even more than pgbench can. for those curious - the profile for pgbench looks like: samples %symbol name 2937841.9087 doCustom 1750224.9672 threadRun 7629 10.8830 pg_strcasecmp 5871 8.3752 compareVariables 2568 3.6633 getVariable 2167 3.0913 putVariable 2065 2.9458 replaceVariable 1971 2.8117 parseVariable 534 0.7618 xstrdup 278 0.3966 xrealloc 137 0.1954 xmalloc Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lazy vxid locks, v1
Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes: On 06/12/2011 11:39 PM, Robert Haas wrote: Profiling reveals that the system spends enormous amounts of CPU time in s_lock. just to reiterate that with numbers - at 160 threads with both patches applied the profile looks like: samples %image name symbol name 828794 75.8662 postgres s_lock Do you know exactly which spinlocks are being contended on here? The next few entries 51672 4.7300 postgres LWLockAcquire 51145 4.6817 postgres LWLockRelease 17636 1.6144 postgres GetSnapshotData suggest that it might be the ProcArrayLock as a result of a huge amount of snapshot-fetching, but this is very weak evidence for that theory. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lazy vxid locks, v1
On Mon, Jun 13, 2011 at 10:29 AM, Tom Lane t...@sss.pgh.pa.us wrote: Stefan Kaltenbrunner ste...@kaltenbrunner.cc writes: On 06/12/2011 11:39 PM, Robert Haas wrote: Profiling reveals that the system spends enormous amounts of CPU time in s_lock. just to reiterate that with numbers - at 160 threads with both patches applied the profile looks like: samples % image name symbol name 828794 75.8662 postgres s_lock Do you know exactly which spinlocks are being contended on here? The next few entries 51672 4.7300 postgres LWLockAcquire 51145 4.6817 postgres LWLockRelease 17636 1.6144 postgres GetSnapshotData suggest that it might be the ProcArrayLock as a result of a huge amount of snapshot-fetching, but this is very weak evidence for that theory. I don't know for sure what is happening on Stefan's system, but I did post the results of some research on this exact topic in my original post. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lazy vxid locks, v1
On Sun, Jun 12, 2011 at 2:39 PM, Robert Haas robertmh...@gmail.com wrote: ... Profiling reveals that the system spends enormous amounts of CPU time in s_lock. LWLOCK_STATS reveals that the only lwlock with significant amounts of blocking is the BufFreelistLock; This is curious. Clearly the entire working set fits in RAM, or you wouldn't be getting number like this. But does the entire working set fit in shared_buffers? If so, you shouldn't see any traffic on BufFreelistLock once all the data is read in. I've only seen contention here when all data fits in OS cache memory but not in shared_buffers. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pgbench cpu overhead (was Re: [HACKERS] lazy vxid locks, v1)
On Mon, Jun 13, 2011 at 7:03 AM, Stefan Kaltenbrunner ste...@kaltenbrunner.cc wrote: ... so it seems that sysbench is actually significantly less overhead than pgbench and the lower throughput at the higher conncurency seems to be cause by sysbench being able to stress the backend even more than pgbench can. Hi Stefan, pgbench sends each query (per connection) and waits for the reply before sending another. Do we know whether sysbench does that, or if it just stuffs the kernel's IPC buffer full of queries without synchronously waiting for individual replies? I can't get sysbench to make for me, or I'd strace in single client mode and see what kind of messages are going back and forth. Cheers, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pgbench cpu overhead (was Re: [HACKERS] lazy vxid locks, v1)
On Tue, Jun 14, 2011 at 09:27, Jeff Janes jeff.ja...@gmail.com wrote: pgbench sends each query (per connection) and waits for the reply before sending another. We can use -j option to run pgbench in multiple threads to avoid request starvation. What setting did you use, Stefan? for those curious - the profile for pgbench looks like: samples %symbol name 2937841.9087 doCustom 1750224.9672 threadRun 7629 10.8830 pg_strcasecmp If the bench is bottleneck, it would be better to reduce pg_strcasecmp calls by holding meta command names as integer values of sub-META_COMMAND instead of string comparison for each loop. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pgbench cpu overhead (was Re: [HACKERS] lazy vxid locks, v1)
On 06/13/2011 08:27 PM, Jeff Janes wrote: pgbench sends each query (per connection) and waits for the reply before sending another. Do we know whether sysbench does that, or if it just stuffs the kernel's IPC buffer full of queries without synchronously waiting for individual replies? sysbench creates a thread for each client and lets them go at things at whatever speed they can handle. You have to setup pgbench with a worker per core to get them even close to level footing. And even in that case, sysbench has a significant advantage, because it's got the commands it runs more or less hard-coded in the program. pgbench is constantly parsing things in its internal command language and then turning them into SQL requests. That's flexible and allows it to be used for some neat custom things, but it uses a lot more resources to drive the same number of clients. I can't get sysbench to make for me, or I'd strace in single client mode and see what kind of messages are going back and forth. If you're using a sysbench tarball, no surprise. It doesn't build on lots of platforms now. If you grab http://projects.2ndquadrant.it/sites/default/files/bottom-up-benchmarking.pdf it has my sysbench notes starting on page 34. I had to checkout the latest version from their development repo to get it to compile on any recent system. The attached wrapper script may be helpful to you as well to help get over the learning curve for how to run the program; it iterates sysbench over a number of database sizes and thread counts running the complicated to setup OLTP test. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us #!/bin/bash SB=$HOME/0.4/sysbench/sysbench DB=--pgsql-user=postgres --pgsql-password=password --pgsql-db=sysbench --pgsql-host=localhost --db-driver=pgsql #DB=--mysql-user=root --db-driver=mysql --mysql-table-engine=innodb --mysql-db=sysbench #THREADS=1 2 3 4 5 6 7 8 9 10 11 12 16 24 32 48 64 96 THREADS=1 TIME=10 TEST_PARAM=--oltp-read-only=off --oltp-test-mode=complex #TEST_PARAM=--oltp-read-only=on --oltp-test-mode=simple SIZES=1 #SIZES=1 10 100 1000 5000 1 5 10 # In complex, non read-only mode, there will be duplicate key issues, and # threads will fail with deadlock--causing no value to be returned. #TEST_PARAM=--oltp-test-mode=complex for s in $SIZES do SIZE_PARAM=--oltp-table-size=$s # Just in case! $SB $DB --test=oltp cleanup $SB $DB $SIZE_PARAM --test=oltp prepare for t in $THREADS do echo -n $t threads: $SB $DB --test=oltp $TEST_PARAM $SIZE_PARAM --init-rng --max-requests=0 --max-time=$TIME --num-threads=$t run 21 # | grep read/write requests done $SB $DB --test=oltp cleanup done -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lazy vxid locks, v1
On 06/13/2011 07:55 AM, Stefan Kaltenbrunner wrote: all those tests are done with pgbench running on the same box - which has a noticable impact on the results because pgbench is using ~1 core per 8 cores of the backend tested in cpu resoures - though I don't think it causes any changes in the results that would show the performance behaviour in a different light. Yeah, this used to make a much bigger difference, but nowadays it's not so important. So long as you have enough cores that you can spare a chunk of them to drive the test with, and you crank -j up to a lot, there doesn't seem to be much of an advantage to moving the clients to a remote system now. You end up trading off CPU time for everything going through the network stack, which adds yet another set of uncertainty to the whole thing anyway. I'm glad to see so many people have jumped onto doing these SELECT-only tests now. The performance farm idea I've been working on runs a test just like what's proven useful here. I'd suggested that because it's been really sensitive to changes in locking and buffer management for me. -- Greg Smith 2ndQuadrant USg...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pgbench cpu overhead (was Re: [HACKERS] lazy vxid locks, v1)
Excerpts from Jeff Janes's message of lun jun 13 20:27:15 -0400 2011: On Mon, Jun 13, 2011 at 7:03 AM, Stefan Kaltenbrunner ste...@kaltenbrunner.cc wrote: ... so it seems that sysbench is actually significantly less overhead than pgbench and the lower throughput at the higher conncurency seems to be cause by sysbench being able to stress the backend even more than pgbench can. Hi Stefan, pgbench sends each query (per connection) and waits for the reply before sending another. I noticed that pgbench's doCustom (the function highest in the profile posted) returns doing nothing if the connection is supposed to be sleeping; seems an open door for busy waiting. I didn't check the rest of the code to see if there's something avoiding that condition. I also noticed that it seems to be very liberal about calling INSTR_TIME_SET_CURRENT in the same function which perhaps could be optimizing by calling it a single time at entry and reusing the value, but I guess that would show up in the profile as a kernel call so it's maybe not a problem. -- Álvaro Herrera alvhe...@commandprompt.com The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: pgbench cpu overhead (was Re: [HACKERS] lazy vxid locks, v1)
On Tue, Jun 14, 2011 at 13:09, Alvaro Herrera alvhe...@commandprompt.com wrote: I noticed that pgbench's doCustom (the function highest in the profile posted) returns doing nothing if the connection is supposed to be sleeping; seems an open door for busy waiting. pgbench uses select() with/without timeout in the cases, no? -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] lazy vxid locks, v1
Here is a patch that applies over the reducing the overhead of frequent table locks (fastlock-v3) patch and allows heavyweight VXID locks to spring into existence only when someone wants to wait on them. I believe there is a large benefit to be had from this optimization, because the combination of these two patches virtually eliminates lock manager traffic on pgbench -S workloads. However, there are several flies in the ointment. 1. It's a bit of a kludge. I leave it to readers of the patch to determine exactly what about this patch they think is kludgey, but it's likely not the empty set. I suspect that MyProc-fpLWLock needs to be renamed to something a bit more generic if we're going to use it like this, but I don't immediately know what to call it. Also, the mechanism whereby we take SInvalWriteLock to work out the mapping from BackendId to PGPROC * is not exactly awesome. I don't think it matters from a performance point of view, because operations that need VXID locks are sufficiently rare that the additional lwlock traffic won't matter a bit. However, we could avoid this altogether if we rejiggered the mechanism for allocating PGPROCs and backend IDs. Right now, we allocate PGPROCs off of linked lists, except for auxiliary procs which allocate them by scanning a three-element array for an empty slot. Then, when the PGPROC subscribes to sinval, the sinval mechanism allocates a backend ID by scanning for the lowest unused backend ID in the ProcState array. If we changed the logic for allocating PGPROCs to mimic what the sinval queue currently does, then the backend ID could be defined as the offset into the PGPROC array. Translating between a backend ID and a PGPROC * now becomes a matter of pointer arithmetic. Not sure if this is worth doing. 2. Bad thing happen with large numbers of connections. This patch increases peak performance, but as you increase the number of concurrent connections beyond the number of CPU cores, performance drops off faster with the patch than without it. For example, on the 32-core loaner from Nate Boley, using 80 pgbench -S clients, unpatched HEAD runs at ~36K TPS; with fastlock, it jumps up to about ~99K TPS; with this patch also applied, it drops down to about ~64K TPS, despite the fact that nearly all the contention on the lock manager locks has been eliminated.On Stefan Kaltenbrunner's 40-core box, he was actually able to see performance drop down below unpatched HEAD with this applied! This is immensely counterintuitive. What is going on? Profiling reveals that the system spends enormous amounts of CPU time in s_lock. LWLOCK_STATS reveals that the only lwlock with significant amounts of blocking is the BufFreelistLock; but that doesn't explain the high CPU utilization. In fact, it appears that the problem is with the LWLocks that are frequently acquired in *shared* mode. There is no actual lock conflict, but each LWLock is protected by a spinlock which must be acquired and released to bump the shared locker counts. In HEAD, everything bottlenecks on the lock manager locks and so it's not really possible for enough traffic to build up on any single spinlock to have a serious impact on performance. The locks being sought there are exclusive, so when they are contended, processes just get descheduled. But with the exclusive locks out of the way, everyone very quickly lines up to acquire shared buffer manager locks, buffer content locks, etc. and large pile-ups ensue, leaving to massive cache line contention and tons of CPU usage. My initial thought was that this was contention over the root block of the index on the pgbench_accounts table and the buf mapping lock protecting it, but instrumentation showed otherwise. I hacked up the system to report how often each lwlock spinlock exceeded spins_per_delay. The following is the end of a report showing the locks with the greatest amounts of excess spinning: lwlock 0: shacq 0 exacq 191032 blk 42554 spin 272 lwlock 41: shacq 5982347 exacq 11937 blk 1825 spin 4217 lwlock 38: shacq 6443278 exacq 11960 blk 1726 spin 4440 lwlock 47: shacq 6106601 exacq 12096 blk 1555 spin 4497 lwlock 34: shacq 6423317 exacq 11896 blk 1863 spin 4776 lwlock 45: shacq 6455173 exacq 12052 blk 1825 spin 4926 lwlock 39: shacq 6867446 exacq 12067 blk 1899 spin 5071 lwlock 44: shacq 6824502 exacq 12040 blk 1655 spin 5153 lwlock 37: shacq 6727304 exacq 11935 blk 2077 spin 5252 lwlock 46: shacq 6862206 exacq 12017 blk 2046 spin 5352 lwlock 36: shacq 6854326 exacq 11920 blk 1914 spin 5441 lwlock 43: shacq 7184761 exacq 11874 blk 1863 spin 5625 lwlock 48: shacq 7612458 exacq 12109 blk 2029 spin 5780 lwlock 35: shacq 7150616 exacq 11916 blk 2026 spin 5782 lwlock 33: shacq 7536878 exacq 11985 blk 2105 spin 6273 lwlock 40: shacq 7199089 exacq 12068 blk 2305 spin 6290 lwlock 456: shacq 36258224 exacq 0 blk 0 spin 54264 lwlock 42: shacq 43012736 exacq 11851 blk 10675 spin 62017 lwlock 4: shacq 72516569 exacq 190 blk 196 spin 341914
Re: [HACKERS] lazy vxid locks, v1
On Sun, Jun 12, 2011 at 10:39 PM, Robert Haas robertmh...@gmail.com wrote: I hacked up the system to report how often each lwlock spinlock exceeded spins_per_delay. I don't doubt the rest of your analysis but one thing to note, number of spins on a spinlock is not the same as the amount of time spent waiting for it. When there's contention on a spinlock the actual test-and-set instruction ends up taking a long time while cache lines are copied around. In theory you could have processes spending an inordinate amount of time waiting on a spinlock even though they never actually hit spins_per_delay or you could have processes that quickly exceed spins_per_delay. I think in practice the results are the same because the code the spinlocks protect is always short so it's hard to get the second case on a multi-core box without actually having contention anyways. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] lazy vxid locks, v1
On Sun, Jun 12, 2011 at 5:58 PM, Greg Stark st...@mit.edu wrote: On Sun, Jun 12, 2011 at 10:39 PM, Robert Haas robertmh...@gmail.com wrote: I hacked up the system to report how often each lwlock spinlock exceeded spins_per_delay. I don't doubt the rest of your analysis but one thing to note, number of spins on a spinlock is not the same as the amount of time spent waiting for it. When there's contention on a spinlock the actual test-and-set instruction ends up taking a long time while cache lines are copied around. In theory you could have processes spending an inordinate amount of time waiting on a spinlock even though they never actually hit spins_per_delay or you could have processes that quickly exceed spins_per_delay. I think in practice the results are the same because the code the spinlocks protect is always short so it's hard to get the second case on a multi-core box without actually having contention anyways. All good points. I don't immediately have a better way of measuring what's going on. Maybe dtrace could do it, but I don't really know how to use it and am not sure it's set up on any of the boxes I have for testing. Throwing gettimeofday() calls into SpinLockAcquire() seems likely to change the overall system behavior enough to make the results utterly meaningless. It wouldn't be real difficult to count the number of times that we TAS() rather than just counting the number of times we TAS() more than spins-per-delay, but I'm not sure whether that would really address your concern. Hopefully, further experimentation will make things more clear. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers