Re: [HACKERS] WIP: dynahash replacement for buffer table
On Tue, Jan 27, 2015 at 9:50 AM, Robert Haas wrote: > This developed a slight merge conflict. I've rebased the attached > version, and I also took the step of getting rid of buf_table.c, as I > think I proposed somewhere upthread. This avoids the overhead of > constructing a BufferTag only to copy it into a BufferLookupEnt, plus > some function calls and so forth. A quick-and-dirty test suggests > this might not have cut down on the 1-client overhead much, but I > think it's worth doing anyway: it's certainly saving a few cycles, and > I don't think it's complicating anything measurably. > Performance data at some of the configurations. Configuration and Db Details -- IBM POWER-8 24 cores, 192 hardware threads RAM = 492GB checkpoint_segments=300 checkpoint_timeout=25min Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8) Duration of each individual run = 5min Scale_factor - 5000 HEAD (commit id - 168a809d) Below is the data for median of 3-runs with pgbench read-only (using -M prepared) configuration Shared_buffers=8GB Client Count/No. Of Runs (tps) 1 8 16 32 64 128 256 HEAD 17748 119106 164949 246632 216763 183177 173055 HEAD + patch 17802 119721 167422 298746 457863 422621 356756 Shared_buffers=16GB Client Count/No. Of Runs (tps) 1 8 16 32 64 128 256 HEAD 18139 113265 169217 270172 310936 238490 215308 HEAD + patch 17900 119960 174196 314866 448238 425916 347919 Observations as per data -- a. It improves the tps by great margin at higher client count. b. It is evident that as we increase the shared buffers, the gain is relatively less (gain when shared_buffers is 16GB is lesser as compare to when shared_buffers is 8GB) I think the patch is valuable for such loads even though it will show lesser benefit at higher shared buffers value, although we might want to once verify that it doesn't topple at configurations such as (shared_buffers = scale_factor). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 2015-01-27 10:36:35 -0500, Robert Haas wrote: > On Mon, Jan 26, 2015 at 11:20 PM, Robert Haas wrote: > > This developed a slight merge conflict. I've rebased the attached > > version, and I also took the step of getting rid of buf_table.c, as I > > think I proposed somewhere upthread. This avoids the overhead of > > constructing a BufferTag only to copy it into a BufferLookupEnt, plus > > some function calls and so forth. A quick-and-dirty test suggests > > this might not have cut down on the 1-client overhead much, but I > > think it's worth doing anyway: it's certainly saving a few cycles, and > > I don't think it's complicating anything measurably. > > So here are median-of-three results for 5-minute read-only pgbench > runs at scale factor 1000, shared_buffers = 8GB, on hydra (POWER7) at > various client counts. > That's extremely unimpressive. You (Andres) showed a much bigger > benefit on a different machine with a much higher scale factor (5000) > so I don't know whether the modest benefit here is because of the > different machine, the different scale factor, or what. Based on my test on hydra some time back the bottleneck is nearly entirely in GetSnapshotData() at higher client counts. So I'm not that surprised you don't see the big benefit here. IIRC on hydra the results for using a large enough shared_buffers setting for a fully cached run at that scale is pretty close to your master results - so there's really not much performance increase to be expected by making the buf table lockless. I guess you would see a slightly bigger difference at a different shared_buffer/scale combination. IIRC a scale 1000 is about 15GB large? So about half of the data set fit in shared buffers. In the scale 5k results I posted it was about 1/5. I had also tested on a four socket x86 machine where GetSnapshotData() was a, but not the sole big, bottleneck. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Mon, Jan 26, 2015 at 11:20 PM, Robert Haas wrote: > This developed a slight merge conflict. I've rebased the attached > version, and I also took the step of getting rid of buf_table.c, as I > think I proposed somewhere upthread. This avoids the overhead of > constructing a BufferTag only to copy it into a BufferLookupEnt, plus > some function calls and so forth. A quick-and-dirty test suggests > this might not have cut down on the 1-client overhead much, but I > think it's worth doing anyway: it's certainly saving a few cycles, and > I don't think it's complicating anything measurably. So here are median-of-three results for 5-minute read-only pgbench runs at scale factor 1000, shared_buffers = 8GB, on hydra (POWER7) at various client counts. clients - master@4b2a25 - master+chash-buftable-v2 1 8319.254199 8105.479632 2 15485.561948 15138.227533 3 23690.186576 23264.702784 4 32157.346740 31536.870881 5 40879.532747 40455.794841 6 49063.279970 49625.681573 7 57518.374517 57492.275197 8 65351.415323 65622.211763 16 126166.416498 126668.793798 24 155727.916112 155587.414299 32 180329.012019 179543.424754 40 201384.186317 200109.614362 48 222352.265595 225688.574611 56 240400.659338 254398.144976 64 253699.031075 266624.224699 72 261198.336625 270652.578322 80 264569.862257 270332.738084 That's extremely unimpressive. You (Andres) showed a much bigger benefit on a different machine with a much higher scale factor (5000) so I don't know whether the modest benefit here is because of the different machine, the different scale factor, or what. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
This developed a slight merge conflict. I've rebased the attached version, and I also took the step of getting rid of buf_table.c, as I think I proposed somewhere upthread. This avoids the overhead of constructing a BufferTag only to copy it into a BufferLookupEnt, plus some function calls and so forth. A quick-and-dirty test suggests this might not have cut down on the 1-client overhead much, but I think it's worth doing anyway: it's certainly saving a few cycles, and I don't think it's complicating anything measurably. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/contrib/Makefile b/contrib/Makefile index 195d447..141ef0a 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -19,6 +19,7 @@ SUBDIRS = \ earthdistance \ file_fdw \ fuzzystrmatch \ + hashtest \ hstore \ intagg \ intarray \ diff --git a/contrib/hashtest/Makefile b/contrib/hashtest/Makefile new file mode 100644 index 000..3ee42f8 --- /dev/null +++ b/contrib/hashtest/Makefile @@ -0,0 +1,18 @@ +# contrib/hashtest/Makefile + +MODULE_big = hashtest +OBJS = hashtest.o + +EXTENSION = hashtest +DATA = hashtest--1.0.sql + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/hashtest +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/hashtest/hashtest--1.0.sql b/contrib/hashtest/hashtest--1.0.sql new file mode 100644 index 000..e271baf --- /dev/null +++ b/contrib/hashtest/hashtest--1.0.sql @@ -0,0 +1,52 @@ +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION hashtest" to load this file. \quit + +CREATE FUNCTION chash_insert_test() +RETURNS void +AS 'MODULE_PATHNAME', 'chash_insert_test' +LANGUAGE C; + +CREATE FUNCTION chash_search_test() +RETURNS void +AS 'MODULE_PATHNAME', 'chash_search_test' +LANGUAGE C; + +CREATE FUNCTION chash_delete_test() +RETURNS void +AS 'MODULE_PATHNAME', 'chash_delete_test' +LANGUAGE C; + +CREATE FUNCTION chash_concurrent_test() +RETURNS void +AS 'MODULE_PATHNAME', 'chash_concurrent_test' +LANGUAGE C; + +CREATE FUNCTION chash_collision_test() +RETURNS void +AS 'MODULE_PATHNAME', 'chash_collision_test' +LANGUAGE C; + +CREATE FUNCTION dynahash_insert_test() +RETURNS void +AS 'MODULE_PATHNAME', 'dynahash_insert_test' +LANGUAGE C; + +CREATE FUNCTION dynahash_search_test() +RETURNS void +AS 'MODULE_PATHNAME', 'dynahash_search_test' +LANGUAGE C; + +CREATE FUNCTION dynahash_delete_test() +RETURNS void +AS 'MODULE_PATHNAME', 'dynahash_delete_test' +LANGUAGE C; + +CREATE FUNCTION dynahash_concurrent_test() +RETURNS void +AS 'MODULE_PATHNAME', 'dynahash_concurrent_test' +LANGUAGE C; + +CREATE FUNCTION dynahash_collision_test() +RETURNS void +AS 'MODULE_PATHNAME', 'dynahash_collision_test' +LANGUAGE C; diff --git a/contrib/hashtest/hashtest.c b/contrib/hashtest/hashtest.c new file mode 100644 index 000..172a5bb --- /dev/null +++ b/contrib/hashtest/hashtest.c @@ -0,0 +1,527 @@ +/*- + * hashtest.c + *- + */ + +#include "postgres.h" + +#include "funcapi.h" +#include "libpq/auth.h" +#include "lib/stringinfo.h" +#include "miscadmin.h" +#include "portability/instr_time.h" +#include "storage/ipc.h" +#include "utils/chash.h" + +PG_MODULE_MAGIC; + +void _PG_init(void); +Datum chash_insert_test(PG_FUNCTION_ARGS); +Datum chash_search_test(PG_FUNCTION_ARGS); +Datum chash_delete_test(PG_FUNCTION_ARGS); +Datum chash_concurrent_test(PG_FUNCTION_ARGS); +Datum chash_collision_test(PG_FUNCTION_ARGS); +Datum dynahash_insert_test(PG_FUNCTION_ARGS); +Datum dynahash_search_test(PG_FUNCTION_ARGS); +Datum dynahash_delete_test(PG_FUNCTION_ARGS); +Datum dynahash_concurrent_test(PG_FUNCTION_ARGS); +Datum dynahash_collision_test(PG_FUNCTION_ARGS); +static void hashtest_shmem_startup(void); + +PG_FUNCTION_INFO_V1(chash_insert_test); +PG_FUNCTION_INFO_V1(chash_search_test); +PG_FUNCTION_INFO_V1(chash_delete_test); +PG_FUNCTION_INFO_V1(chash_concurrent_test); +PG_FUNCTION_INFO_V1(chash_collision_test); +PG_FUNCTION_INFO_V1(dynahash_insert_test); +PG_FUNCTION_INFO_V1(dynahash_search_test); +PG_FUNCTION_INFO_V1(dynahash_delete_test); +PG_FUNCTION_INFO_V1(dynahash_concurrent_test); +PG_FUNCTION_INFO_V1(dynahash_collision_test); + +typedef struct +{ + uint32 key; + uint32 val; +} hentry; + +static CHashDescriptor cdesc = { + "hashtest-chash", /* name */ + 1048576, /* capacity */ + sizeof(hentry), /* element size */ + sizeof(uint32) /* key size */ +}; + +#define DYNAHASH_PARTITIONS 16 + +static shmem_startup_hook_type prev_shmem_startup_hook = NULL; +static CHashTable chash; +static HTAB *dynahash; +static LWLockId dynahash_lock[DYNAHASH_PARTITIONS]; +static ClientAuthentication_hook_type original_client_auth_hook
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Sun, Nov 9, 2014 at 3:58 PM, Andres Freund wrote: >> > * There's several cases where it's noticeable that chash creates more >> > cacheline misses - that's imo a good part of why the single threaded >> > performance regresses. There's good reasons for the current design >> > without a inline bucket, namely that it makes the concurrency handling >> > easier. But I think that can be countered by still storing a pointer - >> > just to an element inside the bucket. Afaics that'd allow this to be >> > introduced quite easily? >> >> It's not obvious to me how that would work. If you just throw those >> elements on the garbage lists with everything else, it will soon be >> the case that one bucket header ends up using the cell stored in some >> other bucket, which isn't going to be awesome. So you need some kind >> of more complex scheme to preserve locality. > > Treating the element inside the bucket as a kind of one element freelist > seems quite workable to me. Test whether it's marked deleted, scan the > hazard pointer list to see whether it can be used. If yes, grand. If > not, go to the current code. The fact that the element is cacheline > local will make the test for its deletedness almost free. Yes, the > hazard pointer scan surely ain't free - but at least for cases like > bufmgr where reads are never less frequent than writes and very often > *much* more frequent I'm pretty sure it'd be a noticeable win. Maybe. I'd expect that to radically increase the time spend doing hazard pointer scans. The performance of this system depends heavily on garbage collection not happening too often, and ISTR that the performance changes significantly if you vary the amount of of overallocation. >> I'm not sure. We're talking about something on the order of half a >> percent on my tests, and lots of changes cause things to bounce around >> by that much. I'm not sure it makes sense to get too worked up about >> that if the alternative is to add a big pile of new complexity. > > I saw things in the range of 3-4% on my laptop. Mrmpf. Well, that sucks. >> > * I don't understand right now (but then I'm in a Hotel bar) why it's >> > safe for CHashAddToGarbage() to clobber the hash value? >> > CHashBucketScan() relies the chain being ordered by hashcode. No? >> > Another backend might just have been past the IsMarked() bit and look >> > at the hashcode? >> >> I think the worst case scenario is that we falsely conclude that >> there's a hash match when there really isn't. The ensuing memcmp() >> will clarify matters. > > Hm. Let me think about it for a while. > > I was thinking that a spurious cmp < 0 could causing us to to miss a > match - but that could only happen if the match just has been removed > from the list. Which is fine. Perhaps that case deserves to be mentioned > in the comment? Maybe. I mean, the general principle here is that there may be some difference between the apparent order of execution on one CPU as opposed to another, and the code that uses the hash table has to be OK with that - at least, unless it has independent methods of assuring that things happen in the right order, but in that case that other thing may well become the botleneck. This is just one example of that. You might also fail to see an insert that's just happened on some other node but is not committed to main memory yet, which is not really an issue we need to fix; it's just how things are. A general discussion of this somewhere might be worthwhile, but it's pretty much the same as any other highly-concurrent hashtable implemenation you'll find out there. (It's also not really different from surrounding the hash table operation, and only the hash table operation, with a big lock. Then things can't change while you are scanning the bucket list, but they can change by the time you can do anything with the returned value, which is the same problem we have to cope with here.) > * Another thing I'm wondering about here is whether it's possible that > somebody is currently walking an "older version" of the bucket list > (i.e. is currently at an already unlinked element) and then visits a > newly inserted element with an 'apparently' out of order hash > value. That seems possible because the inserter might not have seen the > unlinked element. Afaics that's not a problem for searches - but I'm not > sure whether it couldn't cause a problem for concurrent inserts and > deletes. This is why the delete-mark bit has to be part of the same atomic word as the next-pointer. If somebody applies a delete-mark, a subsequent attempt to insert an entry at that position will fail because the CAS() of the next-word won't find the right value there - it will find a delete-marked value, which will never be the value it's expecting. > * One thing I noticed while looking at that part of code is: > + h = target_node->un.hashcode; > + if (h == hashcode) > + cmp = memcmp(
Re: [HACKERS] WIP: dynahash replacement for buffer table
There hasn't been much activity on this patch these days, and Andres has provided some feedback, hence switching to Returned with feedback. Regards, -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 2014-11-07 11:08:57 -0500, Robert Haas wrote: > On Wed, Nov 5, 2014 at 6:19 PM, Andres Freund wrote: > > * In benchmarks it becomes apparent that the dynamic element width makes > > some macros like CHashTableGetNode() and > > CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86 > > can't figure how to build an efficient expression for the > > target. There's two ways to address this: > > a) Actually morph chash into something that will be included into the > > user sites. Since I think there'll only be a limited number of those > > that's probably acceptable. > > Have you benchmarked this? How much does it help? I've done quick benchmarks, and it helped in the 2-3% range in some tests, and was a wash in others. What I did wasn't actually including it in buf_table.c, but hardcoding the size in chash. That's obviously not the real solution. > > b) Simplify the access rules. I quite seriously doubt that the > > interleaving of garbage and freelists is actually necessary/helpful. > > That wasn't added for nothing. I did a bunch of performance testing > on this vs. dynahash (I think the test code is included in the patch) > and the region of memory containing the freelists practically caught > fire. Spreading them out like this greatly improved the performance > under heavy concurrency, but even with those changes the freelists > were extremely hot. Now, since this is the buffer mapping table > rather than a tight loop around hash table manipulation, the > difference may not be enough to measure. But on a microbenchmark it > *definitely* was. I think it'd be much less visible for the buffer manager, but it might still be visible... > > * There's several cases where it's noticeable that chash creates more > > cacheline misses - that's imo a good part of why the single threaded > > performance regresses. There's good reasons for the current design > > without a inline bucket, namely that it makes the concurrency handling > > easier. But I think that can be countered by still storing a pointer - > > just to an element inside the bucket. Afaics that'd allow this to be > > introduced quite easily? > > It's not obvious to me how that would work. If you just throw those > elements on the garbage lists with everything else, it will soon be > the case that one bucket header ends up using the cell stored in some > other bucket, which isn't going to be awesome. So you need some kind > of more complex scheme to preserve locality. Treating the element inside the bucket as a kind of one element freelist seems quite workable to me. Test whether it's marked deleted, scan the hazard pointer list to see whether it can be used. If yes, grand. If not, go to the current code. The fact that the element is cacheline local will make the test for its deletedness almost free. Yes, the hazard pointer scan surely ain't free - but at least for cases like bufmgr where reads are never less frequent than writes and very often *much* more frequent I'm pretty sure it'd be a noticeable win. > > I'm afraid that I can't see us going forward unless we address the > > noticeable single threaded penalties. Do you see that differently? > > I'm not sure. We're talking about something on the order of half a > percent on my tests, and lots of changes cause things to bounce around > by that much. I'm not sure it makes sense to get too worked up about > that if the alternative is to add a big pile of new complexity. I saw things in the range of 3-4% on my laptop. > > * There's some whitespace damage. (space followed by tab, followed by > > actual contents) > > That's the least of our problems at this point. Sure, but still worth cleaning up ;) > > * I still think it's a good idea to move the full memory barriers into > > the cleanup/writing process by doing write memory barriers when > > acquiring a hazard pointer and RMW atomic operations (e.g. atomic add) > > in the process testing for the hazard pointer. > > Can you cite any existing precedent in other system software? Does > Linux do anything like that, for example? No, I can't right now. I think it follows from the cache coherency rules, but I can understand suspicion there. > > * Shouldn't we relax the CPU in a couple places like CHashAllocate(), > > CHashBucketScan()'s loops? > > You mean like, have it sleep the way SpinLockAcquire() does? Not actually like in s_lock(), but rather like the SPIN_DELAY() used in the spinlock code for retries. > Yeah, possibly, but that adds non-trivial code complexity which may > not be worth it if - as is hoped for - there's no real contention > there. I think just adding a pg_spin_delay() before retrying should be trivial. > > * I don't understand right now (but then I'm in a Hotel bar) why it's > > safe for CHashAddToGarbage() to clobber the hash value? > > CHashBucketScan() relies the chain being ordered by hashcode. No? > > Another backend might just hav
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Wed, Nov 5, 2014 at 6:19 PM, Andres Freund wrote: > * In benchmarks it becomes apparent that the dynamic element width makes > some macros like CHashTableGetNode() and > CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86 > can't figure how to build an efficient expression for the > target. There's two ways to address this: > a) Actually morph chash into something that will be included into the > user sites. Since I think there'll only be a limited number of those > that's probably acceptable. Have you benchmarked this? How much does it help? > b) Simplify the access rules. I quite seriously doubt that the > interleaving of garbage and freelists is actually necessary/helpful. That wasn't added for nothing. I did a bunch of performance testing on this vs. dynahash (I think the test code is included in the patch) and the region of memory containing the freelists practically caught fire. Spreading them out like this greatly improved the performance under heavy concurrency, but even with those changes the freelists were extremely hot. Now, since this is the buffer mapping table rather than a tight loop around hash table manipulation, the difference may not be enough to measure. But on a microbenchmark it *definitely* was. > * There's several cases where it's noticeable that chash creates more > cacheline misses - that's imo a good part of why the single threaded > performance regresses. There's good reasons for the current design > without a inline bucket, namely that it makes the concurrency handling > easier. But I think that can be countered by still storing a pointer - > just to an element inside the bucket. Afaics that'd allow this to be > introduced quite easily? It's not obvious to me how that would work. If you just throw those elements on the garbage lists with everything else, it will soon be the case that one bucket header ends up using the cell stored in some other bucket, which isn't going to be awesome. So you need some kind of more complex scheme to preserve locality. > I'm afraid that I can't see us going forward unless we address the > noticeable single threaded penalties. Do you see that differently? I'm not sure. We're talking about something on the order of half a percent on my tests, and lots of changes cause things to bounce around by that much. I'm not sure it makes sense to get too worked up about that if the alternative is to add a big pile of new complexity. > * There's some whitespace damage. (space followed by tab, followed by > actual contents) That's the least of our problems at this point. > * I still think it's a good idea to move the full memory barriers into > the cleanup/writing process by doing write memory barriers when > acquiring a hazard pointer and RMW atomic operations (e.g. atomic add) > in the process testing for the hazard pointer. Can you cite any existing precedent in other system software? Does Linux do anything like that, for example? > * Shouldn't we relax the CPU in a couple places like CHashAllocate(), > CHashBucketScan()'s loops? You mean like, have it sleep the way SpinLockAcquire() does? Yeah, possibly, but that adds non-trivial code complexity which may not be worth it if - as is hoped for - there's no real contention there. > * I don't understand right now (but then I'm in a Hotel bar) why it's > safe for CHashAddToGarbage() to clobber the hash value? > CHashBucketScan() relies the chain being ordered by hashcode. No? > Another backend might just have been past the IsMarked() bit and look > at the hashcode? I think the worst case scenario is that we falsely conclude that there's a hash match when there really isn't. The ensuing memcmp() will clarify matters. > * We really should find a way to sensibly print the stats. Yep. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
Hi, > git branch also available at: > http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash2014 A minor review of this: * Should be rebased ontop of the atomics API * In benchmarks it becomes apparent that the dynamic element width makes some macros like CHashTableGetNode() and CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86 can't figure how to build an efficient expression for the target. There's two ways to address this: a) Actually morph chash into something that will be included into the user sites. Since I think there'll only be a limited number of those that's probably acceptable. b) Simplify the access rules. I quite seriously doubt that the interleaving of garbage and freelists is actually necessary/helpful. * There's several cases where it's noticeable that chash creates more cacheline misses - that's imo a good part of why the single threaded performance regresses. There's good reasons for the current design without a inline bucket, namely that it makes the concurrency handling easier. But I think that can be countered by still storing a pointer - just to an element inside the bucket. Afaics that'd allow this to be introduced quite easily? I'm afraid that I can't see us going forward unless we address the noticeable single threaded penalties. Do you see that differently? * There's some whitespace damage. (space followed by tab, followed by actual contents) * I still think it's a good idea to move the full memory barriers into the cleanup/writing process by doing write memory barriers when acquiring a hazard pointer and RMW atomic operations (e.g. atomic add) in the process testing for the hazard pointer. * Shouldn't we relax the CPU in a couple places like CHashAllocate(), CHashBucketScan()'s loops? * I don't understand right now (but then I'm in a Hotel bar) why it's safe for CHashAddToGarbage() to clobber the hash value? CHashBucketScan() relies the chain being ordered by hashcode. No? Another backend might just have been past the IsMarked() bit and look at the hashcode? * We really should find a way to sensibly print the stats. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 2014-10-16 20:22:24 -0400, Robert Haas wrote: > On Thu, Oct 16, 2014 at 6:53 PM, Andres Freund wrote: > > When using shared_buffers = 96GB there's a performance benefit, but not > > huge: > > master (f630b0dd5ea2de52972d456f5978a012436115e): 153621.8 > > master + LW_SHARED + lockless StrategyGetBuffer(): 477118.4 > > master + LW_SHARED + lockless StrategyGetBuffer() + chash: 496788.6 > > master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 499562.7 > > > > But with shared_buffers = 16GB: > > master (f630b0dd5ea2de52972d456f5978a012436115e): 177302.9 > > master + LW_SHARED + lockless StrategyGetBuffer(): 206172.4 > > master + LW_SHARED + lockless StrategyGetBuffer() + chash: 413344.1 > > master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 426405.8 > > Very interesting. This doesn't show that chash is the right solution, > but it definitely shows that doing nothing is the wrong solution. Absolutely. The thing worrying me most (but not all that much on an absolute scale) about chash is that lots of the solutions to memory management it builds are specific to it... And generalizing afterwards will be hard because we'll have to prove that that general solution is as performant as the special case code... > It shows that, even with the recent bump to 128 lock manager > partitions, and LW_SHARED on top of that, workloads that actually > update the buffer mapping table still produce a lot of contention > there. FWIW, I couldn't see much of a benefit without LW_SHARED even though I a *few* things. The bottleneck simply is completely elsewhere. > This hasn't been obvious to me from profiling, but the numbers > above make it pretty clear. So I'm not super surprised about that. There very well might be cases where it's a bad bottleneck before, but I've not seen them. > It also seems to suggest that trying to get rid of the memory barriers > isn't a very useful optimization project. I'm not yet convinced of that. Yes, it's not showing up hugely in the numbers here, but that's simply because the workload is again completely dominated by the kernel copying data for the read()s, GetSnapshotData(), the buffer mapping cache misses and a few other familiar friends. > We might get a couple of > percent out of it, but it's pretty small potatoes, so unless it can be > done more easily than I suspect, it's probably not worth bothering > with. I still think that moving the barrier to the reading side would be simple (implementation wise) and correct for x86. If you think about it, otherwise our spinlock implementation for x86 would be broken. As a unlock is just a compiler barrier the full barrier on acquire better be a full synchronization point. Am I missing something? > An approach I think might have more promise is to have bufmgr.c > call the CHash stuff directly instead of going through buf_table.c. I don't see all that much point in buf_table.c currently - on the other hand it has lead to it replacing the buffer mapping being slightly easier... Maybe it should just go to a header... > Right now, for example, BufferAlloc() creates and initializes a > BufferTag and passes a pointer to that buffer tag to BufTableLookup, > which copies it into a BufferLookupEnt. But it would be just as easy > for BufferAlloc() to put the BufferLookupEnt on its own stack, and > then you wouldn't need to copy the data an extra time. Now a 20-byte > copy isn't a lot, but it's completely unnecessary and looks easy to > get rid of. Worthwile to try. > > I had to play with setting max_connections+1 sometimes to get halfway > > comparable results for master - unaligned data otherwise causes wierd > > results otherwise. Without doing that the performance gap between master > > 96/16G was even bigger. We really need to fix that... > > > > This is pretty awesome. > > Thanks. I wasn't quite sure how to test this or where the workloads > that it would benefit would be found, so I appreciate you putting time > into it. And I'm really glad to hear that it delivers good results. I wasn't sure either ;). Lucky that it played out so impressively. After the first results I was nearly ready to send out a 'Meh.' type of message ;) Btw, CHashTableGetNode isn't exactly cheap. It shows up noticeably in profiles. It results in several non-pipelineable instructions in a already penalized section of the code... Replacing arena_stride by a constant provided measurable improvements (no imul is required anymore, instead you get shr;lea or so). I'm not sure how to deal with that. If it shows up even on my quite new laptop, it'll be more so noticeable on older x86 platforms. > I think it would be useful to plumb the chash statistics into the > stats collector or at least a debugging dump of some kind for testing. I'm not sure it's solid enough at this point to be in the stats collector. But I surely would like to access it somehow. I'm e.g. absolutely not s
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Thu, Oct 16, 2014 at 6:53 PM, Andres Freund wrote: > When using shared_buffers = 96GB there's a performance benefit, but not > huge: > master (f630b0dd5ea2de52972d456f5978a012436115e): 153621.8 > master + LW_SHARED + lockless StrategyGetBuffer(): 477118.4 > master + LW_SHARED + lockless StrategyGetBuffer() + chash: 496788.6 > master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 499562.7 > > But with shared_buffers = 16GB: > master (f630b0dd5ea2de52972d456f5978a012436115e): 177302.9 > master + LW_SHARED + lockless StrategyGetBuffer(): 206172.4 > master + LW_SHARED + lockless StrategyGetBuffer() + chash: 413344.1 > master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 426405.8 Very interesting. This doesn't show that chash is the right solution, but it definitely shows that doing nothing is the wrong solution. It shows that, even with the recent bump to 128 lock manager partitions, and LW_SHARED on top of that, workloads that actually update the buffer mapping table still produce a lot of contention there. This hasn't been obvious to me from profiling, but the numbers above make it pretty clear. It also seems to suggest that trying to get rid of the memory barriers isn't a very useful optimization project. We might get a couple of percent out of it, but it's pretty small potatoes, so unless it can be done more easily than I suspect, it's probably not worth bothering with. An approach I think might have more promise is to have bufmgr.c call the CHash stuff directly instead of going through buf_table.c. Right now, for example, BufferAlloc() creates and initializes a BufferTag and passes a pointer to that buffer tag to BufTableLookup, which copies it into a BufferLookupEnt. But it would be just as easy for BufferAlloc() to put the BufferLookupEnt on its own stack, and then you wouldn't need to copy the data an extra time. Now a 20-byte copy isn't a lot, but it's completely unnecessary and looks easy to get rid of. > I had to play with setting max_connections+1 sometimes to get halfway > comparable results for master - unaligned data otherwise causes wierd > results otherwise. Without doing that the performance gap between master > 96/16G was even bigger. We really need to fix that... > > This is pretty awesome. Thanks. I wasn't quite sure how to test this or where the workloads that it would benefit would be found, so I appreciate you putting time into it. And I'm really glad to hear that it delivers good results. I think it would be useful to plumb the chash statistics into the stats collector or at least a debugging dump of some kind for testing. They include a number of useful contention measures, and I'd be interested to see what those look like on this workload. (If we're really desperate for every last ounce of performance, we could also disable those statistics in production builds. That's probably worth testing at least once to see if it matters much, but I kind of hope it doesn't.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
Hi, On 2014-10-14 09:30:58 -0400, Robert Haas wrote: > A few years ago I started working on a concurrent hash table for > PostgreSQL. The hash table part of it worked, but I never did > anything with it, really. Amit mentioned to me earlier this week that > he was seeing contention inside the dynahash machinery, which inspired > me to go back and update the patch. I took the basic infrastructure > from before and used it to replace the buffer table. Patch is > attached. So. I ran a quick tests on a larger x86 machine. 4xE5-4620, 256GB RAM. The hotel wifi is too flaky to get detailed results, but some tasty bits. The test I used is readonly pgbench on a scale 5000 DB - about 73GB. I'm benchmarking chash ontop my LW_SHARED and StrategyGetBuffer() optimizations because otherwise the bottleneck is completely elsewhere. When using shared_buffers = 96GB there's a performance benefit, but not huge: master (f630b0dd5ea2de52972d456f5978a012436115e): 153621.8 master + LW_SHARED + lockless StrategyGetBuffer(): 477118.4 master + LW_SHARED + lockless StrategyGetBuffer() + chash: 496788.6 master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 499562.7 But with shared_buffers = 16GB: master (f630b0dd5ea2de52972d456f5978a012436115e): 177302.9 master + LW_SHARED + lockless StrategyGetBuffer(): 206172.4 master + LW_SHARED + lockless StrategyGetBuffer() + chash: 413344.1 master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 426405.8 All the numbers here -P5 output when it looks like it has stabilized - it takes a couple minutes to get to that point so pgbench runs would have to be really long to not be skewed by the startup. I don't think my manual selection of measurements matters much at this scale of differences ;) I had to play with setting max_connections+1 sometimes to get halfway comparable results for master - unaligned data otherwise causes wierd results otherwise. Without doing that the performance gap between master 96/16G was even bigger. We really need to fix that... This is pretty awesome. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund wrote: > Yes, I can see that now. I do wonder if that's not going to prove quite > expensive... I think I can see some ways around needing that. But I > think that needs some benchmarking first - no need to build a even more > complex solution if not needed. I did a bit of testing on my MacBook Pro last night. Unfortunately, I don't have access to a large x86 machine the moment.[1] I tried four configurations: (1) master (2) master with chash patch (3) master with chash patch, pg_memory_barrier() changed from lock addl to mfence (4) same as (3), plus barrier at the end of CHashSearch changed to a compiler barrier (which should be safe on x86) I tested pgbench -S, scale factor 100, shared_buffers 400MB. 3 trials per configuration; runs were interleaved. Percentages indicate the average difference between that line and master. At 1 client: (1) 11689.388986 11426.653176 11297.775005 (2) 11618.306822 11331.805396 11273.272945 -0.55% (3) 11813.664402 11300.082928 11231.265030 -0.20% (4) 11428.097384 11266.979165 11294.422376 -1.23% At 16 clients: (1) 56716.465893 56992.590587 56755.298362 (2) 57126.787712 56848.527712 57351.559138 +0.51% (3) 56779.624757 57036.610925 56878.036445 +0.13% (4) 56818.522369 56885.544509 56977.810413 +0.13% I think this tends to confirm that the patch is a small loss (for unknown reasons) at 1 client, but that it picks up speed with more contention, even on a machine with only 4 cores. But if there's an important difference between the different fencing techniques, it doesn't show up in this test. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company [1] Yes, this is a hint. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Thu, Oct 16, 2014 at 12:26 PM, Ryan Johnson wrote: > The only metric where RCU loses to > hazard pointers is if you have really tight timing constraints on resource > reclamation. I think we do have that problem. It's certainly completely unacceptable for a query to prevent buffer reclaim on any significant number of buffers even until the end of the query, let alone the end of the transaction. But, hey, if somebody wants to try writing a patch different than the one that I wrote and see whether it works better than mine, I'm totally cool with that. This is something I came up with, and we're here to evaluate whether it works better than any other option that we have now or that someone wants to develop. I'm not saying this is the best solution; it's just something I've got that seems to basically work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Oct 15, 2014 7:32 PM, "Ants Aasma" wrote: > I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way > associativity). This allows us to fit the bucket onto 2 regular sized > cache lines and have 8 bytes left over. Buckets would be protected by > seqlocks stored in the extra space. On the read side we would only > need 2 read barriers (basically free on x86), and we are guaranteed to > have an answer by fetching two pairs of cache lines. We can trade > memory bandwidth for latency by issuing prefetches (once we add > primitives for this). Alternatively we can trade insert speed for > lookup speed by using asymmetrically sized tables. I was thinking about this. It came to me that we might not even need BufferTag's to be present in the hash table. In the hash table itself we store just a combination of 4 byte buffer tag hash-buffer id. This way we can store 8 pairs in one cacheline. Lookup must account for the fact that due to hash collisions we may find multiple matches one of which may be the buffer we are looking for. We can make condition exceedingly unlikely by taking advantage of the fact that we have two hashes anyway, in each table we use the lookup hash of the other table for tagging. (32bit hash collision within 8 items). By having a reasonably high load factor (75% should work) and 8 bytes per key we even have a pretty good chance of fitting the hotter parts of the buffer lookup table in CPU caches. We use a separate array for the locks (spinlocks should be ok here) for fine grained locking, this may be 1:1 with the buckets or we can map many buckets to a single lock. Otherwise the scheme should work mostly the same. Using this scheme we can get by on the lookup side using 64 bit atomic reads with no barriers, we only need atomic ops for pinning the found buffer. I haven't had the chance to check out how second-chance hashing works and if this scheme would be applicable to it. Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 16/10/2014 7:59 AM, Robert Haas wrote: On Thu, Oct 16, 2014 at 9:30 AM, Andres Freund wrote: On 2014-10-16 09:19:16 -0400, Robert Haas wrote: On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson wrote: Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like an ideal fit... In brief, RCU has the following requirements: Read-heavy access pattern Writers must be able to make dead objects unreachable to new readers (easily done for most data structures) Writers must be able to mark dead objects in such a way that existing readers know to ignore their contents but can still traverse the data structure properly (usually straightforward) Readers must occasionally inform the system that they are not currently using any RCU-protected pointers (to allow resource reclamation) Have a look at http://lwn.net/Articles/573424/ and specifically the "URCU overview" section. Basically, that last requirement - that readers inform the system tat they are not currently using any RCU-protected pointers - turns out to require either memory barriers or signals. Well, there's the "quiescent-state-based RCU" - that's actually something that could reasonably be used inside postgres. Put something around socket reads (syscalls are synchronization points) and non-nested lwlock acquires. That'd mean it's nearly no additional overhead. Sure, so, you reuse your existing barriers instead of adding new ones. Making it work sounds like a lot of work for an uncertain benefit though. Perhaps RCU is too much work and/or too little benefit to justify replacing the current latch-based code. I personally am not convinced, but have no hard data to offer other than to point at the rapid (even accelerating) uptake of RCU throughout the Linux kernel (cf. Fig. 1 of http://www2.rdrop.com/users/paulmck/techreports/RCUUsage.2013.02.24a.pdf). However--- If we are convinced the latch-based code needs to go and the question is whether to replace it with RCU or hazard pointers, then RCU wins hands-down IMO. It's comparable work to implement, easier to reason about (RCU read protocol is significantly simpler), and a clearer performance benefit (hazard pointers require more barriers, more atomic ops, more validating, and more pointer chasing than RCU). The only metric where RCU loses to hazard pointers is if you have really tight timing constraints on resource reclamation. Hazard pointers give a tight bound on the number of dead objects that cannot be reclaimed at any given moment, while RCU does not. I've not heard that this is a major concern, though, and in practice it doesn't seem to be a problem unless you forget to annotate an important quiescent point (like a blocking syscall). Ryan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 16/10/2014 7:19 AM, Robert Haas wrote: On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson wrote: Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like an ideal fit... In brief, RCU has the following requirements: Read-heavy access pattern Writers must be able to make dead objects unreachable to new readers (easily done for most data structures) Writers must be able to mark dead objects in such a way that existing readers know to ignore their contents but can still traverse the data structure properly (usually straightforward) Readers must occasionally inform the system that they are not currently using any RCU-protected pointers (to allow resource reclamation) Have a look at http://lwn.net/Articles/573424/ and specifically the "URCU overview" section. Basically, that last requirement - that readers inform the system tat they are not currently using any RCU-protected pointers - turns out to require either memory barriers or signals. All of the many techniques that have been developed in this area are merely minor variations on a very old theme: set some kind of flag variable in shared memory to let people know that you are reading a shared data structure, and clear it when you are done. Then, other people can figure out when it's safe to recycle memory that was previously part of that data structure. Sure, but RCU has the key benefit of decoupling its machinery (esp. that flag update) from the actual critical section(s) it protects. In a DBMS setting, for example, once per transaction or SQL statement would do just fine. The notification can be much better than a simple flag---you want to know whether the thread has ever quiesced since the last reclaim cycle began, not whether it is currently quiesced (which it usually isn't). In the implementation I use, a busy thread (e.g. not about to go idle) can "chain" its RCU "transactions." In the common case, a chained quiesce call comes when the RCU epoch is not trying to change, and the "flag update" degenerates to a simple load. Further, the only time it's critical to have that memory barrier is if the quiescing thread is about to go idle. Otherwise, missing a flag just imposes a small delay on resource reclamation (and that's assuming the flag in question even belonged to a straggler process). How you implement epoch management, especially the handling of stragglers, is the deciding factor in whether RCU works well. The early URCU techniques were pretty terrible, and maybe general-purpose URCU is doomed to stay that way, but in a DBMS core it can be done very cleanly and efficiently because we can easily add the quiescent points at appropriate locations in the code. In Linux's RCU, the flag variable is "whether the process is currently scheduled on a CPU", which is obviously not workable from user-space. Lacking that, you need an explicit flag variable, which means you need memory barriers, since the protected operation is a load and the flag variable is updated via a store. You can try to avoid some of the overhead by updating the flag variable less often (say, when a signal arrives) or you can make it more fine-grained (in my case, we only prevent reclaim of a fraction of the data structure at a time, rather than all of it) or various other variants, but none of this is unfortunately so simple as "apply technique X and your problem just goes away". Magic wand, no (does nothing for update contention, for example, and requires some care to apply). But from a practical perspective RCU, properly implemented, does make an awful lot of problems an awful lot simpler to tackle. Especially for the readers. Ryan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Thu, Oct 16, 2014 at 9:30 AM, Andres Freund wrote: > On 2014-10-16 09:19:16 -0400, Robert Haas wrote: >> On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson >> wrote: >> > Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like >> > an ideal fit... >> > >> > In brief, RCU has the following requirements: >> > >> > Read-heavy access pattern >> > Writers must be able to make dead objects unreachable to new readers >> > (easily >> > done for most data structures) >> > Writers must be able to mark dead objects in such a way that existing >> > readers know to ignore their contents but can still traverse the data >> > structure properly (usually straightforward) >> > Readers must occasionally inform the system that they are not currently >> > using any RCU-protected pointers (to allow resource reclamation) >> >> Have a look at http://lwn.net/Articles/573424/ and specifically the >> "URCU overview" section. Basically, that last requirement - that >> readers inform the system tat they are not currently using any >> RCU-protected pointers - turns out to require either memory barriers >> or signals. > > Well, there's the "quiescent-state-based RCU" - that's actually > something that could reasonably be used inside postgres. Put something > around socket reads (syscalls are synchronization points) and non-nested > lwlock acquires. That'd mean it's nearly no additional overhead. Sure, so, you reuse your existing barriers instead of adding new ones. Making it work sounds like a lot of work for an uncertain benefit though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 2014-10-16 09:19:16 -0400, Robert Haas wrote: > On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson > wrote: > > Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like > > an ideal fit... > > > > In brief, RCU has the following requirements: > > > > Read-heavy access pattern > > Writers must be able to make dead objects unreachable to new readers (easily > > done for most data structures) > > Writers must be able to mark dead objects in such a way that existing > > readers know to ignore their contents but can still traverse the data > > structure properly (usually straightforward) > > Readers must occasionally inform the system that they are not currently > > using any RCU-protected pointers (to allow resource reclamation) > > Have a look at http://lwn.net/Articles/573424/ and specifically the > "URCU overview" section. Basically, that last requirement - that > readers inform the system tat they are not currently using any > RCU-protected pointers - turns out to require either memory barriers > or signals. Well, there's the "quiescent-state-based RCU" - that's actually something that could reasonably be used inside postgres. Put something around socket reads (syscalls are synchronization points) and non-nested lwlock acquires. That'd mean it's nearly no additional overhead. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson wrote: > Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like > an ideal fit... > > In brief, RCU has the following requirements: > > Read-heavy access pattern > Writers must be able to make dead objects unreachable to new readers (easily > done for most data structures) > Writers must be able to mark dead objects in such a way that existing > readers know to ignore their contents but can still traverse the data > structure properly (usually straightforward) > Readers must occasionally inform the system that they are not currently > using any RCU-protected pointers (to allow resource reclamation) Have a look at http://lwn.net/Articles/573424/ and specifically the "URCU overview" section. Basically, that last requirement - that readers inform the system tat they are not currently using any RCU-protected pointers - turns out to require either memory barriers or signals. All of the many techniques that have been developed in this area are merely minor variations on a very old theme: set some kind of flag variable in shared memory to let people know that you are reading a shared data structure, and clear it when you are done. Then, other people can figure out when it's safe to recycle memory that was previously part of that data structure. In Linux's RCU, the flag variable is "whether the process is currently scheduled on a CPU", which is obviously not workable from user-space. Lacking that, you need an explicit flag variable, which means you need memory barriers, since the protected operation is a load and the flag variable is updated via a store. You can try to avoid some of the overhead by updating the flag variable less often (say, when a signal arrives) or you can make it more fine-grained (in my case, we only prevent reclaim of a fraction of the data structure at a time, rather than all of it) or various other variants, but none of this is unfortunately so simple as "apply technique X and your problem just goes away". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 15/10/2014 11:41 AM, Robert Haas wrote: On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund wrote: On 2014-10-14 17:53:10 -0400, Robert Haas wrote: On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund wrote: The code in CHashSearch shows the problem there: you need to STORE the hazard pointer before you begin to do the LOAD operations to scan the bucket, and you must finish all of those LOADs before you STORE the NULL hazard pointer. A read or write barrier won't provide store/load or load/store ordering. I'm not sure I understand the problem here - but a read + write barrier should suffice? To avoid falling back to two full barriers, we'd need a separate define pg_read_write_barrier(), but that's not a problem. IIUC that should allow us to avoid emitting any actual code on x86. Well, thinking about x86 specifically, it definitely needs at least one mfence, after setting the hazard pointer and before beginning to read the bucket chain. Yes, I can see that now. I do wonder if that's not going to prove quite expensive... I think I can see some ways around needing that. But I think that needs some benchmarking first - no need to build a even more complex solution if not needed. The solution I'm thinking of is to essentially move away from hazard pointers and store something like a generation counter per backend. Which is updated less often, and in combination with other operations. When a backend need to clean up and sees that there's a straggler with a old generation it sends that backend a signal to ensure it sets the latest generation. It's possible that might work ... but on the timescale we're talking about here, that's asking the garbage collecting process to wait for practically geologic time. Back when I first wrote this code, I spent a fair amount of time looking at existing work in the area of lock-free hash tables. Essentially all of the work I was able to find on this topic assumes a threaded model - or more precisely, it assumes that shared memory allocation is cheap and easy and you'll have no problem getting as much of it as you need whenever you need it. This assumption often isn't even spelled out explicitly: it's just assumed that that's the programming environment you're working in. Finding highly parallel algorithms that don't rely on memory allocation as a primitive is hard. Hazard pointers are one of the few techniques I found that seems like it can work in our architecture. I'm quite reluctant to leave aside techniques that have been proven to work well in favor of inventing something entirely novel to PostgreSQL. That having been said, there is some literature on generation numbers, and I think something like that could be made to work. It does have some significant disadvantages, though. One is that a single process which fails to update its generation number prevents all reclamation, system-wide.In my algorithm, a process whose execution is suspended while a hazard pointer is set prevents recycling of only one of many garbage lists. A process searching for a reusable element can mostly likely find some other garbage list to reclaim instead. Also, a generation number implies that we update the value periodically, rather than before and after each critical section. Anything that forces garbage collection to be postponed longer than absolutely necessary seems to me likely to be a loser. It might be a little faster as long as we have free elements to allocate, but as soon as we're out and have to wait longer than we otherwise would for garbage collection, and all system activity halts as a result, even for a few milliseconds, it's going to be a whole lot slower. Or at least, I think. That having been said, I don't know what to do about the fact that the fence is too expensive. I don't know that we've really established that that's the true root of the problem rather than some other pedestrian optimization failure. But the existing code is using an atomic operation to acquire a spinlock, then releasing it, walking the bucket chain, and then using another atomic operation to acquire a spinlock and then releasing it again. Surely a pure fence shouldn't cost more than a spinlock cycle? Even with arguably one extra cache line touch, that seems like it ought to be a win. But my intuitions in this area are shaky. Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like an ideal fit... In brief, RCU has the following requirements: * Read-heavy access pattern * Writers must be able to make dead objects unreachable to new readers (easily done for most data structures) * Writers must be able to mark dead objects in such a way that existing readers know to ignore their contents but can still traverse the data structure properly (usually straightforward) * Readers must occasionally inform the system that they are not currently using any RCU-protected pointers (to allow resource reclamation) [1] http://www.rdrop.com/~paulmck/RCU/ I
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 2014-10-14 17:53:10 -0400, Robert Haas wrote: > On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund wrote: > >> The code in CHashSearch shows the problem there: you need to STORE the > >> hazard pointer before you begin to do the LOAD operations to scan the > >> bucket, and you must finish all of those LOADs before you STORE the > >> NULL hazard pointer. A read or write barrier won't provide store/load > >> or load/store ordering. > > > > I'm not sure I understand the problem here - but a read + write barrier > > should suffice? To avoid falling back to two full barriers, we'd need a > > separate define pg_read_write_barrier(), but that's not a problem. IIUC > > that should allow us to avoid emitting any actual code on x86. > > Well, thinking about x86 specifically, it definitely needs at least > one mfence, after setting the hazard pointer and before beginning to > read the bucket chain. It probably doesn't need the second mfence, > before clearing the the hazard pointer, because x86 allows loads to be > reordered before stores but not stores before loads. We could > certainly try removing the second fence on x86 for benchmarking > purposes, and we could also check whether mfence outperforms lock; > addl in that scenario. Hm. So I thought about this for a while this morning after I wasn't able to unprogram my hotel room's alarm clock that a previous occupant set way to early. Given that premise, take the following with a grain of salt. The reason for neading an mfence is that a read/write barrier doesn't guarantee that the store is visible to other processes - just in which order they become visible. Right? And that's essentially because the write might sit in the process's store buffer. So, how about *not* using a full barrier on the reader side (i.e. the side that does the hazard pointer writes). But instead do something like a atomic_fetch_add(hazard_ptr, 0) (i.e. lock; xaddl) on the side that needs to scan the hazard pointers. That, combined with the write memory barrier, should guarantee that the store buffers are emptied. Which is pretty nice, because it moves the overhead to the rather infrequent scanning of the hazard pointers - which needs to do lots of other atomic ops anyway. Sounds sane? That's something that should best be simulated in an exhaustive x86 simulator, but it sounds sane - and it'd be quite awesome imo :) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 2014-10-15 13:41:25 -0400, Robert Haas wrote: > On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund wrote: > > The solution I'm thinking of is to essentially move away from hazard > > pointers and store something like a generation counter per > > backend. Which is updated less often, and in combination with other > > operations. When a backend need to clean up and sees that there's a > > straggler with a old generation it sends that backend a signal to ensure > > it sets the latest generation. > > It's possible that might work ... but on the timescale we're talking > about here, that's asking the garbage collecting process to wait for > practically geologic time. I think it depends on what we're tying the generation increase to. We very well could add a implict generation increment to, say, lwlock acquisition - which are full barriers anyway. And I don't think we'll normally have a that high rate of garbage collection anyway - there should be plenty of headroom. > Back when I first wrote this code, I spent a fair amount of time > looking at existing work in the area of lock-free hash tables. > Essentially all of the work I was able to find on this topic assumes a > threaded model - or more precisely, it assumes that shared memory > allocation is cheap and easy and you'll have no problem getting as > much of it as you need whenever you need it. FWIW, I think many of the solutions that are actually used in practice don't rely on that heavily. I know at least some that require memory to be reserved beforehand, in special contexts. > This assumption often > isn't even spelled out explicitly: it's just assumed that that's the > programming environment you're working in. Finding highly parallel > algorithms that don't rely on memory allocation as a primitive is > hard. Hazard pointers are one of the few techniques I found that > seems like it can work in our architecture. I'm quite reluctant to > leave aside techniques that have been proven to work well in favor of > inventing something entirely novel to PostgreSQL. I don't think something like generation numbers is really that new and unproven - it's essentially a more trivial version of RCU. Which is used quite heavily, and under different names. That said, I'm far from convinced that they are the solution - they just were the first thing that came to my mind thinking about the problem. > That having been said, there is some literature on generation numbers, > and I think something like that could be made to work. It does have > some significant disadvantages, though. One is that a single process > which fails to update its generation number prevents all reclamation, > system-wide.In my algorithm, a process whose execution is > suspended while a hazard pointer is set prevents recycling of only one > of many garbage lists. A process searching for a reusable element can > mostly likely find some other garbage list to reclaim instead. Hm. Couldn't a similar scheme with several lists be used with generation numbers? > Also, a generation number implies that we update the value > periodically, rather than before and after each critical section. Hm. Don't think it necessarily has to do that. > Anything that forces garbage collection to be postponed longer than > absolutely necessary seems to me likely to be a loser. It might be a > little faster as long as we have free elements to allocate, but as > soon as we're out and have to wait longer than we otherwise would for > garbage collection, and all system activity halts as a result, even > for a few milliseconds, it's going to be a whole lot slower. Or at > least, I think. I think it really depends on the user of the facility. If the generation were e.g. also tied to lwlocks I couldn't see bufmgr.c outrunning that realistically. > That having been said, I don't know what to do about the fact that the > fence is too expensive. I'm far from sure that it's the problem at hand here - the reason I'm wondering about the barriers is primarily that the buffer mapping hash table is one of the top profile entries in in all concurrent workloads. The top one often even, after removing some locking bottleneck. Nearly all of the time is spent there due to cache misses. While I think we can get the table smaller and more efficient for the same NBuffers value, realistically we need to cope with much bigger NBuffers values. Since cache misses are a problem that we're going to have to deal with, restricting the processor's tool for efficiently dealing with that (out of order execution, SMT), doesn't seem like a wise choice. > I don't know that we've really established > that that's the true root of the problem rather than some other > pedestrian optimization failure. Absolutely. > But the existing code is using an > atomic operation to acquire a spinlock, then releasing it, walking the > bucket chain, and then using another atomic operation to acquire a > spinlock and then releasing it again. Well, we don
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund wrote: > On 2014-10-14 17:53:10 -0400, Robert Haas wrote: >> On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund >> wrote: >> >> The code in CHashSearch shows the problem there: you need to STORE the >> >> hazard pointer before you begin to do the LOAD operations to scan the >> >> bucket, and you must finish all of those LOADs before you STORE the >> >> NULL hazard pointer. A read or write barrier won't provide store/load >> >> or load/store ordering. >> > >> > I'm not sure I understand the problem here - but a read + write barrier >> > should suffice? To avoid falling back to two full barriers, we'd need a >> > separate define pg_read_write_barrier(), but that's not a problem. IIUC >> > that should allow us to avoid emitting any actual code on x86. >> >> Well, thinking about x86 specifically, it definitely needs at least >> one mfence, after setting the hazard pointer and before beginning to >> read the bucket chain. > > Yes, I can see that now. I do wonder if that's not going to prove quite > expensive... I think I can see some ways around needing that. But I > think that needs some benchmarking first - no need to build a even more > complex solution if not needed. > > The solution I'm thinking of is to essentially move away from hazard > pointers and store something like a generation counter per > backend. Which is updated less often, and in combination with other > operations. When a backend need to clean up and sees that there's a > straggler with a old generation it sends that backend a signal to ensure > it sets the latest generation. It's possible that might work ... but on the timescale we're talking about here, that's asking the garbage collecting process to wait for practically geologic time. Back when I first wrote this code, I spent a fair amount of time looking at existing work in the area of lock-free hash tables. Essentially all of the work I was able to find on this topic assumes a threaded model - or more precisely, it assumes that shared memory allocation is cheap and easy and you'll have no problem getting as much of it as you need whenever you need it. This assumption often isn't even spelled out explicitly: it's just assumed that that's the programming environment you're working in. Finding highly parallel algorithms that don't rely on memory allocation as a primitive is hard. Hazard pointers are one of the few techniques I found that seems like it can work in our architecture. I'm quite reluctant to leave aside techniques that have been proven to work well in favor of inventing something entirely novel to PostgreSQL. That having been said, there is some literature on generation numbers, and I think something like that could be made to work. It does have some significant disadvantages, though. One is that a single process which fails to update its generation number prevents all reclamation, system-wide.In my algorithm, a process whose execution is suspended while a hazard pointer is set prevents recycling of only one of many garbage lists. A process searching for a reusable element can mostly likely find some other garbage list to reclaim instead. Also, a generation number implies that we update the value periodically, rather than before and after each critical section. Anything that forces garbage collection to be postponed longer than absolutely necessary seems to me likely to be a loser. It might be a little faster as long as we have free elements to allocate, but as soon as we're out and have to wait longer than we otherwise would for garbage collection, and all system activity halts as a result, even for a few milliseconds, it's going to be a whole lot slower. Or at least, I think. That having been said, I don't know what to do about the fact that the fence is too expensive. I don't know that we've really established that that's the true root of the problem rather than some other pedestrian optimization failure. But the existing code is using an atomic operation to acquire a spinlock, then releasing it, walking the bucket chain, and then using another atomic operation to acquire a spinlock and then releasing it again. Surely a pure fence shouldn't cost more than a spinlock cycle? Even with arguably one extra cache line touch, that seems like it ought to be a win. But my intuitions in this area are shaky. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 15/10/2014 10:32 AM, Ants Aasma wrote: On Tue, Oct 14, 2014 at 6:19 PM, Robert Haas wrote: With regard for using a hash table for the buffer mapping lock I'm doubtful that any form of separate chaining is the right one. We currently have a quite noticeable problem with the number of cache misses in the buffer mapping hash (and also the heavyweight lock mgr) - if we stay with hashes that's only going to be fundamentally lower than dynahash if we change the type of hashing. I've had good, *preliminary*, results using an open addressing + linear probing approach. I'm very skeptical of open addressing + linear probing. Such hash tables tend to degrade over time, because you have to leave tombstones behind to ensure that future scans advance far enough. There's really no way to recover without rebuilding the whole hash table, and eventually it degrades to linear search. If we're spending too much time walking hash chains, I think the solution is to increase the number of buckets so that the chains get shorter. What about cuckoo hashing? There was a recent paper on how to do fine grained locking with cuckoo hashes. [1] I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way associativity). This allows us to fit the bucket onto 2 regular sized cache lines and have 8 bytes left over. Buckets would be protected by seqlocks stored in the extra space. On the read side we would only need 2 read barriers (basically free on x86), and we are guaranteed to have an answer by fetching two pairs of cache lines. We can trade memory bandwidth for latency by issuing prefetches (once we add primitives for this). Alternatively we can trade insert speed for lookup speed by using asymmetrically sized tables. [1] https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf Actually, I'd go with second-chance hashing [2], same number of hash functions but it's more stable (no infinite loops, for example). Most probably the techniques from [1] would apply equally well. [2] http://www.eecs.harvard.edu/~michaelm/postscripts/infocom_hardware_submit.pdf Ryan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Tue, Oct 14, 2014 at 6:19 PM, Robert Haas wrote: >> With regard for using a hash table for the buffer mapping lock I'm >> doubtful that any form of separate chaining is the right one. We >> currently have a quite noticeable problem with the number of cache >> misses in the buffer mapping hash (and also the heavyweight lock mgr) - >> if we stay with hashes that's only going to be fundamentally lower than >> dynahash if we change the type of hashing. I've had good, *preliminary*, >> results using an open addressing + linear probing approach. > > I'm very skeptical of open addressing + linear probing. Such hash > tables tend to degrade over time, because you have to leave tombstones > behind to ensure that future scans advance far enough. There's really > no way to recover without rebuilding the whole hash table, and > eventually it degrades to linear search. If we're spending too much > time walking hash chains, I think the solution is to increase the > number of buckets so that the chains get shorter. What about cuckoo hashing? There was a recent paper on how to do fine grained locking with cuckoo hashes. [1] I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way associativity). This allows us to fit the bucket onto 2 regular sized cache lines and have 8 bytes left over. Buckets would be protected by seqlocks stored in the extra space. On the read side we would only need 2 read barriers (basically free on x86), and we are guaranteed to have an answer by fetching two pairs of cache lines. We can trade memory bandwidth for latency by issuing prefetches (once we add primitives for this). Alternatively we can trade insert speed for lookup speed by using asymmetrically sized tables. [1] https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf Regards, Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 2014-10-14 17:53:10 -0400, Robert Haas wrote: > On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund wrote: > >> The code in CHashSearch shows the problem there: you need to STORE the > >> hazard pointer before you begin to do the LOAD operations to scan the > >> bucket, and you must finish all of those LOADs before you STORE the > >> NULL hazard pointer. A read or write barrier won't provide store/load > >> or load/store ordering. > > > > I'm not sure I understand the problem here - but a read + write barrier > > should suffice? To avoid falling back to two full barriers, we'd need a > > separate define pg_read_write_barrier(), but that's not a problem. IIUC > > that should allow us to avoid emitting any actual code on x86. > > Well, thinking about x86 specifically, it definitely needs at least > one mfence, after setting the hazard pointer and before beginning to > read the bucket chain. Yes, I can see that now. I do wonder if that's not going to prove quite expensive... I think I can see some ways around needing that. But I think that needs some benchmarking first - no need to build a even more complex solution if not needed. The solution I'm thinking of is to essentially move away from hazard pointers and store something like a generation counter per backend. Which is updated less often, and in combination with other operations. When a backend need to clean up and sees that there's a straggler with a old generation it sends that backend a signal to ensure it sets the latest generation. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund wrote: >> The code in CHashSearch shows the problem there: you need to STORE the >> hazard pointer before you begin to do the LOAD operations to scan the >> bucket, and you must finish all of those LOADs before you STORE the >> NULL hazard pointer. A read or write barrier won't provide store/load >> or load/store ordering. > > I'm not sure I understand the problem here - but a read + write barrier > should suffice? To avoid falling back to two full barriers, we'd need a > separate define pg_read_write_barrier(), but that's not a problem. IIUC > that should allow us to avoid emitting any actual code on x86. Well, thinking about x86 specifically, it definitely needs at least one mfence, after setting the hazard pointer and before beginning to read the bucket chain. It probably doesn't need the second mfence, before clearing the the hazard pointer, because x86 allows loads to be reordered before stores but not stores before loads. We could certainly try removing the second fence on x86 for benchmarking purposes, and we could also check whether mfence outperforms lock; addl in that scenario. I tested PPC, because that's what I have. I think we're emitting two sync instructions there, but maybe lwsync or isync would be enough in one or both cases. The first of these links indicates that lwsync ought to be enough for both cases; the second says that we need a sync after setting the hazard pointer but that an lwsync is enough before clearing it. Or that's my reading anyway. http://www.ibm.com/developerworks/systems/articles/powerpc.html http://peeterjoot.wordpress.com/2010/07/23/use-of-lwsync-instead-of-isync-as-a-mutex-acquire-memory-barrier/ >> > My guess is that the additional indirection via the arena explains the >> > difference in cache misses? But I might be completely off here. >> >> The arena makes the calculation of the pointer address less >> predictable, which might make things more difficult for the CPU >> pipeline. But aside from that, I think it's the same basic idea: you >> bounce from some sort of bucket header to the first element of the >> chain and then, hopefully, you're done. Am I missing something? > > I haven't really read much of the code yet (wrote most of this email on > my workstation before embarking on a plane), but afaics there's one > indirection to the bucket, and then one to the first node in the linked > list. Right? > In dynahash it's first to the segment, but there's only few, so it's > likely to be cached. Then to the bucket - which, afaics, already can > contain the key. > > So it's not really the arena, but that you chose not to store the first > element in a chain inline... Hmm, you have a point. I think the reason I did it that way is because I don't know how to make the memory management work out otherwise. If you store the first element inline, then even an empty bucket uses up as much memory space as one with a single element. More seriously, it breaks the deletion algorithm. There's no good way to atomically swap out the bucket header if it is located in a fixed position and bigger than the size of object the machine can manipulate atomically. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 2014-10-14 11:19:16 -0400, Robert Haas wrote: > On Tue, Oct 14, 2014 at 10:25 AM, Andres Freund > wrote: > >> The key idea here is that lookups are done without any locks, only > >> memory barriers; and inserts and deletes are done using atomic ops. > > > > Hm. I quickly looked and I see that you use two full barriers for every > > lookup. That's pretty expensive. I think this should be doable using > > only read/write barriers on the lookup side? Even on architectures where > > read/write memory barriers have to be something but a compiler barrier, > > they're still usually a magnitude or so cheaper than full barriers. > > The code in CHashSearch shows the problem there: you need to STORE the > hazard pointer before you begin to do the LOAD operations to scan the > bucket, and you must finish all of those LOADs before you STORE the > NULL hazard pointer. A read or write barrier won't provide store/load > or load/store ordering. I'm not sure I understand the problem here - but a read + write barrier should suffice? To avoid falling back to two full barriers, we'd need a separate define pg_read_write_barrier(), but that's not a problem. IIUC that should allow us to avoid emitting any actual code on x86. > > With regard for using a hash table for the buffer mapping lock I'm > > doubtful that any form of separate chaining is the right one. We > > currently have a quite noticeable problem with the number of cache > > misses in the buffer mapping hash (and also the heavyweight lock mgr) - > > if we stay with hashes that's only going to be fundamentally lower than > > dynahash if we change the type of hashing. I've had good, *preliminary*, > > results using an open addressing + linear probing approach. > > I'm very skeptical of open addressing + linear probing. Such hash > tables tend to degrade over time, because you have to leave tombstones > behind to ensure that future scans advance far enough. You can try to be a bit smarter than a plain open addressing approach. But yes, it has disadvantages. > There's really > no way to recover without rebuilding the whole hash table, and > eventually it degrades to linear search. If we're spending too much > time walking hash chains, I think the solution is to increase the > number of buckets so that the chains get shorter. Might be worthwile to try. I know that both my handrolled open addressing and my hand rolled chaining approach have significantly fewer cache misses than dynahash for the same amount of work. Hm. Possibly that's at least partially because of the segment stuff in dynahash? Anyway, you can get the size of the entire hashtable down quite a bit. Primarily because there's no need to store a next pointer. There's also not really any need for the 'hashvalue' in the bufmgr case imo. > > My guess is that the additional indirection via the arena explains the > > difference in cache misses? But I might be completely off here. > > The arena makes the calculation of the pointer address less > predictable, which might make things more difficult for the CPU > pipeline. But aside from that, I think it's the same basic idea: you > bounce from some sort of bucket header to the first element of the > chain and then, hopefully, you're done. Am I missing something? I haven't really read much of the code yet (wrote most of this email on my workstation before embarking on a plane), but afaics there's one indirection to the bucket, and then one to the first node in the linked list. Right? In dynahash it's first to the segment, but there's only few, so it's likely to be cached. Then to the bucket - which, afaics, already can contain the key. So it's not really the arena, but that you chose not to store the first element in a chain inline... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Tue, Oct 14, 2014 at 11:31 AM, Andres Freund wrote: >> It doesn't look particularly dangerous to me. Famous last words. > >> Basically, I think what we're doing right now is holding the buffer >> mapping lock so that the buffer can't be renamed under us while we're >> pinning it. > > What I'm afraid of is that there's hidden assumptions about the > consistency provided by the mapping locks. That's certainly worth checking for, but I think the only code that needs to be checked is the code that would formerly have run while holding said lock. And there isn't that much of that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 2014-10-14 11:08:08 -0400, Robert Haas wrote: > On Tue, Oct 14, 2014 at 10:47 AM, Andres Freund > wrote: > > On 2014-10-14 09:30:58 -0400, Robert Haas wrote: > >> I took the basic infrastructure from before and used it to replace the > >> buffer table. Patch is attached. > > > > Hm. Unless I missed something you pretty much completely eradicated > > locks from the buffer mapping code? That'd be pretty cool, but it's also > > scary. > > > > How confident are you that your conversion is correct? Not in the sense > > that there aren't any buglets, but fundamentally. > > It doesn't look particularly dangerous to me. Famous last words. > Basically, I think what we're doing right now is holding the buffer > mapping lock so that the buffer can't be renamed under us while we're > pinning it. What I'm afraid of is that there's hidden assumptions about the consistency provided by the mapping locks. > If we don't do that, I think the only consequence is > that, by the time we get the buffer pin, the buffer might no longer be > the one we looked up. So we need to recheck. But assuming we do > that, I don't see an issue. In fact, it occurred to me while I was > cobbling this together that we might want to experiment with that > change independently of chash. We already know (from the > StrategyGetBuffer locking changes) that holding system-wide locks to > prevent people from twaddling the state of individual buffers can be a > loser. This case isn't nearly as bad, because we're only pinning one > buffer, rather than potentially all of them multiple times, and the > lock we're holding only affects 1/128th of the buffers, not all of > them. Also IIRC at least linux has targeted wakeup/time transfer logic when using semaphore - and doesn't for spinlocks. So if you're not happening to sleep while the lwlock's spinlock is held, the result will be better. Only that we'll frequently sleep within that for very frequently taken locks... > The other application I had in mind for chash was SLRU stuff. I > haven't tried to write the code yet, but fundamentally, the same > considerations apply there. You do the lookup locklessly to find out > which buffer is thought to contain the SLRU page you want, but by the > time you lock the page the answer might be out of date. Assuming that > this doesn't happen many times in a row and that lookups are > relatively cheap, you could get rid of having any centralized lock for > SLRU, and just have per-page locks. Hm. I have to admit I haven't really looked enough into the slru code to judge this. My impression so far wasn't that the locking itself was the problem in most scenarios - that's not what you've seen? > I'm not quite sure what fundamental dangers you're thinking about > here Oh, only in the context of the bufmgr.c conversion. Not more generally. I agree that a lockfree hashtable is something quite worthwile to have. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Tue, Oct 14, 2014 at 10:25 AM, Andres Freund wrote: >> The key idea here is that lookups are done without any locks, only >> memory barriers; and inserts and deletes are done using atomic ops. > > Hm. I quickly looked and I see that you use two full barriers for every > lookup. That's pretty expensive. I think this should be doable using > only read/write barriers on the lookup side? Even on architectures where > read/write memory barriers have to be something but a compiler barrier, > they're still usually a magnitude or so cheaper than full barriers. The code in CHashSearch shows the problem there: you need to STORE the hazard pointer before you begin to do the LOAD operations to scan the bucket, and you must finish all of those LOADs before you STORE the NULL hazard pointer. A read or write barrier won't provide store/load or load/store ordering. > I don't see much reason to strive for full lock-free ness. If the locks > aren't noticeable in real world scenarios - who cares? That's my feeling too. ISTR that when I stress-tested the hash table back in 2012 when I started this code, the concurrency was far better than dynahash with 16 lwlock partitions. I don't remember off hand exactly how I tested that, but it was with the code in hashtest.c. I also seem to recall that it was possible to make the freelists get VERY hot, but of course that was a workload where hash table updates were the only thing happening, so I assume that on a workload where we're actually doing work (like, copying the data in and out of kernel buffers) that's not going to be a real problem unless you have thousands of cores. But we'll see. > With regard for using a hash table for the buffer mapping lock I'm > doubtful that any form of separate chaining is the right one. We > currently have a quite noticeable problem with the number of cache > misses in the buffer mapping hash (and also the heavyweight lock mgr) - > if we stay with hashes that's only going to be fundamentally lower than > dynahash if we change the type of hashing. I've had good, *preliminary*, > results using an open addressing + linear probing approach. I'm very skeptical of open addressing + linear probing. Such hash tables tend to degrade over time, because you have to leave tombstones behind to ensure that future scans advance far enough. There's really no way to recover without rebuilding the whole hash table, and eventually it degrades to linear search. If we're spending too much time walking hash chains, I think the solution is to increase the number of buckets so that the chains get shorter. > My guess is that the additional indirection via the arena explains the > difference in cache misses? But I might be completely off here. The arena makes the calculation of the pointer address less predictable, which might make things more difficult for the CPU pipeline. But aside from that, I think it's the same basic idea: you bounce from some sort of bucket header to the first element of the chain and then, hopefully, you're done. Am I missing something? > It'd be quite interesting to see a perf stat -ddd of dynahash.c/chash > employing builds for some comparable load. I'm hoping you can test this out on x86. All I have to work with are the POWER machines, and the characteristics of those are somewhat different. I can test it there, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Tue, Oct 14, 2014 at 8:38 PM, Andres Freund wrote: > > On 2014-10-14 20:30:45 +0530, Amit Kapila wrote: > > On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund > > wrote: > > > > > > On 2014-10-14 09:30:58 -0400, Robert Haas wrote: > > > > A few years ago I started working on a concurrent hash table for > > > > PostgreSQL. The hash table part of it worked, but I never did > > > > anything with it, really. Amit mentioned to me earlier this week that > > > > he was seeing contention inside the dynahash machinery, which inspired > > > > me to go back and update the patch. > > > > > > Interestingly I've benchmarked similar loads, even on the same machine > > > as Amit, > > > > There is one catch here, for these profiles I am using Power-8 m/c > > and the load is slightly higher (5000 scale factor). > > Ah, right. I don't think the scale factor changes much, but the > different architecture certainly does. As I said elsewhere, I would not > believe these profiles much until they're actually done with optimized > code... Today, that m/c is not accessible, so will take a day or so to get the optimized profiles and will post it once I am able to take the same. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Tue, Oct 14, 2014 at 10:47 AM, Andres Freund wrote: > On 2014-10-14 09:30:58 -0400, Robert Haas wrote: >> I took the basic infrastructure from before and used it to replace the >> buffer table. Patch is attached. > > Hm. Unless I missed something you pretty much completely eradicated > locks from the buffer mapping code? That'd be pretty cool, but it's also > scary. > > How confident are you that your conversion is correct? Not in the sense > that there aren't any buglets, but fundamentally. It doesn't look particularly dangerous to me. Famous last words. Basically, I think what we're doing right now is holding the buffer mapping lock so that the buffer can't be renamed under us while we're pinning it. If we don't do that, I think the only consequence is that, by the time we get the buffer pin, the buffer might no longer be the one we looked up. So we need to recheck. But assuming we do that, I don't see an issue. In fact, it occurred to me while I was cobbling this together that we might want to experiment with that change independently of chash. We already know (from the StrategyGetBuffer locking changes) that holding system-wide locks to prevent people from twaddling the state of individual buffers can be a loser. This case isn't nearly as bad, because we're only pinning one buffer, rather than potentially all of them multiple times, and the lock we're holding only affects 1/128th of the buffers, not all of them. The other application I had in mind for chash was SLRU stuff. I haven't tried to write the code yet, but fundamentally, the same considerations apply there. You do the lookup locklessly to find out which buffer is thought to contain the SLRU page you want, but by the time you lock the page the answer might be out of date. Assuming that this doesn't happen many times in a row and that lookups are relatively cheap, you could get rid of having any centralized lock for SLRU, and just have per-page locks. I'm not quite sure what fundamental dangers you're thinking about here, but from what I understand from reading the literature, doing lookups in a lock-free hash table needs to be thought about in a sort of relativistic way. The different CPUs have slightly different views of the world, so any answer you get may be obsolete by the time you get it, and may according to some other CPU's view of the world have been obsolete even at the time that it was copied. That's OK, because it's just equivalent to some other CPU doing its hash table update slightly sooner or later than it actually did it, within the bounds imposed by memory barriers, which it could have done anyway. There is no one source of truth. The result of all this is that some of the synchronization responsibility is transferred to the caller. That's obviously a bit more complex to reason about, but it hasn't stopped lock-free hash tables from being wildly popular data structures. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 2014-10-14 20:30:45 +0530, Amit Kapila wrote: > On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund > wrote: > > > > On 2014-10-14 09:30:58 -0400, Robert Haas wrote: > > > A few years ago I started working on a concurrent hash table for > > > PostgreSQL. The hash table part of it worked, but I never did > > > anything with it, really. Amit mentioned to me earlier this week that > > > he was seeing contention inside the dynahash machinery, which inspired > > > me to go back and update the patch. > > > > Interestingly I've benchmarked similar loads, even on the same machine > > as Amit, > > There is one catch here, for these profiles I am using Power-8 m/c > and the load is slightly higher (5000 scale factor). Ah, right. I don't think the scale factor changes much, but the different architecture certainly does. As I said elsewhere, I would not believe these profiles much until they're actually done with optimized code... I also think we need to be a bit careful about optimizing too much for stock pgbench with working set >> s_b. The uniform readonly access pattern you're profiling here isn't something super realistic. Still valuable, but we should take it with a grain of salt. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund wrote: > > On 2014-10-14 09:30:58 -0400, Robert Haas wrote: > > A few years ago I started working on a concurrent hash table for > > PostgreSQL. The hash table part of it worked, but I never did > > anything with it, really. Amit mentioned to me earlier this week that > > he was seeing contention inside the dynahash machinery, which inspired > > me to go back and update the patch. > > Interestingly I've benchmarked similar loads, even on the same machine > as Amit, There is one catch here, for these profiles I am using Power-8 m/c and the load is slightly higher (5000 scale factor). With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 2014-10-14 09:30:58 -0400, Robert Haas wrote: > I took the basic infrastructure from before and used it to replace the > buffer table. Patch is attached. Hm. Unless I missed something you pretty much completely eradicated locks from the buffer mapping code? That'd be pretty cool, but it's also scary. How confident are you that your conversion is correct? Not in the sense that there aren't any buglets, but fundamentally. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: dynahash replacement for buffer table
On 2014-10-14 09:30:58 -0400, Robert Haas wrote: > A few years ago I started working on a concurrent hash table for > PostgreSQL. The hash table part of it worked, but I never did > anything with it, really. Amit mentioned to me earlier this week that > he was seeing contention inside the dynahash machinery, which inspired > me to go back and update the patch. Interestingly I've benchmarked similar loads, even on the same machine as Amit, and I do seem trememdous time spent in dynahash.c. It's nearly all cache misses in my tests though. > I took the basic infrastructure > from before and used it to replace the buffer table. Patch is > attached. That's pretty cool. The algorithm is complex enough that I haven't fully understood it yet, but it sounds sane on a first glance. > The key idea here is that lookups are done without any locks, only > memory barriers; and inserts and deletes are done using atomic ops. Hm. I quickly looked and I see that you use two full barriers for every lookup. That's pretty expensive. I think this should be doable using only read/write barriers on the lookup side? Even on architectures where read/write memory barriers have to be something but a compiler barrier, they're still usually a magnitude or so cheaper than full barriers. > The algorithm is not strictly lock-free for reasons explained in the > comments in chash.c, but it's a lot less locky than what we have now, > so in theory you might think that would be a good thing. I don't see much reason to strive for full lock-free ness. If the locks aren't noticeable in real world scenarios - who cares? > I haven't had time to do much performance testing yet, but it looks > like this may be slower at low client counts and faster at high client > counts. However, my results weren't real reproducible, and I haven't > done comprehensive testing yet. What was really bizarre is that I > couldn't really pin down the cause of the slowness at low client > counts; a quick perf profile showed overhead concentrated in > CHashBucketScan, basically memory access latency for walking the > bucket chain. But the table is built to have a load factor of 1, so I > can't see why that should amount to much, or why it should be > significantly worse than for dynahash. With regard for using a hash table for the buffer mapping lock I'm doubtful that any form of separate chaining is the right one. We currently have a quite noticeable problem with the number of cache misses in the buffer mapping hash (and also the heavyweight lock mgr) - if we stay with hashes that's only going to be fundamentally lower than dynahash if we change the type of hashing. I've had good, *preliminary*, results using an open addressing + linear probing approach. My guess is that the additional indirection via the arena explains the difference in cache misses? But I might be completely off here. It'd be quite interesting to see a perf stat -ddd of dynahash.c/chash employing builds for some comparable load. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] WIP: dynahash replacement for buffer table
A few years ago I started working on a concurrent hash table for PostgreSQL. The hash table part of it worked, but I never did anything with it, really. Amit mentioned to me earlier this week that he was seeing contention inside the dynahash machinery, which inspired me to go back and update the patch. I took the basic infrastructure from before and used it to replace the buffer table. Patch is attached. The key idea here is that lookups are done without any locks, only memory barriers; and inserts and deletes are done using atomic ops. The algorithm is not strictly lock-free for reasons explained in the comments in chash.c, but it's a lot less locky than what we have now, so in theory you might think that would be a good thing. I haven't had time to do much performance testing yet, but it looks like this may be slower at low client counts and faster at high client counts. However, my results weren't real reproducible, and I haven't done comprehensive testing yet. What was really bizarre is that I couldn't really pin down the cause of the slowness at low client counts; a quick perf profile showed overhead concentrated in CHashBucketScan, basically memory access latency for walking the bucket chain. But the table is built to have a load factor of 1, so I can't see why that should amount to much, or why it should be significantly worse than for dynahash. This patch contains assorted leftovers and is grotty in various ways, but I'm sharing it anyway just to get it out there. git branch also available at: http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash2014 -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/contrib/Makefile b/contrib/Makefile index b37d0dd..0b91ac1 100644 --- a/contrib/Makefile +++ b/contrib/Makefile @@ -20,6 +20,7 @@ SUBDIRS = \ earthdistance \ file_fdw \ fuzzystrmatch \ + hashtest \ hstore \ intagg \ intarray \ diff --git a/contrib/hashtest/Makefile b/contrib/hashtest/Makefile new file mode 100644 index 000..3ee42f8 --- /dev/null +++ b/contrib/hashtest/Makefile @@ -0,0 +1,18 @@ +# contrib/hashtest/Makefile + +MODULE_big = hashtest +OBJS = hashtest.o + +EXTENSION = hashtest +DATA = hashtest--1.0.sql + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/hashtest +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/hashtest/hashtest--1.0.sql b/contrib/hashtest/hashtest--1.0.sql new file mode 100644 index 000..e271baf --- /dev/null +++ b/contrib/hashtest/hashtest--1.0.sql @@ -0,0 +1,52 @@ +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION hashtest" to load this file. \quit + +CREATE FUNCTION chash_insert_test() +RETURNS void +AS 'MODULE_PATHNAME', 'chash_insert_test' +LANGUAGE C; + +CREATE FUNCTION chash_search_test() +RETURNS void +AS 'MODULE_PATHNAME', 'chash_search_test' +LANGUAGE C; + +CREATE FUNCTION chash_delete_test() +RETURNS void +AS 'MODULE_PATHNAME', 'chash_delete_test' +LANGUAGE C; + +CREATE FUNCTION chash_concurrent_test() +RETURNS void +AS 'MODULE_PATHNAME', 'chash_concurrent_test' +LANGUAGE C; + +CREATE FUNCTION chash_collision_test() +RETURNS void +AS 'MODULE_PATHNAME', 'chash_collision_test' +LANGUAGE C; + +CREATE FUNCTION dynahash_insert_test() +RETURNS void +AS 'MODULE_PATHNAME', 'dynahash_insert_test' +LANGUAGE C; + +CREATE FUNCTION dynahash_search_test() +RETURNS void +AS 'MODULE_PATHNAME', 'dynahash_search_test' +LANGUAGE C; + +CREATE FUNCTION dynahash_delete_test() +RETURNS void +AS 'MODULE_PATHNAME', 'dynahash_delete_test' +LANGUAGE C; + +CREATE FUNCTION dynahash_concurrent_test() +RETURNS void +AS 'MODULE_PATHNAME', 'dynahash_concurrent_test' +LANGUAGE C; + +CREATE FUNCTION dynahash_collision_test() +RETURNS void +AS 'MODULE_PATHNAME', 'dynahash_collision_test' +LANGUAGE C; diff --git a/contrib/hashtest/hashtest.c b/contrib/hashtest/hashtest.c new file mode 100644 index 000..172a5bb --- /dev/null +++ b/contrib/hashtest/hashtest.c @@ -0,0 +1,527 @@ +/*- + * hashtest.c + *- + */ + +#include "postgres.h" + +#include "funcapi.h" +#include "libpq/auth.h" +#include "lib/stringinfo.h" +#include "miscadmin.h" +#include "portability/instr_time.h" +#include "storage/ipc.h" +#include "utils/chash.h" + +PG_MODULE_MAGIC; + +void _PG_init(void); +Datum chash_insert_test(PG_FUNCTION_ARGS); +Datum chash_search_test(PG_FUNCTION_ARGS); +Datum chash_delete_test(PG_FUNCTION_ARGS); +Datum chash_concurrent_test(PG_FUNCTION_ARGS); +Datum chash_collision_test(PG_FUNCTION_ARGS); +Datum dynahash_insert_test(PG_FUNCTION_ARGS); +Datum dynahash_search_test(PG_FUNCTION_ARGS); +Datum dynahash_delete_test(P