On Thu, Feb 11, 2016 at 9:53 AM, Robert Haas <robertmh...@gmail.com> wrote: > The fact that InitLocks() doesn't do this has been discussed before > and there's no consensus on changing it. It is, at any rate, a > separate issue. I'll go through the rest of this patch again now.
I did a little bit of cleanup on this patch and the results are attached. I also did some benchmarking of this version. I tested this on two systems, in each case using five-minute, read-only pgbench runs at scale factor 3000, varying the client count and taking the median of three runs. First, I tested it on hydra, a 2-socket, 16-processor, 64-thread POWER box. This is a community resource hosted at OSUOSL. Second, I tested it on cthulhu, an 8-socket, 64-processor, 128-thread x86_64 box. This is an EnterpriseDB resource. Here are the results: hydra, master vs. patched 1 client: 8042.872109 vs. 7839.587491 (-2.5%) 64 clients: 213311.852043 vs. 214002.314071 (+0.3%) 96 clients: 219551.356907 vs. 221908.397489 (+1.1%) 128 clients: 210279.022760 vs. 217974.079171 (+3.7%) cthulhu, master vs. patched 1 client: 3407.705820 vs. 3645.129360 (+7.0%) 64 clients: 88667.681890 vs. 82636.914343 (-6.8%) 96 clients: 79303.750250 vs. 105442.993869 (+33.0%) 128 clients: 74684.510668 vs. 120984.938371 (+62.0%) Obviously, the high-client count results on cthulhu are stellar. I'm *assuming* that the regressions are just random variation. I am however wondering if it to set the freelist affinity based on something other than the hash value, like say the PID, so that the same process doesn't keep switching to a different freelist for every buffer eviction. It also strikes me that it's probably quite likely that slock_t mutex[NUM_FREELISTS] is a poor way to lay out this data in memory. For example, on a system where slock_t is just one byte, most likely all of those mutexes are going to be in the same cache line, which means you're going to get a LOT of false sharing. It seems like it would be sensible to define a new struct that contains an slock_t, a long, and a HASHELEMENT *, and then make an array of those structs. That wouldn't completely eliminate false sharing, but it would reduce it quite a bit. My guess is that if you did that, you could reduce the number of freelists to 8 or less and get pretty much the same performance benefit that you're getting right now with 32. And that, too, seems likely to be good for single-client performance. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c index 24a53da..06c413c 100644 --- a/src/backend/utils/hash/dynahash.c +++ b/src/backend/utils/hash/dynahash.c @@ -15,7 +15,7 @@ * to hash_create. This prevents any attempt to split buckets on-the-fly. * Therefore, each hash bucket chain operates independently, and no fields * of the hash header change after init except nentries and freeList. - * A partitioned table uses a spinlock to guard changes of those two fields. + * A partitioned table uses spinlocks to guard changes of those fields. * This lets any subset of the hash buckets be treated as a separately * lockable partition. We expect callers to use the low-order bits of a * lookup key's hash value as a partition number --- this will work because @@ -111,6 +111,8 @@ #define DEF_DIRSIZE 256 #define DEF_FFACTOR 1 /* default fill factor */ +/* Number of freelists to be used for a partitioned hash table. */ +#define NUM_FREELISTS 32 /* A hash bucket is a linked list of HASHELEMENTs */ typedef HASHELEMENT *HASHBUCKET; @@ -128,12 +130,17 @@ typedef HASHBUCKET *HASHSEGMENT; */ struct HASHHDR { - /* In a partitioned table, take this lock to touch nentries or freeList */ - slock_t mutex; /* unused if not partitioned table */ - - /* These fields change during entry addition/deletion */ - long nentries; /* number of entries in hash table */ - HASHELEMENT *freeList; /* linked list of free elements */ + /* + * The freelist can become a point of contention on high-concurrency hash + * tables, so we use an array of freelist, each with its own mutex and + * nentries count, instead of just a single one. + * + * If hash table is not partitioned only nentries[0] and freeList[0] are + * used and spinlocks are not used at all. + */ + slock_t mutex[NUM_FREELISTS]; /* array of spinlocks */ + long nentries[NUM_FREELISTS]; /* number of entries */ + HASHELEMENT *freeList[NUM_FREELISTS]; /* lists of free elements */ /* These fields can change, but not in a partitioned table */ /* Also, dsize can't change in a shared table, even if unpartitioned */ @@ -166,6 +173,9 @@ struct HASHHDR #define IS_PARTITIONED(hctl) ((hctl)->num_partitions != 0) +#define FREELIST_IDX(hctl, hashcode) \ + (IS_PARTITIONED(hctl) ? hashcode % NUM_FREELISTS : 0) + /* * Top control structure for a hashtable --- in a shared table, each backend * has its own copy (OK since no fields change at runtime) @@ -219,10 +229,10 @@ static long hash_accesses, */ static void *DynaHashAlloc(Size size); static HASHSEGMENT seg_alloc(HTAB *hashp); -static bool element_alloc(HTAB *hashp, int nelem); +static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx); static bool dir_realloc(HTAB *hashp); static bool expand_table(HTAB *hashp); -static HASHBUCKET get_hash_entry(HTAB *hashp); +static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx); static void hdefault(HTAB *hashp); static int choose_nelem_alloc(Size entrysize); static bool init_htab(HTAB *hashp, long nelem); @@ -482,10 +492,40 @@ hash_create(const char *tabname, long nelem, HASHCTL *info, int flags) if ((flags & HASH_SHARED_MEM) || nelem < hctl->nelem_alloc) { - if (!element_alloc(hashp, (int) nelem)) - ereport(ERROR, - (errcode(ERRCODE_OUT_OF_MEMORY), - errmsg("out of memory"))); + int i, + freelist_partitions, + nelem_alloc, + nelem_alloc_first; + + /* + * If hash table is partitioned all freeLists have equal number of + * elements. Otherwise only freeList[0] is used. + */ + if (IS_PARTITIONED(hashp->hctl)) + freelist_partitions = NUM_FREELISTS; + else + freelist_partitions = 1; + + nelem_alloc = nelem / freelist_partitions; + if (nelem_alloc == 0) + nelem_alloc = 1; + + /* Make sure all memory will be used */ + if (nelem_alloc * freelist_partitions < nelem) + nelem_alloc_first = + nelem - nelem_alloc * (freelist_partitions - 1); + else + nelem_alloc_first = nelem_alloc; + + for (i = 0; i < freelist_partitions; i++) + { + int temp = (i == 0) ? nelem_alloc_first : nelem_alloc; + + if (!element_alloc(hashp, temp, i)) + ereport(ERROR, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("out of memory"))); + } } if (flags & HASH_FIXED_SIZE) @@ -503,9 +543,6 @@ hdefault(HTAB *hashp) MemSet(hctl, 0, sizeof(HASHHDR)); - hctl->nentries = 0; - hctl->freeList = NULL; - hctl->dsize = DEF_DIRSIZE; hctl->nsegs = 0; @@ -572,12 +609,14 @@ init_htab(HTAB *hashp, long nelem) HASHSEGMENT *segp; int nbuckets; int nsegs; + int i; /* * initialize mutex if it's a partitioned table */ if (IS_PARTITIONED(hctl)) - SpinLockInit(&hctl->mutex); + for (i = 0; i < NUM_FREELISTS; i++) + SpinLockInit(&(hctl->mutex[i])); /* * Divide number of elements by the fill factor to determine a desired @@ -648,7 +687,7 @@ init_htab(HTAB *hashp, long nelem) "HIGH MASK ", hctl->high_mask, "LOW MASK ", hctl->low_mask, "NSEGS ", hctl->nsegs, - "NENTRIES ", hctl->nentries); + "NENTRIES ", hash_get_num_entries(hctl)); #endif return true; } @@ -769,7 +808,7 @@ hash_stats(const char *where, HTAB *hashp) where, hashp->hctl->accesses, hashp->hctl->collisions); fprintf(stderr, "hash_stats: entries %ld keysize %ld maxp %u segmentcount %ld\n", - hashp->hctl->nentries, (long) hashp->hctl->keysize, + hash_get_num_entries(hashp), (long) hashp->hctl->keysize, hashp->hctl->max_bucket, hashp->hctl->nsegs); fprintf(stderr, "%s: total accesses %ld total collisions %ld\n", where, hash_accesses, hash_collisions); @@ -863,6 +902,7 @@ hash_search_with_hash_value(HTAB *hashp, HASHBUCKET currBucket; HASHBUCKET *prevBucketPtr; HashCompareFunc match; + int freelist_idx = FREELIST_IDX(hctl, hashvalue); #if HASH_STATISTICS hash_accesses++; @@ -885,7 +925,7 @@ hash_search_with_hash_value(HTAB *hashp, * order of these tests is to try to check cheaper conditions first. */ if (!IS_PARTITIONED(hctl) && !hashp->frozen && - hctl->nentries / (long) (hctl->max_bucket + 1) >= hctl->ffactor && + hctl->nentries[0] / (long) (hctl->max_bucket + 1) >= hctl->ffactor && !has_seq_scans(hashp)) (void) expand_table(hashp); } @@ -943,20 +983,20 @@ hash_search_with_hash_value(HTAB *hashp, { /* if partitioned, must lock to touch nentries and freeList */ if (IS_PARTITIONED(hctl)) - SpinLockAcquire(&hctl->mutex); + SpinLockAcquire(&(hctl->mutex[freelist_idx])); - Assert(hctl->nentries > 0); - hctl->nentries--; + Assert(hctl->nentries[freelist_idx] > 0); + hctl->nentries[freelist_idx]--; /* remove record from hash bucket's chain. */ *prevBucketPtr = currBucket->link; /* add the record to the freelist for this table. */ - currBucket->link = hctl->freeList; - hctl->freeList = currBucket; + currBucket->link = hctl->freeList[freelist_idx]; + hctl->freeList[freelist_idx] = currBucket; if (IS_PARTITIONED(hctl)) - SpinLockRelease(&hctl->mutex); + SpinLockRelease(&hctl->mutex[freelist_idx]); /* * better hope the caller is synchronizing access to this @@ -982,7 +1022,7 @@ hash_search_with_hash_value(HTAB *hashp, elog(ERROR, "cannot insert into frozen hashtable \"%s\"", hashp->tabname); - currBucket = get_hash_entry(hashp); + currBucket = get_hash_entry(hashp, freelist_idx); if (currBucket == NULL) { /* out of memory */ @@ -1175,39 +1215,69 @@ hash_update_hash_key(HTAB *hashp, * create a new entry if possible */ static HASHBUCKET -get_hash_entry(HTAB *hashp) +get_hash_entry(HTAB *hashp, int freelist_idx) { HASHHDR *hctl = hashp->hctl; HASHBUCKET newElement; + int borrow_from_idx; for (;;) { /* if partitioned, must lock to touch nentries and freeList */ if (IS_PARTITIONED(hctl)) - SpinLockAcquire(&hctl->mutex); + SpinLockAcquire(&hctl->mutex[freelist_idx]); /* try to get an entry from the freelist */ - newElement = hctl->freeList; + newElement = hctl->freeList[freelist_idx]; + if (newElement != NULL) break; - /* no free elements. allocate another chunk of buckets */ if (IS_PARTITIONED(hctl)) - SpinLockRelease(&hctl->mutex); + SpinLockRelease(&hctl->mutex[freelist_idx]); - if (!element_alloc(hashp, hctl->nelem_alloc)) + /* no free elements. allocate another chunk of buckets */ + if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx)) { - /* out of memory */ - return NULL; + if (!IS_PARTITIONED(hctl)) + return NULL; /* out of memory */ + + /* try to borrow element from another partition */ + borrow_from_idx = freelist_idx; + for (;;) + { + borrow_from_idx = (borrow_from_idx + 1) % NUM_FREELISTS; + if (borrow_from_idx == freelist_idx) + break; + + SpinLockAcquire(&(hctl->mutex[borrow_from_idx])); + newElement = hctl->freeList[borrow_from_idx]; + + if (newElement != NULL) + { + hctl->freeList[borrow_from_idx] = newElement->link; + SpinLockRelease(&(hctl->mutex[borrow_from_idx])); + + SpinLockAcquire(&hctl->mutex[freelist_idx]); + hctl->nentries[freelist_idx]++; + SpinLockRelease(&hctl->mutex[freelist_idx]); + + break; + } + + SpinLockRelease(&(hctl->mutex[borrow_from_idx])); + } + + return newElement; } } /* remove entry from freelist, bump nentries */ - hctl->freeList = newElement->link; - hctl->nentries++; + hctl->freeList[freelist_idx] = newElement->link; + hctl->nentries[freelist_idx]++; if (IS_PARTITIONED(hctl)) - SpinLockRelease(&hctl->mutex); + SpinLockRelease(&hctl->mutex[freelist_idx]); return newElement; } @@ -1218,11 +1288,21 @@ get_hash_entry(HTAB *hashp) long hash_get_num_entries(HTAB *hashp) { + int i; + long sum = hashp->hctl->nentries[0]; + /* * We currently don't bother with the mutex; it's only sensible to call * this function if you've got lock on all partitions of the table. */ - return hashp->hctl->nentries; + + if (!IS_PARTITIONED(hashp->hctl)) + return sum; + + for (i = 1; i < NUM_FREELISTS; i++) + sum += hashp->hctl->nentries[i]; + + return sum; } /* @@ -1527,10 +1607,10 @@ seg_alloc(HTAB *hashp) } /* - * allocate some new elements and link them into the free list + * allocate some new elements and link them into the indicated free list */ static bool -element_alloc(HTAB *hashp, int nelem) +element_alloc(HTAB *hashp, int nelem, int freelist_idx) { HASHHDR *hctl = hashp->hctl; Size elementSize; @@ -1563,14 +1643,14 @@ element_alloc(HTAB *hashp, int nelem) /* if partitioned, must lock to touch freeList */ if (IS_PARTITIONED(hctl)) - SpinLockAcquire(&hctl->mutex); + SpinLockAcquire(&hctl->mutex[freelist_idx]); /* freelist could be nonempty if two backends did this concurrently */ - firstElement->link = hctl->freeList; - hctl->freeList = prevElement; + firstElement->link = hctl->freeList[freelist_idx]; + hctl->freeList[freelist_idx] = prevElement; if (IS_PARTITIONED(hctl)) - SpinLockRelease(&hctl->mutex); + SpinLockRelease(&hctl->mutex[freelist_idx]); return true; }
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers