optimize lookups in snapshot [sub]xip arrays

2022-07-13 Thread Nathan Bossart
Hi hackers,

A few years ago, there was a proposal to create hash tables for long
[sub]xip arrays in snapshots [0], but the thread seems to have fizzled out.
I was curious whether this idea still showed measurable benefits, so I
revamped the patch and ran the same test as before [1].  Here are the
results for 60₋second runs on an r5d.24xlarge with the data directory on
the local NVMe storage:

 writers  HEAD  patch  diff

 16   659   664+1%
 32   645   663+3%
 64   659   692+5%
 128  641   716+12%
 256  619   610-1%
 512  530   702+32%
 768  469   582+24%
 1000 367   577+57%

As before, the hash table approach seems to provide a decent benefit at
higher client counts, so I felt it was worth reviving the idea.

The attached patch has some key differences from the previous proposal.
For example, the new patch uses simplehash instead of open-coding a new
hash table.  Also, I've bumped up the threshold for creating hash tables to
128 based on the results of my testing.  The attached patch waits until a
lookup of [sub]xip before generating the hash table, so we only need to
allocate enough space for the current elements in the [sub]xip array, and
we avoid allocating extra memory for workloads that do not need the hash
tables.  I'm slightly worried about increasing the number of memory
allocations in this code path, but the results above seemed encouraging on
that front.

Thoughts?

[0] https://postgr.es/m/35960b8af917e9268881cd8df3f88320%40postgrespro.ru
[1] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e%402ndquadrant.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From a7e63198030d8c77df1720a85f9eca6d1d5078b2 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Tue, 12 Jul 2022 11:39:41 -0700
Subject: [PATCH v1 1/1] Optimize lookups in snapshot "transactions in
 progress" arrays.

Presently, XidInMVCCSnapshot() performs linear searches through the
xip and subxip arrays.  This is ordinarily not too bad, but this
O(n) behavior may degrade performance for certain workloads at
higher client counts.  This change teaches XidInMVCCSnapshot() to
generate hash tables when the [sub]xip array is large, thereby
achieving O(1) lookup times at the expense of some extra memory.
These hash tables are regarded as ephemeral; they only live in
process-local memory and are never rewritten, copied, or
serialized.  This means that we only need to allocate enough room
for the current elements in [sub]xip, which should usually save
memory (but increase the number of allocations).  Another benefit
of this approach is that the hash tables are not allocated for
workloads that do not generate snapshots with long [sub]xip arrays.

A synthetic workload that generates snapshots with many transaction
IDs showed no regression in TPS at lower client counts, and it
provided over 50% improvement at higher client counts (i.e., 1000
connections).
---
 src/backend/storage/ipc/procarray.c |   7 +-
 src/backend/utils/time/snapmgr.c| 111 ++--
 src/include/utils/snapshot.h|  19 +
 3 files changed, 114 insertions(+), 23 deletions(-)

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index dadaa958a8..e126eb72bb 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -2544,12 +2544,15 @@ GetSnapshotData(Snapshot snapshot)
 	snapshot->curcid = GetCurrentCommandId(false);
 
 	/*
-	 * This is a new snapshot, so set both refcounts are zero, and mark it as
-	 * not copied in persistent memory.
+	 * This is a new snapshot, so set both refcounts are zero, mark it as not
+	 * copied in persistent memory, and mark the xip and subxip hash tables as
+	 * not built.
 	 */
 	snapshot->active_count = 0;
 	snapshot->regd_count = 0;
 	snapshot->copied = false;
+	snapshot->xiph = NULL;
+	snapshot->subxiph = NULL;
 
 	GetSnapshotDataInitOldSnapshot(snapshot);
 
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5bc2a15160..c5fcfd254c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -53,6 +53,7 @@
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "catalog/catalog.h"
+#include "common/hashfn.h"
 #include "datatype/timestamp.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
@@ -626,6 +627,8 @@ CopySnapshot(Snapshot snapshot)
 	newsnap->active_count = 0;
 	newsnap->copied = true;
 	newsnap->snapXactCompletionCount = 0;
+	newsnap->xiph = NULL;
+	newsnap->subxiph = NULL;
 
 	/* setup XID array */
 	if (snapshot->xcnt > 0)
@@ -667,6 +670,10 @@ FreeSnapshot(Snapshot snapshot)
 	Assert(snapshot->active_count == 0);
 	Assert(snapshot->copied);
 
+	if (snapshot->xiph)
+		xip_hash_destroy(snapshot->xiph);
+	if (snapshot->subxiph)
+		xip_hash_destroy(snapshot->subxiph);
 	pfree(snapshot);
 }
 
@@ -2233,6 +2240,8 @@ Rest

Re: optimize lookups in snapshot [sub]xip arrays

2022-07-14 Thread Bharath Rupireddy
On Wed, Jul 13, 2022 at 10:40 PM Nathan Bossart
 wrote:
>
> Hi hackers,
>
> A few years ago, there was a proposal to create hash tables for long
> [sub]xip arrays in snapshots [0], but the thread seems to have fizzled out.
> I was curious whether this idea still showed measurable benefits, so I
> revamped the patch and ran the same test as before [1].  Here are the
> results for 60₋second runs on an r5d.24xlarge with the data directory on
> the local NVMe storage:
>
>  writers  HEAD  patch  diff
> 
>  16   659   664+1%
>  32   645   663+3%
>  64   659   692+5%
>  128  641   716+12%
>  256  619   610-1%
>  512  530   702+32%
>  768  469   582+24%
>  1000 367   577+57%

Impressive.

> As before, the hash table approach seems to provide a decent benefit at
> higher client counts, so I felt it was worth reviving the idea.
>
> The attached patch has some key differences from the previous proposal.
> For example, the new patch uses simplehash instead of open-coding a new
> hash table.  Also, I've bumped up the threshold for creating hash tables to
> 128 based on the results of my testing.  The attached patch waits until a
> lookup of [sub]xip before generating the hash table, so we only need to
> allocate enough space for the current elements in the [sub]xip array, and
> we avoid allocating extra memory for workloads that do not need the hash
> tables.  I'm slightly worried about increasing the number of memory
> allocations in this code path, but the results above seemed encouraging on
> that front.
>
> Thoughts?
>
> [0] https://postgr.es/m/35960b8af917e9268881cd8df3f88320%40postgrespro.ru
> [1] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e%402ndquadrant.com

Aren't these snapshot arrays always sorted? I see the following code:

/* sort so we can bsearch() */
qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);

/* sort so we can bsearch() later */
qsort(snap->subxip, snap->subxcnt, sizeof(TransactionId), xidComparator);

If the ordering isn't an invariant of these snapshot arrays, can we
also use the hash table mechanism for all of the snapshot arrays
infrastructure rather than qsort+bsearch in a few places and hash
table for others?

+ * The current value worked well in testing, but it's still mostly a guessed-at
+ * number that might need updating in the future.
+ */
+#define XIP_HASH_MIN_ELEMENTS (128)
+

Do you see a regression with a hash table for all the cases? Why can't
we just build a hash table irrespective of these limits and use it for
all the purposes instead of making it complex with different
approaches if we don't have measurable differences in the performance
or throughput?

+static inline bool
+XidInXip(TransactionId xid, TransactionId *xip, uint32 xcnt,
+ xip_hash_hash **xiph)

+ /* Make sure the hash table is built. */
+ if (*xiph == NULL)
+ {
+ *xiph = xip_hash_create(TopTransactionContext, xcnt, NULL);
+
+ for (int i = 0; i < xcnt; i++)

Why create a hash table on the first search? Why can't it be built
while inserting or creating these snapshots? Basically, instead of the
array, can these snapshot structures be hash tables by themselves? I
know this requires a good amount of code refactoring, but worth
considering IMO as it removes bsearch thus might improve the
performance further.

Regards,
Bharath Rupireddy.




Re: optimize lookups in snapshot [sub]xip arrays

2022-07-14 Thread Nathan Bossart
Hi Bharath,

Thanks for taking a look.

On Thu, Jul 14, 2022 at 03:10:56PM +0530, Bharath Rupireddy wrote:
> Aren't these snapshot arrays always sorted? I see the following code:
> 
> /* sort so we can bsearch() */
> qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);
> 
> /* sort so we can bsearch() later */
> qsort(snap->subxip, snap->subxcnt, sizeof(TransactionId), xidComparator);

AFAICT these arrays are sorted in limited cases, such as
pg_current_snapshot() and logical replication.  GetSnapshotData() does not
appear to sort them, so I don't think we can always assume they are sorted.
In the previous thread, Tomas analyzed simply sorting the arrays [0] and
found that it provided much less improvement compared to the hash table
approach, so I have not seriously considered it here.

> If the ordering isn't an invariant of these snapshot arrays, can we
> also use the hash table mechanism for all of the snapshot arrays
> infrastructure rather than qsort+bsearch in a few places and hash
> table for others?

Unless there is demonstrable benefit in doing so for the few places that
sort the arrays, I'm ѕkeptical it's worth the complexity.  This patch is
targeted to XidInMVCCSnapshot(), which we can demonstrate has clear impact
on TPS for some workloads.

> + * The current value worked well in testing, but it's still mostly a 
> guessed-at
> + * number that might need updating in the future.
> + */
> +#define XIP_HASH_MIN_ELEMENTS (128)
> +
> 
> Do you see a regression with a hash table for all the cases? Why can't
> we just build a hash table irrespective of these limits and use it for
> all the purposes instead of making it complex with different
> approaches if we don't have measurable differences in the performance
> or throughput?

I performed the same tests as before with a variety of values.  Here are
the results:

 writers  HEAD  1   16  32  64  128

 16   659   698 678 659 665 664
 32   645   661 688 657 649 663
 64   659   656 653 649 663 692
 128  641   636 639 679 643 716
 256  619   641 619 643 653 610
 512  530   609 582 602 605 702
 768  469   610 608 551 571 582
 1000 367   610 538 557 556 577

I was surpised to see that there really wasn't a regression at the low end,
but keep in mind that this is a rather large machine and a specialized
workload for generating snapshots with long [sub]xip arrays.  That being
said, there really wasn't any improvement at the low end, either.  If we
always built a hash table, we'd be introducing more overhead and memory
usage in return for approximately zero benefit.  My intent was to only take
on the overhead in cases where we believe it might have a positive impact,
which is why I picked the somewhat conservative value of 128.  If the
overhead isn't a concern, it might be feasible to always make [sub]xip a
hash table.

> +static inline bool
> +XidInXip(TransactionId xid, TransactionId *xip, uint32 xcnt,
> + xip_hash_hash **xiph)
> 
> + /* Make sure the hash table is built. */
> + if (*xiph == NULL)
> + {
> + *xiph = xip_hash_create(TopTransactionContext, xcnt, NULL);
> +
> + for (int i = 0; i < xcnt; i++)
> 
> Why create a hash table on the first search? Why can't it be built
> while inserting or creating these snapshots? Basically, instead of the
> array, can these snapshot structures be hash tables by themselves? I
> know this requires a good amount of code refactoring, but worth
> considering IMO as it removes bsearch thus might improve the
> performance further.

The idea is to avoid the overhead unless something actually needs to
inspect these arrays.

[0] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e%402ndquadrant.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-07-15 Thread Andres Freund
Hi,

Sounds worth pursuing.

On 2022-07-13 10:09:50 -0700, Nathan Bossart wrote:
> The attached patch has some key differences from the previous proposal.
> For example, the new patch uses simplehash instead of open-coding a new
> hash table.

+1

> The attached patch waits until a lookup of [sub]xip before generating the
> hash table, so we only need to allocate enough space for the current
> elements in the [sub]xip array, and we avoid allocating extra memory for
> workloads that do not need the hash tables.

Hm. Are there any contexts where we'd not want the potential for failing due
to OOM?

I wonder if we additionally / alternatively could use a faster method of
searching the array linearly, e.g. using SIMD.


Another thing that might be worth looking into is to sort the xip/subxip
arrays into a binary-search optimized layout. That'd make the binary search
faster, wouldn't require additional memory (a boolean indicating whether
sorted somewhere, I guess), and would easily persist across copies of the
snapshot.


> I'm slightly worried about increasing the number of memory
> allocations in this code path, but the results above seemed encouraging on
> that front.

ISMT that the test wouldn't be likely to show those issues.


> These hash tables are regarded as ephemeral; they only live in
> process-local memory and are never rewritten, copied, or
> serialized.

What does rewriting refer to here?

Not convinced that's the right idea in case of copying. I think we often end
up copying snapshots frequently, and building & allocating the hashed xids
separately every time seems not great.


> + snapshot->xiph = NULL;
> + snapshot->subxiph = NULL;

Do we need separate hashes for these? ISTM that if we're overflowed then we
don't need ->subxip[h], and if not, then the action for an xid being in ->xip
and ->subxiph is the same?

Greetings,

Andres Freund




Re: optimize lookups in snapshot [sub]xip arrays

2022-07-16 Thread Nathan Bossart
Hi Andres,

Thanks for taking a look.

On Fri, Jul 15, 2022 at 01:08:57PM -0700, Andres Freund wrote:
> Hm. Are there any contexts where we'd not want the potential for failing due
> to OOM?

I'm not sure about this one.

> I wonder if we additionally / alternatively could use a faster method of
> searching the array linearly, e.g. using SIMD.

I looked into using SIMD.  The patch is attached, but it is only intended
for benchmarking purposes and isn't anywhere close to being worth serious
review.  There may be a simpler/better way to implement the linear search,
but this seemed to work well.  Overall, SIMD provided a decent improvement.
I had to increase the number of writers quite a bit in order to demonstrate
where the hash tables began winning.  Here are the numbers:

writers head simd hash
256 663  632  694
512 530  618  637
768 489  544  573
1024364  508  562
2048185  306  485
4096146  197  441

While it is unsurprising that the hash tables perform the best, there are a
couple of advantages to SIMD that might make that approach worth
considering.  For one, there's really no overhead (i.e., you don't need to
sort the array or build a hash table), so we can avoid picking an arbitrary
threshold and just have one code path.  Also, a SIMD implementation for a
linear search through an array of integers could likely be easily reused
elsewhere.

> Another thing that might be worth looking into is to sort the xip/subxip
> arrays into a binary-search optimized layout. That'd make the binary search
> faster, wouldn't require additional memory (a boolean indicating whether
> sorted somewhere, I guess), and would easily persist across copies of the
> snapshot.

I spent some time looking into this, but I haven't attempted to implement
it.  IIUC the most difficult part of this is sorting the array in place to
the special layout.

>> These hash tables are regarded as ephemeral; they only live in
>> process-local memory and are never rewritten, copied, or
>> serialized.
> 
> What does rewriting refer to here?

I mean that a hash table created for one snapshot will not be cleared and
reused for another.

> Not convinced that's the right idea in case of copying. I think we often end
> up copying snapshots frequently, and building & allocating the hashed xids
> separately every time seems not great.

Right.  My concern with reusing the hash tables is that we'd need to
allocate much more space that would go largely unused in many cases.

>> +snapshot->xiph = NULL;
>> +snapshot->subxiph = NULL;
> 
> Do we need separate hashes for these? ISTM that if we're overflowed then we
> don't need ->subxip[h], and if not, then the action for an xid being in ->xip
> and ->subxiph is the same?

Do you mean that we can combine these into one hash table?  That might
work.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5bc2a15160..25d1a3564c 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -45,6 +45,7 @@
  */
 #include "postgres.h"
 
+#include 
 #include 
 #include 
 
@@ -2271,6 +2272,40 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
 	SetTransactionSnapshot(snapshot, NULL, InvalidPid, source_pgproc);
 }
 
+static inline bool
+XidInXip(TransactionId xid, TransactionId *xip, uint32 len)
+{
+	__m128i xids = _mm_set1_epi32(xid);
+	uint32  its = len & ~15;  /* round down to nearest multiple of 16 */
+	uint32	i;
+
+	for (i = 0; i < its; i += 16)
+	{
+		__m128i xips1 = _mm_loadu_si128((__m128i *) &xip[i]);
+		__m128i xips2 = _mm_loadu_si128((__m128i *) &xip[i + 4]);
+		__m128i xips3 = _mm_loadu_si128((__m128i *) &xip[i + 8]);
+		__m128i xips4 = _mm_loadu_si128((__m128i *) &xip[i + 12]);
+		__m128i result1 = _mm_cmpeq_epi32(xids, xips1);
+		__m128i result2 = _mm_cmpeq_epi32(xids, xips2);
+		__m128i result3 = _mm_cmpeq_epi32(xids, xips3);
+		__m128i result4 = _mm_cmpeq_epi32(xids, xips4);
+		__m128i tmp1 = _mm_packs_epi32(result1, result2);
+		__m128i tmp2 = _mm_packs_epi32(result3, result4);
+		__m128i result = _mm_packs_epi16(tmp1, tmp2);
+
+		if (_mm_movemask_epi8(result) != 0)
+			return true;
+	}
+
+	while (i < len)
+	{
+		if (TransactionIdEquals(xid, xip[i++]))
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * XidInMVCCSnapshot
  *		Is the given XID still-in-progress according to the snapshot?
@@ -2284,8 +2319,6 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
 bool
 XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 {
-	uint32		i;
-
 	/*
 	 * Make a quick range check to eliminate most XIDs without looking at the
 	 * xip arrays.  Note that this is OK even if we convert a subxact XID to
@@ -2317,13 +2350,8 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 		if (!snapshot->suboverflowed)
 		{
 			/* we have full data, so search subxip */
-			int32		j;
-
-			for (j = 0; j < snapsho

Re: optimize lookups in snapshot [sub]xip arrays

2022-07-23 Thread Zhang Mingli
Hi, all


> 
>   if (!snapshot->suboverflowed)
>   {
>   /* we have full data, so search subxip */
> - int32   j;
> -
> - for (j = 0; j < snapshot->subxcnt; j++)
> - {
> - if (TransactionIdEquals(xid, 
> snapshot->subxip[j]))
> - return true;
> - }
> + if (XidInXip(xid, snapshot->subxip, snapshot->subxcnt,
> +  &snapshot->subxiph))
> + return true;
>  
>   /* not there, fall through to search xip[] */
>   }


If snaphost->suboverflowed is  false then the subxcnt must be less than 
PGPROC_MAX_CACHED_SUBXIDS which is 64 now.

And we won’t use hash if the xcnt is less than XIP_HASH_MIN_ELEMENTS which is 
128 currently during discussion.

So that, subxid’s hash table will never be used, right?

Regards,

Zhang Mingli


> On Jul 14, 2022, at 01:09, Nathan Bossart  wrote:
> 
> Hi hackers,
> 
> A few years ago, there was a proposal to create hash tables for long
> [sub]xip arrays in snapshots [0], but the thread seems to have fizzled out.
> I was curious whether this idea still showed measurable benefits, so I
> revamped the patch and ran the same test as before [1].  Here are the
> results for 60₋second runs on an r5d.24xlarge with the data directory on
> the local NVMe storage:
> 
> writers  HEAD  patch  diff
>
> 16   659   664+1%
> 32   645   663+3%
> 64   659   692+5%
> 128  641   716+12%
> 256  619   610-1%
> 512  530   702+32%
> 768  469   582+24%
> 1000 367   577+57%
> 
> As before, the hash table approach seems to provide a decent benefit at
> higher client counts, so I felt it was worth reviving the idea.
> 
> The attached patch has some key differences from the previous proposal.
> For example, the new patch uses simplehash instead of open-coding a new
> hash table.  Also, I've bumped up the threshold for creating hash tables to
> 128 based on the results of my testing.  The attached patch waits until a
> lookup of [sub]xip before generating the hash table, so we only need to
> allocate enough space for the current elements in the [sub]xip array, and
> we avoid allocating extra memory for workloads that do not need the hash
> tables.  I'm slightly worried about increasing the number of memory
> allocations in this code path, but the results above seemed encouraging on
> that front.
> 
> Thoughts?
> 
> [0] https://postgr.es/m/35960b8af917e9268881cd8df3f88320%40postgrespro.ru
> [1] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e%402ndquadrant.com
> 
> -- 
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
> 





Re: optimize lookups in snapshot [sub]xip arrays

2022-07-24 Thread Yura Sokolov
В Ср, 13/07/2022 в 10:09 -0700, Nathan Bossart пишет:
> Hi hackers,
> 
> A few years ago, there was a proposal to create hash tables for long
> [sub]xip arrays in snapshots [0], but the thread seems to have fizzled out.
> I was curious whether this idea still showed measurable benefits, so I
> revamped the patch and ran the same test as before [1].  Here are the
> results for 60₋second runs on an r5d.24xlarge with the data directory on
> the local NVMe storage:
> 
>  writers  HEAD  patch  diff
>     
>  16   659   664    +1%
>  32   645   663    +3%
>  64   659   692    +5%
>  128  641   716    +12%
>  256  619   610    -1%
>  512  530   702    +32%
>  768  469   582    +24%
>  1000 367   577    +57%
> 
> As before, the hash table approach seems to provide a decent benefit at
> higher client counts, so I felt it was worth reviving the idea.
> 
> The attached patch has some key differences from the previous proposal.
> For example, the new patch uses simplehash instead of open-coding a new
> hash table.  Also, I've bumped up the threshold for creating hash tables to
> 128 based on the results of my testing.  The attached patch waits until a
> lookup of [sub]xip before generating the hash table, so we only need to
> allocate enough space for the current elements in the [sub]xip array, and
> we avoid allocating extra memory for workloads that do not need the hash
> tables.  I'm slightly worried about increasing the number of memory
> allocations in this code path, but the results above seemed encouraging on
> that front.
> 
> Thoughts?
> 
> [0] https://postgr.es/m/35960b8af917e9268881cd8df3f88320%40postgrespro.ru
> [1] https://postgr.es/m/057a9a95-19d2-05f0-17e2-f46ff20e9b3e%402ndquadrant.com
> 

I'm glad my idea has been reborn.

Well, may be simplehash is not bad idea.
While it certainly consumes more memory and CPU instructions.

I'll try to review.

regards,

Yura Sokolov




Re: optimize lookups in snapshot [sub]xip arrays

2022-07-24 Thread Nathan Bossart
On Sun, Jul 24, 2022 at 12:48:25PM +0800, Zhang Mingli wrote:
> If snaphost->suboverflowed is  false then the subxcnt must be less than 
> PGPROC_MAX_CACHED_SUBXIDS which is 64 now.
> 
> And we won’t use hash if the xcnt is less than XIP_HASH_MIN_ELEMENTS which is 
> 128 currently during discussion.
> 
> So that, subxid’s hash table will never be used, right?

This array will store up to TOTAL_MAX_CACHED_SUBXIDS transactions, which
will typically be much greater than 64.  When there isn't any overflow,
subxip stores all of the subxids for all of the entries in the procarray.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-07-24 Thread Zhang Mingli
Got it, thanks.



Regards,
Zhang Mingli



> On Jul 25, 2022, at 12:08, Nathan Bossart  wrote:
> 
> On Sun, Jul 24, 2022 at 12:48:25PM +0800, Zhang Mingli wrote:
>> If snaphost->suboverflowed is  false then the subxcnt must be less than 
>> PGPROC_MAX_CACHED_SUBXIDS which is 64 now.
>> 
>> And we won’t use hash if the xcnt is less than XIP_HASH_MIN_ELEMENTS which 
>> is 128 currently during discussion.
>> 
>> So that, subxid’s hash table will never be used, right?
> 
> This array will store up to TOTAL_MAX_CACHED_SUBXIDS transactions, which
> will typically be much greater than 64.  When there isn't any overflow,
> subxip stores all of the subxids for all of the entries in the procarray.
> 
> -- 
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com





Re: optimize lookups in snapshot [sub]xip arrays

2022-07-25 Thread Nathan Bossart
On Sat, Jul 16, 2022 at 08:59:57PM -0700, Nathan Bossart wrote:
> On Fri, Jul 15, 2022 at 01:08:57PM -0700, Andres Freund wrote:
>> I wonder if we additionally / alternatively could use a faster method of
>> searching the array linearly, e.g. using SIMD.
> 
> I looked into using SIMD.  The patch is attached, but it is only intended
> for benchmarking purposes and isn't anywhere close to being worth serious
> review.  There may be a simpler/better way to implement the linear search,
> but this seemed to work well.  Overall, SIMD provided a decent improvement.
> I had to increase the number of writers quite a bit in order to demonstrate
> where the hash tables began winning.  Here are the numbers:
> 
> writers head simd hash
> 256 663  632  694
> 512 530  618  637
> 768 489  544  573
> 1024364  508  562
> 2048185  306  485
> 4096146  197  441
> 
> While it is unsurprising that the hash tables perform the best, there are a
> couple of advantages to SIMD that might make that approach worth
> considering.  For one, there's really no overhead (i.e., you don't need to
> sort the array or build a hash table), so we can avoid picking an arbitrary
> threshold and just have one code path.  Also, a SIMD implementation for a
> linear search through an array of integers could likely be easily reused
> elsewhere.

>From the discussion thus far, it seems there is interest in optimizing
[sub]xip lookups, so I'd like to spend some time moving it forward.  I
think the biggest open question is which approach to take.  Both the SIMD
and hash table approaches seem viable, but I think I prefer the SIMD
approach at the moment (see the last paragraph of quoted text for the
reasons).  What do folks think?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-07-25 Thread Zhang Mingli




> On Jul 26, 2022, at 03:04, Nathan Bossart  wrote:
>> 
> From the discussion thus far, it seems there is interest in optimizing
> [sub]xip lookups, so I'd like to spend some time moving it forward.  I
> think the biggest open question is which approach to take.  Both the SIMD
> and hash table approaches seem viable, but I think I prefer the SIMD
> approach at the moment (see the last paragraph of quoted text for the
> reasons).  What do folks think?
> 
> -- 
> Nathan Bossart
> Amazon Web Services: https://aws.amazon.com
> 
> 

+1, I’m not familiar with SIMD, will try to review this patch.


Regards,
Zhang Mingli






Re: optimize lookups in snapshot [sub]xip arrays

2022-07-26 Thread Andres Freund
On 2022-07-25 12:04:19 -0700, Nathan Bossart wrote:
> On Sat, Jul 16, 2022 at 08:59:57PM -0700, Nathan Bossart wrote:
> > On Fri, Jul 15, 2022 at 01:08:57PM -0700, Andres Freund wrote:
> >> I wonder if we additionally / alternatively could use a faster method of
> >> searching the array linearly, e.g. using SIMD.
> > 
> > I looked into using SIMD.  The patch is attached, but it is only intended
> > for benchmarking purposes and isn't anywhere close to being worth serious
> > review.  There may be a simpler/better way to implement the linear search,
> > but this seemed to work well.  Overall, SIMD provided a decent improvement.
> > I had to increase the number of writers quite a bit in order to demonstrate
> > where the hash tables began winning.  Here are the numbers:
> > 
> > writers head simd hash
> > 256 663  632  694
> > 512 530  618  637
> > 768 489  544  573
> > 1024364  508  562
> > 2048185  306  485
> > 4096146  197  441
> > 
> > While it is unsurprising that the hash tables perform the best, there are a
> > couple of advantages to SIMD that might make that approach worth
> > considering.  For one, there's really no overhead (i.e., you don't need to
> > sort the array or build a hash table), so we can avoid picking an arbitrary
> > threshold and just have one code path.  Also, a SIMD implementation for a
> > linear search through an array of integers could likely be easily reused
> > elsewhere.
> 
> From the discussion thus far, it seems there is interest in optimizing
> [sub]xip lookups, so I'd like to spend some time moving it forward.  I
> think the biggest open question is which approach to take.  Both the SIMD
> and hash table approaches seem viable, but I think I prefer the SIMD
> approach at the moment (see the last paragraph of quoted text for the
> reasons).  What do folks think?

Agreed on all points.




Re: optimize lookups in snapshot [sub]xip arrays

2022-07-28 Thread Nathan Bossart
On Tue, Jul 26, 2022 at 11:19:06AM -0700, Andres Freund wrote:
> On 2022-07-25 12:04:19 -0700, Nathan Bossart wrote:
>> From the discussion thus far, it seems there is interest in optimizing
>> [sub]xip lookups, so I'd like to spend some time moving it forward.  I
>> think the biggest open question is which approach to take.  Both the SIMD
>> and hash table approaches seem viable, but I think I prefer the SIMD
>> approach at the moment (see the last paragraph of quoted text for the
>> reasons).  What do folks think?
> 
> Agreed on all points.

Great!  Here is a new patch.  A couple notes:

 * I briefly looked into seeing whether auto-vectorization was viable and
   concluded it was not for these loops.

 * I borrowed USE_SSE2 from one of John Naylor's patches [0].  I'm not sure
   whether this is committable, so I would welcome thoughts on the proper
   form.  Given the comment says that SSE2 is supported by all x86-64
   hardware, I'm not seeing why we need the SSE 4.2 checks.  Is it not
   enough to check for __x86_64__ and _M_AMD64?

 * I haven't looked into adding an ARM implementation yet.

[0] 
https://postgr.es/m/CAFBsxsHko7yc8A-2PpjQ%3D2StomXF%2BT2jgKF%3DWaMFZWi8CvV7hA%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 10a0369182be525dbe849d856b663aede10c4c16 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 28 Jul 2022 12:15:47 -0700
Subject: [PATCH v3 1/1] Use SSE2 intrinsics for XidInMVCCSnapshot().

This optimizes the linear searches through the [sub]xip arrays when
possible, which should improve performance significantly when the
arrays are large.
---
 src/backend/utils/time/snapmgr.c | 28 +++
 src/include/c.h  | 13 ++
 src/include/utils/linearsearch.h | 80 
 3 files changed, 100 insertions(+), 21 deletions(-)
 create mode 100644 src/include/utils/linearsearch.h

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5bc2a15160..834c8867d4 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -63,6 +63,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/linearsearch.h"
 #include "utils/memutils.h"
 #include "utils/old_snapshot.h"
 #include "utils/rel.h"
@@ -2284,8 +2285,6 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
 bool
 XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 {
-	uint32		i;
-
 	/*
 	 * Make a quick range check to eliminate most XIDs without looking at the
 	 * xip arrays.  Note that this is OK even if we convert a subxact XID to
@@ -2317,13 +2316,8 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 		if (!snapshot->suboverflowed)
 		{
 			/* we have full data, so search subxip */
-			int32		j;
-
-			for (j = 0; j < snapshot->subxcnt; j++)
-			{
-if (TransactionIdEquals(xid, snapshot->subxip[j]))
-	return true;
-			}
+			if (pg_linearsearch_uint32(xid, snapshot->subxip, snapshot->subxcnt))
+return true;
 
 			/* not there, fall through to search xip[] */
 		}
@@ -2344,16 +2338,11 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 return false;
 		}
 
-		for (i = 0; i < snapshot->xcnt; i++)
-		{
-			if (TransactionIdEquals(xid, snapshot->xip[i]))
-return true;
-		}
+		if (pg_linearsearch_uint32(xid, snapshot->xip, snapshot->xcnt))
+			return true;
 	}
 	else
 	{
-		int32		j;
-
 		/*
 		 * In recovery we store all xids in the subxact array because it is by
 		 * far the bigger array, and we mostly don't know which xids are
@@ -2383,11 +2372,8 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 		 * indeterminate xid. We don't know whether it's top level or subxact
 		 * but it doesn't matter. If it's present, the xid is visible.
 		 */
-		for (j = 0; j < snapshot->subxcnt; j++)
-		{
-			if (TransactionIdEquals(xid, snapshot->subxip[j]))
-return true;
-		}
+		if (pg_linearsearch_uint32(xid, snapshot->subxip, snapshot->subxcnt))
+			return true;
 	}
 
 	return false;
diff --git a/src/include/c.h b/src/include/c.h
index d35405f191..8b7d844fc9 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -371,6 +371,19 @@ typedef void (*pg_funcptr_t) (void);
 #endif
 #endif
 
+/*
+ * Are SSE2 intrinsics available?
+ *
+ * Note: We piggy-back on the check for SSE 4.2 intrinstics but only need SSE2
+ * at runtime.  That's supported by all x84-64 hardware, so we don't need an
+ * indirect function call.
+ *
+ * XXX: Consider removing CRC from the names.
+ */
+#if (defined(__x86_64__) || defined(_M_AMD64)) && (defined(USE_SSE42_CRC32C) || defined(USE_SSE42_CRC32C_WITH_RUNTIME_CHECK))
+#define USE_SSE2
+#endif
+
 
 /* 
  *Section 2:	bool, true, false
diff --git a/src/include/utils/linearsearch.h b/src/include/utils/linearsearch.h
new file mode 100644
index 00..c797fd18ca
--- /dev/null
+++ b/src/include/utils/linearsearch.

Re: optimize lookups in snapshot [sub]xip arrays

2022-07-29 Thread John Naylor
On Fri, Jul 29, 2022 at 4:34 AM Nathan Bossart 
wrote:
>  * I briefly looked into seeing whether auto-vectorization was viable and
>concluded it was not for these loops.
>
>  * I borrowed USE_SSE2 from one of John Naylor's patches [0].  I'm not
sure
>whether this is committable,

I'll be the first to say it's not committable and needs some thought. Since
there are several recently proposed patches that take advantage of SSE2, it
seems time for me to open a new thread and get that prerequisite settled.
I'll do that next week.

> so I would welcome thoughts on the proper
>form.  Given the comment says that SSE2 is supported by all x86-64
>hardware, I'm not seeing why we need the SSE 4.2 checks.  Is it not
>enough to check for __x86_64__ and _M_AMD64?

That's enough for emitting instructions that the target CPU can run, but
says nothing (I think) about the host compiler's ability to understand the
intrinsics and associated headers. The architecture is old enough that
maybe zero compilers in the buildfarm that target AMD64 fail to understand
SSE2 intrinsics, but I hadn't looked into it. The SSE 4.2 intrinsics check
is not necessary, but it was sufficient and already present, so I borrowed
it for the PoC.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: optimize lookups in snapshot [sub]xip arrays

2022-07-29 Thread Nathan Bossart
On Sat, Jul 30, 2022 at 12:02:02PM +0700, John Naylor wrote:
> On Fri, Jul 29, 2022 at 4:34 AM Nathan Bossart 
> wrote:
>>  * I borrowed USE_SSE2 from one of John Naylor's patches [0].  I'm not
> sure
>>whether this is committable,
> 
> I'll be the first to say it's not committable and needs some thought. Since
> there are several recently proposed patches that take advantage of SSE2, it
> seems time for me to open a new thread and get that prerequisite settled.
> I'll do that next week.

Awesome.  I will help test and review.

>> so I would welcome thoughts on the proper
>>form.  Given the comment says that SSE2 is supported by all x86-64
>>hardware, I'm not seeing why we need the SSE 4.2 checks.  Is it not
>>enough to check for __x86_64__ and _M_AMD64?
> 
> That's enough for emitting instructions that the target CPU can run, but
> says nothing (I think) about the host compiler's ability to understand the
> intrinsics and associated headers. The architecture is old enough that
> maybe zero compilers in the buildfarm that target AMD64 fail to understand
> SSE2 intrinsics, but I hadn't looked into it. The SSE 4.2 intrinsics check
> is not necessary, but it was sufficient and already present, so I borrowed
> it for the PoC.

Got it, makes sense.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-02 Thread Nathan Bossart
On Fri, Jul 29, 2022 at 10:38:11PM -0700, Nathan Bossart wrote:
> On Sat, Jul 30, 2022 at 12:02:02PM +0700, John Naylor wrote:
>> I'll be the first to say it's not committable and needs some thought. Since
>> there are several recently proposed patches that take advantage of SSE2, it
>> seems time for me to open a new thread and get that prerequisite settled.
>> I'll do that next week.
> 
> Awesome.  I will help test and review.

While this prerequisite is worked out [0], here is a new patch set.  I've
added an 0002 in which I've made use of the proposed SSE2 linear search
function in several other areas.  I haven't done any additional performance
analysis, and it's likely I'm missing some eligible code locations, but at
the very least, this demonstrates the reusability of the new function.

[0] 
https://postgr.es/m/CAFBsxsE2G_H_5Wbw%2BNOPm70-BK4xxKf86-mRzY%3DL2sLoQqM%2B-Q%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From a2b2ab3777f689775c2e731822aefe2ab500e8ee Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Thu, 28 Jul 2022 12:15:47 -0700
Subject: [PATCH v4 1/2] Use SSE2 intrinsics for XidInMVCCSnapshot().

This optimizes the linear searches through the [sub]xip arrays when
possible, which should improve performance significantly when the
arrays are large.
---
 src/backend/utils/time/snapmgr.c | 28 +++-
 src/include/c.h  |  8 
 src/include/utils/linearsearch.h | 78 
 3 files changed, 93 insertions(+), 21 deletions(-)
 create mode 100644 src/include/utils/linearsearch.h

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5bc2a15160..834c8867d4 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -63,6 +63,7 @@
 #include "storage/sinvaladt.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/linearsearch.h"
 #include "utils/memutils.h"
 #include "utils/old_snapshot.h"
 #include "utils/rel.h"
@@ -2284,8 +2285,6 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
 bool
 XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 {
-	uint32		i;
-
 	/*
 	 * Make a quick range check to eliminate most XIDs without looking at the
 	 * xip arrays.  Note that this is OK even if we convert a subxact XID to
@@ -2317,13 +2316,8 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 		if (!snapshot->suboverflowed)
 		{
 			/* we have full data, so search subxip */
-			int32		j;
-
-			for (j = 0; j < snapshot->subxcnt; j++)
-			{
-if (TransactionIdEquals(xid, snapshot->subxip[j]))
-	return true;
-			}
+			if (pg_linearsearch_uint32(xid, snapshot->subxip, snapshot->subxcnt))
+return true;
 
 			/* not there, fall through to search xip[] */
 		}
@@ -2344,16 +2338,11 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 return false;
 		}
 
-		for (i = 0; i < snapshot->xcnt; i++)
-		{
-			if (TransactionIdEquals(xid, snapshot->xip[i]))
-return true;
-		}
+		if (pg_linearsearch_uint32(xid, snapshot->xip, snapshot->xcnt))
+			return true;
 	}
 	else
 	{
-		int32		j;
-
 		/*
 		 * In recovery we store all xids in the subxact array because it is by
 		 * far the bigger array, and we mostly don't know which xids are
@@ -2383,11 +2372,8 @@ XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 		 * indeterminate xid. We don't know whether it's top level or subxact
 		 * but it doesn't matter. If it's present, the xid is visible.
 		 */
-		for (j = 0; j < snapshot->subxcnt; j++)
-		{
-			if (TransactionIdEquals(xid, snapshot->subxip[j]))
-return true;
-		}
+		if (pg_linearsearch_uint32(xid, snapshot->subxip, snapshot->subxcnt))
+			return true;
 	}
 
 	return false;
diff --git a/src/include/c.h b/src/include/c.h
index d35405f191..2c1a47bc28 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -371,6 +371,14 @@ typedef void (*pg_funcptr_t) (void);
 #endif
 #endif
 
+/*
+ * Are SSE2 intrinsics available?
+ */
+#if (defined(__x86_64__) || defined(_M_AMD64))
+#include 
+#define USE_SSE2
+#endif
+
 
 /* 
  *Section 2:	bool, true, false
diff --git a/src/include/utils/linearsearch.h b/src/include/utils/linearsearch.h
new file mode 100644
index 00..65b0092a65
--- /dev/null
+++ b/src/include/utils/linearsearch.h
@@ -0,0 +1,78 @@
+/*-
+ *
+ * linearsearch.h
+ *	  Optimized linear search routines.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/include/utils/linearsearch.h
+ *
+ *-
+ */
+#ifndef LINEARSEARCH_H
+#define LINEARSEARCH_H
+
+#include "c.h"
+
+#ifdef USE_SSE2
+#include "port/pg_bitutils.h"
+#endif
+
+/*
+ * pg_linearsearch_uint32
+ *
+ * Returns the address of the first element in 'base' that equals 'key', or
+ * NULL

Re: optimize lookups in snapshot [sub]xip arrays

2022-08-02 Thread Andres Freund
Hi,

FWIW, I'd split the introduction of the helper and the use of it in snapmgr
into separate patches.


On 2022-08-02 15:13:01 -0700, Nathan Bossart wrote:
> diff --git a/src/include/c.h b/src/include/c.h
> index d35405f191..2c1a47bc28 100644
> --- a/src/include/c.h
> +++ b/src/include/c.h
> @@ -371,6 +371,14 @@ typedef void (*pg_funcptr_t) (void);
>  #endif
>  #endif
>  
> +/*
> + * Are SSE2 intrinsics available?
> + */
> +#if (defined(__x86_64__) || defined(_M_AMD64))
> +#include 
> +#define USE_SSE2
> +#endif
> +

It doesn't strike me as a good idea to include this in every single
translation unit in pg. That header (+dependencies) isn't small.

I'm on board with normalizing defines for SSE availability somewhere central
though.


> +/*
> + * pg_linearsearch_uint32
> + *
> + * Returns the address of the first element in 'base' that equals 'key', or
> + * NULL if no match is found.
> + */
> +#ifdef USE_SSE2
> +pg_attribute_no_sanitize_alignment()
> +#endif

What's the deal with this annotation? Needs a comment.


> +static inline uint32 *
> +pg_linearsearch_uint32(uint32 key, uint32 *base, uint32 nelem)

Hm. I suspect this could be a bit faster if we didn't search for the offset,
but just for presence in the array? Most users don't need the offset.

Greetings,

Andres Freund




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-02 Thread Nathan Bossart
On Tue, Aug 02, 2022 at 03:55:39PM -0700, Andres Freund wrote:
> FWIW, I'd split the introduction of the helper and the use of it in snapmgr
> into separate patches.

Will do.

> On 2022-08-02 15:13:01 -0700, Nathan Bossart wrote:
>> diff --git a/src/include/c.h b/src/include/c.h
>> index d35405f191..2c1a47bc28 100644
>> --- a/src/include/c.h
>> +++ b/src/include/c.h
>> @@ -371,6 +371,14 @@ typedef void (*pg_funcptr_t) (void);
>>  #endif
>>  #endif
>>  
>> +/*
>> + * Are SSE2 intrinsics available?
>> + */
>> +#if (defined(__x86_64__) || defined(_M_AMD64))
>> +#include 
>> +#define USE_SSE2
>> +#endif
>> +
> 
> It doesn't strike me as a good idea to include this in every single
> translation unit in pg. That header (+dependencies) isn't small.
> 
> I'm on board with normalizing defines for SSE availability somewhere central
> though.

Yeah, this is just a temporary hack for now.  It'll go away once the
defines for SSE2 availability are committed.

>> +/*
>> + * pg_linearsearch_uint32
>> + *
>> + * Returns the address of the first element in 'base' that equals 'key', or
>> + * NULL if no match is found.
>> + */
>> +#ifdef USE_SSE2
>> +pg_attribute_no_sanitize_alignment()
>> +#endif
> 
> What's the deal with this annotation? Needs a comment.

Will do.  c.h suggests that this should only be used for x86-specific code.

>> +static inline uint32 *
>> +pg_linearsearch_uint32(uint32 key, uint32 *base, uint32 nelem)
> 
> Hm. I suspect this could be a bit faster if we didn't search for the offset,
> but just for presence in the array? Most users don't need the offset.

Just under half of the callers in 0002 require the offset, but I don't know
if any of those are worth optimizing in the first place.  I'll change it
for now.  It's easy enough to add it back in the future if required.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-02 Thread John Naylor
On Wed, Aug 3, 2022 at 6:43 AM Nathan Bossart 
wrote:
> Just under half of the callers in 0002 require the offset, but I don't
know
> if any of those are worth optimizing in the first place.  I'll change it
> for now.  It's easy enough to add it back in the future if required.

Yeah, some of those callers will rarely have more than several elements to
search in the first place, or aren't performance-sensitive.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: optimize lookups in snapshot [sub]xip arrays

2022-08-03 Thread Nathan Bossart
Here is a new patch set.  0001 is the currently-proposed patch from the
other thread [0] for determining SSE2 support.  0002 introduces the
optimized linear search function.  And 0003 makes use of the new function
for the [sub]xip lookups in XidInMVCCSnapshot().

[0] 
https://postgr.es/m/CAFBsxsGktHL7%3DJXbgnKTi_uL0VRPcH4FSAqc6yK-3%2BJYfqPPjA%40mail.gmail.com

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From bd523948876f801b7f1b909f399b2cc41acf06cf Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 3 Aug 2022 11:07:40 +0700
Subject: [PATCH v5 1/3] Support SSE2 intrinsics where available

SSE2 vector instructions are part of the spec for the 64-bit x86
architecture. Until now we have relied on the compiler to autovectorize
in some limited situations, but some useful coding idioms can only be
expressed explicitly via compiler intrinsics. To this end, add a header
that defines USE_SSE2 when available. While x86-only for now, we can
add other architectures in the future. This will also be the intended
place for low-level hepler functions that use vector operations.

Reviewed by Nathan Bossart

Discussion: https://www.postgresql.org/message-id/CAFBsxsE2G_H_5Wbw%2BNOPm70-BK4xxKf86-mRzY%3DL2sLoQqM%2B-Q%40mail.gmail.com
---
 src/include/port/simd.h | 30 ++
 1 file changed, 30 insertions(+)
 create mode 100644 src/include/port/simd.h

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
new file mode 100644
index 00..a571e79f57
--- /dev/null
+++ b/src/include/port/simd.h
@@ -0,0 +1,30 @@
+/*-
+ *
+ * simd.h
+ *	  Support for platform-specific vector operations.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/port/simd.h
+ *
+ *-
+ */
+#ifndef SIMD_H
+#define SIMD_H
+
+/*
+ * SSE2 instructions are part of the spec for the 64-bit x86 ISA. We assume
+ * that compilers targeting this architecture understand SSE2 intrinsics.
+ *
+ * We use emmintrin.h rather than the comprehensive header immintrin.h in
+ * order to exclude extensions beyond SSE2. This is because MSVC, at least,
+ * will allow the use of intrinsics that haven't been enabled at compile
+ * time.
+ */
+#if (defined(__x86_64__) || defined(_M_AMD64))
+#include 
+#define USE_SSE2
+#endif
+
+#endif			/* SIMD_H */
-- 
2.25.1

>From 89d17ba8a669b53814551284f8f8c82192eb1402 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:49:04 -0700
Subject: [PATCH v5 2/3] Introduce optimized routine for linear searches
 through an array of integers.

If SSE2 is available, this function uses it to speed up the search.  Otherwise,
it uses a simple 'for' loop.  This is a prerequisite for a follow-up commit
that will use this function to optimize [sub]xip lookups in
XidInMVCCSnapshot(), but it can be used anywhere that might benefit from such
an optimization.

It might be worthwhile to add an ARM-specific code path to this function in the
future.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor
Discussion: https://postgr.es/m/20220713170950.GA3116318%40nathanxps13
---
 src/include/utils/linearsearch.h | 76 
 1 file changed, 76 insertions(+)
 create mode 100644 src/include/utils/linearsearch.h

diff --git a/src/include/utils/linearsearch.h b/src/include/utils/linearsearch.h
new file mode 100644
index 00..51298b4355
--- /dev/null
+++ b/src/include/utils/linearsearch.h
@@ -0,0 +1,76 @@
+/*-
+ *
+ * linearsearch.h
+ *	  Optimized linear search routines.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/include/utils/linearsearch.h
+ *
+ *-
+ */
+#ifndef LINEARSEARCH_H
+#define LINEARSEARCH_H
+
+#include "port/simd.h"
+
+/*
+ * pg_linearsearch_uint32
+ *
+ * Returns true if there is an element in 'base' that equals 'key'.  Otherwise,
+ * returns false.
+ *
+ * Since pg_attribute_no_sanitize_alignment() is only intended for x86-specific
+ * code, we surround it with an SSE2 check.
+ */
+#ifdef USE_SSE2
+pg_attribute_no_sanitize_alignment()
+#endif
+static inline bool
+pg_linearsearch_uint32(uint32 key, uint32 *base, uint32 nelem)
+{
+	uint32		i = 0;
+
+	/* If possible, use SSE2 intrinsics to speed up the search. */
+#ifdef USE_SSE2
+	__m128i		keys = _mm_set1_epi32(key);	/* load 4 copies of key */
+	uint32		its = nelem & ~0xF;			/* round down to multiple of 16 */
+
+	for (; i < its; i += 16)
+	{
+		/* load the next 16 values into __m128i variables */
+		__m128i vals1 = _mm_loadu_si128((__m128i *) &base[i]);
+		__m128i vals2 = _mm_loadu_si128((__m128i *) &base[i + 4]);
+		__m128i vals3 =

Re: optimize lookups in snapshot [sub]xip arrays

2022-08-03 Thread Andres Freund
Hi,

On 2022-08-02 16:43:57 -0700, Nathan Bossart wrote:
> >> +/*
> >> + * pg_linearsearch_uint32
> >> + *
> >> + * Returns the address of the first element in 'base' that equals 'key', 
> >> or
> >> + * NULL if no match is found.
> >> + */
> >> +#ifdef USE_SSE2
> >> +pg_attribute_no_sanitize_alignment()
> >> +#endif
> > 
> > What's the deal with this annotation? Needs a comment.
> 
> Will do.  c.h suggests that this should only be used for x86-specific code.

What I'm asking is why the annotation is needed at all?

Greetings,

Andres Freund




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-03 Thread Nathan Bossart
On Wed, Aug 03, 2022 at 11:06:58AM -0700, Andres Freund wrote:
> On 2022-08-02 16:43:57 -0700, Nathan Bossart wrote:
>> >> +#ifdef USE_SSE2
>> >> +pg_attribute_no_sanitize_alignment()
>> >> +#endif
>> > 
>> > What's the deal with this annotation? Needs a comment.
>> 
>> Will do.  c.h suggests that this should only be used for x86-specific code.
> 
> What I'm asking is why the annotation is needed at all?

Upon further inspection, I don't think this is needed.  I originally
borrowed it from the SSE version of the CRC code, but while it is trivial
to produce alignment failures with the CRC code, I haven't been able to
generate any with my patches.  Looking at the code, I'm not sure why I was
worried about this in the first place.  Please pardon the brain fade.

Here is a new patch set without the annotation.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From bd523948876f801b7f1b909f399b2cc41acf06cf Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Wed, 3 Aug 2022 11:07:40 +0700
Subject: [PATCH v6 1/3] Support SSE2 intrinsics where available

SSE2 vector instructions are part of the spec for the 64-bit x86
architecture. Until now we have relied on the compiler to autovectorize
in some limited situations, but some useful coding idioms can only be
expressed explicitly via compiler intrinsics. To this end, add a header
that defines USE_SSE2 when available. While x86-only for now, we can
add other architectures in the future. This will also be the intended
place for low-level hepler functions that use vector operations.

Reviewed by Nathan Bossart

Discussion: https://www.postgresql.org/message-id/CAFBsxsE2G_H_5Wbw%2BNOPm70-BK4xxKf86-mRzY%3DL2sLoQqM%2B-Q%40mail.gmail.com
---
 src/include/port/simd.h | 30 ++
 1 file changed, 30 insertions(+)
 create mode 100644 src/include/port/simd.h

diff --git a/src/include/port/simd.h b/src/include/port/simd.h
new file mode 100644
index 00..a571e79f57
--- /dev/null
+++ b/src/include/port/simd.h
@@ -0,0 +1,30 @@
+/*-
+ *
+ * simd.h
+ *	  Support for platform-specific vector operations.
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/port/simd.h
+ *
+ *-
+ */
+#ifndef SIMD_H
+#define SIMD_H
+
+/*
+ * SSE2 instructions are part of the spec for the 64-bit x86 ISA. We assume
+ * that compilers targeting this architecture understand SSE2 intrinsics.
+ *
+ * We use emmintrin.h rather than the comprehensive header immintrin.h in
+ * order to exclude extensions beyond SSE2. This is because MSVC, at least,
+ * will allow the use of intrinsics that haven't been enabled at compile
+ * time.
+ */
+#if (defined(__x86_64__) || defined(_M_AMD64))
+#include 
+#define USE_SSE2
+#endif
+
+#endif			/* SIMD_H */
-- 
2.25.1

>From 9b70c265fa7a254117436eed59c2d0effd07a00d Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:49:04 -0700
Subject: [PATCH v6 2/3] Introduce optimized routine for linear searches
 through an array of integers.

If SSE2 is available, this function uses it to speed up the search.  Otherwise,
it uses a simple 'for' loop.  This is a prerequisite for a follow-up commit
that will use this function to optimize [sub]xip lookups in
XidInMVCCSnapshot(), but it can be used anywhere that might benefit from such
an optimization.

It might be worthwhile to add an ARM-specific code path to this function in the
future.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor
Discussion: https://postgr.es/m/20220713170950.GA3116318%40nathanxps13
---
 src/include/utils/linearsearch.h | 70 
 1 file changed, 70 insertions(+)
 create mode 100644 src/include/utils/linearsearch.h

diff --git a/src/include/utils/linearsearch.h b/src/include/utils/linearsearch.h
new file mode 100644
index 00..a23ad45d82
--- /dev/null
+++ b/src/include/utils/linearsearch.h
@@ -0,0 +1,70 @@
+/*-
+ *
+ * linearsearch.h
+ *	  Optimized linear search routines.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/include/utils/linearsearch.h
+ *
+ *-
+ */
+#ifndef LINEARSEARCH_H
+#define LINEARSEARCH_H
+
+#include "port/simd.h"
+
+/*
+ * pg_linearsearch_uint32
+ *
+ * Returns true if there is an element in 'base' that equals 'key'.  Otherwise,
+ * returns false.
+ */
+static inline bool
+pg_linearsearch_uint32(uint32 key, uint32 *base, uint32 nelem)
+{
+	uint32		i = 0;
+
+	/* If possible, use SSE2 intrinsics to speed up the search. */
+#ifdef USE_SSE2
+	__m128i		keys = _mm_set1_epi32(key);	/* load 4 copies of key */
+	uint32		its = nelem & ~0xF;		

Re: optimize lookups in snapshot [sub]xip arrays

2022-08-04 Thread John Naylor
On Thu, Aug 4, 2022 at 3:25 AM Nathan Bossart 
wrote:
> Here is a new patch set without the annotation.

Were you considering adding the new function to simd.h now that that's
committed? It's a bit up in the air what should go in there, but this new
function is low-level and generic enough to be a candidate...

I wonder if the "pg_" prefix is appropriate here, as that is most often
used for things that hide specific details *and* where the base name would
clash, like OS calls or C library functions. I'm not quite sure where the
line is drawn, but I mean that "linearsearch" is a generic algorithm and
not a specific API we are implementing, if that makes sense.

The suffix "_uint32" might be more succinct as "32" (cf pg_bswap32(),
pg_popcount32, etc). We'll likely want to search bytes sometime, so
something to keep in mind as far as naming ("_int" vs "_byte"?).

I'm not a fan of "its" as a variable name, and I'm curious what it's
intended to convey.

All the __m128i vars could technically be declared const, although I think
it doesn't matter -- it's just a hint to the reader.

Out of curiosity do we know how much we get by loading four registers
rather than two?
--
John Naylor
EDB: http://www.enterprisedb.com


Re: optimize lookups in snapshot [sub]xip arrays

2022-08-04 Thread Nathan Bossart
On Thu, Aug 04, 2022 at 02:58:14PM +0700, John Naylor wrote:
> Were you considering adding the new function to simd.h now that that's
> committed? It's a bit up in the air what should go in there, but this new
> function is low-level and generic enough to be a candidate...

I don't have a strong opinion.  I went with a separate file because I
envisioned a variety of possible linear search functions (e.g., char,
uint16, uint32), and some might not use SIMD instructions.  Futhermore, it
seemed less obvious to look in simd.h for linear search functions.  That
being said, it might make sense to just add it here for now.

> I wonder if the "pg_" prefix is appropriate here, as that is most often
> used for things that hide specific details *and* where the base name would
> clash, like OS calls or C library functions. I'm not quite sure where the
> line is drawn, but I mean that "linearsearch" is a generic algorithm and
> not a specific API we are implementing, if that makes sense.

Yeah, I was concerned about clashing with lsearch() and lfind().  I will
drop the prefix.

> The suffix "_uint32" might be more succinct as "32" (cf pg_bswap32(),
> pg_popcount32, etc). We'll likely want to search bytes sometime, so
> something to keep in mind as far as naming ("_int" vs "_byte"?).

How about something like lsearch32 or linearsearch32?

> I'm not a fan of "its" as a variable name, and I'm curious what it's
> intended to convey.

It's short for "iterations."  I'll spell it out completely to avoid this
kind of confusion.

> All the __m128i vars could technically be declared const, although I think
> it doesn't matter -- it's just a hint to the reader.

Will do.

> Out of curiosity do we know how much we get by loading four registers
> rather than two?

The small program I've been using for testing takes about 40% more time
with the two register approach.  The majority of this test involves
searching for elements that either don't exist in the array or that live
near the end of the array, so this is probably close to the worst case.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-04 Thread John Naylor
On Fri, Aug 5, 2022 at 5:15 AM Nathan Bossart 
wrote:
>
> On Thu, Aug 04, 2022 at 02:58:14PM +0700, John Naylor wrote:
> > Were you considering adding the new function to simd.h now that that's
> > committed? It's a bit up in the air what should go in there, but this
new
> > function is low-level and generic enough to be a candidate...
>
> I don't have a strong opinion.  I went with a separate file because I
> envisioned a variety of possible linear search functions (e.g., char,
> uint16, uint32), and some might not use SIMD instructions.  Futhermore, it
> seemed less obvious to look in simd.h for linear search functions.

That is a good point. Maybe potential helpers in simd.h should only deal
specifically with vector registers, with it's users providing C fallbacks.
I don't have any good ideas of where to put the new function, though.

> > I wonder if the "pg_" prefix is appropriate here, as that is most often
> > used for things that hide specific details *and* where the base name
would
> > clash, like OS calls or C library functions. I'm not quite sure where
the
> > line is drawn, but I mean that "linearsearch" is a generic algorithm and
> > not a specific API we are implementing, if that makes sense.
>
> Yeah, I was concerned about clashing with lsearch() and lfind().  I will
> drop the prefix.

Hmm, I didn't know about those. lfind() is similar enough that it would
make sense to have pg_lfind32() etc in src/include/port/pg_lsearch.h, at
least for the v4 version that returns the pointer. We already declare
bsearch_arg() in src/include/port.h and that's another kind of array
search. Returning bool is different enough to have a different name.
pg_lfind32_ispresent()?  *_noptr()? Meh.

Having said all that, the man page under BUGS [1] says "The naming is
unfortunate."

> > Out of curiosity do we know how much we get by loading four registers
> > rather than two?
>
> The small program I've been using for testing takes about 40% more time
> with the two register approach.  The majority of this test involves
> searching for elements that either don't exist in the array or that live
> near the end of the array, so this is probably close to the worst case.

Ok, sounds good.

[1] https://man7.org/linux/man-pages/man3/lsearch.3.html#BUGS

--
John Naylor
EDB: http://www.enterprisedb.com


Re: optimize lookups in snapshot [sub]xip arrays

2022-08-05 Thread Nathan Bossart
On Fri, Aug 05, 2022 at 11:02:15AM +0700, John Naylor wrote:
> That is a good point. Maybe potential helpers in simd.h should only deal
> specifically with vector registers, with it's users providing C fallbacks.
> I don't have any good ideas of where to put the new function, though.

I moved it to src/include/port for now since that's where files like
pg_bswap.h live.

> Hmm, I didn't know about those. lfind() is similar enough that it would
> make sense to have pg_lfind32() etc in src/include/port/pg_lsearch.h, at
> least for the v4 version that returns the pointer. We already declare
> bsearch_arg() in src/include/port.h and that's another kind of array
> search. Returning bool is different enough to have a different name.
> pg_lfind32_ispresent()?  *_noptr()? Meh.
> 
> Having said all that, the man page under BUGS [1] says "The naming is
> unfortunate."

I went ahead and renamed it to pg_lfind32() and switched it back to
returning the pointer.  That felt the cleanest from the naming perspective,
but as Andres noted, it might not be as fast as just looking for the
presence of the element.  I modified my small testing program to perform
many searches on small arrays, and I wasn't able to identify any impact, so
perhaps thіs is good enough.

Thoughts?

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 205737c56c7e49e8de25e6b4afca6a96abbb4e60 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:49:04 -0700
Subject: [PATCH v7 1/2] Introduce optimized routine for linear searches
 through an array of integers.

If SSE2 is available, this function uses it to speed up the search.  Otherwise,
it uses a simple 'for' loop.  This is a prerequisite for a follow-up commit
that will use this function to optimize [sub]xip lookups in
XidInMVCCSnapshot(), but it can be used anywhere that might benefit from such
an optimization.

It might be worthwhile to add an ARM-specific code path to this function in the
future.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor
Discussion: https://postgr.es/m/20220713170950.GA3116318%40nathanxps13
---
 src/include/port/pg_lfind.h | 73 +
 1 file changed, 73 insertions(+)
 create mode 100644 src/include/port/pg_lfind.h

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
new file mode 100644
index 00..27721490a6
--- /dev/null
+++ b/src/include/port/pg_lfind.h
@@ -0,0 +1,73 @@
+/*-
+ *
+ * pg_lfind.h
+ *	  Optimized linear search routines.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_lfind.h
+ *
+ *-
+ */
+#ifndef PG_LFIND_H
+#define PG_LFIND_H
+
+#ifdef USE_SSE2
+#include "port/pg_bitutils.h"
+#endif
+
+/*
+ * pg_lfind32
+ *
+ * Returns the address of the first element in 'base' that equals 'key', or
+ * NULL if no match is found.
+ */
+static inline uint32 *
+pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
+{
+	uint32		i = 0;
+
+	/* If possible, use SSE2 intrinsics to speed up the search. */
+#ifdef USE_SSE2
+	__m128i		keys = _mm_set1_epi32(key);	/* load 4 copies of key */
+	uint32		iterations = nelem & ~0xF;	/* round down to multiple of 16 */
+
+	for (; i < iterations; i += 16)
+	{
+		/* load the next 16 values into __m128i variables */
+		__m128i vals1 = _mm_loadu_si128((__m128i *) &base[i]);
+		__m128i vals2 = _mm_loadu_si128((__m128i *) &base[i + 4]);
+		__m128i vals3 = _mm_loadu_si128((__m128i *) &base[i + 8]);
+		__m128i vals4 = _mm_loadu_si128((__m128i *) &base[i + 12]);
+
+		/* perform the comparisons */
+		__m128i result1 = _mm_cmpeq_epi32(keys, vals1);
+		__m128i result2 = _mm_cmpeq_epi32(keys, vals2);
+		__m128i result3 = _mm_cmpeq_epi32(keys, vals3);
+		__m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+
+		/* shrink the results into a single variable */
+		__m128i tmp1 = _mm_packs_epi32(result1, result2);
+		__m128i tmp2 = _mm_packs_epi32(result3, result4);
+		__m128i tmp3 = _mm_packs_epi16(tmp1, tmp2);
+		uint32 result = _mm_movemask_epi8(tmp3);
+
+		/* see if there was a match */
+		if (result != 0)
+			return &base[i + pg_rightmost_one_pos32(result)];
+	}
+#endif
+
+	/* Process the remaining elements the slow way. */
+	for (; i < nelem; i++)
+	{
+		if (key == base[i])
+			return &base[i];
+	}
+
+	return NULL;
+}
+
+#endif			/* PG_LFIND_H */
-- 
2.25.1

>From abbce6208a5463f4d5da177d05f167d08e7eee2d Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:59:28 -0700
Subject: [PATCH v7 2/2] Optimize linear searches in XidInMVCCSnapshot().

This change makes use of the recently-introduced optimized linear search
routine to speed up searches through the [sub]xip arrays when possible, which
should improve performance significantly when the arrays are large.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor
Discussion: https

Re: optimize lookups in snapshot [sub]xip arrays

2022-08-05 Thread Andres Freund
Hi,

On 2022-08-05 13:25:10 -0700, Nathan Bossart wrote:
> I went ahead and renamed it to pg_lfind32() and switched it back to
> returning the pointer.  That felt the cleanest from the naming perspective,
> but as Andres noted, it might not be as fast as just looking for the
> presence of the element.  I modified my small testing program to perform
> many searches on small arrays, and I wasn't able to identify any impact, so
> perhaps thіs is good enough.

Why on small arrays? I'd expect a difference mainly if it there's at least a
few iterations.

But mainly I'd expect to find a difference if the SIMD code were optimized a
further on the basis of not needing to return the offset. E.g. by
replacing _mm_packs_epi32 with _mm_or_si128, that's cheaper.

- Andres




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-06 Thread Nathan Bossart
On Fri, Aug 05, 2022 at 03:04:34PM -0700, Andres Freund wrote:
> But mainly I'd expect to find a difference if the SIMD code were optimized a
> further on the basis of not needing to return the offset. E.g. by
> replacing _mm_packs_epi32 with _mm_or_si128, that's cheaper.

I haven't been able to find a significant difference between the two.  If
anything, the _mm_packs_epi* approach actually seems to be slightly faster
in some cases.  For something marginally more concrete, I compared the two
in perf-top and saw the following for the relevant instructions:

_mm_packs_epi*:
0.19 │   packssdw   %xmm1,%xmm0
0.62 │   packssdw   %xmm1,%xmm0
7.14 │   packsswb   %xmm1,%xmm0

_mm_or_si128:
1.52 │   por%xmm1,%xmm0
2.05 │   por%xmm1,%xmm0
5.66 │   por%xmm1,%xmm0

I also tried a combined approach where I replaced _mm_packs_epi16 with
_mm_or_si128:
1.16 │   packssdw   %xmm1,%xmm0
1.47 │   packssdw   %xmm1,%xmm0
8.17 │   por%xmm1,%xmm0

Of course, this simplistic analysis leaves out the impact of the
surrounding instructions, but it seems to support the idea that the
_mm_packs_epi* approach might have a slight edge.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-06 Thread Nathan Bossart
On Sat, Aug 06, 2022 at 11:13:26AM -0700, Nathan Bossart wrote:
> On Fri, Aug 05, 2022 at 03:04:34PM -0700, Andres Freund wrote:
>> But mainly I'd expect to find a difference if the SIMD code were optimized a
>> further on the basis of not needing to return the offset. E.g. by
>> replacing _mm_packs_epi32 with _mm_or_si128, that's cheaper.
> 
> I haven't been able to find a significant difference between the two.  If
> anything, the _mm_packs_epi* approach actually seems to be slightly faster
> in some cases.  For something marginally more concrete, I compared the two
> in perf-top and saw the following for the relevant instructions:

Nevermind, I'm wrong.  When compiled with -O2, it uses more than just the
xmm0 and xmm1 registers, and the _mm_or_si128 approach consistently shows a
speedup of slightly more than 5%.  Patches attached.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From bdfab91f1f2fa647d1b8a888dd6f8ad61ab80523 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:49:04 -0700
Subject: [PATCH v8 1/2] Introduce optimized routine for linear searches
 through an array of integers.

If SSE2 is available, this function uses it to speed up the search.  Otherwise,
it uses a simple 'for' loop.  This is a prerequisite for a follow-up commit
that will use this function to optimize [sub]xip lookups in
XidInMVCCSnapshot(), but it can be used anywhere that might benefit from such
an optimization.

It might be worthwhile to add an ARM-specific code path to this function in the
future.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor
Discussion: https://postgr.es/m/20220713170950.GA3116318%40nathanxps13
---
 src/include/port/pg_lfind.h | 70 +
 1 file changed, 70 insertions(+)
 create mode 100644 src/include/port/pg_lfind.h

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
new file mode 100644
index 00..8a212cc06b
--- /dev/null
+++ b/src/include/port/pg_lfind.h
@@ -0,0 +1,70 @@
+/*-
+ *
+ * pg_lfind.h
+ *	  Optimized linear search routines.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_lfind.h
+ *
+ *-
+ */
+#ifndef PG_LFIND_H
+#define PG_LFIND_H
+
+#include "port/simd.h"
+
+/*
+ * pg_lfind32
+ *
+ * Returns true if there is an element in 'base' that equals 'key'.  Otherwise,
+ * returns false.
+ */
+static inline bool
+pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
+{
+	uint32		i = 0;
+
+	/* If possible, use SSE2 intrinsics to speed up the search. */
+#ifdef USE_SSE2
+	__m128i		keys = _mm_set1_epi32(key);	/* load 4 copies of key */
+	uint32		iterations = nelem & ~0xF;	/* round down to multiple of 16 */
+
+	for (; i < iterations; i += 16)
+	{
+		/* load the next 16 values into __m128i variables */
+		__m128i vals1 = _mm_loadu_si128((__m128i *) &base[i]);
+		__m128i vals2 = _mm_loadu_si128((__m128i *) &base[i + 4]);
+		__m128i vals3 = _mm_loadu_si128((__m128i *) &base[i + 8]);
+		__m128i vals4 = _mm_loadu_si128((__m128i *) &base[i + 12]);
+
+		/* perform the comparisons */
+		__m128i result1 = _mm_cmpeq_epi32(keys, vals1);
+		__m128i result2 = _mm_cmpeq_epi32(keys, vals2);
+		__m128i result3 = _mm_cmpeq_epi32(keys, vals3);
+		__m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+
+		/* shrink the results into a single variable */
+		__m128i tmp1 = _mm_or_si128(result1, result2);
+		__m128i tmp2 = _mm_or_si128(result3, result4);
+		__m128i result = _mm_or_si128(tmp1, tmp2);
+
+		/* see if there was a match */
+		if (_mm_movemask_epi8(result) != 0)
+			return true;
+	}
+#endif
+
+	/* Process the remaining elements the slow way. */
+	for (; i < nelem; i++)
+	{
+		if (key == base[i])
+			return true;
+	}
+
+	return false;
+}
+
+#endif			/* PG_LFIND_H */
-- 
2.25.1

>From 826cc190c9ad76558adc41f1cbee76fc47e78c15 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:59:28 -0700
Subject: [PATCH v8 2/2] Optimize linear searches in XidInMVCCSnapshot().

This change makes use of the recently-introduced optimized linear search
routine to speed up searches through the [sub]xip arrays when possible, which
should improve performance significantly when the arrays are large.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor
Discussion: https://postgr.es/m/20220713170950.GA3116318%40nathanxps13
---
 src/backend/utils/time/snapmgr.c | 28 +++-
 1 file changed, 7 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5bc2a15160..9b504c9745 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -56,6 +56,7 @@
 #include "datatype/timestamp.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "port/pg_lfind.h"
 #include "storage/predicate.h"
 #incl

Re: optimize lookups in snapshot [sub]xip arrays

2022-08-07 Thread John Naylor
On Sun, Aug 7, 2022 at 4:25 AM Nathan Bossart  wrote:
>
> [v8]

Okay, I think it's basically in good shape. Since it should be a bit
faster than a couple versions ago, would you be up for retesting with
the original test having 8 to 512 writers? And also add the const
markers we discussed upthread? Aside from that, I plan to commit this
week unless there is further bikeshedding.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Bharath Rupireddy
On Mon, Aug 8, 2022 at 12:17 PM John Naylor
 wrote:
>
> On Sun, Aug 7, 2022 at 4:25 AM Nathan Bossart  
> wrote:
> >
> > [v8]
>
> Okay, I think it's basically in good shape. Since it should be a bit
> faster than a couple versions ago, would you be up for retesting with
> the original test having 8 to 512 writers? And also add the const
> markers we discussed upthread? Aside from that, I plan to commit this
> week unless there is further bikeshedding.

I quickly reviewed v8 patch set, few comments:

1) pg_lfind32 - why just uint32? If it's not possible to define
functions for char, unsigned char, int16, uint16, int32, int64, uint64
and so on, can we add a few comments around that? Also, the comments
can talk about if the base type or element data type of array or data
type of key matters to use pg_lfind32.

2) I think this is not just for the remaining elements but also for
non-USE_SSE2 cases. Also, please specify in which cases we reach here
for USE_SSE2 cases.
+/* Process the remaining elements the slow way. */

3) Can pg_lfind32 return the index of  the key found, for instance to
use it for setting/resetting the found element in the array?
+ * pg_lfind32
+ *
+ * Returns true if there is an element in 'base' that equals 'key'.  Otherwise,
+ * returns false.
+ */
+static inline bool
+pg_lfind32(uint32 key, uint32 *base, uint32 nelem)

4) Can we, right away, use this API to replace linear search, say, in
SimpleLruReadPage_ReadOnly(), ATExecAttachPartitionIdx(),
AfterTriggerSetState()? I'm sure I might be missing other places, but
can we replace the possible found areas with the new function?

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread John Naylor
On Mon, Aug 8, 2022 at 2:26 PM Bharath Rupireddy
 wrote:
>
>
> 1) pg_lfind32 - why just uint32? If it's not possible to define
> functions for char, unsigned char, int16, uint16, int32, int64, uint64
> and so on, can we add a few comments around that? Also, the comments

Future work, as far as I'm  concerned. I'm interested in using a char
version for json strings.

> 3) Can pg_lfind32 return the index of  the key found, for instance to
> use it for setting/resetting the found element in the array?

That was just discussed. It's slightly faster not to return an index.

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Bharath Rupireddy
On Mon, Aug 8, 2022 at 2:30 PM John Naylor  wrote:
>
> On Mon, Aug 8, 2022 at 2:26 PM Bharath Rupireddy
>  wrote:
>
> > 3) Can pg_lfind32 return the index of  the key found, for instance to
> > use it for setting/resetting the found element in the array?
>
> That was just discussed. It's slightly faster not to return an index.

I haven't looked upthread, please share the difference. How about
another version of the function that returns the index too?

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Nathan Bossart
On Mon, Aug 08, 2022 at 01:46:48PM +0700, John Naylor wrote:
> Okay, I think it's basically in good shape. Since it should be a bit
> faster than a couple versions ago, would you be up for retesting with
> the original test having 8 to 512 writers?

Sure thing.  The results are similar.  As before, the improvements are most
visible when the arrays are large.

writers  head  patch
8672   680
16   639   664
32   701   689
64   705   703
128  628   653
256  576   627
512  530   584
768  450   536
1024 350   494

> And also add the const
> markers we discussed upthread?

Oops, sorry about that.  This is done in v9.

> Aside from that, I plan to commit this
> week unless there is further bikeshedding.

Great, thanks.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 816c56bc8779a1e8ab85db8ca61ba8d3438957d7 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:49:04 -0700
Subject: [PATCH v9 1/2] Introduce optimized routine for linear searches
 through an array of integers.

If SSE2 is available, this function uses it to speed up the search.  Otherwise,
it uses a simple 'for' loop.  This is a prerequisite for a follow-up commit
that will use this function to optimize [sub]xip lookups in
XidInMVCCSnapshot(), but it can be used anywhere that might benefit from such
an optimization.

It might be worthwhile to add an ARM-specific code path to this function in the
future.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor
Discussion: https://postgr.es/m/20220713170950.GA3116318%40nathanxps13
---
 src/include/port/pg_lfind.h | 69 +
 1 file changed, 69 insertions(+)
 create mode 100644 src/include/port/pg_lfind.h

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
new file mode 100644
index 00..4a9484a16d
--- /dev/null
+++ b/src/include/port/pg_lfind.h
@@ -0,0 +1,69 @@
+/*-
+ *
+ * pg_lfind.h
+ *	  Optimized linear search routines.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_lfind.h
+ *
+ *-
+ */
+#ifndef PG_LFIND_H
+#define PG_LFIND_H
+
+#include "port/simd.h"
+
+/*
+ * pg_lfind32
+ *
+ * Returns true if there is an element in 'base' that equals 'key'.  Otherwise,
+ * returns false.
+ */
+static inline bool
+pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
+{
+	uint32		i = 0;
+
+	/* If possible, use SSE2 intrinsics to speed up the search. */
+#ifdef USE_SSE2
+	const __m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
+	uint32		iterations = nelem & ~0xF;/* round down to multiple of 16 */
+
+	for (; i < iterations; i += 16)
+	{
+		/* load the next 16 values into __m128i variables */
+		const __m128i vals1 = _mm_loadu_si128((__m128i *) &base[i]);
+		const __m128i vals2 = _mm_loadu_si128((__m128i *) &base[i + 4]);
+		const __m128i vals3 = _mm_loadu_si128((__m128i *) &base[i + 8]);
+		const __m128i vals4 = _mm_loadu_si128((__m128i *) &base[i + 12]);
+
+		/* perform the comparisons */
+		const __m128i result1 = _mm_cmpeq_epi32(keys, vals1);
+		const __m128i result2 = _mm_cmpeq_epi32(keys, vals2);
+		const __m128i result3 = _mm_cmpeq_epi32(keys, vals3);
+		const __m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+
+		/* shrink the results into a single variable */
+		const __m128i tmp1 = _mm_or_si128(result1, result2);
+		const __m128i tmp2 = _mm_or_si128(result3, result4);
+		const __m128i result = _mm_or_si128(tmp1, tmp2);
+
+		/* see if there was a match */
+		if (_mm_movemask_epi8(result) != 0)
+			return true;
+	}
+#endif
+
+	/* Process the remaining elements the slow way. */
+	for (; i < nelem; i++)
+	{
+		if (key == base[i])
+			return true;
+	}
+
+	return false;
+}
+
+#endif			/* PG_LFIND_H */
-- 
2.25.1

>From 6942097a6406bca2c52851bbad40e5f679cc18ef Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:59:28 -0700
Subject: [PATCH v9 2/2] Optimize linear searches in XidInMVCCSnapshot().

This change makes use of the recently-introduced optimized linear search
routine to speed up searches through the [sub]xip arrays when possible, which
should improve performance significantly when the arrays are large.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor
Discussion: https://postgr.es/m/20220713170950.GA3116318%40nathanxps13
---
 src/backend/utils/time/snapmgr.c | 28 +++-
 1 file changed, 7 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5bc2a15160..9b504c9745 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -56,6 +56,7 @@
 #include "datatype/timestamp.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"

Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Nathan Bossart
On Mon, Aug 08, 2022 at 12:56:28PM +0530, Bharath Rupireddy wrote:
> 1) pg_lfind32 - why just uint32? If it's not possible to define
> functions for char, unsigned char, int16, uint16, int32, int64, uint64
> and so on, can we add a few comments around that? Also, the comments
> can talk about if the base type or element data type of array or data
> type of key matters to use pg_lfind32.

I figured that we'd add functions for other types when needed.  I
considered making the new function generic by adding an argument for the
element size.  Then, we could branch to optimized routines based on the
element size (e.g., pg_lfind() would call pg_lfind32() if the element size
was 4 bytes).  However, that seemed like more complexity than is required,
and it's probably nice to avoid the extra branching.

> 2) I think this is not just for the remaining elements but also for
> non-USE_SSE2 cases. Also, please specify in which cases we reach here
> for USE_SSE2 cases.
> +/* Process the remaining elements the slow way. */

Well, in the non-SSE2 case, all of the elements are remaining at this
point.  :)

> 3) Can pg_lfind32 return the index of  the key found, for instance to
> use it for setting/resetting the found element in the array?

As discussed upthread, only returning whether the element is present in the
array is slightly faster.  If we ever needed a version that returned the
address of the matching element, we could reevaluate whether this small
boost was worth creating a separate function or if we should just modify
pg_lfind32() to be a tad slower.  I don't think we need to address that
now, though.

> 4) Can we, right away, use this API to replace linear search, say, in
> SimpleLruReadPage_ReadOnly(), ATExecAttachPartitionIdx(),
> AfterTriggerSetState()? I'm sure I might be missing other places, but
> can we replace the possible found areas with the new function?

I had found a few eligible linear searches earlier [0], but I haven't done
any performance analysis that proved such changes were worthwhile.  While
substituting linear searches with pg_lfind32() is probably an improvement
in most cases, I think we ought to demonstrate the benefits for each one.

[0] https://postgr.es/m/20220802221301.GA742739%40nathanxps13

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Masahiko Sawada
On Tue, Aug 9, 2022 at 7:33 AM Nathan Bossart  wrote:
>
> On Mon, Aug 08, 2022 at 01:46:48PM +0700, John Naylor wrote:
> > Okay, I think it's basically in good shape. Since it should be a bit
> > faster than a couple versions ago, would you be up for retesting with
> > the original test having 8 to 512 writers?
>
> Sure thing.  The results are similar.  As before, the improvements are most
> visible when the arrays are large.
>
> writers  head  patch
> 8672   680
> 16   639   664
> 32   701   689
> 64   705   703
> 128  628   653
> 256  576   627
> 512  530   584
> 768  450   536
> 1024 350   494
>
> > And also add the const
> > markers we discussed upthread?
>
> Oops, sorry about that.  This is done in v9.
>
> > Aside from that, I plan to commit this
> > week unless there is further bikeshedding.
>
> Great, thanks.

The patch looks good to me. One minor point is:

+ * IDENTIFICATION
+ *   src/port/pg_lfind.h

The path doesn't match to the actual file path, src/include/port/pg_lfind.h.

Regards,


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Nathan Bossart
On Tue, Aug 09, 2022 at 10:57:44AM +0900, Masahiko Sawada wrote:
> The patch looks good to me. One minor point is:

Thanks for taking a look.

> + * IDENTIFICATION
> + *   src/port/pg_lfind.h
> 
> The path doesn't match to the actual file path, src/include/port/pg_lfind.h.

Fixed in v10.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From ed80cfaf146f82b930ea09b8efc062cc4088d4b6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:49:04 -0700
Subject: [PATCH v10 1/2] Introduce optimized routine for linear searches
 through an array of integers.

If SSE2 is available, this function uses it to speed up the search.  Otherwise,
it uses a simple 'for' loop.  This is a prerequisite for a follow-up commit
that will use this function to optimize [sub]xip lookups in
XidInMVCCSnapshot(), but it can be used anywhere that might benefit from such
an optimization.

It might be worthwhile to add an ARM-specific code path to this function in the
future.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor, Bharath Rupireddy, Masahiko Sawada
Discussion: https://postgr.es/m/20220713170950.GA3116318%40nathanxps13
---
 src/include/port/pg_lfind.h | 69 +
 1 file changed, 69 insertions(+)
 create mode 100644 src/include/port/pg_lfind.h

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
new file mode 100644
index 00..24de544f63
--- /dev/null
+++ b/src/include/port/pg_lfind.h
@@ -0,0 +1,69 @@
+/*-
+ *
+ * pg_lfind.h
+ *	  Optimized linear search routines.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/port/pg_lfind.h
+ *
+ *-
+ */
+#ifndef PG_LFIND_H
+#define PG_LFIND_H
+
+#include "port/simd.h"
+
+/*
+ * pg_lfind32
+ *
+ * Returns true if there is an element in 'base' that equals 'key'.  Otherwise,
+ * returns false.
+ */
+static inline bool
+pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
+{
+	uint32		i = 0;
+
+	/* If possible, use SSE2 intrinsics to speed up the search. */
+#ifdef USE_SSE2
+	const __m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
+	uint32		iterations = nelem & ~0xF;/* round down to multiple of 16 */
+
+	for (; i < iterations; i += 16)
+	{
+		/* load the next 16 values into __m128i variables */
+		const __m128i vals1 = _mm_loadu_si128((__m128i *) &base[i]);
+		const __m128i vals2 = _mm_loadu_si128((__m128i *) &base[i + 4]);
+		const __m128i vals3 = _mm_loadu_si128((__m128i *) &base[i + 8]);
+		const __m128i vals4 = _mm_loadu_si128((__m128i *) &base[i + 12]);
+
+		/* perform the comparisons */
+		const __m128i result1 = _mm_cmpeq_epi32(keys, vals1);
+		const __m128i result2 = _mm_cmpeq_epi32(keys, vals2);
+		const __m128i result3 = _mm_cmpeq_epi32(keys, vals3);
+		const __m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+
+		/* shrink the results into a single variable */
+		const __m128i tmp1 = _mm_or_si128(result1, result2);
+		const __m128i tmp2 = _mm_or_si128(result3, result4);
+		const __m128i result = _mm_or_si128(tmp1, tmp2);
+
+		/* see if there was a match */
+		if (_mm_movemask_epi8(result) != 0)
+			return true;
+	}
+#endif
+
+	/* Process the remaining elements the slow way. */
+	for (; i < nelem; i++)
+	{
+		if (key == base[i])
+			return true;
+	}
+
+	return false;
+}
+
+#endif			/* PG_LFIND_H */
-- 
2.25.1

>From 28bc28d898afad0762c1966b21e9582294e8b101 Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:59:28 -0700
Subject: [PATCH v10 2/2] Optimize linear searches in XidInMVCCSnapshot().

This change makes use of the recently-introduced optimized linear search
routine to speed up searches through the [sub]xip arrays when possible, which
should improve performance significantly when the arrays are large.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor, Bharath Rupireddy, Masahiko Sawada
Discussion: https://postgr.es/m/20220713170950.GA3116318%40nathanxps13
---
 src/backend/utils/time/snapmgr.c | 28 +++-
 1 file changed, 7 insertions(+), 21 deletions(-)

diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 5bc2a15160..9b504c9745 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -56,6 +56,7 @@
 #include "datatype/timestamp.h"
 #include "lib/pairingheap.h"
 #include "miscadmin.h"
+#include "port/pg_lfind.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -2284,8 +2285,6 @@ RestoreTransactionSnapshot(Snapshot snapshot, void *source_pgproc)
 bool
 XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
 {
-	uint32		i;
-
 	/*
 	 * Make a quick range check to eliminate most XIDs without looking at the
 	 * xip arrays.  Note that this is OK even if we convert a subxact XID to
@@ -2317,13 +2316,8 @@ Xid

Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Bharath Rupireddy
On Tue, Aug 9, 2022 at 4:37 AM Nathan Bossart  wrote:
>
> On Mon, Aug 08, 2022 at 12:56:28PM +0530, Bharath Rupireddy wrote:
> > 1) pg_lfind32 - why just uint32? If it's not possible to define
> > functions for char, unsigned char, int16, uint16, int32, int64, uint64
> > and so on, can we add a few comments around that? Also, the comments
> > can talk about if the base type or element data type of array or data
> > type of key matters to use pg_lfind32.
>
> I figured that we'd add functions for other types when needed.  I
> considered making the new function generic by adding an argument for the
> element size.  Then, we could branch to optimized routines based on the
> element size (e.g., pg_lfind() would call pg_lfind32() if the element size
> was 4 bytes).  However, that seemed like more complexity than is required,
> and it's probably nice to avoid the extra branching.
>
> > 3) Can pg_lfind32 return the index of  the key found, for instance to
> > use it for setting/resetting the found element in the array?
>
> As discussed upthread, only returning whether the element is present in the
> array is slightly faster.  If we ever needed a version that returned the
> address of the matching element, we could reevaluate whether this small
> boost was worth creating a separate function or if we should just modify
> pg_lfind32() to be a tad slower.  I don't think we need to address that
> now, though.

Isn't it a good idea to capture the above two points as comments in
port/pg_lfind.h just to not lose track of it? I know these are present
in the hackers thread, but having them in the form of comments helps
developers who attempt to change or use the new function.

-- 
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Nathan Bossart
On Tue, Aug 09, 2022 at 09:40:15AM +0530, Bharath Rupireddy wrote:
> Isn't it a good idea to capture the above two points as comments in
> port/pg_lfind.h just to not lose track of it? I know these are present
> in the hackers thread, but having them in the form of comments helps
> developers who attempt to change or use the new function.

Hm.  My first impression is that this is exactly the sort of information
that is better captured on the lists.  I'm not sure that the lack of such
commentary really poses any threat for future changes, which would need to
be judged on their own merit, anyway.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread Tom Lane
Nathan Bossart  writes:
> On Tue, Aug 09, 2022 at 09:40:15AM +0530, Bharath Rupireddy wrote:
>> Isn't it a good idea to capture the above two points as comments in
>> port/pg_lfind.h just to not lose track of it? I know these are present
>> in the hackers thread, but having them in the form of comments helps
>> developers who attempt to change or use the new function.

> Hm.  My first impression is that this is exactly the sort of information
> that is better captured on the lists.  I'm not sure that the lack of such
> commentary really poses any threat for future changes, which would need to
> be judged on their own merit, anyway.

It's clearly unproductive (not to say impossible) to enumerate every
possible alternative design and say why you didn't choose it.  If
there's some particular "obvious" choice that you feel a need to
refute, then sure write a comment about that.  Notably, if we used
to do X and now do Y because X was found to be broken, then it's good
to have a comment trail discouraging future hackers from reinventing
X.  But that doesn't lead to needing comments about an unrelated
option Z.

regards, tom lane




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-08 Thread John Naylor
On Tue, Aug 9, 2022 at 10:51 AM Nathan Bossart  wrote:
> Fixed in v10.

I decided I wasn't quite comfortable changing snapshot handling
without further guarantees.  To this end, 0002 in the attached v11 is
an addendum that adds assert checking (also pgindent and some
comment-smithing). As I suspected, make check-world passes even with
purposefully screwed-up coding. 0003 uses pg_lfind32 in syscache.c and
I verified that sticking in the wrong answer will lead to a crash in
assert-enabled builds in short order. I'd kind of like to throw this
(or something else suitable) at the build farm first for that reason.
It's simpler than the qsort/qunique/binary search that was there
before, so that's nice, but I've not tried to test performance.

-- 
John Naylor
EDB: http://www.enterprisedb.com
From 1c89b9b7d3c71bb1a703caaf7c01c93bc9e2515f Mon Sep 17 00:00:00 2001
From: John Naylor 
Date: Tue, 9 Aug 2022 11:51:52 +0700
Subject: [PATCH v11 2/4] Add assert checking, run pgindent, comment changes

---
 src/include/port/pg_lfind.h | 78 ++---
 1 file changed, 56 insertions(+), 22 deletions(-)

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index 24de544f63..fb125977b2 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -18,51 +18,85 @@
 /*
  * pg_lfind32
  *
- * Returns true if there is an element in 'base' that equals 'key'.  Otherwise,
- * returns false.
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
  */
 static inline bool
 pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
 {
 	uint32		i = 0;
 
-	/* If possible, use SSE2 intrinsics to speed up the search. */
+	/* Use SIMD intrinsics where available. */
 #ifdef USE_SSE2
-	const __m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
-	uint32		iterations = nelem & ~0xF;/* round down to multiple of 16 */
 
-	for (; i < iterations; i += 16)
+	/*
+	 * A 16-byte register only has four 4-byte lanes. For better
+	 * instruction-level parallelism, each loop iteration operates on a block
+	 * of four registers. Testing has showed this is ~40% faster than using a
+	 * block of two registers.
+	 */
+	const		__m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
+	uint32		iterations = nelem & ~0xF;	/* round down to multiple of 16 */
+
+#if defined(USE_ASSERT_CHECKING)
+	bool		assert_result = false;
+
+	/* pre-compute the result for assert checking */
+	for (i = 0; i < nelem; i++)
 	{
-		/* load the next 16 values into __m128i variables */
-		const __m128i vals1 = _mm_loadu_si128((__m128i *) &base[i]);
-		const __m128i vals2 = _mm_loadu_si128((__m128i *) &base[i + 4]);
-		const __m128i vals3 = _mm_loadu_si128((__m128i *) &base[i + 8]);
-		const __m128i vals4 = _mm_loadu_si128((__m128i *) &base[i + 12]);
+		if (key == base[i])
+		{
+			assert_result = true;
+			break;
+		}
+	}
+#endif
 
-		/* perform the comparisons */
-		const __m128i result1 = _mm_cmpeq_epi32(keys, vals1);
-		const __m128i result2 = _mm_cmpeq_epi32(keys, vals2);
-		const __m128i result3 = _mm_cmpeq_epi32(keys, vals3);
-		const __m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+	for (i = 0; i < iterations; i += 16)
+	{
+		/* load the next block into 4 registers holding 4 values each */
+		const		__m128i vals1 = _mm_loadu_si128((__m128i *) & base[i]);
+		const		__m128i vals2 = _mm_loadu_si128((__m128i *) & base[i + 4]);
+		const		__m128i vals3 = _mm_loadu_si128((__m128i *) & base[i + 8]);
+		const		__m128i vals4 = _mm_loadu_si128((__m128i *) & base[i + 12]);
 
-		/* shrink the results into a single variable */
-		const __m128i tmp1 = _mm_or_si128(result1, result2);
-		const __m128i tmp2 = _mm_or_si128(result3, result4);
-		const __m128i result = _mm_or_si128(tmp1, tmp2);
+		/* compare each value to the key */
+		const		__m128i result1 = _mm_cmpeq_epi32(keys, vals1);
+		const		__m128i result2 = _mm_cmpeq_epi32(keys, vals2);
+		const		__m128i result3 = _mm_cmpeq_epi32(keys, vals3);
+		const		__m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+
+		/* combine the results into a single variable */
+		const		__m128i tmp1 = _mm_or_si128(result1, result2);
+		const		__m128i tmp2 = _mm_or_si128(result3, result4);
+		const		__m128i result = _mm_or_si128(tmp1, tmp2);
 
 		/* see if there was a match */
 		if (_mm_movemask_epi8(result) != 0)
+		{
+#if defined(USE_ASSERT_CHECKING)
+			Assert(assert_result == true);
+#endif
 			return true;
+		}
 	}
-#endif
+#endif			/* USE_SSE2 */
 
-	/* Process the remaining elements the slow way. */
+	/* Process the remaining elements one at a time. */
 	for (; i < nelem; i++)
 	{
 		if (key == base[i])
+		{
+#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
+			Assert(assert_result == true);
+#endif
 			return true;
+		}
 	}
 
+#if defined(USE_SSE2) && defined(USE_ASSERT_CHECKING)
+	Assert(assert_result == false);
+#endif
 	return false;
 }
 
-- 
2.36.1

From ff77224f9227bcff88a68e63f39754296810351c Mon Sep 17 00:00:00 2001
From: Nathan Boss

Re: optimize lookups in snapshot [sub]xip arrays

2022-08-09 Thread Nathan Bossart
On Tue, Aug 09, 2022 at 01:21:41PM +0700, John Naylor wrote:
> I decided I wasn't quite comfortable changing snapshot handling
> without further guarantees.  To this end, 0002 in the attached v11 is
> an addendum that adds assert checking (also pgindent and some
> comment-smithing). As I suspected, make check-world passes even with
> purposefully screwed-up coding. 0003 uses pg_lfind32 in syscache.c and
> I verified that sticking in the wrong answer will lead to a crash in
> assert-enabled builds in short order. I'd kind of like to throw this
> (or something else suitable) at the build farm first for that reason.
> It's simpler than the qsort/qunique/binary search that was there
> before, so that's nice, but I've not tried to test performance.

Your adjustments in 0002 seem reasonable to me.  I think it makes sense to
ensure there is test coverage for pg_lfind32(), but I don't know if that
syscache code is the right choice.  For non-USE_SSE2 builds, it might make
these lookups more expensive.  I'll look around to see if there are any
other suitable candidates.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-09 Thread Nathan Bossart
On Tue, Aug 09, 2022 at 01:00:37PM -0700, Nathan Bossart wrote:
> Your adjustments in 0002 seem reasonable to me.  I think it makes sense to
> ensure there is test coverage for pg_lfind32(), but I don't know if that
> syscache code is the right choice.  For non-USE_SSE2 builds, it might make
> these lookups more expensive.  I'll look around to see if there are any
> other suitable candidates.

One option might be to create a small test module for pg_lfind32().  Here
is an example.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
>From 0b15a0569318cc846980b7771ff809ef5d2d505c Mon Sep 17 00:00:00 2001
From: Nathan Bossart 
Date: Wed, 3 Aug 2022 09:49:04 -0700
Subject: [PATCH v12 1/2] Introduce optimized routine for linear searches
 through an array of integers.

If SSE2 is available, this function uses it to speed up the search.  Otherwise,
it uses a simple 'for' loop.  This is a prerequisite for a follow-up commit
that will use this function to optimize [sub]xip lookups in
XidInMVCCSnapshot(), but it can be used anywhere that might benefit from such
an optimization.

It might be worthwhile to add an ARM-specific code path to this function in the
future.

Author: Nathan Bossart
Reviewed by: Andres Freund, John Naylor, Bharath Rupireddy, Masahiko Sawada
Discussion: https://postgr.es/m/20220713170950.GA3116318%40nathanxps13
---
 src/include/port/pg_lfind.h   | 103 ++
 src/test/modules/Makefile |   1 +
 src/test/modules/test_lfind/.gitignore|   4 +
 src/test/modules/test_lfind/Makefile  |  23 
 .../test_lfind/expected/test_lfind.out|  12 ++
 .../modules/test_lfind/sql/test_lfind.sql |   8 ++
 .../modules/test_lfind/test_lfind--1.0.sql|   8 ++
 src/test/modules/test_lfind/test_lfind.c  |  52 +
 .../modules/test_lfind/test_lfind.control |   4 +
 9 files changed, 215 insertions(+)
 create mode 100644 src/include/port/pg_lfind.h
 create mode 100644 src/test/modules/test_lfind/.gitignore
 create mode 100644 src/test/modules/test_lfind/Makefile
 create mode 100644 src/test/modules/test_lfind/expected/test_lfind.out
 create mode 100644 src/test/modules/test_lfind/sql/test_lfind.sql
 create mode 100644 src/test/modules/test_lfind/test_lfind--1.0.sql
 create mode 100644 src/test/modules/test_lfind/test_lfind.c
 create mode 100644 src/test/modules/test_lfind/test_lfind.control

diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
new file mode 100644
index 00..fb125977b2
--- /dev/null
+++ b/src/include/port/pg_lfind.h
@@ -0,0 +1,103 @@
+/*-
+ *
+ * pg_lfind.h
+ *	  Optimized linear search routines.
+ *
+ * Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/include/port/pg_lfind.h
+ *
+ *-
+ */
+#ifndef PG_LFIND_H
+#define PG_LFIND_H
+
+#include "port/simd.h"
+
+/*
+ * pg_lfind32
+ *
+ * Return true if there is an element in 'base' that equals 'key', otherwise
+ * return false.
+ */
+static inline bool
+pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
+{
+	uint32		i = 0;
+
+	/* Use SIMD intrinsics where available. */
+#ifdef USE_SSE2
+
+	/*
+	 * A 16-byte register only has four 4-byte lanes. For better
+	 * instruction-level parallelism, each loop iteration operates on a block
+	 * of four registers. Testing has showed this is ~40% faster than using a
+	 * block of two registers.
+	 */
+	const		__m128i keys = _mm_set1_epi32(key); /* load 4 copies of key */
+	uint32		iterations = nelem & ~0xF;	/* round down to multiple of 16 */
+
+#if defined(USE_ASSERT_CHECKING)
+	bool		assert_result = false;
+
+	/* pre-compute the result for assert checking */
+	for (i = 0; i < nelem; i++)
+	{
+		if (key == base[i])
+		{
+			assert_result = true;
+			break;
+		}
+	}
+#endif
+
+	for (i = 0; i < iterations; i += 16)
+	{
+		/* load the next block into 4 registers holding 4 values each */
+		const		__m128i vals1 = _mm_loadu_si128((__m128i *) & base[i]);
+		const		__m128i vals2 = _mm_loadu_si128((__m128i *) & base[i + 4]);
+		const		__m128i vals3 = _mm_loadu_si128((__m128i *) & base[i + 8]);
+		const		__m128i vals4 = _mm_loadu_si128((__m128i *) & base[i + 12]);
+
+		/* compare each value to the key */
+		const		__m128i result1 = _mm_cmpeq_epi32(keys, vals1);
+		const		__m128i result2 = _mm_cmpeq_epi32(keys, vals2);
+		const		__m128i result3 = _mm_cmpeq_epi32(keys, vals3);
+		const		__m128i result4 = _mm_cmpeq_epi32(keys, vals4);
+
+		/* combine the results into a single variable */
+		const		__m128i tmp1 = _mm_or_si128(result1, result2);
+		const		__m128i tmp2 = _mm_or_si128(result3, result4);
+		const		__m128i result = _mm_or_si128(tmp1, tmp2);
+
+		/* see if there was a match */
+		if (_mm_movemask_epi8(result) != 0)
+		{
+#if defined(USE_ASSERT_CHECKING)
+			Assert(assert_result == true);
+#endif
+			return true;
+		

Re: optimize lookups in snapshot [sub]xip arrays

2022-08-09 Thread Masahiko Sawada
On Wed, Aug 10, 2022 at 5:00 AM Nathan Bossart  wrote:
>
> On Tue, Aug 09, 2022 at 01:21:41PM +0700, John Naylor wrote:
> > I decided I wasn't quite comfortable changing snapshot handling
> > without further guarantees.  To this end, 0002 in the attached v11 is
> > an addendum that adds assert checking (also pgindent and some
> > comment-smithing). As I suspected, make check-world passes even with
> > purposefully screwed-up coding. 0003 uses pg_lfind32 in syscache.c and
> > I verified that sticking in the wrong answer will lead to a crash in
> > assert-enabled builds in short order. I'd kind of like to throw this
> > (or something else suitable) at the build farm first for that reason.
> > It's simpler than the qsort/qunique/binary search that was there
> > before, so that's nice, but I've not tried to test performance.
>
> Your adjustments in 0002 seem reasonable to me.  I think it makes sense to
> ensure there is test coverage for pg_lfind32(), but I don't know if that
> syscache code is the right choice.  For non-USE_SSE2 builds, it might make
> these lookups more expensive.

I think that for non-USE_SSE2 builds, there is no additional overhead
as all assertion-related code in pg_lfind32 depends on USE_SSE2.

> I'll look around to see if there are any
> other suitable candidates.

As you proposed, having a test module for that seems to be a good
idea. We can add test codes for future optimizations that utilize SIMD
operations.

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-09 Thread John Naylor
On Wed, Aug 10, 2022 at 7:13 AM Nathan Bossart  wrote:
>
> On Tue, Aug 09, 2022 at 01:00:37PM -0700, Nathan Bossart wrote:
> > Your adjustments in 0002 seem reasonable to me.  I think it makes sense to
> > ensure there is test coverage for pg_lfind32(), but I don't know if that
> > syscache code is the right choice.  For non-USE_SSE2 builds, it might make
> > these lookups more expensive.

Yeah.

On Wed, Aug 10, 2022 at 9:25 AM Masahiko Sawada  wrote:
> I think that for non-USE_SSE2 builds, there is no additional overhead
> as all assertion-related code in pg_lfind32 depends on USE_SSE2.

Nathan is referring to RelationSupportsSysCache() and
RelationHasSysCache(). They currently use binary search and using
linear search on non-x86-64 platforms is probably slower.

[Nathan again]
> One option might be to create a small test module for pg_lfind32().  Here
> is an example.

LGTM, let's see what the buildfarm thinks of 0001.

--
John Naylor
EDB: http://www.enterprisedb.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-10 Thread Nathan Bossart
On Wed, Aug 10, 2022 at 10:50:02AM +0700, John Naylor wrote:
> LGTM, let's see what the buildfarm thinks of 0001.

Thanks!  I haven't noticed any related buildfarm failures yet.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-10 Thread John Naylor
On Thu, Aug 11, 2022 at 4:46 AM Nathan Bossart  wrote:
>
> On Wed, Aug 10, 2022 at 10:50:02AM +0700, John Naylor wrote:
> > LGTM, let's see what the buildfarm thinks of 0001.
>
> Thanks!  I haven't noticed any related buildfarm failures yet.

I was waiting for all the Windows animals to report in, and it looks
like they have, so I've pushed 0002. Thanks for picking this topic up
again!

-- 
John Naylor
EDB: http://www.enterprisedb.com




Re: optimize lookups in snapshot [sub]xip arrays

2022-08-10 Thread Nathan Bossart
On Thu, Aug 11, 2022 at 09:50:54AM +0700, John Naylor wrote:
> I was waiting for all the Windows animals to report in, and it looks
> like they have, so I've pushed 0002. Thanks for picking this topic up
> again!

Thanks for reviewing and committing.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com