On Tue, Mar 19, 2024 at 04:53:04PM +0700, John Naylor wrote:
> On Tue, Mar 19, 2024 at 10:16 AM Nathan Bossart
> <[email protected]> wrote:
>> 0002 does the opposite of this. That is, after we've completed as many
>> blocks as possible, we move the iterator variable back to "end -
>> block_size" and do one final iteration to cover all the remaining elements.
>
> Sounds similar in principle, but it looks really complicated. I don't
> think the additional loops and branches are a good way to go, either
> for readability or for branch prediction. My sketch has one branch for
> which loop to do, and then performs only one loop. Let's do the
> simplest thing that could work. (I think we might need a helper
> function to do the block, but the rest should be easy)
I tried to trim some of the branches, and came up with the attached patch.
I don't think this is exactly what you were suggesting, but I think it's
relatively close. My testing showed decent benefits from using 2 vectors
when there aren't enough elements for 4, so I've tried to keep that part
intact. This changes pg_lfind32() to something like:
if not many elements
process one by one
while enough elements for 4 registers remain
process with 4 registers
if no elements remain
return false
if more than 2-registers-worth of elements remain
do one iteration with 2 registers
do another iteration on last 2-registers-worth of elements
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
diff --git a/src/include/port/pg_lfind.h b/src/include/port/pg_lfind.h
index b8dfa66eef..d154b61555 100644
--- a/src/include/port/pg_lfind.h
+++ b/src/include/port/pg_lfind.h
@@ -80,6 +80,34 @@ pg_lfind8_le(uint8 key, uint8 *base, uint32 nelem)
return false;
}
+static inline bool
+lfind32_2reg_helper(const Vector32 keys, uint32 *base)
+{
+ const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
+ Vector32 vals1,
+ vals2,
+ result1,
+ result2,
+ result;
+
+ /* load the next block into 2 registers */
+ vector32_load(&vals1, base);
+ vector32_load(&vals2, &base[nelem_per_vector]);
+
+ /* compare each value to the key */
+ result1 = vector32_eq(keys, vals1);
+ result2 = vector32_eq(keys, vals2);
+
+ /* combine the results into a single variable */
+ result = vector32_or(result1, result2);
+
+ /* see if there was a match */
+ if (vector32_is_highbit_set(result))
+ return true;
+
+ return false;
+}
+
/*
* pg_lfind32
*
@@ -100,7 +128,8 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
*/
const Vector32 keys = vector32_broadcast(key); /* load copies of key */
const uint32 nelem_per_vector = sizeof(Vector32) / sizeof(uint32);
- const uint32 nelem_per_iteration = 4 * nelem_per_vector;
+ uint32 nelem_per_iteration = 4 * nelem_per_vector;
+ uint32 remaining;
/* round down to multiple of elements per iteration */
const uint32 tail_idx = nelem & ~(nelem_per_iteration - 1);
@@ -119,6 +148,9 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
}
#endif
+ if (nelem < nelem_per_vector * 2)
+ goto slow_path;
+
for (i = 0; i < tail_idx; i += nelem_per_iteration)
{
Vector32 vals1,
@@ -157,8 +189,21 @@ pg_lfind32(uint32 key, uint32 *base, uint32 nelem)
return true;
}
}
+
+ nelem_per_iteration = 2 * nelem_per_vector;
+ remaining = nelem - i;
+
+ if (remaining == 0)
+ return false;
+
+ if (remaining > nelem_per_iteration &&
+ lfind32_2reg_helper(keys, &base[i]))
+ return true;
+
+ return lfind32_2reg_helper(keys, &base[nelem - nelem_per_iteration]);
#endif /* ! USE_NO_SIMD */
+slow_path:
/* Process the remaining elements one at a time. */
for (; i < nelem; i++)
{