Re: Speeding up GIST index creation for tsvectors

2021-08-02 Thread John Naylor
On Sun, Aug 1, 2021 at 11:41 PM Amit Khandekar 
wrote:
>
> > FWIW, I anticipate some push back from the community because of the
fact that the optimization relies on statistical phenomena.
>
> I dug into this issue for tsvector type. Found out that it's the way
> in which the sign array elements are arranged that is causing the
pointers to
> be misaligned:
[...]
> If siglen is not a multiple of 8 (say 700), cache[j].sign will in some
> cases point to non-8-byte-aligned addresses, as you can see in the
> above code snippet.
>
> Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid
> of the misalignment. This change applied over the 0001-v3 patch gives
> additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more
> space, but for not-so-small siglens, this looks worth doing. Haven't
> yet checked into types other than tsvector.

Sounds good.

> Will get back with your other review comments. I thought, meanwhile, I
> can post the above update first.

Thinking some more, my discomfort with inline functions that call a global
function doesn't make logical sense, so feel free to do it that way if you
like.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Speeding up GIST index creation for tsvectors

2021-08-01 Thread Amit Khandekar
On Sat, 20 Mar 2021 at 02:19, John Naylor  wrote:
> On Fri, Mar 19, 2021 at 8:57 AM Amit Khandekar  wrote:
> > Regarding the alignment changes... I have removed the code that
> > handled the leading identically unaligned bytes, for lack of evidence
> > that percentage of such cases is significant. Like I noted earlier,
> > for the tsearch data I used, identically unaligned cases were only 6%.
> > If I find scenarios where these cases can be significant after all and
> > if we cannot do anything in the gist index code, then we might have to
> > bring back the unaligned byte handling. I didn't get a chance to dig
> > deeper into the gist index implementation to see why they are not
> > always 8-byte aligned.
>
> I find it stranger that something equivalent to char* is not randomly 
> misaligned, but rather only seems to land on 4-byte boundaries.
>
> [thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive.
>
> FWIW, I anticipate some push back from the community because of the fact that 
> the optimization relies on statistical phenomena.

I dug into this issue for tsvector type. Found out that it's the way
in which the sign array elements are arranged that is causing the pointers to
be misaligned:

Datum
gtsvector_picksplit(PG_FUNCTION_ARGS)
{
..
cache = (CACHESIGN *) palloc(sizeof(CACHESIGN) * (maxoff + 2));
cache_sign = palloc(siglen * (maxoff + 2));

for (j = 0; j < maxoff + 2; j++)
cache[j].sign = &cache_sign[siglen * j];

}

If siglen is not a multiple of 8 (say 700), cache[j].sign will in some
cases point to non-8-byte-aligned addresses, as you can see in the
above code snippet.

Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid
of the misalignment. This change applied over the 0001-v3 patch gives
additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more
space, but for not-so-small siglens, this looks worth doing. Haven't
yet checked into types other than tsvector.

Will get back with your other review comments. I thought, meanwhile, I
can post the above update first.




Re: Speeding up GIST index creation for tsvectors

2021-03-19 Thread John Naylor
On Fri, Mar 19, 2021 at 8:57 AM Amit Khandekar 
wrote:
>
> On Thu, 11 Mar 2021 at 04:25, John Naylor 
wrote:
> > Okay, so it's hard-coded in various functions in contrib modules. In
that
> > case, let's just keep the existing coding for those. In fact, the
comments
> > that got removed by your patch specifically say: /* Using the popcount
> > functions here isn't likely to win */ ...which your testing confirmed.
The
> > new function should be used only for Gist and possibly Intarray, whose
> > default is 124. That means we can't get rid of hemdistsign(), but that's
> > fine.
>
> The comment is there for all types. Since I get the performance better
> on all the types, I have kept the pg_xorcount() call for all these
> contrib modules.  I understand that since for some types default
> siglen is small, we won't get benefit. But I think we should call
> pg_xorcount() for the benefit of non-default siglen case.

The 1-7% degradation measured was from an earlier version, when
pg_xorcount_long had a lot of finicky branching and computation. Is it
still true in v3? We should answer that first. I'm interested in what
happens if you now use pg_xorcount_long in the call sites, at least in the
worst case 7% test.

> Have replaced hemdistsign() by pg_xorcount() everywhere; but with
> that, the call looks a bit clumsy because of having to type-cast the
> parameters to const unsigned char *.  I noticed that the cast to
> "unsigned char *" is required only when we use the value in the
> pg_number_of_ones array. Elsewhere we can safely use 'a' and 'b' as
> "char *". So I changed the pg_xorcount() parameters from unsigned char
> * to char *.

That matches the style of that file, so +1.

> > I think the way to go is a simplified version of your 0001 (not 0002),
with
> > only a single function, for gist and intarray only, and a style that
better
> > matches the surrounding code. If you look at my xor functions in the
attached
> > text file, you'll get an idea of what it should look like. Note that it
got
> > the above performance without ever trying to massage the pointer
alignment.
> > I'm a bit uncomfortable with the fact that we can't rely on alignment,
but
> > maybe there's a simple fix somewhere in the gist code.
>
> In the attached 0001-v3 patch, I have updated the code to match it
> with surrounding code; specifically the usage of *a++ rather than
> a[i].
>
> Regarding the alignment changes... I have removed the code that
> handled the leading identically unaligned bytes, for lack of evidence
> that percentage of such cases is significant. Like I noted earlier,
> for the tsearch data I used, identically unaligned cases were only 6%.
> If I find scenarios where these cases can be significant after all and
> if we cannot do anything in the gist index code, then we might have to
> bring back the unaligned byte handling. I didn't get a chance to dig
> deeper into the gist index implementation to see why they are not
> always 8-byte aligned.

I find it stranger that something equivalent to char* is not randomly
misaligned, but rather only seems to land on 4-byte boundaries.

[thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive.

FWIW, I anticipate some push back from the community because of the fact
that the optimization relies on statistical phenomena.

> I have kept the 0002 patch as-is. Due to significant *additional*
> speedup, over and above the 0001 improvement, I think the code
> re-arrangement done is worth it for non-x86 platforms.

For the amount of code uglification involved, we should be getting full asm
popcount support on Arm, not an attempt to kluge around the current
implementation. I'd be happy to review such an effort for PG15, by the way.

Readability counts, and as it stands I don't feel comfortable asking a
committer to read 0002.
--
John Naylor
EDB: http://www.enterprisedb.com


Re: Speeding up GIST index creation for tsvectors

2021-03-19 Thread Amit Khandekar
On Thu, 11 Mar 2021 at 04:25, John Naylor  wrote:
> Okay, so it's hard-coded in various functions in contrib modules. In that
> case, let's just keep the existing coding for those. In fact, the comments
> that got removed by your patch specifically say: /* Using the popcount
> functions here isn't likely to win */ ...which your testing confirmed. The
> new function should be used only for Gist and possibly Intarray, whose
> default is 124. That means we can't get rid of hemdistsign(), but that's
> fine.

The comment is there for all types. Since I get the performance better
on all the types, I have kept the pg_xorcount() call for all these
contrib modules.  I understand that since for some types default
siglen is small, we won't get benefit. But I think we should call
pg_xorcount() for the benefit of non-default siglen case.

Have replaced hemdistsign() by pg_xorcount() everywhere; but with
that, the call looks a bit clumsy because of having to type-cast the
parameters to const unsigned char *.  I noticed that the cast to
"unsigned char *" is required only when we use the value in the
pg_number_of_ones array. Elsewhere we can safely use 'a' and 'b' as
"char *". So I changed the pg_xorcount() parameters from unsigned char
* to char *.

> I think the way to go is a simplified version of your 0001 (not 0002), with
> only a single function, for gist and intarray only, and a style that better
> matches the surrounding code. If you look at my xor functions in the attached
> text file, you'll get an idea of what it should look like. Note that it got
> the above performance without ever trying to massage the pointer alignment.
> I'm a bit uncomfortable with the fact that we can't rely on alignment, but
> maybe there's a simple fix somewhere in the gist code.

In the attached 0001-v3 patch, I have updated the code to match it
with surrounding code; specifically the usage of *a++ rather than
a[i].

Regarding the alignment changes... I have removed the code that
handled the leading identically unaligned bytes, for lack of evidence
that percentage of such cases is significant. Like I noted earlier,
for the tsearch data I used, identically unaligned cases were only 6%.
If I find scenarios where these cases can be significant after all and
if we cannot do anything in the gist index code, then we might have to
bring back the unaligned byte handling. I didn't get a chance to dig
deeper into the gist index implementation to see why they are not
always 8-byte aligned.

I have kept the 0002 patch as-is. Due to significant *additional*
speedup, over and above the 0001 improvement, I think the code
re-arrangement done is worth it for non-x86 platforms.

-Amit Khandekar
From c6ae575dd483b85cec6748c2e014d0f32565b4eb Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Fri, 19 Mar 2021 20:22:44 +0800
Subject: [PATCH 1/2] Speed up xor'ing of two gist index signatures for
 tsvectors

In hemdistsign(), rather than using xor operator on char values, use
it in 64-bit chunks. And since the chunks are 64-bit, use popcount64()
on each of the chunks. I have checked that the two bitvector pointer
arguments of hemdistsign() are not always 64-bit aligned. So do the
64-bit chunks only if both pointers are 8 byte-aligned.

This results in speed-up in Gist index creation for tsvectors. With
default siglen (124), the speed up is 12-20%. With siglen=700, it is
30-50%. So with longer signature lengths, we get higher percentage
speed-up.

Similar results are seen in other types using gist index, such as
intarray, hstore, and ltree that are availale in contrib.

With smaller siglens such as 10, 20 etc, there is a bit of a reduction
in speed by 1-7% if we use this optimization. It's probably because of
an extra function call for pg_xorcount(); and also might be due to
the extra logic in pg_xorcount(). So for siglen less than 32,
keep the existing method using byte-by-byte traversal.
---
 contrib/hstore/hstore_gist.c  | 17 ++--
 contrib/intarray/_intbig_gist.c   | 18 +---
 contrib/ltree/_ltree_gist.c   | 19 ++---
 src/backend/utils/adt/tsgistidx.c | 26 +--
 src/include/port/pg_bitutils.h| 17 
 src/port/pg_bitutils.c| 34 +++
 6 files changed, 61 insertions(+), 70 deletions(-)

diff --git a/contrib/hstore/hstore_gist.c b/contrib/hstore/hstore_gist.c
index 102c9cea72..4970a0a2f0 100644
--- a/contrib/hstore/hstore_gist.c
+++ b/contrib/hstore/hstore_gist.c
@@ -8,6 +8,7 @@
 #include "access/stratnum.h"
 #include "catalog/pg_type.h"
 #include "hstore.h"
+#include "port/pg_bitutils.h"
 #include "utils/pg_crc.h"
 
 /* gist_hstore_ops opclass options */
@@ -256,20 +257,6 @@ sizebitvec(BITVECP sign, int siglen)
 	return size;
 }
 
-static int
-hemdistsign(BITVECP a, BITVECP b, int siglen)
-{
-	int			i,
-dist = 0;
-
-	LOOPBIT(siglen)
-	{
-		if (GETBIT(a, i) != GETBIT(b, i))
-			dist++;
-	}
-	return dist;
-}
-
 static int
 hemdist(GISTTYP

Re: Speeding up GIST index creation for tsvectors

2021-03-10 Thread John Naylor
I wrote:
> I'll post a patch soon that builds on that, so you can see what I mean.

I've attached where I was imagining this heading, as a text file to avoid
distracting the cfbot. Here are the numbers I got with your test on the
attached, as well as your 0001, on x86-64 Clang 10, default siglen:

master:
739ms

v3-0001
692ms

attached POC
665ms

The small additional speed up is not worth it, given the code churn and
complexity, so I don't want to go this route after all. I think the way to
go is a simplified version of your 0001 (not 0002), with only a single
function, for gist and intarray only, and a style that better matches the
surrounding code. If you look at my xor functions in the attached text
file, you'll get an idea of what it should look like. Note that it got the
above performance without ever trying to massage the pointer alignment. I'm
a bit uncomfortable with the fact that we can't rely on alignment, but
maybe there's a simple fix somewhere in the gist code.

--
John Naylor
EDB: http://www.enterprisedb.com
 src/backend/access/heap/visibilitymap.c |  13 +-
 src/backend/nodes/bitmapset.c   |  23 +--
 src/backend/utils/adt/tsgistidx.c   |  14 +-
 src/include/port/pg_bitutils.h  |  12 +-
 src/port/pg_bitutils.c  | 240 ++--
 5 files changed, 214 insertions(+), 88 deletions(-)

diff --git a/src/backend/access/heap/visibilitymap.c 
b/src/backend/access/heap/visibilitymap.c
index e198df65d8..d910eb2875 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -388,7 +388,6 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, 
BlockNumber *all_fro
{
Buffer  mapBuffer;
uint64 *map;
-   int i;
 
/*
 * Read till we fall off the end of the map.  We assume that 
any extra
@@ -409,17 +408,11 @@ visibilitymap_count(Relation rel, BlockNumber 
*all_visible, BlockNumber *all_fro
StaticAssertStmt(MAPSIZE % sizeof(uint64) == 0,
 "unsupported MAPSIZE");
if (all_frozen == NULL)
-   {
-   for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
-   nvisible += pg_popcount64(map[i] & 
VISIBLE_MASK64);
-   }
+   nvisible = pg_popcount_mask64(map, MAPSIZE, 
VISIBLE_MASK64);
else
{
-   for (i = 0; i < MAPSIZE / sizeof(uint64); i++)
-   {
-   nvisible += pg_popcount64(map[i] & 
VISIBLE_MASK64);
-   nfrozen += pg_popcount64(map[i] & 
FROZEN_MASK64);
-   }
+   nvisible = pg_popcount_mask64(map, MAPSIZE, 
VISIBLE_MASK64);
+   nfrozen = pg_popcount_mask64(map, MAPSIZE, 
FROZEN_MASK64);
}
 
ReleaseBuffer(mapBuffer);
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 649478b0d4..5368c72255 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -452,7 +452,6 @@ bms_is_member(int x, const Bitmapset *a)
 int
 bms_member_index(Bitmapset *a, int x)
 {
-   int i;
int bitnum;
int wordnum;
int result = 0;
@@ -466,14 +465,8 @@ bms_member_index(Bitmapset *a, int x)
bitnum = BITNUM(x);
 
/* count bits in preceding words */
-   for (i = 0; i < wordnum; i++)
-   {
-   bitmapword  w = a->words[i];
-
-   /* No need to count the bits in a zero word */
-   if (w != 0)
-   result += bmw_popcount(w);
-   }
+   result = pg_popcount((const char *) a->words,
+wordnum * BITS_PER_BITMAPWORD 
/ BITS_PER_BYTE);
 
/*
 * Now add bits of the last word, but only those before the item. We can
@@ -645,22 +638,14 @@ bms_get_singleton_member(const Bitmapset *a, int *member)
 int
 bms_num_members(const Bitmapset *a)
 {
-   int result = 0;
int nwords;
-   int wordnum;
 
if (a == NULL)
return 0;
nwords = a->nwords;
-   for (wordnum = 0; wordnum < nwords; wordnum++)
-   {
-   bitmapword  w = a->words[wordnum];
 
-   /* No need to count the bits in a zero word */
-   if (w != 0)
-   result += bmw_popcount(w);
-   }
-   return result;
+   return pg_popcount((const char *) a->words,
+  nwords * BITS_PER_BITMAPWORD / 
BITS_PER_BYTE);
 }
 
 /*
diff --git a/src/backend/utils/adt/tsgistidx.c 
b/src/backend/utils/adt/tsgistidx.

Re: Speeding up GIST index creation for tsvectors

2021-03-10 Thread John Naylor
On Mon, Mar 8, 2021 at 8:43 AM Amit Khandekar 
wrote:
>
> On Wed, 3 Mar 2021 at 23:32, John Naylor 
wrote:

> > 0001:
> >
> > + /*
> > + * We can process 64-bit chunks only if both are mis-aligned by the
same
> > + * number of bytes.
> > + */
> > + if (b_aligned - b == a_aligned - a)
> >
> > The obvious question here is: how often are they identically
misaligned? You
> > don't indicate that your measurements differ in a bimodal fashion, so
does
> > that mean they happen to be always (mis)aligned?
>
> I ran CREATE INDEX on tsvector columns using the tsearch.data that I
> had attached upthread, with some instrumentation; here are the
> proportions :
> 1. In 15% of the cases, only one among a and b was aligned. The other
> was offset from the 8-byte boundary by 4 bytes.
> 2. 6% of the cases, both were offset by 4 bytes, i.e. identically
misaligned.
> 3. Rest of the cases, both were aligned.

That's somewhat strange. I would have assumed always aligned, or usually
not. It sounds like we could require them both to be aligned and not bother
with the byte-by-byte prologue. I also wonder if it's worth it to memcpy to
a new buffer if the passed pointer is not aligned.

> > + /* For smaller lengths, do simple byte-by-byte traversal */
> > + if (bytes <= 32)
> >
> > You noted upthread:
> >
> > > Since for the gist index creation of some of these types the default
> > > value for siglen is small (8-20), I tested with small siglens. For
> > > siglens <= 20, particularly for values that are not multiples of 8
> > > (e.g. 10, 13, etc), I see a  1-7 % reduction in speed of index
> > > creation. It's probably because of
> > > an extra function call for pg_xorcount(); and also might be due to the
> > > extra logic in pg_xorcount() which becomes prominent for shorter
> > > traversals. So for siglen less than 32, I kept the existing method
> > > using byte-by-byte traversal.
> >
> > I wonder if that can be addressed directly, while cleaning up the loops
and
> > alignment checks in pg_xorcount_long() a little. For example, have a
look at
> > pg_crc32c_armv8.c -- it might be worth testing a similar approach.
>
> Yeah, we can put the bytes <= 32 condition inside pg_xorcount_long().
> I avoided that to not hamper the <= 32 scenarios. Details explained
> below for "why inline pg_xorcount is calling global function"
>
> > Also, pardon my ignorance, but where can I find the default siglen for
various types?
> Check SIGLEN_DEFAULT.

Okay, so it's hard-coded in various functions in contrib modules. In that
case, let's just keep the existing coding for those. In fact, the comments
that got removed by your patch specifically say:

/* Using the popcount functions here isn't likely to win */

...which your testing confirmed. The new function should be used only for
Gist and possibly Intarray, whose default is 124. That means we can't get
rid of hemdistsign(), but that's fine. Alternatively, we could say that a
small regression is justifiable reason to refactor all call sites, but I'm
not proposing that.

> > (I did make sure to remove  indirect calls  from the retail functions
> > in [1], in case we want to go that route).
>
> Yeah, I quickly had a look at that. I am still going over that thread.
> Thanks for the exhaustive analysis there.

I'll post a patch soon that builds on that, so you can see what I mean.

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Speeding up GIST index creation for tsvectors

2021-03-08 Thread Amit Khandekar
On Wed, 3 Mar 2021 at 23:32, John Naylor  wrote:
> Your performance numbers look like this is a fruitful area to improve. I have 
> not yet tested performance, but I will do so at a later date.

Thanks for reviewing the patch !

> I did some
> microbenchmarking of our popcount implementation, since I wasn't quite sure
> it's optimal, and indeed, there is room for improvement there [1]. I'd be
> curious to hear your thoughts on those findings. I think it'd be worth it to
> test a version of this patch using those idioms here as well, so at some
> point I plan to post something.

I am not yet clear about the implications of that work on this patch
set here, but I am still going over it, and will reply to that.

>
> Now for the patches:
>
> 0001:
>
> + /*
> + * We can process 64-bit chunks only if both are mis-aligned by the same
> + * number of bytes.
> + */
> + if (b_aligned - b == a_aligned - a)
>
> The obvious question here is: how often are they identically misaligned? You
> don't indicate that your measurements differ in a bimodal fashion, so does
> that mean they happen to be always (mis)aligned?

I ran CREATE INDEX on tsvector columns using the tsearch.data that I
had attached upthread, with some instrumentation; here are the
proportions :
1. In 15% of the cases, only one among a and b was aligned. The other
was offset from the 8-byte boundary by 4 bytes.
2. 6% of the cases, both were offset by 4 bytes, i.e. identically misaligned.
3. Rest of the cases, both were aligned.

With other types, and with different sets of data, I believe I can get
totally different proportions.

> Is that an accident of the gist coding and could change unexpectedly?
> And how hard would it be to
> allocate those values upstream so that the pointers are always aligned on
> 8-byte boundaries? (I imagine pretty hard, since there are multiple callers,
> and such tight coupling is not good style)

That I am not sure though; haven't clearly understood the structure of
gist indexes yet. I believe it would depend on individual gist
implementation, and we can't assume about that ?


> + /* For smaller lengths, do simple byte-by-byte traversal */
> + if (bytes <= 32)
>
> You noted upthread:
>
> > Since for the gist index creation of some of these types the default
> > value for siglen is small (8-20), I tested with small siglens. For
> > siglens <= 20, particularly for values that are not multiples of 8
> > (e.g. 10, 13, etc), I see a  1-7 % reduction in speed of index
> > creation. It's probably because of
> > an extra function call for pg_xorcount(); and also might be due to the
> > extra logic in pg_xorcount() which becomes prominent for shorter
> > traversals. So for siglen less than 32, I kept the existing method
> > using byte-by-byte traversal.
>
> I wonder if that can be addressed directly, while cleaning up the loops and
> alignment checks in pg_xorcount_long() a little. For example, have a look at
> pg_crc32c_armv8.c -- it might be worth testing a similar approach.

Yeah, we can put the bytes <= 32 condition inside pg_xorcount_long().
I avoided that to not hamper the <= 32 scenarios. Details explained
below for "why inline pg_xorcount is calling global function"

> Also, pardon my ignorance, but where can I find the default siglen for 
> various types?
Check SIGLEN_DEFAULT.

>
> +/* Count the number of 1-bits in the result of xor operation */
> +extern uint64 pg_xorcount_long(const unsigned char *a, const unsigned char 
> *b,
> +   int bytes);
> +static inline uint64 pg_xorcount(const unsigned char *a, const unsigned char 
> *b,
> + int bytes)
>
> I don't think it makes sense to have a static inline function call a global 
> function.

As you may note, the global function will be called only in a subset
of cases where siglen <= 32. Yeah, we can put the bytes <= 32
condition inside pg_xorcount_long(). I avoided that to not hamper the
<= 32 scenarios. If I do this, it will add a function call for these
small siglen scenarios. The idea was: use the function call only for
cases where we know that the function call overhead will be masked by
the popcount() optimization.


>
> -static int
> +static inline int
>  hemdistsign(BITVECP a, BITVECP b, int siglen)
>
>  Not sure what the reason is for inlining all these callers.
> Come to think of it, I don't see a reason to have hemdistsign()
> at all anymore. All it does is call pg_xorcount(). I suspect that's
> what Andrey Borodin meant when he said upthread:
>
>  > > > Meanwhile there are at least 4 incarnation of hemdistsign()
> > > > functions that are quite similar. I'd propose to refactor them 
> > > > somehow...

I had something in mind when I finally decided to not remove
hemdistsign(). Now I think you are right, we can remove hemdistsign()
altogether. Let me check again.






> 0002:
>
> I'm not really happy with this patch. I like the idea of keeping indirect
> calls out of non-x86 platforms, but I believe it could be done more simply.

I am open for oth

Re: Speeding up GIST index creation for tsvectors

2021-03-03 Thread John Naylor
On Wed, Jan 27, 2021 at 11:07 AM Amit Khandekar 
wrote:

Hi Amit,

Your performance numbers look like this is a fruitful area to improve. I
have not yet tested performance, but I will do so at a later date. I did
some microbenchmarking of our popcount implementation, since I wasn't quite
sure it's optimal, and indeed, there is room for improvement there [1]. I'd
be curious to hear your thoughts on those findings. I think it'd be worth
it to test a version of this patch using those idioms here as well, so at
some point I plan to post something.

Now for the patches:

0001:

+ /*
+ * We can process 64-bit chunks only if both are mis-aligned by the same
+ * number of bytes.
+ */
+ if (b_aligned - b == a_aligned - a)

The obvious question here is: how often are they identically misaligned?
You don't indicate that your measurements differ in a bimodal fashion, so
does that mean they happen to be always (mis)aligned? Is that an accident
of the gist coding and could change unexpectedly? And how hard would it be
to allocate those values upstream so that the pointers are always aligned
on 8-byte boundaries? (I imagine pretty hard, since there are multiple
callers, and such tight coupling is not good style)

+ /* For smaller lengths, do simple byte-by-byte traversal */
+ if (bytes <= 32)

You noted upthread:

> Since for the gist index creation of some of these types the default
> value for siglen is small (8-20), I tested with small siglens. For
> siglens <= 20, particularly for values that are not multiples of 8
> (e.g. 10, 13, etc), I see a  1-7 % reduction in speed of index
> creation. It's probably because of
> an extra function call for pg_xorcount(); and also might be due to the
> extra logic in pg_xorcount() which becomes prominent for shorter
> traversals. So for siglen less than 32, I kept the existing method
> using byte-by-byte traversal.

I wonder if that can be addressed directly, while cleaning up the loops and
alignment checks in pg_xorcount_long() a little. For example, have a look
at pg_crc32c_armv8.c -- it might be worth testing a similar approach.

Also, pardon my ignorance, but where can I find the default siglen for
various types?

+/* Count the number of 1-bits in the result of xor operation */
+extern uint64 pg_xorcount_long(const unsigned char *a, const unsigned char
*b,
+   int bytes);
+static inline uint64 pg_xorcount(const unsigned char *a, const unsigned
char *b,
+ int bytes)

I don't think it makes sense to have a static inline function call a global
function.

-static int
+static inline int
 hemdistsign(BITVECP a, BITVECP b, int siglen)

Not sure what the reason is for inlining all these callers. Come to think
of it, I don't see a reason to have hemdistsign() at all anymore. All it
does is call pg_xorcount(). I suspect that's what Andrey Borodin meant when
he said upthread:

> > > Meanwhile there are at least 4 incarnation of hemdistsign() functions
that are quite similar. I'd propose to refactor them somehow...


0002:

I'm not really happy with this patch. I like the idea of keeping indirect
calls out of non-x86 platforms, but I believe it could be done more simply.
For one, I don't see a need to invent a third category of retail function.
Second, there's no reason to put "nonasm" in the header as a static inline,
and then call from there to the new now-global function "slow". Especially
since the supposed static inline is still needed as a possible value as a
function pointer on x86, so the compiler is not going to inline it on x86
anyway. That just confuses things. (I did make sure to remove indirect
calls from the retail functions in [1], in case we want to go that route).

[1]
https://www.postgresql.org/message-id/CAFBsxsFCWys_yfPe4PoF3%3D2_oxU5fFR2H%2BmtM6njUA8nBiCYug%40mail.gmail.com

--
John Naylor
EDB: http://www.enterprisedb.com


Re: Speeding up GIST index creation for tsvectors

2021-02-28 Thread Amit Khandekar
I have added this one in the March commitfest.
https://commitfest.postgresql.org/32/3023/




Re: Speeding up GIST index creation for tsvectors

2021-01-27 Thread Amit Khandekar
On Tue, 15 Dec 2020 at 20:34, Amit Khandekar  wrote:
>
> On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin  wrote:
> > +1
> > This will make all INSERTs and UPDATES for tsvector's GiSTs.
>
> Oh, I didn't realize that this code is getting used in GIST index
> insertion and creation too. Will check there.

I ran some insert and update tests; they show only marginal
improvement. So looks like the patch is mainly improving index builds.

> > Meanwhile there are at least 4 incarnation of hemdistsign() functions that 
> > are quite similar. I'd propose to refactor them somehow...
>
> Yes, I hope we get the benefit there also. Before that, I thought I
> should post the first use-case to get some early comments. Thanks for
> your encouraging comments :)

The attached v2 version of 0001 patch extends the hemdistsign()
changes to the other use cases like  intarray, ltree and hstore. I see
the same index build improvement for all these types.

Since for the gist index creation of some of these types the default
value for siglen is small (8-20), I tested with small siglens. For
siglens <= 20, particularly for values that are not multiples of 8
(e.g. 10, 13, etc), I see a  1-7 % reduction in speed of index
creation. It's probably because of
an extra function call for pg_xorcount(); and also might be due to the
extra logic in pg_xorcount() which becomes prominent for shorter
traversals. So for siglen less than 32, I kept the existing method
using byte-by-byte traversal.

--
Thanks,
-Amit Khandekar
Huawei Technologies
From b8905b47f7e0af8d4558fdac73cc2283c263cbcc Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Thu, 10 Dec 2020 21:45:49 +0800
Subject: [PATCH 2/2] Avoid function pointer dereferencing for
 pg_popcount32/64() call.

With USE_POPCNT_ASM set (which is true only on x86), we decide with
the help of __get_cpuid() at runtime whether the CPU supports popcnt
instruction, and hence we dynamically choose the corresponding
function; thus pg_popcount32/64 is a function pointer. But for
platforms with USE_POPCNT_ASM not set, we don't have to use dynamic
assignment of functions, but we were still declaring pg_popcount64 as a
function pointer. So arrange for direct function call so as to get rid
of function pointer dereferencing each time pg_popcount32/64 is
called.

To do this, define pg_popcount64 to another function name
(pg_popcount64_nonasm) rather than a function pointer, whenever
USE_POPCNT_ASM is not defined.  And let pg_popcount64_nonasm() be a
static inline function so that whenever pg_popcount64() is called,
the __builtin_popcount() gets called directly.  For platforms not
supporting __builtin_popcount(), continue using the slow version as is
the current behaviour. pg_popcount64_slow() was actually a misnomer,
because it was actually calling __builtin_popcount() which is not a slow
function. Now with this commit, pg_popcount64_slow() indeed has only
slow code.

Tested this on ARM64, and the gist index creation for tsvectors speeds
up by ~ 6% for default siglen (=124), and by 12% with siglen=700
---
 src/include/port/pg_bitutils.h | 59 ++
 src/port/pg_bitutils.c | 47 +--
 2 files changed, 67 insertions(+), 39 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 174df28e66..b039f94ee5 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -23,6 +23,19 @@ extern const uint8 pg_rightmost_one_pos[256];
 extern const uint8 pg_number_of_ones[256];
 #endif
 
+/*
+ * On x86_64, we can use the hardware popcount instruction, but only if
+ * we can verify that the CPU supports it via the cpuid instruction.
+ *
+ * Otherwise, we fall back to __builtin_popcount if the compiler has that,
+ * or a hand-rolled implementation if not.
+ */
+#ifdef HAVE_X86_64_POPCNTQ
+#if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
+#define USE_POPCNT_ASM 1
+#endif
+#endif
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
@@ -208,8 +221,54 @@ pg_ceil_log2_64(uint64 num)
 }
 
 /* Count the number of one-bits in a uint32 or uint64 */
+
+/*
+ * With USE_POPCNT_ASM set (which is true only on x86), we decide at runtime
+ * whether the CPU supports popcnt instruction, and hence we dynamically choose
+ * the corresponding function; thus pg_popcount32/64 is a function pointer. But
+ * for platforms with USE_POPCNT_ASM not set, we don't have to use dynamic
+ * assignment of functions, so we arrange for direct function call so as to get
+ * rid of function pointer dereferencing each time pg_popcount32/64 is called.
+ */
+#ifdef USE_POPCNT_ASM
 extern int	(*pg_popcount32) (uint32 word);
 extern int	(*pg_popcount64) (uint64 word);
+#else
+#define pg_popcount32 pg_popcount32_nonasm
+#define pg_popcount64 pg_popcount64_nonasm
+#endif
+
+/* Slow versions of pg_popcount */
+#ifndef HAVE__BUILTIN_POPCOUNT
+extern int pg_popcount32_slow(uint32 word);
+extern int pg_popcount

Re: Speeding up GIST index creation for tsvectors

2020-12-15 Thread Amit Khandekar
On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin  wrote:
> +1
> This will make all INSERTs and UPDATES for tsvector's GiSTs.

Oh, I didn't realize that this code is getting used in GIST index
insertion and creation too. Will check there.

> Also I really like idea of taking advantage of hardware capabilities like 
> __builtin_* etc wherever possible.

Yes. Also, the __builtin_popcount() uses SIMD vectorization (on arm64
: "cnt  v0.8b, v0.8b"), hence there's all the more reason to use it.
Over and above that, I had thought that if we can auto-vectorize the
byte-by-byte xor operation and the popcount() call using compiler
optimizations, we would benefit out of this, but didn't see any more
improvement. I hoped for the benefit because that would have allowed
us to process in 128-bit chunks or 256-bit chunks, since the vector
registers are at least that long. Maybe gcc is not that smart to
translate __builtin_popcount() to 128/256 bit vectorized instruction.
But for XOR operator, it does translate to 128bit vectorized
instructions (on arm64 : "eor  v2.16b, v2.16b, v18.16b")

> Meanwhile there are at least 4 incarnation of hemdistsign() functions that 
> are quite similar. I'd propose to refactor them somehow...

Yes, I hope we get the benefit there also. Before that, I thought I
should post the first use-case to get some early comments. Thanks for
your encouraging comments :)




Re: Speeding up GIST index creation for tsvectors

2020-12-13 Thread Andrey Borodin



> 13 дек. 2020 г., в 17:46, Amit Khandekar  написал(а):
> 
> On Thu, 10 Dec 2020 at 20:43, Pavel Borisov  wrote:
>> 
>> Hi, Amit!
>> It's really cool to hear about another GiST improvement proposal. I'd like 
>> to connect recently committed GiST ordered build discussion here [1] and 
>> further improvement proposed [2]
>> 
>> I've tested feature [1] and got 2.5-3 times speed improvement which is much 
>> better I believe.
> 
> Yeah, I am completely new to the GIST stuff, but I had taken a quick
> look at the sortsupport feature for GIST, and found it very
> interesting. I believe it's an additional option for making the gist
> index builds much faster.
+1
This will make all INSERTs and UPDATES for tsvector's GiSTs.
Also I really like idea of taking advantage of hardware capabilities like 
__builtin_* etc wherever possible.

Meanwhile there are at least 4 incarnation of hemdistsign() functions that are 
quite similar. I'd propose to refactor them somehow...

Thanks!

Best regards, Andrey Borodin.





Re: Speeding up GIST index creation for tsvectors

2020-12-13 Thread Amit Khandekar
On Thu, 10 Dec 2020 at 20:43, Pavel Borisov  wrote:
>
> Hi, Amit!
> It's really cool to hear about another GiST improvement proposal. I'd like to 
> connect recently committed GiST ordered build discussion here [1] and further 
> improvement proposed [2]
>
> I've tested feature [1] and got 2.5-3 times speed improvement which is much 
> better I believe.

Yeah, I am completely new to the GIST stuff, but I had taken a quick
look at the sortsupport feature for GIST, and found it very
interesting. I believe it's an additional option for making the gist
index builds much faster. But then I thought that my small patch would
still be worthwhile because for tsvector types the non-sort method for
index build would continue to be used by users, and in general we can
extend this small optimization for other gist types also.

> There is an ongoing activity [2] to build support for different data types 
> for GiST. Maybe you will consider it interesting to join.
>
> BTW you may have heard about Gin and Rum [3] indexes which suit text search 
> much, much better (and faster) than GiST. The idea to process data in bigger 
> chunks is good. Still optimize index structure, minimizing disc pages access, 
> etc. seems better in many cases.

Sure. Thanks for the pointers.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies




Re: Speeding up GIST index creation for tsvectors

2020-12-10 Thread Pavel Borisov
Hi, Amit!
It's really cool to hear about another GiST improvement proposal. I'd like
to connect recently committed GiST ordered build discussion here [1] and
further improvement proposed [2]

I've tested feature [1] and got 2.5-3 times speed improvement which is much
better I believe. There is an ongoing activity [2] to build support for
different data types for GiST. Maybe you will consider it interesting to
join.

BTW you may have heard about Gin and Rum [3] indexes which suit text search
much, much better (and faster) than GiST. The idea to process data in
bigger chunks is good. Still optimize index structure, minimizing disc
pages access, etc. seems better in many cases.

Thank you for your proposal!

[1] https://commitfest.postgresql.org/29/2276/
[2] https://commitfest.postgresql.org/31/2824/
[3] https://github.com/postgrespro/rum

-- 
Best regards,
Pavel Borisov

Postgres Professional: http://postgrespro.com