Re: Changing default -march landscape

2024-07-31 Thread Andres Freund
Hi,

On 2024-07-30 21:59:44 -0500, Nathan Bossart wrote:
> On Tue, Jul 30, 2024 at 07:39:18PM -0700, Andres Freund wrote:
> > We can hide most of the dispatch cost in a static inline function that only
> > does the runtime test if size is large enough - the size is a compile time
> > constant most of the time, which optimizes away the dispatch cost most of 
> > the
> > time.  And even if not, an inlined runtime branch is a *lot* cheaper than an
> > indirect function call.
> 
> I ended up doing precisely this for pg_popcount()/pg_popcount_masked(),
> although not quite as sophisticated as what you propose below.  I'll look
> into expanding on this strategy in v18.

I think you subsume that under "not quite as sophisticated", but just to make
clear:  The most important bits are to

a) do the dispatch in a header, without an indirect function call

b) implement intrinsic using stuff in a header if it's using a size argument
   or such, because that allows to compiler to optimize away size checks in
   the very common case of such an argument being a compile time constant.

Greetings,

Andres Freund




Re: Changing default -march landscape

2024-07-30 Thread Nathan Bossart
On Tue, Jul 30, 2024 at 07:39:18PM -0700, Andres Freund wrote:
> On 2024-06-12 20:09:45 -0500, Nathan Bossart wrote:
>> The idea that's been floating around recently is to build a bunch of
>> different versions of Postgres and to choose one on startup based on what
>> the CPU supports.  That seems like quite a lot of work, and it'll increase
>> the size of the builds quite a bit, but it at least doesn't have the
>> aforementioned problem.
> 
> I still think that's a good idea - but I don't think it's gonna deal well with
> the various avx512 "features". The landscape of what's supported on what CPU
> is so comically complicated that there's no way to build separate versions for
> everything.

Good point.

> We can hide most of the dispatch cost in a static inline function that only
> does the runtime test if size is large enough - the size is a compile time
> constant most of the time, which optimizes away the dispatch cost most of the
> time.  And even if not, an inlined runtime branch is a *lot* cheaper than an
> indirect function call.

I ended up doing precisely this for pg_popcount()/pg_popcount_masked(),
although not quite as sophisticated as what you propose below.  I'll look
into expanding on this strategy in v18.

-- 
nathan




Re: Changing default -march landscape

2024-07-30 Thread Andres Freund
Hi,

Reply triggered by https://postgr.es/m/ZqmRYh3iikm1Kh3D%40nathan


On 2024-06-12 20:09:45 -0500, Nathan Bossart wrote:
> This is perhaps only tangentially related, but I've found it really
> difficult to avoid painting ourselves into a corner with this stuff.  Let's
> use the SSE 4.2 CRC32C code as an example.  Right now, if your default
> compiler flags indicate support for SSE 4.2 (which I'll note can be assumed
> with x86-64-v2), we'll use it unconditionally, no function pointer
> required.  If additional compiler flags happen to convince the compiler to
> generate SSE 4.2 code, we'll instead build both a fallback version and the
> SSE version, and then we'll use a function pointer to direct to whatever we
> detect is available on the CPU when the server starts.

> Now, let's say we require x86-64-v2.  Once we have that, we can avoid the
> function pointer on many more x86 machines.  While that sounds great, now
> we have a different problem.  If someone wants to add, say, AVX-512 support
> [0], which is a much newer instruction set, we'll need to use the function
> pointer again.  And we're back where we started.  We could instead just ask
> folks to compile with --march=native, but then these optimizations are only
> available for a subset of users until we decide the instructions are
> standard enough 20 years from now.

I don't think this is that big an issue:

So much of the avx512 stuff is only available on an almost arbitrary seeming
set of platforms, so we'll need runtime tests approximately forever. Which
also means we need a fairly standard pattern for deciding whether the runtime
dispatch is worth the cost.  That's pretty independent of whether we e.g. want
to implement pg_popcount64 without runtime dispatch, it'd not make sense to
use avx512 for that anyway.


> The idea that's been floating around recently is to build a bunch of
> different versions of Postgres and to choose one on startup based on what
> the CPU supports.  That seems like quite a lot of work, and it'll increase
> the size of the builds quite a bit, but it at least doesn't have the
> aforementioned problem.

I still think that's a good idea - but I don't think it's gonna deal well with
the various avx512 "features". The landscape of what's supported on what CPU
is so comically complicated that there's no way to build separate versions for
everything.



We can hide most of the dispatch cost in a static inline function that only
does the runtime test if size is large enough - the size is a compile time
constant most of the time, which optimizes away the dispatch cost most of the
time.  And even if not, an inlined runtime branch is a *lot* cheaper than an
indirect function call.


I think this should be something *roughly* like this:


/* shared between all cpu-type dependent code */

#define PGCPUCAP_INIT   (1 << 0)
#define PGCPUCAP_POPCNT (1 << 1)
#define PGCPUCAP_VPOPCNT(1 << 2)
#define PGCPUCAP_CRC32C (1 << 3)
...

extern uint32 pg_cpucap;  /* possibly an bool array would be better */
extern void pg_cpucap_initialize(void);



static inline int
pg_popcount64(uint64 word)
{
Assert(pg_cpucap & PGCPUCAP_INIT);

#if defined(HAVE_POPCNT64_COMPILETIME) || defined(HAVE_POPCNT64_RUNTIME)

#if defined(HAVE_POPCNT64_RUNTIME)
if (pg_cpucap & PGCPUCAP_POPCNT64)
#endif
{
return pg_popcount64_fast(buf, bytes);
}
/* fall through if not available */
#else
return pg_popcount64_slow(word);
#endif
}

/* badly guessed */
#define AVX512_THRESHOLD 64

static inline uint64
pg_popcount(const char *buf, int bytes)
{
uint64 popcnt = 0;

Assert(pg_cpucap & PGCPUCAP_INIT);

/*
 * Most of the times `bytes` will be a compile time constant and the
 * branches below can be optimized out.  Even if they can't, a branch or
 * three here is a lot cheaper than an indirect function call.
 */

#if defined(HAVE_AVX512_VPOPCNT_COMPILETIME) || 
defined(HAVE_AVX512_VPOPCNT_RUNTIME)
if (unlikely(bytes >= AVX512_THRESHOLD))
{
#if defined(HAVE_AVX512_VPOPCNT_RUNTIME)
if (pg_cpucap & PGCPUCAP_VPOPCNT)
#else
{
return pg_popcount_avx512(buf, bytes);
}
/* if not available we fall through to the unrolled implementation */
#endif
}
#endif /* HAVE_AVX512_VPOPCNT_* */

/* XXX: should probably be implemented in separate function */
if (bytes > 8)
{
while (bytes >= 8)
{
uint64 word;

/*
 * Address could be unaligned, compiler will optimize this to a
 * plain [unaligned] load on most architectures.
 */
memcpy(&word, buf, sizeof(uint64));

/*
 * TODO: if compilers can't hoist the pg_cpucap check
 * automatically, we should do so "manually".
 */
popcnt += pg_popcount64(word);

buf += sizeof(uint64);
bytes -= sizeof(uint64);
}
}

/*
 

Re: Changing default -march landscape

2024-06-24 Thread Christoph Berg
Re: To Thomas Munro
> Or Debian might just bump the baseline. PostgreSQL asking for it might
> just be the reason we wanted to hear to make it happen.

Which level would PostgreSQL specifically want? x86-64-v2 or even
x86-64-v3?

Christoph




Re: Changing default -march landscape

2024-06-24 Thread Christoph Berg
Hi,

sorry for the delayed reply, I suck at prioritizing things.

Re: Thomas Munro
> OK let me CC Christoph and ask the question this way: hypothetically,
> if RHEL users' PostgreSQL packages became automatically faster than
> Debian users' packages because of default -march choice in the
> toolchain, what would the Debian project think about that, and what
> should we do about it?  The options discussed so far were:
> 
> 1.  Consider a community apt repo that is identical to the one except
> targeting the higher microarch levels, that users can point a machine
> to if they want to.

There are sub-variations of that: Don't use -march in Debian for
strict baseline compatiblity, but use -march=something in
apt.postgresql.org; bump to -march=x86-64-v2 for the server build (but
not libpq and psql) saying that PG servers need better hardware; ...

I'd rather want to avoid adding yet another axis to the matrix we
target on apt.postgresql.org, it's already complex enough. So ideally
there should be only one server-build. Or if we decide it's worth to
have several, extension builds should still be compatible with either.

> 2.  Various ideas for shipping multiple versions of the code at a
> higher granularity than we're doing to day (a callback for a single
> instruction!  the opposite extreme being the whole executable +
> libraries), probably using some of techniques mentioned at
> https://wiki.debian.org/InstructionSelection.

I don't have enough experience with that to say anything about the
trade-offs, or if the online instruction selectors themselves add too
much overhead.

Not to mention that testing things against all instruction variants is
probably next to impossible in practice.

> I would guess that 1 is about 100x easier but I haven't looked into it.

Are there any numbers about what kind of speedup we could expect?

If the online selection isn't feasible/worthwhile, I'd be willing to
bump the baseline for the server packages. There are already packages
doing that, and there's even infrastructure in the "isa-support"
package that lets packages declare a dependency on CPU features.

Or Debian might just bump the baseline. PostgreSQL asking for it might
just be the reason we wanted to hear to make it happen.

Christoph




Re: Changing default -march landscape

2024-06-13 Thread Thomas Munro
On Thu, Jun 13, 2024 at 8:15 PM Magnus Hagander  wrote:
> Yeah, I think it's completely unreasonable  to push this on packagers and 
> just say "this is your problem now". If we do that, we can assume the only 
> people to get any benefit from these optimizations are those that use a fully 
> managed cloud service like azure or RDS.

It would also benefit distros that have decided to move their baseline
micro-arch level right now, probably without any additional action
from the maintainers assuming that gcc defaults to -march=*-v2 etc.
The cloud vendors' internal distros aren't really special in that
regard are they?

Hmm, among Linux distros, maybe it's really only Debian that isn't
moving or talking about moving the baseline yet?  (Are they?)

> They can do it, but we need to tell them how and when. And if we intend for 
> packagers to be part of the solution we need to explicitly bring them into 
> the discussion of how to do it, at a fairly early stage (and no, we can't 
> expect them to follow every thread on hackers).

OK let me CC Christoph and ask the question this way: hypothetically,
if RHEL users' PostgreSQL packages became automatically faster than
Debian users' packages because of default -march choice in the
toolchain, what would the Debian project think about that, and what
should we do about it?  The options discussed so far were:

1.  Consider a community apt repo that is identical to the one except
targeting the higher microarch levels, that users can point a machine
to if they want to.
2.  Various ideas for shipping multiple versions of the code at a
higher granularity than we're doing to day (a callback for a single
instruction!  the opposite extreme being the whole executable +
libraries), probably using some of techniques mentioned at
https://wiki.debian.org/InstructionSelection.

I would guess that 1 is about 100x easier but I haven't looked into it.




Re: Changing default -march landscape

2024-06-13 Thread Magnus Hagander
On Thu, Jun 13, 2024 at 9:41 AM Peter Eisentraut 
wrote:

> On 13.06.24 04:00, Nathan Bossart wrote:
> > That's true, but my point is that as soon as we start avoiding function
> > pointers more commonly, it becomes difficult to justify adding them back
> in
> > order to support new instruction sets.  Should we just compile in the SSE
> > 4.2 version, or should we take a chance on AVX-512 with the function
> > pointer?
> >
> >>> The idea that's been floating around recently is to build a bunch of
> >>> different versions of Postgres and to choose one on startup based on
> what
> >>> the CPU supports.  That seems like quite a lot of work, and it'll
> increase
> >>> the size of the builds quite a bit, but it at least doesn't have the
> >>> aforementioned problem.
> >>
> >> I guess another idea would be for the PGDG packagers or someone else
> >> interested in performance to create repos with binaries built for
> >> these microarch levels and users can research what they need.  The new
> >> -v2 etc levels are a lot more practical than the microarch names and
> >> individual features...
> >
> > Heartily agreed.
>
> One thing that is perhaps not clear (to me?) is how much this matters
> and how much of it matters.  Obviously, we know that it matters some,
> otherwise we'd not be working on it.  But does it, like, matter only
> with checksums, or with thousands of partitions, or with many CPUs, or
> certain types of indexes, etc.?
>
> If we're going to, say, create some recommendations for packagers around
> this, how are they supposed to determine the tradeoffs?  It might be
> easy for a packager to set some slightly-higher -march flag that is in
> line with their distro's policies, but it would probably be a lot more
> work to create separate binaries or a separate repository for, say,
> moving from SSE-something to AVX-something.  And how are they supposed
> to decide that, and how are they supposed to communicate that to their
> users?  (And how can we get different packagers to make somewhat
> consistent decisions around this?)
>
> We have in a somewhat similar case quite clearly documented that without
> native spinlock support everything will be terrible.  And there is
> probably some information out there that without certain CPU support
> checksum performance will be terrible.  But beyond that we probably
> don't have much.
>

Yeah, I think it's completely unreasonable  to push this on packagers and
just say "this is your problem now". If we do that, we can assume the only
people to get any benefit from these optimizations are those that use a
fully managed cloud service like azure or RDS.

They can do it, but we need to tell them how and when. And if we intend for
packagers to be part of the solution we need to explicitly bring them into
the discussion of how to do it, at a fairly early stage (and no, we can't
expect them to follow every thread on hackers).

-- 
 Magnus Hagander
 Me: https://www.hagander.net/ 
 Work: https://www.redpill-linpro.com/ 


Re: Changing default -march landscape

2024-06-13 Thread Peter Eisentraut

On 13.06.24 04:00, Nathan Bossart wrote:

That's true, but my point is that as soon as we start avoiding function
pointers more commonly, it becomes difficult to justify adding them back in
order to support new instruction sets.  Should we just compile in the SSE
4.2 version, or should we take a chance on AVX-512 with the function
pointer?


The idea that's been floating around recently is to build a bunch of
different versions of Postgres and to choose one on startup based on what
the CPU supports.  That seems like quite a lot of work, and it'll increase
the size of the builds quite a bit, but it at least doesn't have the
aforementioned problem.


I guess another idea would be for the PGDG packagers or someone else
interested in performance to create repos with binaries built for
these microarch levels and users can research what they need.  The new
-v2 etc levels are a lot more practical than the microarch names and
individual features...


Heartily agreed.


One thing that is perhaps not clear (to me?) is how much this matters 
and how much of it matters.  Obviously, we know that it matters some, 
otherwise we'd not be working on it.  But does it, like, matter only 
with checksums, or with thousands of partitions, or with many CPUs, or 
certain types of indexes, etc.?


If we're going to, say, create some recommendations for packagers around 
this, how are they supposed to determine the tradeoffs?  It might be 
easy for a packager to set some slightly-higher -march flag that is in 
line with their distro's policies, but it would probably be a lot more 
work to create separate binaries or a separate repository for, say, 
moving from SSE-something to AVX-something.  And how are they supposed 
to decide that, and how are they supposed to communicate that to their 
users?  (And how can we get different packagers to make somewhat 
consistent decisions around this?)


We have in a somewhat similar case quite clearly documented that without 
native spinlock support everything will be terrible.  And there is 
probably some information out there that without certain CPU support 
checksum performance will be terrible.  But beyond that we probably 
don't have much.






Re: Changing default -march landscape

2024-06-12 Thread Nathan Bossart
On Thu, Jun 13, 2024 at 01:20:17PM +1200, Thomas Munro wrote:
> The way I think about it, it's not our place to require anything (I
> mean, we can't literally put -march=XXX into our build files, or if we
> do the Debian et al maintainers will have to remove them by local
> patch because they are in charge of what the baseline is for their
> distro), but we should do the best thing possible when people DO build
> with modern -march.  I would assume for example that Amazon Linux is
> set up to use a default -march that targets the actual minimum
> microarch level on AWS hosts.  I guess what I'm pointing out here is
> that the baseline is (finally!) moving on common distributions, and
> yet we've coded things in a way that doesn't benefit...

That's true, but my point is that as soon as we start avoiding function
pointers more commonly, it becomes difficult to justify adding them back in
order to support new instruction sets.  Should we just compile in the SSE
4.2 version, or should we take a chance on AVX-512 with the function
pointer?

>> The idea that's been floating around recently is to build a bunch of
>> different versions of Postgres and to choose one on startup based on what
>> the CPU supports.  That seems like quite a lot of work, and it'll increase
>> the size of the builds quite a bit, but it at least doesn't have the
>> aforementioned problem.
> 
> I guess another idea would be for the PGDG packagers or someone else
> interested in performance to create repos with binaries built for
> these microarch levels and users can research what they need.  The new
> -v2 etc levels are a lot more practical than the microarch names and
> individual features...

Heartily agreed.

-- 
nathan




Re: Changing default -march landscape

2024-06-12 Thread Thomas Munro
On Thu, Jun 13, 2024 at 1:09 PM Nathan Bossart  wrote:
> Now, let's say we require x86-64-v2.  Once we have that, we can avoid the
> function pointer on many more x86 machines.  While that sounds great, now
> we have a different problem.  If someone wants to add, say, AVX-512 support
> [0], which is a much newer instruction set, we'll need to use the function
> pointer again.  And we're back where we started.  We could instead just ask
> folks to compile with --march=native, but then these optimizations are only
> available for a subset of users until we decide the instructions are
> standard enough 20 years from now.

The way I think about it, it's not our place to require anything (I
mean, we can't literally put -march=XXX into our build files, or if we
do the Debian et al maintainers will have to remove them by local
patch because they are in charge of what the baseline is for their
distro), but we should do the best thing possible when people DO build
with modern -march.  I would assume for example that Amazon Linux is
set up to use a default -march that targets the actual minimum
microarch level on AWS hosts.  I guess what I'm pointing out here is
that the baseline is (finally!) moving on common distributions, and
yet we've coded things in a way that doesn't benefit...

> The idea that's been floating around recently is to build a bunch of
> different versions of Postgres and to choose one on startup based on what
> the CPU supports.  That seems like quite a lot of work, and it'll increase
> the size of the builds quite a bit, but it at least doesn't have the
> aforementioned problem.

I guess another idea would be for the PGDG packagers or someone else
interested in performance to create repos with binaries built for
these microarch levels and users can research what they need.  The new
-v2 etc levels are a lot more practical than the microarch names and
individual features...




Re: Changing default -march landscape

2024-06-12 Thread Nathan Bossart
On Thu, Jun 13, 2024 at 11:11:56AM +1200, Thomas Munro wrote:
> David R and I were discussing vectorisation and microarchitectures and
> what you can expect the target microarchitecture to be these days, and
> it seemed like some of our choices are not really very
> forward-looking.
> 
> Distros targeting x86-64 traditionally assumed the original AMD64 K8
> instruction set, so if we want to use newer instructions we use
> various configure or runtime checks to see if that's safe.
> 
> Recent GCC and Clang versions understand -march=x86-64-v{2,3,4}[1].
> RHEL9 and similar and SUSE tumbleweed now require x86-64-v2, and IIUC
> they changed the -march default to -v2 in their build of GCC, and I
> think Ubuntu has something in the works perhaps for -v3[2].
> 
> Some of our current tricks won't won't take proper advantage of that:
> we'll still access POPCNT through a function pointer!

This is perhaps only tangentially related, but I've found it really
difficult to avoid painting ourselves into a corner with this stuff.  Let's
use the SSE 4.2 CRC32C code as an example.  Right now, if your default
compiler flags indicate support for SSE 4.2 (which I'll note can be assumed
with x86-64-v2), we'll use it unconditionally, no function pointer
required.  If additional compiler flags happen to convince the compiler to
generate SSE 4.2 code, we'll instead build both a fallback version and the
SSE version, and then we'll use a function pointer to direct to whatever we
detect is available on the CPU when the server starts.

Now, let's say we require x86-64-v2.  Once we have that, we can avoid the
function pointer on many more x86 machines.  While that sounds great, now
we have a different problem.  If someone wants to add, say, AVX-512 support
[0], which is a much newer instruction set, we'll need to use the function
pointer again.  And we're back where we started.  We could instead just ask
folks to compile with --march=native, but then these optimizations are only
available for a subset of users until we decide the instructions are
standard enough 20 years from now.

The idea that's been floating around recently is to build a bunch of
different versions of Postgres and to choose one on startup based on what
the CPU supports.  That seems like quite a lot of work, and it'll increase
the size of the builds quite a bit, but it at least doesn't have the
aforementioned problem.

Sorry if I just rambled on about something unrelated, but your message had
enough keywords to get me thinking about this again.

[0] 
https://postgr.es/m/BL1PR11MB530401FA7E9B1CA432CF9DC3DC192%40BL1PR11MB5304.namprd11.prod.outlook.com

-- 
nathan