Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Andres Freund Mon, 23 Feb 2026 14:28:23 -0800

Hi,

On 2026-02-23 16:24:57 +0100, David Geier wrote:
> The code wasn't compiling properly on Windows because __x86_64__ is not
> defined in Visual C++. I've changed the code to use
>
>   #if defined(__x86_64__) || defined(_M_X64)


Independently of this patchset I wonder if it'd be worth introducing a
PG_ARCH_X64 or such, to avoid this kind of thing.


> I've tested v8 of the patch (= v7 plus aforementioned changes) on
> Windows. I'm reporting the best of 3 runs.
>
> lotsarows test with parallelism disabled:
>
> master: 2781 ms
> v7:     2776 ms (timing_clock_source = 'system')
> v7:     2091 ms (timing_clock_source = 'tsc')

Nice.

> pg_test_timing:
>
> master: 27.04 ns
> v7:     16.59 ns (QueryxPerformanceCounter)
> v7:     13.69 ns (RDTSCP)
> v7:      9.42 ns (RDTSC)

Very nice.


Unfortunately, on linux, applying up to 0002 cause a small regression in
pg_test_timing.

With cpuidle disabled, performance governor, pinned to one core.

pg_test_timing, turboboost disabled:

412f78c66ee     27.70 ns
0002            28.48 ns

pg_test_timing, turboboost enabled:

412f78c66ee     20.41 ns
0002            21.04 ns


However, I tried, but failed, to push an actual EXPLAIN ANALYZE to show that
difference. All the differences I see are well below the run-to-run noise.

Which makes sense - the increase in overhead here probably is visible because
it increases the dependency chain inside the loop, which wouldn't be visible
in a normal explain (and of course, with more patches applied, a lot more is
won).



> From 25b58d2890e65a95ce426a0b80fab41c1c99bd8f Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Sat, 31 Jan 2026 08:49:46 -0800
> Subject: [PATCH v8 1/4] Check for HAVE__CPUIDEX and HAVE__GET_CPUID_COUNT
>  separately
>
> Previously we would only check for the availability of __cpuidex if
> the related __get_cpuid_count was not available on a platform. But there
> are cases where we want to be able to call __cpuidex as the only viable
> option, specifically, when accessing a high leaf like VM Hypervisor
> information (0x40000000), which __get_cpuid_count does not allow.
>
> This will be used in an future commit to access Hypervisor information
> about the TSC frequency of x86 CPUs, where available.
>
> Note that __cpuidex is defined in cpuid.h for GCC/clang, but in intrin.h
> for MSVC. Because we now set HAVE__CPUIDEX for GCC/clang when available,
> adjust existing code to check for _MSC_VER when including intrin.h.
>
> Author: Lukas Fittl <[email protected]>
> Reviewed-by:
> Discussion: 
> https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de



>  # Check for XSAVE intrinsics
> diff --git a/meson.build b/meson.build
> index ebfb85e93e5..312c919eaa4 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -2080,7 +2080,8 @@ elif cc.links('''
>  endif
>
>
> -# Check for __get_cpuid_count() and __cpuidex() in a similar fashion.
> +# Check for __get_cpuid_count() and __cpuidex() separately, since we 
> sometimes
> +# need __cpuidex() even if __get_cpuid_count() is available.
>  if cc.links('''
>      #include <cpuid.h>
>      int main(int arg, char **argv)
> @@ -2091,8 +2092,13 @@ if cc.links('''
>      ''', name: '__get_cpuid_count',
>      args: test_c_args)
>    cdata.set('HAVE__GET_CPUID_COUNT', 1)
> -elif cc.links('''
> +endif
> +if cc.links('''
> +    #ifdef _MSC_VER
>      #include <intrin.h>
> +    #else
> +    #include <cpuid.h>
> +    #endif
>      int main(int arg, char **argv)
>      {
>          unsigned int exx[4] = {0, 0, 0, 0};

FWIW, this seems to trigger a warning locally:

/srv/dev/build/postgres/m-dev-assert/meson-private/tmpw34r2pnc/testfile.c: In 
function 'main':
/srv/dev/build/postgres/m-dev-assert/meson-private/tmpw34r2pnc/testfile.c:10:19:
 warning: pointer targets in passing argument 1 of '__cpuidex' differ in signe>
   10 |         __cpuidex(exx, 7, 0);
      |                   ^~~
      |                   |
      |                   unsigned int *
In file included from 
/srv/dev/build/postgres/m-dev-assert/meson-private/tmpw34r2pnc/testfile.c:5:
/home/andres/build/gcc/master/install/lib/gcc/x86_64-pc-linux-gnu/16/include/cpuid.h:361:16:
 note: expected 'int *' but argument is of type 'unsigned int *'
  361 | __cpuidex (int __cpuid_info[4], int __leaf, int __subleaf)
      |            ~~~~^~~~~~~~~~~~~~~
-----------
Checking if "__cpuidex" links: YES 




> diff --git a/src/port/pg_crc32c_sse42_choose.c 
> b/src/port/pg_crc32c_sse42_choose.c
> index f586476964f..7a75380b483 100644
> --- a/src/port/pg_crc32c_sse42_choose.c
> +++ b/src/port/pg_crc32c_sse42_choose.c
> @@ -20,11 +20,11 @@
>
>  #include "c.h"
>
> -#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
> +#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) || 
> (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
>  #include <cpuid.h>
>  #endif

Why would we want to include cpuid.h with msvc if one of the other variables
is defined?


> -#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
> +#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
>  #include <intrin.h>
>  #endif

And here, why would we want to include intrin.h if HAVE__CPUID is defined?


Seems like this should just be something like

#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) || 
defined(HAVE__CPUIDEX)
#if defined(_MSC_VER)
#include <intrin.h>
#else
#include <cpuid.h>
#endif /* defined(_MSC_VER) */
#endif



> From 2392d95626599a1b5562f9216eb8c334db99c932 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Fri, 25 Jul 2025 17:57:20 -0700
> Subject: [PATCH v8 2/4] Timing: Streamline ticks to nanosecond conversion
>  across platforms
>
> The timing infrastructure (INSTR_* macros) measures time elapsed using
> clock_gettime() on POSIX systems, which returns the time as nanoseconds,
> and QueryPerformanceCounter() on Windows, which is a specialized timing
> clock source that returns a tick counter that needs to be converted to
> nanoseconds using the result of QueryPerformanceFrequency().
>
> This conversion currently happens ad-hoc on Windows, e.g. when calling
> INSTR_TIME_GET_NANOSEC, which calls QueryPerformanceFrequency() on every
> invocation, despite the frequency being stable after program start,
> incurring unnecessary overhead. It also causes a fractured implementation
> where macros are defined differently between platforms.
>
> To ease code readability, and prepare for a future change that intends
> to use a ticks-to-nanosecond conversion on x86-64 for TSC use, introduce
> a new pg_ticks_to_ns() function that gets called on all platforms.
>
> This function relies on a separately initialized ticks_per_ns_scaled
> value, that represents the conversion ratio. This value is initialized
> from QueryPerformanceFrequency() on Windows, and set to zero on x86-64
> POSIX systems, which results in the ticks being treated as nanoseconds.
> Other architectures always directly return the original ticks.
>
> To support this, pg_initialize_timing() is introduced, and is now
> mandatory for both the backend and any frontend programs to call before
> utilizing INSTR_* macros.

I wonder if it's worth trying to transparently initialize in the overflow
codepath. Probably not, but worth explicitly considering.


> In passing modify pg_test_timing to reduce the per-loop overhead caused
> by repeated divisions in INSTR_TIME_GET_NANOSEC when the ticks variable
> has become very large. Instead diff first and then turn it into nanosecs.

I'd like to see this broken out into a separate change.


> diff --git a/src/bin/pg_test_timing/pg_test_timing.c 
> b/src/bin/pg_test_timing/pg_test_timing.c
> index a5621251afc..9fd630a490a 100644

> @@ -182,9 +184,8 @@ test_timing(unsigned int duration)
>                                       bits;
>
>               prev = cur;
> -             INSTR_TIME_SET_CURRENT(temp);
> -             cur = INSTR_TIME_GET_NANOSEC(temp);
> -             diff = cur - prev;
> +             INSTR_TIME_SET_CURRENT(cur);
> +             diff = INSTR_TIME_DIFF_NANOSEC(cur, prev);
>
>               /* Did time go backwards? */
>               if (unlikely(diff < 0))

FWIW, I don't think this needs a special INSTR_TIME macro, it could just use
INSTR_TIME_SUBTRACT() and INSTR_TIME_GET_NANOSEC().



> diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
> index cb4e986092e..c8b233be16c 100644
> --- a/src/bin/pgbench/pgbench.c
> +++ b/src/bin/pgbench/pgbench.c
> @@ -7334,6 +7334,9 @@ main(int argc, char **argv)
>               initRandomState(&state[i].cs_func_rs);
>       }
>
> +     /* initialize timing infrastructure (required for INSTR_* calls) */
> +     pg_initialize_timing();
> +
>       /* opening connection... */
>       con = doConnect();
>       if (con == NULL)

FWIW, I also verified that I am am unable to see measure overhead in pgbench
due to the more expensive conversion.  Not surprised, but it did seem like a
possibility, because pgbench unfortunately always converts the gathered time
to microseconds, rather than compute a difference between two timestamps.


> +
> +/*
> + * Stores what the number of ticks needs to be multiplied with to end up
> + * with nanoseconds using integer math.
> + *
> + * On certain platforms (currently Windows) the ticks to nanoseconds 
> conversion
> + * requires floating point math because:
> + *
> + * sec = ticks / frequency_hz
> + * ns  = ticks / frequency_hz * 1,000,000,000
> + * ns  = ticks * (1,000,000,000 / frequency_hz)
> + * ns  = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz
> + *
> + * Here, 'ns' is usually a floating number. For example for a 2.5 GHz CPU
> + * the scaling factor becomes 1,000,000 / 2,500,000 = 1.2.
> + *
> + * To be able to use integer math we work around the lack of precision. We
> + * first scale the integer up and after the multiplication by the number
> + * of ticks in INSTR_TIME_GET_NANOSEC() we divide again by the same value.
> + * We picked the scaler such that it provides enough precision and is a
> + * power-of-two which allows for shifting instead of doing an integer
> + * division. We utilize unsigned integers even though ticks are stored as a
> + * signed value because that encourages compilers to generate better 
> assembly.


> + * On all other platforms we are using clock_gettime(), which uses 
> nanoseconds
> + * as ticks. Hence, we set the multiplier to zero, which causes 
> pg_ticks_to_ns
> + * to return the original value.
> + */
> +uint64               ticks_per_ns_scaled = 0;
> +uint64               max_ticks_no_overflow = 0;
> +
> +static void set_ticks_per_ns(void);
> +
> +void
> +pg_initialize_timing()
> +{
> +     set_ticks_per_ns();
> +}
> +
> +#ifndef WIN32
> +
> +static void
> +set_ticks_per_ns()
> +{
> +     ticks_per_ns_scaled = 0;
> +     max_ticks_no_overflow = 0;
> +}
> +
> +#else                                                        /* WIN32 */
> +
> +/* GetTimerFrequency returns counts per second */
> +static inline double
> +GetTimerFrequency(void)
> +{
> +     LARGE_INTEGER f;
> +
> +     QueryPerformanceFrequency(&f);
> +     return (double) f.QuadPart;
> +}
> +
> +static void
> +set_ticks_per_ns()
> +{
> +     ticks_per_ns_scaled = INT64CONST(1000000000) * TICKS_TO_NS_PRECISION / 
> GetTimerFrequency();


This should probably use NS_PER_S.

I wonder whether we should use an explicit shift here and in pg_ticks_to_ns(),
to avoid having to rely on the compiler to do so for us.


> +static inline int64
> +pg_ticks_to_ns(int64 ticks)
> +{
> +#if defined(__x86_64__) || defined(_M_X64)
> +     int64           ns = 0;
> +
> +     if (ticks_per_ns_scaled == 0)
> +             return ticks;

There should be comment explaining (or referencing another explanation) for
why this exists.


> +     /*
> +      * Would multiplication overflow? If so perform computation in two 
> parts.
> +      * Check overflow without actually overflowing via: a * b > max <=> a >
> +      * max / b
> +      */
> +     if (unlikely(ticks > (int64) max_ticks_no_overflow))

The "via" comment seems a bit misplaced, given that the transformation is not
really utilized here (but at the point where max_ticks_no_overflow) is
computed.


> +     {
> +             /*
> +              * Compute how often the maximum number of ticks fits 
> completely into
> +              * the number of elapsed ticks and convert that number into
> +              * nanoseconds. Then multiply by the count to arrive at the 
> final
> +              * value. In a 2nd step we adjust the number of elapsed ticks 
> and
> +              * convert the remaining ticks.
> +              */
> +             int64           count = ticks / max_ticks_no_overflow;
> +             int64           max_ns = max_ticks_no_overflow * 
> ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
> +
> +             ns = max_ns * count;
> +
> +             /*
> +              * Subtract the ticks that we now already accounted for, so 
> that they
> +              * don't get counted twice.
> +              */
> +             ticks -= count * max_ticks_no_overflow;
> +             Assert(ticks >= 0);

I think we could perhaps make the overflow case a good bit cheaper, by
avoiding any divisions with a non-constant factor (assuming I haven't blown
the logic below).  Instead of doing a division we can "transform back" into
the non-scaled representation, I think?

ns = (ticks * ticks_per_ns_scaled) / TICKS_TO_NS_PRECISION

  equals, assuming arbitrary precision

ns = (ticks / TICKS_TO_NS_PRECISION) * ticks_per_ns_scaled

  and not assuming arbitrary precision:

count = ticks // TICKS_TO_NS_PRECISION
rem_ticks = ticks - (count * TICKS_TO_NS_PRECISION)
ns = count * ticks_per_ns_scaled + rem_ticks * ticks_per_ns_scaled // 
TICKS_TO_NS_PRECISION

None of which afaict would overflow?




> --- a/src/common/instr_time.c
> +++ b/src/common/instr_time.c
> @@ -20,8 +20,8 @@
>   * Stores what the number of ticks needs to be multiplied with to end up
>   * with nanoseconds using integer math.
>   *
> - * On certain platforms (currently Windows) the ticks to nanoseconds 
> conversion
> - * requires floating point math because:
> + * In certain cases (TSC on x86-64, and QueryPerformanceCounter on Windows)
> + * the ticks to nanoseconds conversion requires floating point math because:
>   *
>   * sec = ticks / frequency_hz
>   * ns  = ticks / frequency_hz * 1,000,000,000
> @@ -39,7 +39,7 @@
>   * division. We utilize unsigned integers even though ticks are stored as a
>   * signed value because that encourages compilers to generate better 
> assembly.
>   *
> - * On all other platforms we are using clock_gettime(), which uses 
> nanoseconds
> + * In all other cases we are using clock_gettime(), which uses nanoseconds
>   * as ticks. Hence, we set the multiplier to zero, which causes 
> pg_ticks_to_ns
>   * to return the original value.
>   */
> @@ -48,16 +48,57 @@ uint64            max_ticks_no_overflow = 0;
>
>  static void set_ticks_per_ns(void);
>
> +int                  timing_clock_source = TIMING_CLOCK_SOURCE_AUTO;
> +
> +#if defined(__x86_64__) || defined(_M_X64)
> +/* Indicates if TSC instructions (RDTSC and RDTSCP) are usable. */
> +extern bool has_usable_tsc;
> +
> +static void tsc_initialize(void);
> +static bool tsc_use_by_default(void);
> +static void set_ticks_per_ns_for_tsc(void);
> +static bool set_tsc_frequency_khz(void);
> +static bool is_rdtscp_available(void);
> +#endif
> +
>  void
>  pg_initialize_timing()
>  {
> +#if defined(__x86_64__) || defined(_M_X64)
> +     tsc_initialize();
> +#endif
> +
> +     set_ticks_per_ns();
> +}
> +
> +bool
> +pg_set_timing_clock_source(TimingClockSourceType source)
> +{
> +#if defined(__x86_64__) || defined(_M_X64)
> +     switch (source)
> +     {
> +             case TIMING_CLOCK_SOURCE_AUTO:
> +                     use_tsc = has_usable_tsc && tsc_use_by_default();
> +                     break;
> +             case TIMING_CLOCK_SOURCE_SYSTEM:
> +                     use_tsc = false;
> +                     break;
> +             case TIMING_CLOCK_SOURCE_TSC:
> +                     if (!has_usable_tsc)    /* Tell caller TSC is not 
> usable */
> +                             return false;
> +                     use_tsc = true;
> +                     break;
> +     }
> +#endif
>       set_ticks_per_ns();
> +     timing_clock_source = source;
> +     return true;
>  }

Perhaps this should ensure that pg_initialize_timing() has already been called?


> +bool
> +check_timing_clock_source(int *newval, void **extra, GucSource source)
> +{
> +#if defined(__x86_64__) || defined(_M_X64)
> +     pg_initialize_timing();
> +
> +     if (*newval == TIMING_CLOCK_SOURCE_TSC && !has_usable_tsc)
> +     {
> +             GUC_check_errdetail("TSC is not supported as fast clock 
> source");
> +             return false;

The GUC name doesn't refer to "fast", so probably this shouldn't either?


> +const char *
> +show_timing_clock_source()
> +{
> +#if defined(__x86_64__) || defined(_M_X64)
> +     TimingClockSourceType effective_source = 
> pg_current_timing_clock_source();
> +
> +     switch (timing_clock_source)
> +     {
> +             case TIMING_CLOCK_SOURCE_AUTO:
> +                     if (effective_source == TIMING_CLOCK_SOURCE_TSC)
> +                             return "auto (tsc)";
> +                     else
> +                             return "auto (system)";
> +             case TIMING_CLOCK_SOURCE_SYSTEM:
> +                     return "system";
> +             case TIMING_CLOCK_SOURCE_TSC:
> +                     return "tsc";
> +     }
> +#else
> +     switch (timing_clock_source)
> +     {
> +             case TIMING_CLOCK_SOURCE_AUTO:
> +                     return "auto (system)";
> +             case TIMING_CLOCK_SOURCE_SYSTEM:
> +                     return "system";
> +     }
> +#endif

Seems like it'd be nicer if we had one switch with the ifdef-ery inside the
TIMING_CLOCK_SOURCE_AUTO case?  If we add support for tsc based clock sources
on arm as well, this would get a bit unmanageable.



> +static uint32 tsc_frequency_khz = 0;
> +
> +/*
> + * Decide whether we use the RDTSC/RDTSCP instructions at runtime, for 
> Linux/x86-64,
> + * instead of incurring the overhead of a full clock_gettime() call.
> + *
> + * This can't be reliably determined at compile time, since the
> + * availability of an "invariant" TSC (that is not affected by CPU
> + * frequency changes) is dependent on the CPU architecture. Additionally,
> + * there are cases where TSC availability is impacted by virtualization,
> + * where a simple cpuid feature check would not be enough.
> + */
> +static void
> +tsc_initialize(void)
> +{
> +     /*
> +      * Compute baseline CPU peformance, determines speed at which the TSC
> +      * advances.
> +      */
> +     if (!set_tsc_frequency_khz())
> +             return;
> +
> +     has_usable_tsc = is_rdtscp_available();
> +}
> +
> +/*
> + * Decides whether to use TSC clock source if the user did not specify it
> + * one way or the other, and it is available (checked separately).
> + *
> + * Currently only enabled by default on Linux, since Linux already does a
> + * significant amount of work to determine whether TSC is a viable clock
> + * source.
> + */
> +static bool
> +tsc_use_by_default()

Postgres style is still funcname(void).


> +{
> +#if defined(__linux__)
> +     FILE       *fp = 
> fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", 
> "r");
> +     char            buf[128];
> +
> +     if (!fp)
> +             return false;
> +
> +     if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
> +     {
> +             fclose(fp);
> +             return true;
> +     }
> +
> +     fclose(fp);
> +#endif
> +
> +     return false;
> +}

I think this will often disable tsc on VMs, due to linux defaulting to
kvm-clock in KVM VMs.

Do we care about that?


If the tsc is not actually viable, is it still listed in
/sys/devices/system/clocksource/clocksource0/available_clocksource
?


> +
> +#define CPUID_HYPERVISOR_VMWARE(words) (words[1] == 0x61774d56 && words[2] 
> == 0x4d566572 && words[3] == 0x65726177) /* VMwareVMware */
> +#define CPUID_HYPERVISOR_KVM(words) (words[1] == 0x4b4d564b && words[2] == 
> 0x564b4d56 && words[3] == 0x0000004d)     /* KVMKVMKVM */
> +
> +static bool
> +set_tsc_frequency_khz()
> +{
> +     uint32          r[4] = {0, 0, 0, 0};
> +
> +#if defined(HAVE__GET_CPUID)
> +     __get_cpuid(0x15, &r[0] /* denominator */ , &r[1] /* numerator */ , 
> &r[2] /* hz */ , &r[3]);
> +#elif defined(HAVE__CPUID)
> +     __cpuid(r, 0x15);
> +#else
> +#error cpuid instruction not available
> +#endif
> +
> +     if (r[2] > 0)
> +     {
> +             if (r[0] == 0 || r[1] == 0)
> +                     return false;
> +
> +             tsc_frequency_khz = r[2] / 1000 * r[1] / r[0];
> +             return true;
> +     }

I think there should be some explanation about what this is testing.
Including perhaps a reference to the relevant documents.



> +     /* Some CPUs only report frequency in 16H */

Dito.


> +#if defined(HAVE__GET_CPUID)
> +     __get_cpuid(0x16, &r[0] /* base_mhz */ , &r[1], &r[2], &r[3]);
> +#elif defined(HAVE__CPUID)
> +     __cpuid(r, 0x16);
> +#else
> +#error cpuid instruction not available
> +#endif

Perhaps we could package the __get_cpuid / __cpuid thing in a wrapper, instead
of repeating the ifdefery three times?



> index 985b6b5af88..e7191c5d6cd 100644
> --- a/src/include/portability/instr_time.h
> +++ b/src/include/portability/instr_time.h
> @@ -4,9 +4,10 @@
>   *     portable high-precision interval timing
>   *
>   * This file provides an abstraction layer to hide portability issues in
> - * interval timing.  On Unix we use clock_gettime(), and on Windows we use
> - * QueryPerformanceCounter().  These macros also give some breathing room to
> - * use other high-precision-timing APIs.
> + * interval timing. On x86 we use the RDTSC/RDTSCP instruction directly in
> + * certain cases, or alternatively clock_gettime() on Unix-like systems and
> + * QueryPerformanceCounter() on Windows. These macros also give some 
> breathing
> + * room to use other high-precision-timing APIs.
>   *
>   * The basic data type is instr_time, which all callers should treat as an
>   * opaque typedef.  instr_time can store either an absolute time (of
> @@ -17,10 +18,11 @@
>   *
>   * INSTR_TIME_SET_ZERO(t)                    set t to zero (memset is 
> acceptable too)
>   *
> - * INSTR_TIME_SET_CURRENT(t)         set t to current time
> + * INSTR_TIME_SET_CURRENT_FAST(t)    set t to current time without waiting
> + *                                                                   for 
> instructions in out-of-order window
>   *
> - * INSTR_TIME_SET_CURRENT_LAZY(t)    set t to current time if t is zero,
> - *                                                                   
> evaluates to whether t changed
> + * INSTR_TIME_SET_CURRENT(t)         set t to current time while waiting for
> + *                                                                   
> instructions in OOO to retire
>   *
>   * INSTR_TIME_ADD(x, y)                              x += y
>   *

I'd probably remove INSTR_TIME_SET_CURRENT_LAZY in a prep commit.


> @@ -93,13 +95,54 @@ typedef struct instr_time
>  extern PGDLLIMPORT uint64 ticks_per_ns_scaled;
>  extern PGDLLIMPORT uint64 max_ticks_no_overflow;
>
> +#if defined(__x86_64__) || defined(_M_X64)
> +#include <immintrin.h>

Why do we need to include immintrin.h in instr_time.h?  Including immintrin.h
makes compilation a lot slower:

$ echo '#include <immintrin.h>'|gcc -ftime-report -xc -o /dev/null -c -

Time variable                                  wall           GGC
 phase setup                        :   0.00 (  1%)  1905k (  6%)
 phase parsing                      :   0.50 ( 99%)    30M ( 94%)
 preprocessing                      :   0.09 ( 18%)  5375k ( 16%)
 lexical analysis                   :   0.02 (  5%)     0  (  0%)
 parser (global)                    :   0.31 ( 61%)    12M ( 39%)
 parser inl. func. body             :   0.08 ( 15%)    12M ( 39%)
 TOTAL                              :   0.51           32M



>  int
> @@ -46,10 +46,47 @@ main(int argc, char *argv[])
>       /* initialize timing infrastructure (required for INSTR_* calls) */
>       pg_initialize_timing();
>
> -     loop_count = test_timing(test_duration);
> -
> +     /*
> +      * First, test default (non-fast) timing code. A clock source for that 
> is
> +      * always available. Hence, we can unconditionally output the result.
> +      */
> +     loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_SYSTEM, 
> false);
>       output(loop_count);
>
> +#if defined(__x86_64__) || defined(_M_X64)

I don't love that now test_timing.c has architecture specific checks.  Could
we abstract this a bit more?


> +     /*
> +      * If on a supported architecture, test the RDTSC clock source. This 
> clock
> +      * source is not always available. In that case the loop count will be 0
> +      * and we don't print.
> +      *
> +      * We first emit RDTSCP timings, which is slower, and gets used for 
> higher
> +      * precision measurements when the TSC clock source is enabled. We emit
> +      * RDTSC second, which is used for faster timing measurements with lower
> +      * precision.
> +      */
> +     printf("\n");
> +     loop_count = test_timing(test_duration, TIMING_CLOCK_SOURCE_TSC, false);
> +     if (loop_count > 0)
> +     {
> +             output(loop_count);
> +             printf("\n");
> +
> +             /* Now, emit fast timing measurements */
> +             loop_count = test_timing(test_duration, 
> TIMING_CLOCK_SOURCE_TSC, true);
> +             output(loop_count);
> +             printf("\n");
> +
> +             pg_set_timing_clock_source(TIMING_CLOCK_SOURCE_AUTO);
> +             if (pg_current_timing_clock_source() == TIMING_CLOCK_SOURCE_TSC)
> +                     printf(_("TSC clock source will be used by default, 
> unless timing_clock_source is set to 'system'.\n"));
> +             else
> +                     printf(_("TSC clock source will not be used by default, 
> unless timing_clock_source is set to 'tsc'.\n"));
> +     }
> +     else
> +             printf(_("TSC clock source is not usable. Likely unable to 
> determine TSC frequency. are you running in an unsupported virtualized 
> environment?.\n"));
> +#endif
> +

A bit weird that most of the output stuff is handled in output(), but then
some of it is handled directly in main() now, some of it in test_timing().


Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Reply via email to