Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Lukas Fittl Sun, 22 Mar 2026 15:23:12 -0700

Hi John,

On Sun, Mar 15, 2026 at 5:49 PM John Naylor <[email protected]> wrote:
>
> On Tue, Mar 10, 2026 at 5:04 PM Lukas Fittl <[email protected]> wrote:
>
> > > Elsewhere in the patch series there are comments that refer to EBX etc
> > > while the code looks like exx[1]. It hasn't been a problem up to now
> > > since there were only a couple places that used cpuid. But this
> > > patchset adds quite a few more, so now seems like a good time to make
> > > it more readable with register name symbols.
> >
> > I'm not a fan of using macros for this, but what if we define
> > ourselves a struct like this:
> >
> > typedef struct CPUIDResult
> > {
> >     unsigned int eax;
> >     unsigned int ebx;
> >     unsigned int ecx;
> >     unsigned int edx;
> > } CPUIDResult;
> >
> > Then the code reads like:
> >
> > pg_cpuid(0x80000001, &r);
> > X86Features[PG_RDTSCP] = r.edx >> 27 & 1;
>
> One objection is that __cpuid() takes an array of 4 integers as an
> argument. I think it would technically happen to work to pass a
> pointer to this struct, but it seems the wrong thing to do. If you're
> not a fan of macros, the other way would be an enum of indices (and
> named with all caps).


How about we deal with this in the pg_cpuid function by having
__cpuid() write into a separate array and then assign that to
CPUIDResult? See the attached 0001 patch.

I think the struct fields are the clearest approach here, but if you
disagree let me know and I can adjust further.

> > > + if (exx[2] & (1 << 27))
> > > + {
> > > + uint32 xcr0_val = 0;
> > > +
> > >
> > > By moving the call to leaf 7 up, the results of leaf 1 just got blown
> > > away, so isn't OXSAVE support now broken? Maybe we actually want 2
> > > separate arrays? (Like "regs" and "ext_regs") That'll also avoid
> > > having to memset.
> >
> > Yeah, I think using two result variables make sense here.
> > Alternatively we could save the result of the OXSAVE check in a
> > boolean.
>
> Actually I like the idea of a boolean better, since that would be less
> churn, and maybe simpler to reason about. 0005 overwrites one of the
> variables in subsequent calls, so it could be confusing for future
> additions as to which one to use.

Yeah, makes sense - switched it to a boolean in the 0005 patch that
adds the TSC logic.

> v11-0001:
>
> +static inline bool
> +pg_cpuid_subleaf(int leaf, int subleaf, CPUIDResult *r)
> +{
> +#if defined(HAVE__GET_CPUID_COUNT)
> +  return __get_cpuid_count(leaf, subleaf, &r->eax, &r->ebx, &r->ecx,
> &r->edx) == 1;
> +#elif defined(HAVE__CPUIDEX)
> +  __cpuidex((int *) r, leaf, subleaf);
> +  return true;
> +#else
> +  memset(r, 0, sizeof(CPUIDResult));
> +  return false;
> +#endif
> +}
>
> This needs a comment to explain the return value.

Good point, added a comment.

I think that should address all open feedback on the 0001 patch.

Attached v12, which also has these additional changes:

v12/0003 (pg_test_timing: Reduce per-loop overhead)

- Fixed the ns to ticks translation issue, per Zsolt's feedback
- Reworked this to use a INSTR_TIME_ADD_NANOSEC macro (instead of
INSTR_TIME_SET_NANOSEC) - I think that's best for the intent of "set a
target time to compare against using a time interval", and avoids
callers passing in fixed timestamps that'd be more expensive to
convert

v12/0004 (Streamline ticks to nanosecond conversion across platforms)

- Moved pg_initialize_timing to InitProcessGlobals instead of main. I
had previously placed this in main because it might have helped a
different TSC calibration technique that could have benefited from a
distance between two function calls, but the calibration function
appears fast enough that we can just run it all at once.
- Add assert that pg_initialize_timing was called before INSTR* macros
are called, and add a missing pg_initialize_timing in pg_regress, as
well as correct where we call it for pgbench

v12/0005 (Main TSC patch)

- Clarify that TSC frequency can technically change between instr_time
set and read (in code, next to the relevant variables), per Zsolt's
feedback
- Add pg_cpuidex helper that can be used to call _cpuidex (and
intentionally not __get_cpuid_count) for getting VM Hypervisor
information

TSC calibration improvements:
- Pass TSC frequency to child processes for EXEC_BACKEND, this makes
TSC calibration usable on Windows, per off-list suggestion from Andres
- To support the EXEC_BACKEND change, refactor tsc_frequency_khz to
have a sentinel value of -1 indicating the TSC was not initialized, vs
0 indicating it was initialized but isn't usable, and drop
has_usable_tsc variable
- Fix the TSC calibration logic to handle frequency decrease, per
Zsolt's feedback
- Run TSC calibration only once GUCs are being set, since we don't
need it earlier and we can simplify the client program logic that way
(if you don't call pg_set_timing_clock_source you won't get TSC)
- Based on further testing, measure the frequency only every 100
iterations, and require at least 10 stable cycles (i.e. effective min
iterations is 1000, max is 1 million as before) -- this improves the
accuracy on my two test machines (AMD bare metal + Azure HyperV)

v12/0006 (pg_test_timing patch)

- Show TSC frequency in pg_test_timing output

Thanks,
Lukas

--
Lukas Fittl

v12-0002-Check-for-HAVE__CPUIDEX-and-HAVE__GET_CPUID_COUN.patch
Description: Binary data

v12-0003-pg_test_timing-Reduce-per-loop-overhead.patch
Description: Binary data

v12-0001-Refactor-handling-of-x86-CPUID-instructions.patch
Description: Binary data

v12-0004-instrumentation-Streamline-ticks-to-nanosecond-c.patch
Description: Binary data

v12-0005-instrumentation-Use-Time-Stamp-Counter-TSC-on-x8.patch
Description: Binary data

v12-0006-pg_test_timing-Also-test-RDTSC-RDTSCP-timing-and.patch
Description: Binary data

v12-0007-instrumentation-ARM-support-for-fast-time-measur.patch
Description: Binary data

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Reply via email to