Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Lukas Fittl Tue, 03 Mar 2026 02:30:01 -0800

On Thu, Feb 26, 2026 at 8:14 AM Andres Freund <[email protected]> wrote:
> > 0002 is the INSTR_TIME_SET_CURRENT_LAZY removal pulled forward
> > 0003 is the INSTR_TIME_LT => INSTR_TIME_GT naming fix from [0] because
> > I assume we'll push that shortly, and 0004 now uses that macro
>
> Pushed these two.


Thanks!

See attached rebased v10 with remaining feedback addressed (except
noted below), as well as:

- Moved the CPU feature bit detection and TSC frequency logic to the
new pg_cpu_x86.c (thanks John!)
- Explicitly check that the CPU has the TSC invariant bit set (this is
a safety measure, to go along with a change to the default handling as
described later in this email)
- Explicitly check if the CPU has the hypervisor bit set, and only do
the hypervisor TSC frequency logic in that case (I think that's
correct, based on my read how the Kernel TSC code handles this when
its in a VM itself)
- Added support for HyperV hypervisor by reading the TSC frequency
MSR. This allows Azure Linux VMs to work as well, and in my test gives
a similar speed up with RDTSC like reported on AWS. Only annoyance is
that to enable it you have to make /dev/cpu/0/msr readable ("setcap
cap_sys_rawio=ep" on the binary that accesses it + give the user/group
access to the device file)

FWIW, I have not incorporated John's patch re: architecture detection
- I think we could get that done independently of this patch set and
rebase as needed, or we can add it here if preferred.

Two outstanding questions this time around:

1) I have not added a better way to return the TSC error information
(i.e. telling the user why TSC can't be used, as requested by Jakub),
because its not clear to me what the best paradigm here is. I agree
it'd be useful, but I think we should have a way to show that both
when setting the GUC to "tsc" explicitly (and failing), and in
pg_test_timing. Maybe there should be a global "tsc_error_reason"
string that is set by the frequency initialization logic, and read by
the GUC check / pg_test_timing?

2) Another thing that came up whilst I was looking for reference
materials to add missing comments: I wonder if our use of RDTSCP (i.e.
the "slow" timing when TSC is enabled) needs an LFENCE instruction
following it, to ensure accuracy, per the comments in the abseil
library that uses it for timing [0]:

"The newer RDTSCP is sometimes described as serializing, but it
actually only serves as a half-fence with release semantics. Although
all instructions in the region will complete before the final
timestamp is captured, subsequent instructions may leak into the
region and increase the elapsed time. Inserting another fence after
the final RDTSCP would prevent such reordering without affecting the
measured region."

I haven't done any actual testing with this, but the argument seems
sound, so maybe its worth adding the instruction for accuracy with
RDTSCP?

> > > I wonder if it's worth trying to transparently initialize in the overflow
> > > codepath. Probably not, but worth explicitly considering.
> >
> > If I follow, you're thinking of something like:
> >
> > - Initialize max_ticks_no_overflow to 0 by default
> > - In the overflow path (which we'd reach the first time around), do an
> > extra check if max_ticks_no_overflow == 0, and then call the
> > initialization function
> > - The initialization function sets max_ticks_no_overflow to a non-zero
> > value, so we don't get there the second time around
> >
> > Is that right? (I think it could work, since its "just" an extra jump
> > instruction in an unlikely edge case)
>
> Yes, that's what I was wondering about.

Having thought more about this, I think this is a bad idea. First, it
would make a subsequent call to pg_ticks_to_ns with the same
instr_time input return incorrect data (since ticks was set before we
set use_tsc=true). Second, we have no guarantee today that the
instr_time value that pg_ticks_to_ns got called on was the only one
that was set and not read yet.

Put differently: Any call to pg_get_ticks/pg_get_ticks_fast before the
TSC is initialized would store ticks in the system frame of reference,
and once pg_ticks_to_ns is called any subsequent pg_get_ticks* calls
would store in the TSC frame of reference. But we don't know at
pg_ticks_to_ns read time whether the instr_time value was stored
before the TSC was initialized, or after.

> > > I think we could perhaps make the overflow case a good bit cheaper, by
> > > avoiding any divisions with a non-constant factor (assuming I haven't 
> > > blown
> > > the logic below).  Instead of doing a division we can "transform back" 
> > > into
> > > the non-scaled representation, I think?
> > >
> > > ns = (ticks * ticks_per_ns_scaled) / TICKS_TO_NS_PRECISION
> > >
> > >   equals, assuming arbitrary precision
> > >
> > > ns = (ticks / TICKS_TO_NS_PRECISION) * ticks_per_ns_scaled
> > >
> > >   and not assuming arbitrary precision:
> > >
> > > count = ticks // TICKS_TO_NS_PRECISION
> > > rem_ticks = ticks - (count * TICKS_TO_NS_PRECISION)
> > > ns = count * ticks_per_ns_scaled + rem_ticks * ticks_per_ns_scaled // 
> > > TICKS_TO_NS_PRECISION
> > >
> > > None of which afaict would overflow?
> >
> > I've left this as is for now since I didn't write the original logic
> > here (I think it was you in a prior version?), and I need a good
> > night's sleep to think through this. Additional help welcome to review
> > your proposal.
>
> i don't remember writing the logic, but that doesn't say much :)

I've thought this through, and I think this seems sound. Adjusted, and
switched over to using explicit bit-shifts as well.

This code could probably still use another thorough read, just to
double check I didn't mess the math up now, especially in the overflow
case.

> > > I think this will often disable tsc on VMs, due to linux defaulting to
> > > kvm-clock in KVM VMs.
> > >
> > > Do we care about that?
> > >
> > >
> > > If the tsc is not actually viable, is it still listed in
> > > /sys/devices/system/clocksource/clocksource0/available_clocksource
> > > ?
> >
> > I think unless we want to do additional checks ourselves (something
> > like in [2]), we need to be careful here, and can't rely on the
> > presence of "tsc" in available clock sources to mean its viable.
> >
> > Specifically, my understanding is that the Kernel lists "tsc" as
> > available in more cases, and then if chosen in the beginning, has a
> > watchdog logic that observes the TSC and modifies it as needed if its
> > not viable. I think in such cases "tsc" would continue to be listed as
> > available, but the Kernel would have notified the user in the kernel
> > log that TSC is unstable.
>
> We probably should verify that the kernel indeed behaves that way, otherwise
> far fewer people will benefit from this improvement.

So I've verified that the Linux Kernel behaves this way, at least by
looking at the code again, i.e. "tsc" won't be removed from the
available clocksource list if the watchdog disqualifies it. I lack a
non-virtualized x86 system to test this with in practice, but I've
found nothing that contradicts this.

But, I also see your point that we want to broaden who can benefit by
default. And after further research, I think we can make a stronger
case for declaring the TSC safe to use ourselves, based on the Linux
kernel discussion in 2021 that stated [1]:

"We're finally at a point where TSC seems to be halfways reliable and
less abused by BIOS tinkerers. TSC_ADJUST was really key as we can now
detect even small modifications reliably and the important point is
that we can cure them as well (not pretty but better than all other
options). ... There is still no architecural guarantee for TSCs being
synchronized on machines with more than 4 sockets."

That then led the kernel folks to turning off the TSC watchdog at boot
when: the TSC is invariant, TSC_ADJUST is set (Intel only), and the
system has 4 sockets or less. No luck for AMD unfortunately, which the
kernel doesn't trust to the same level, so we probably shouldn't
either.

Based on this, I've adjusted our default choice logic to match the
Linux kernel logic (on any platform), but kept the previous Linux
clock source check as a fallback, which helps for AMD systems and 8+
socket systems where the Linux TSC watchdog did its work (presumably
shortly after boot, before Postgres starts), and confirmed the TSC is
safe to use.

On the cloud platforms I checked (AWS and GCP), both the TscInvariant
and TSC_ADJUST bits are set in the VM, so that should also help with
virtualized scenarios if the clock source is kvm-clock but the TSC is
in good shape.

> > > > @@ -93,13 +95,54 @@ typedef struct instr_time
> > > >  extern PGDLLIMPORT uint64 ticks_per_ns_scaled;
> > > >  extern PGDLLIMPORT uint64 max_ticks_no_overflow;
> > > >
> > > > +#if defined(__x86_64__) || defined(_M_X64)
> > > > +#include <immintrin.h>
> > >
> > > Why do we need to include immintrin.h in instr_time.h?  Including 
> > > immintrin.h
> > > makes compilation a lot slower:
> >
> > We previously had x86intrin.h there, I think David changed that to
> > immintrin.h in v8. I've adjusted this to use intrin.h on MSVC instead,
> > as that's noted as the correct file to include from [3], and back to
> > x86intrin.h for other platforms.
>
> My concern is that instr_time.h is quite widely included, including through
> executor/instrument.h and pgstat.h.
>
> An -O0 build without including immintrin.h:
>   $ ninja clean && rm -f .ninja_* && CCACHE_DISABLE=1 time ninja
>   297.05user 53.54system 0:21.53elapsed 1628%CPU (0avgtext+0avgdata 
> 493044maxresident)k
>   584inputs+4215576outputs (36major+13819907minor)pagefaults 0swaps
>
> just adding #include <immintrin.h> to instr_time.h:
>   $ ninja clean && rm -f .ninja_* && CCACHE_DISABLE=1 time ninja
>   529.83user 81.39system 0:47.85elapsed 1277%CPU (0avgtext+0avgdata 
> 585492maxresident)k
>   3504inputs+5232544outputs (31major+23905659minor)pagefaults 0swaps
>
> I.e. the elapsed build time more than doubled.  That seems problematic to me.
>
>
> I think we could either:
>
> a) Avoid the expensive include, e.g. by including a narrower header or by just
>    using the underlying builtin directly.

Per discussing this with Andres off-list (since I was at first a bit
confused and incorrectly stated that using x86intrin.h helps here -
which it doesn't), I've adjusted this to use the built-ins for
RDTSC/RDTSCP like suggested. Those have been available in GCC for a
long time, and were added to clang in 3.5.0, released in September
2014. I think that means we can rely on them without an explicit
configure check. MSVC on the other hand requires the use of intrin.h
from my research.

Thanks,
Lukas

[0]: 
https://github.com/abseil/abseil-cpp/blob/20240116.2/absl/random/internal/nanobenchmark.cc#L178
[1]: https://lore.kernel.org/lkml/[email protected]/

-- 
Lukas Fittl

v10-0002-pg_test_timing-Reduce-per-loop-overhead.patch
Description: Binary data

v10-0004-instrumentation-Use-Time-Stamp-Counter-TSC-on-x8.patch
Description: Binary data

v10-0003-instrumentation-Streamline-ticks-to-nanosecond-c.patch
Description: Binary data

v10-0001-Check-for-HAVE__CPUIDEX-and-HAVE__GET_CPUID_COUN.patch
Description: Binary data

v10-0005-pg_test_timing-Also-test-RDTSC-RDTSCP-timing-and.patch
Description: Binary data

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Reply via email to