subject:"Reduce timing overhead of EXPLAIN ANALYZE using rdtsc\?"

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2024-06-02 Thread Andres Freund

Hi,

At some point this patch switched from rdtsc to rdtscp, which imo largely
negates the point of it. What lead to that?

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2024-02-01 Thread vignesh C

On Sat, 20 Jan 2024 at 09:03, vignesh C  wrote:
>
> On Mon, 20 Feb 2023 at 16:06, David Geier  wrote:
> >
> > Hi!
> >
> > On 2/14/23 13:48, David Geier wrote:
> > >
> > > It still fails.
> > >
> > > I'll get Cirrus-CI working on my own Github fork so I can make sure it
> > > really compiles on all platforms before I submit a new version.
> >
> > It took some time until Cirrus CI allowed me to run tests against my new
> > GitHub account (there's a 3 days freeze to avoid people from getting
> > Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch
> > which passes builds, rebased on latest master.
> >
> > I also reviewed the first two patches a while ago in [1]. I hope we can
> > progress with them to further reduce the size of this patch set.
> >
> > Beyond that: I could work on support for more OSs (e.g. starting with
> > Windows). Is there appetite for that or do we rather want to instead
> > start with a smaller patch?
>
> Are we planning to continue on this and take it further?
> I'm seeing that there has been no activity in this thread for nearly 1
> year now, I'm planning to close this in the current commitfest unless
> someone is planning to take it forward.

Since the author or no one else showed interest in taking it forward
and the patch had no activity for more than 1 year, I have changed the
status to RWF. Feel free to add a new CF entry when someone is
planning to resume work more actively.

Regards,
Vignesh

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2024-01-19 Thread vignesh C

On Mon, 20 Feb 2023 at 16:06, David Geier  wrote:
>
> Hi!
>
> On 2/14/23 13:48, David Geier wrote:
> >
> > It still fails.
> >
> > I'll get Cirrus-CI working on my own Github fork so I can make sure it
> > really compiles on all platforms before I submit a new version.
>
> It took some time until Cirrus CI allowed me to run tests against my new
> GitHub account (there's a 3 days freeze to avoid people from getting
> Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch
> which passes builds, rebased on latest master.
>
> I also reviewed the first two patches a while ago in [1]. I hope we can
> progress with them to further reduce the size of this patch set.
>
> Beyond that: I could work on support for more OSs (e.g. starting with
> Windows). Is there appetite for that or do we rather want to instead
> start with a smaller patch?

Are we planning to continue on this and take it further?
I'm seeing that there has been no activity in this thread for nearly 1
year now, I'm planning to close this in the current commitfest unless
someone is planning to take it forward.

Regards,
Vignesh

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-02-20 Thread David Geier


Hi!

On 2/14/23 13:48, David Geier wrote:


It still fails.

I'll get Cirrus-CI working on my own Github fork so I can make sure it 
really compiles on all platforms before I submit a new version.


It took some time until Cirrus CI allowed me to run tests against my new 
GitHub account (there's a 3 days freeze to avoid people from getting 
Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch 
which passes builds, rebased on latest master.


I also reviewed the first two patches a while ago in [1]. I hope we can 
progress with them to further reduce the size of this patch set.


Beyond that: I could work on support for more OSs (e.g. starting with 
Windows). Is there appetite for that or do we rather want to instead 
start with a smaller patch?


[1] 
https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com


--
David Geier
(ServiceNow)
From d03a9be2522b0ef22fd58cbcfc95eb19ca8b2bea Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 20 Jan 2023 15:31:54 -0800
Subject: [PATCH v9 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(),
 INSTR_TIME_IS_LT()

INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop
without having to convert into time units (which is
costly). INSTR_TIME_IS_LT() can be used to check the loop condition.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/portability/instr_time.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index cc85138e21..aab80effb0 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -15,6 +15,8 @@
  *
  * INSTR_TIME_IS_ZERO(t)			is t equal to zero?
  *
+ * INSTR_TIME_IS_LT(x, y)			x < y
+ *
  * INSTR_TIME_SET_ZERO(t)			set t to zero (memset is acceptable too)
  *
  * INSTR_TIME_SET_CURRENT(t)		set t to current time
@@ -22,6 +24,8 @@
  * INSTR_TIME_SET_CURRENT_LAZY(t)	set t to current time if t is zero,
  *	evaluates to whether t changed
  *
+ * INSTR_TIME_SET_SECONDS(t, s)		set t to s seconds
+ *
  * INSTR_TIME_ADD(x, y)x += y
  *
  * INSTR_TIME_SUBTRACT(x, y)		x -= y
@@ -122,6 +126,9 @@ pg_clock_gettime_ns(void)
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_clock_gettime_ns())
 
+#define INSTR_TIME_SET_SECONDS(t, s) \
+	((t).ticks = NS_PER_S * (s))
+
 #define INSTR_TIME_GET_NANOSEC(t) \
 	((int64) (t).ticks)
 
@@ -156,6 +163,9 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_query_performance_counter())
 
+#define INSTR_TIME_SET_SECONDS(t, s) \
+	((t).ticks = s * GetTimerFrequency())
+
 #define INSTR_TIME_GET_NANOSEC(t) \
 	((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency(
 
@@ -168,6 +178,8 @@ GetTimerFrequency(void)
 
 #define INSTR_TIME_IS_ZERO(t)	((t).ticks == 0)
 
+#define INSTR_TIME_IS_LT(x, y)	((x).ticks < (y).ticks)
+
 
 #define INSTR_TIME_SET_ZERO(t)	((t).ticks = 0)
 
-- 
2.34.1

From 639213b01102a7320a62ba5ed68a8e3d6a05514b Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 16 Jan 2023 11:19:11 -0800
Subject: [PATCH v9 2/3] wip: report nanoseconds in pg_test_timing

This commit also updates pg_test_timing's documentation:

- compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of
  of statement with/without EXPLAIN ANALYZE
- explain the 2x overhead (due to two timestamp acquisitions per row)
- remove old section about old versions of linux - I couldn't update the
  numbers, and it's old enough nobody would care
---
 doc/src/sgml/ref/pgtesttiming.sgml  | 117 ++--
 src/bin/pg_test_timing/pg_test_timing.c |  74 +--
 2 files changed, 95 insertions(+), 96 deletions(-)

diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml
index a5eb3aa25e..7e0266cf58 100644
--- a/doc/src/sgml/ref/pgtesttiming.sgml
+++ b/doc/src/sgml/ref/pgtesttiming.sgml
@@ -93,28 +93,34 @@ PostgreSQL documentation
 
   
Good results will show most (>90%) individual timing calls take less than
-   one microsecond. Average per loop overhead will be even lower, below 100
-   nanoseconds. This example from an Intel i7-860 system using a TSC clock
-   source shows excellent performance:
+   one microsecond (1000 nanoseconds). Average per loop overhead will be even
+   lower, below 100 nanoseconds. This example from an Intel i9-9880H system
+   using a TSC clock source shows excellent performance:
 
 
   
 
   
-   Note that different units are used for the per loop time than the
-   histogram. The loop can have resolution within a few nanoseconds (ns),
-   while the individual timing calls can only resolve down to one microsecond
-   (us).
+   Note that the accuracy of the histogram entries may be lower than the
+   per loop time.
   
 
  
@@ -125,24 +131,25 @@ Histogram of timing durations:
When the query executor is running a statement using
EXPLAIN ANALYZE, individual operations are timed as well

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-02-14 Thread David Geier


Hi!

On 2/14/23 12:11, David Geier wrote:

Hi,

I think I fixed the compilation errors. It was due to a few variables 
being declared under


#if defined(__x86_64__) && defined(__linux__)

while being used also under non x86 Linux.

I also removed again the code to obtain the TSC frequency under 
hypervisors because the TSC is usually emulated and therefore no 
faster than clock_gettime() anyways. So we now simply fallback to 
clock_gettime() on hypervisors when we cannot obtain the frequency via 
leaf 0x16.


Beyond that I reviewed the first two patches a while ago in [1]. I 
hope we can progress with them to further reduce the size of this 
patch set.


[1] 
https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com 




It still fails.

I'll get Cirrus-CI working on my own Github fork so I can make sure it 
really compiles on all platforms before I submit a new version.


--
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-02-14 Thread David Geier


Hi,

On 2/7/23 19:12, Andres Freund wrote:

This fails to build on several platforms:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3751


I think I fixed the compilation errors. It was due to a few variables 
being declared under


#if defined(__x86_64__) && defined(__linux__)

while being used also under non x86 Linux.

I also removed again the code to obtain the TSC frequency under 
hypervisors because the TSC is usually emulated and therefore no faster 
than clock_gettime() anyways. So we now simply fallback to 
clock_gettime() on hypervisors when we cannot obtain the frequency via 
leaf 0x16.


Beyond that I reviewed the first two patches a while ago in [1]. I hope 
we can progress with them to further reduce the size of this patch set.


[1] 
https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com 



--
David Geier
(ServiceNow)
From 36ff7f7ee14bf42ef0fb775cec428180251c3ff9 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 20 Jan 2023 15:31:54 -0800
Subject: [PATCH v8 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(),
 INSTR_TIME_IS_LT()

INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop
without having to convert into time units (which is
costly). INSTR_TIME_IS_LT() can be used to check the loop condition.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/portability/instr_time.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index cc85138e21..aab80effb0 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -15,6 +15,8 @@
  *
  * INSTR_TIME_IS_ZERO(t)			is t equal to zero?
  *
+ * INSTR_TIME_IS_LT(x, y)			x < y
+ *
  * INSTR_TIME_SET_ZERO(t)			set t to zero (memset is acceptable too)
  *
  * INSTR_TIME_SET_CURRENT(t)		set t to current time
@@ -22,6 +24,8 @@
  * INSTR_TIME_SET_CURRENT_LAZY(t)	set t to current time if t is zero,
  *	evaluates to whether t changed
  *
+ * INSTR_TIME_SET_SECONDS(t, s)		set t to s seconds
+ *
  * INSTR_TIME_ADD(x, y)x += y
  *
  * INSTR_TIME_SUBTRACT(x, y)		x -= y
@@ -122,6 +126,9 @@ pg_clock_gettime_ns(void)
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_clock_gettime_ns())
 
+#define INSTR_TIME_SET_SECONDS(t, s) \
+	((t).ticks = NS_PER_S * (s))
+
 #define INSTR_TIME_GET_NANOSEC(t) \
 	((int64) (t).ticks)
 
@@ -156,6 +163,9 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_query_performance_counter())
 
+#define INSTR_TIME_SET_SECONDS(t, s) \
+	((t).ticks = s * GetTimerFrequency())
+
 #define INSTR_TIME_GET_NANOSEC(t) \
 	((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency(
 
@@ -168,6 +178,8 @@ GetTimerFrequency(void)
 
 #define INSTR_TIME_IS_ZERO(t)	((t).ticks == 0)
 
+#define INSTR_TIME_IS_LT(x, y)	((x).ticks < (y).ticks)
+
 
 #define INSTR_TIME_SET_ZERO(t)	((t).ticks = 0)
 
-- 
2.34.1

From 1200bfee5b8ebf7c68d8cefb3771a4c3523c2cea Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 16 Jan 2023 11:19:11 -0800
Subject: [PATCH v8 2/3] wip: report nanoseconds in pg_test_timing

This commit also updates pg_test_timing's documentation:

- compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of
  of statement with/without EXPLAIN ANALYZE
- explain the 2x overhead (due to two timestamp acquisitions per row)
- remove old section about old versions of linux - I couldn't update the
  numbers, and it's old enough nobody would care
---
 doc/src/sgml/ref/pgtesttiming.sgml  | 117 ++--
 src/bin/pg_test_timing/pg_test_timing.c |  74 +--
 2 files changed, 95 insertions(+), 96 deletions(-)

diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml
index a5eb3aa25e..7e0266cf58 100644
--- a/doc/src/sgml/ref/pgtesttiming.sgml
+++ b/doc/src/sgml/ref/pgtesttiming.sgml
@@ -93,28 +93,34 @@ PostgreSQL documentation
 
   
Good results will show most (>90%) individual timing calls take less than
-   one microsecond. Average per loop overhead will be even lower, below 100
-   nanoseconds. This example from an Intel i7-860 system using a TSC clock
-   source shows excellent performance:
+   one microsecond (1000 nanoseconds). Average per loop overhead will be even
+   lower, below 100 nanoseconds. This example from an Intel i9-9880H system
+   using a TSC clock source shows excellent performance:
 
 
   
 
   
-   Note that different units are used for the per loop time than the
-   histogram. The loop can have resolution within a few nanoseconds (ns),
-   while the individual timing calls can only resolve down to one microsecond
-   (us).
+   Note that the accuracy of the histogram entries may be lower than the
+   per loop time.
   
 
  
@@ -125,24 +131,25 @@ Histogram of timing durations:
When the query executor is running a statement using
EXPLAIN ANALYZE, individual operations are

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-02-07 Thread Andres Freund

Hi,

On 2023-01-24 14:30:34 +0100, David Geier wrote:
> Attached is v7 of the patch:
> 
> - Rebased on latest master (most importantly on top of the int64 instr_time
> commits). - Includes two commits from Andres which introduce
> INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() and WIP to report
> pg_test_timing output in nanoseconds. - Converts ticks to nanoseconds only
> with integer math, while accounting for overflow. - Supports RDTSCP via
> INSTR_TIME_SET_CURRENT() and introduced INSTR_TIME_SET_CURRENT_FAST() which
> uses RDTSC.
> 
> I haven't gotten to the following:
> 
> - Looking through all calls to INSTR_TIME_SET_CURRENT() and check if they
> should be replaced by INSTR_TIME_SET_CURRENT_FAST(). - Reviewing Andres
> commits. Potentially improving on pg_test_timing's output. - Looking at
> enabling RDTSC on more platforms. Is there a minimum set of platforms we
> would like support for? Windows should be easy. That would also allow to
> unify the code a little more. - Add more documentation and do more testing
> around the calls to CPUID. - Profiling and optimizing the code. A quick test
> showed about 10% improvement over master with TIMING ON vs TIMING OFF, when
> using the test-case from Andres' e-mail that started this thread.
> 
> I hope I'll find time to work on these points during the next days.

This fails to build on several platforms:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3751

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-26 Thread David Geier


Hi,

On 1/23/23 21:30, Andres Freund wrote:

That's been the case since my first post in the thread :). Mainly, it seems
easier to detect underflow cases during subtraction that way. And the factor
of 2 in range doesn't change a whole lot.

I just realized it the other day :).

If you have time to look at the pg_test_timing part, it'd be
appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
hesitant to push it.

I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
patch.

I've attached those two patches. Feel free to include them in your series if
you want, then the CF entry (and thus cfbot) makes sense again...

I'll include them in my new patch set and also have a careful look at them.


I reviewed the prerequisite patch which introduces 
INSTR_TIME_SET_SECONDS(), as well as the pg_test_timing patch. Here my 
comments:


- The prerequisite patch looks good me.

- By default, the test query in the pg_test_timing doc runs serially. 
What about adding SET max_parallel_workers_per_gather = 0 to make sure 
it really always does (e.g. on a system with different settings for 
parallel_tuple_cost / parallel_setup_cost)? Otherwise, the numbers will 
be much more flaky.


- Why have you added a case distinction for diff == 0? Have you 
encountered this case? If so, how? Maybe add a comment.


- To further reduce overhead we could call INSTR_TIME_SET_CURRENT() 
multiple times. But then again: why do we actually care about the 
per-loop time? Why not instead sum up diff and divide by the number of 
iterations to exclude all the overhead in the first place?


- In the computation of the per-loop time in nanoseconds you can now use 
INSTR_TIME_GET_NANOSEC() instead of INSTR_TIME_GET_DOUBLE() * NS_PER_S.


The rest looks good to me. The rebased patches are part of the patch set 
I sent out yesterday in reply to another mail in this thread.


--
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-24 Thread David Geier


Hi


I think at least some should be converted to just accumulate in an
instr_time...

I think that's for a later patch though?

Yep, at least quite similar.

OK. I coded it up in the latest version of the patch.

Depending on how low we want to keep the error, I don't think we can:

If I set the allowed deviation to 10**-9, we end up requiring a shift by 29
for common ghz ranges. Clearly 33bits isn't an interesting range.

But even if you accept a higher error - we don't have *that* much range
available. Assuming an uint64, the range is ~584 years. If we want 10 years
range, we end up

   math.log(((2**64)-1) / (10 * 365 * 60 * 60 * 24 * 10**9), 2)
   ~= 5.87

So 5 bits available that we could "use" for multiply/shift. For something like
2.5ghz, that'd be ~2% error, clearly not acceptable.  And even just a year of
range, ends up allowing a failure of 30796s = 8min over a year, still too
high.
Thanks for doing the math. Agreed. The latest patch detects overflow and 
correctly handles it.

But I don't think it's really an issue - normally that branch will never be
taken (at least within the memory of the branch predictor), which on modern
CPUs means it'll just be predicted as not taken. So as long as we tell the
compiler what's the likely branch, it should be fine. At least as long as the
branch compares with a hardcoded number.
Yeah. The overflow detection just compares two int64. The "overflow 
threshold" is pre-computed.

- the restriction just to linux, that'll make testing harder for some, and
   ends up encoding too much OS dependency
- I think we need both the barrier and non-barrier variant, otherwise I
   suspect we'll end up with inccuracies we don't want
- needs lots more documentation about why certain cpuid registers are used
- cpu microarch dependencies - isn't there, e.g., the case that the scale on
   nehalem has to be different than on later architectures?
- lack of facility to evaluate how well the different time sources work
Makes sense. I carried that list over to my latest e-mail which also 
includes the patch to have some sort of summary of where we are in a 
single place.


--
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-24 Thread David Geier


Hi,

On 1/23/23 18:41, Andres Freund wrote:

If we add it, it probably shouldn't depend on TIMING, but on
SUMMARY. Regression test queries showing EXPLAIN ANALYZE output all do
something like
   EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)

the SUMMARY OFF gets rid of the "top-level" "Planning Time" and "Execution
Time", whereas the TIMING OFF gets rid of the per-node timing. Those are
separate options because per-node timing is problematic performance-wise
(right now), but whole-query timing rarely is.
Makes sense. I wasn't aware of SUMMARY. Let's keep this option in mind, 
in case we'll revisit exposing the clock source in the future.

Another, independent, thing worth thinking about: I think we might want to
expose both rdtsc and rdtscp. For something like
InstrStartNode()/InstrStopNode(), avoiding the "one-way barrier" of rdtscp is
quite important to avoid changing the query performance. But for measuring
whole-query time, we likely want to measure the actual time.

It probably won't matter hugely for the whole query time - the out of order
window of modern CPUs is large, but not *that* large - but I don't think we
can generally assume that.


That's what I thought as well. I added INSTR_TIME_SET_CURRENT_FAST() and 
for now call that variant from InstrStartNode(), InstrEndNode() and 
pg_test_timing. To do so in InstrEndNode(), I removed 
INSTR_TIME_SET_CURRENT_LAZY(). Otherwise, two variants of that macro 
would be needed. INSTR_TIME_SET_CURRENT_LAZY() was only used in a single 
place and the code is more readable that way. INSTR_TIME_SET_CURRENT() 
is called from a bunch of places. I still have to go through all of them 
and see which should be changed to call the _FAST() variant.


Attached is v7 of the patch:

- Rebased on latest master (most importantly on top of the int64 
instr_time commits). - Includes two commits from Andres which introduce 
INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() and WIP to report 
pg_test_timing output in nanoseconds. - Converts ticks to nanoseconds 
only with integer math, while accounting for overflow. - Supports RDTSCP 
via INSTR_TIME_SET_CURRENT() and introduced 
INSTR_TIME_SET_CURRENT_FAST() which uses RDTSC.


I haven't gotten to the following:

- Looking through all calls to INSTR_TIME_SET_CURRENT() and check if 
they should be replaced by INSTR_TIME_SET_CURRENT_FAST(). - Reviewing 
Andres commits. Potentially improving on pg_test_timing's output. - 
Looking at enabling RDTSC on more platforms. Is there a minimum set of 
platforms we would like support for? Windows should be easy. That would 
also allow to unify the code a little more. - Add more documentation and 
do more testing around the calls to CPUID. - Profiling and optimizing 
the code. A quick test showed about 10% improvement over master with 
TIMING ON vs TIMING OFF, when using the test-case from Andres' e-mail 
that started this thread.


I hope I'll find time to work on these points during the next days.

--
David Geier
(ServiceNow)
From 0b5ce706bed13c0c242e6ace809d3c37a8064029 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 20 Jan 2023 15:31:54 -0800
Subject: [PATCH v7 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(),
 INSTR_TIME_IS_LT()

INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop
without having to convert into time units (which is
costly). INSTR_TIME_IS_LT() can be used to check the loop condition.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/portability/instr_time.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index cc85138e21..aab80effb0 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -15,6 +15,8 @@
  *
  * INSTR_TIME_IS_ZERO(t)			is t equal to zero?
  *
+ * INSTR_TIME_IS_LT(x, y)			x < y
+ *
  * INSTR_TIME_SET_ZERO(t)			set t to zero (memset is acceptable too)
  *
  * INSTR_TIME_SET_CURRENT(t)		set t to current time
@@ -22,6 +24,8 @@
  * INSTR_TIME_SET_CURRENT_LAZY(t)	set t to current time if t is zero,
  *	evaluates to whether t changed
  *
+ * INSTR_TIME_SET_SECONDS(t, s)		set t to s seconds
+ *
  * INSTR_TIME_ADD(x, y)x += y
  *
  * INSTR_TIME_SUBTRACT(x, y)		x -= y
@@ -122,6 +126,9 @@ pg_clock_gettime_ns(void)
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_clock_gettime_ns())
 
+#define INSTR_TIME_SET_SECONDS(t, s) \
+	((t).ticks = NS_PER_S * (s))
+
 #define INSTR_TIME_GET_NANOSEC(t) \
 	((int64) (t).ticks)
 
@@ -156,6 +163,9 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_query_performance_counter())
 
+#define INSTR_TIME_SET_SECONDS(t, s) \
+	((t).ticks = s * GetTimerFrequency())
+
 #define INSTR_TIME_GET_NANOSEC(t) \
 	((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency(
 
@@ -168,6 +178,8 @@ GetTimerFrequency(void)
 
 #define INSTR_TIME_IS_ZERO(t)	((t).ticks == 0)
 
+#define INSTR_TIME_IS_LT(x, y)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-23 Thread Andres Freund

Hi,

On 2023-01-23 18:52:44 +0100, David Geier wrote:
> One thing I was wondering about: why did you chose to use a signed instead
> of an unsigned 64-bit integer for the ticks?

That's been the case since my first post in the thread :). Mainly, it seems
easier to detect underflow cases during subtraction that way. And the factor
of 2 in range doesn't change a whole lot.


> > > If you have time to look at the pg_test_timing part, it'd be
> > > appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
> > > hesitant to push it.
> > I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
> > patch.
> > 
> > I've attached those two patches. Feel free to include them in your series if
> > you want, then the CF entry (and thus cfbot) makes sense again...
> I'll include them in my new patch set and also have a careful look at them.

Thanks.

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-23 Thread Andres Freund

Hi,

On 2023-01-23 18:49:37 +0100, David Geier wrote:
> On 1/21/23 05:12, Andres Freund wrote:
> > We do currently do the conversion quite frequently.  Admittedly I was
> > partially motivated by trying to get the per-loop overhead in pg_test_timing
> > down ;)
> > 
> > But I think it's a real issue. Places where we do, but shouldn't, convert:
> > 
> > - ExecReScan() - quite painful, we can end up with a lot of those
> > - InstrStopNode() - adds a good bit of overhead to simple
> InstrStopNode() doesn't convert in the general case but only for the first
> tuple or when async. So it goes somewhat hand in hand with ExecReScan().

I think even the first-scan portion is likely noticable for quick queries -
you can quickly end up with 5-10 nodes, even for queries processed in the <
0.1ms range.

Of course it's way worse with rescans / loops.


> > - PendingWalStats.wal_write_time - this is particularly bad because it 
> > happens
> >within very contended code
> > - calls to pgstat_count_buffer_read_time(), 
> > pgstat_count_buffer_write_time() -
> >they can be very frequent
> > - pgbench.c, as we already discussed
> > - pg_stat_statements.c
> > - ...
> > 
> > These all will get a bit slower when moving to a "variable" frequency.

> I wonder if we will be able to measure any of them easily. But given that
> it's many more places than I had realized and given that the optimized code
> is not too involved, let's give it a try.

I think at least some should be converted to just accumulate in an
instr_time...



> > What was your approach for avoiding the costly operation?  I ended up with a
> > integer multiplication + shift approximation for the floating point
> > multiplication (which in turn uses the inverse of the division by the
> > frequency). To allow for sufficient precision while also avoiding 
> > overflows, I
> > had to make that branch conditional, with a slow path for large numbers of
> > nanoseconds.
> 
> It seems like we ended up with the same. I do:
> 
> sec = ticks / frequency_hz
> ns  = ticks / frequency_hz * 1,000,000,000
> ns  = ticks * (1,000,000,000 / frequency_hz)
> ns  = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz
> 
> Now, the constant scaling factor in parentheses is typically a floating
> point number. For example for a frequency of 2.5 GHz it would be 2.5. To
> work around that we can do something like:
> 
> ns  = ticks * (1,000,000 * scaler / frequency_khz) / scaler
> 
> Where scaler is a power-of-2, big enough to maintain enough precision while
> allowing for a shift to implement the division.

Yep, at least quite similar.


> The additional multiplication with scaler makes that the maximum range go
> down, because we must ensure we never overflow. I'm wondering if we cannot
> pick scaler in such a way that remaining range of cycles is large enough for
> our use case and we can therefore live without bothering for the overflow
> case. What would be "enough"? 1 year? 10 years? ...

Depending on how low we want to keep the error, I don't think we can:

If I set the allowed deviation to 10**-9, we end up requiring a shift by 29
for common ghz ranges. Clearly 33bits isn't an interesting range.

But even if you accept a higher error - we don't have *that* much range
available. Assuming an uint64, the range is ~584 years. If we want 10 years
range, we end up

  math.log(((2**64)-1) / (10 * 365 * 60 * 60 * 24 * 10**9), 2)
  ~= 5.87

So 5 bits available that we could "use" for multiply/shift. For something like
2.5ghz, that'd be ~2% error, clearly not acceptable.  And even just a year of
range, ends up allowing a failure of 30796s = 8min over a year, still too
high.


But I don't think it's really an issue - normally that branch will never be
taken (at least within the memory of the branch predictor), which on modern
CPUs means it'll just be predicted as not taken. So as long as we tell the
compiler what's the likely branch, it should be fine. At least as long as the
branch compares with a hardcoded number.


> > I think it'd be great - but I'm not sure we're there yet, reliability and
> > code-complexity wise.

> Thanks to your commits, the diff of the new patch set will be already much
> smaller and easier to review. What's your biggest concern in terms of
> reliability?

- the restriction just to linux, that'll make testing harder for some, and
  ends up encoding too much OS dependency
- I think we need both the barrier and non-barrier variant, otherwise I
  suspect we'll end up with inccuracies we don't want
- needs lots more documentation about why certain cpuid registers are used
- cpu microarch dependencies - isn't there, e.g., the case that the scale on
  nehalem has to be different than on later architectures?
- lack of facility to evaluate how well the different time sources work


> > I think it might be worth makign the rdts aspect somewhat
> > measurable. E.g. allowing pg_test_timing to use both at the same time, and
> > have it compare elapsed time with both

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-23 Thread David Geier


Hi,

On 1/21/23 06:31, Andres Freund wrote:

I pushed the int64-ification commits.


Great. I started rebasing.

One thing I was wondering about: why did you chose to use a signed 
instead of an unsigned 64-bit integer for the ticks?

If you have time to look at the pg_test_timing part, it'd be
appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
hesitant to push it.

I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
patch.

I've attached those two patches. Feel free to include them in your series if
you want, then the CF entry (and thus cfbot) makes sense again...

I'll include them in my new patch set and also have a careful look at them.

--
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-23 Thread David Geier


Hi,

On 1/21/23 05:12, Andres Freund wrote:

We do currently do the conversion quite frequently.  Admittedly I was
partially motivated by trying to get the per-loop overhead in pg_test_timing
down ;)

But I think it's a real issue. Places where we do, but shouldn't, convert:

- ExecReScan() - quite painful, we can end up with a lot of those
- InstrStopNode() - adds a good bit of overhead to simple
InstrStopNode() doesn't convert in the general case but only for the 
first tuple or when async. So it goes somewhat hand in hand with 
ExecReScan().

- PendingWalStats.wal_write_time - this is particularly bad because it happens
   within very contended code
- calls to pgstat_count_buffer_read_time(), pgstat_count_buffer_write_time() -
   they can be very frequent
- pgbench.c, as we already discussed
- pg_stat_statements.c
- ...

These all will get a bit slower when moving to a "variable" frequency.
I wonder if we will be able to measure any of them easily. But given 
that it's many more places than I had realized and given that the 
optimized code is not too involved, let's give it a try.

What was your approach for avoiding the costly operation?  I ended up with a
integer multiplication + shift approximation for the floating point
multiplication (which in turn uses the inverse of the division by the
frequency). To allow for sufficient precision while also avoiding overflows, I
had to make that branch conditional, with a slow path for large numbers of
nanoseconds.


It seems like we ended up with the same. I do:

sec = ticks / frequency_hz
ns  = ticks / frequency_hz * 1,000,000,000
ns  = ticks * (1,000,000,000 / frequency_hz)
ns  = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz

Now, the constant scaling factor in parentheses is typically a floating 
point number. For example for a frequency of 2.5 GHz it would be 2.5. To 
work around that we can do something like:


ns  = ticks * (1,000,000 * scaler / frequency_khz) / scaler

Where scaler is a power-of-2, big enough to maintain enough precision 
while allowing for a shift to implement the division.


The additional multiplication with scaler makes that the maximum range 
go down, because we must ensure we never overflow. I'm wondering if we 
cannot pick scaler in such a way that remaining range of cycles is large 
enough for our use case and we can therefore live without bothering for 
the overflow case. What would be "enough"? 1 year? 10 years? ...


Otherwise, we indeed need code that cares for the potential overflow. My 
hunch is that it can be done branchless, but it for sure adds dependent 
instructions. Maybe in that case a branch is better that almost 
certainly will never be taken?


I'll include the code in the new patch set which I'll latest submit 
tomorrow.



I think it'd be great - but I'm not sure we're there yet, reliability and
code-complexity wise.
Thanks to your commits, the diff of the new patch set will be already 
much smaller and easier to review. What's your biggest concern in terms 
of reliability?

I think it might be worth makign the rdts aspect somewhat
measurable. E.g. allowing pg_test_timing to use both at the same time, and
have it compare elapsed time with both sources of counters.
I haven't yet looked into pg_test_timing. I'll do that while including 
your patches into the new patch set.


--
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-23 Thread Andres Freund

Hi,

On 2023-01-23 18:23:17 +0100, David Geier wrote:
> On 1/21/23 05:14, Andres Freund wrote:
> > The elapsed time is already inherently unstable, so we shouldn't have any 
> > test
> > output showing the time.
> > 
> > But I doubt showing it in every explain is a good idea - we use instr_time 
> > in
> > plenty of other places. Why show it in explain, but not in all those other
> > places?
> 
> Yeah. I thought it would only be an issue if we showed it unconditionally in
> EXPLAIN ANALYZE. If we only show it with TIMING ON, we're likely fine with
> pretty much all regression tests.

If we add it, it probably shouldn't depend on TIMING, but on
SUMMARY. Regression test queries showing EXPLAIN ANALYZE output all do
something like
  EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)

the SUMMARY OFF gets rid of the "top-level" "Planning Time" and "Execution
Time", whereas the TIMING OFF gets rid of the per-node timing. Those are
separate options because per-node timing is problematic performance-wise
(right now), but whole-query timing rarely is.

> But given the different opinions, I'll leave it out in the new patch set for
> the moment being.

Makes sense.

Another, independent, thing worth thinking about: I think we might want to
expose both rdtsc and rdtscp. For something like
InstrStartNode()/InstrStopNode(), avoiding the "one-way barrier" of rdtscp is
quite important to avoid changing the query performance. But for measuring
whole-query time, we likely want to measure the actual time.

It probably won't matter hugely for the whole query time - the out of order
window of modern CPUs is large, but not *that* large - but I don't think we
can generally assume that.

I'm thinking of something like INSTR_TIME_SET_CURRENT() and
INSTR_TIME_SET_CURRENT_FAST() or _NOBARRIER().

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-23 Thread David Geier


Hi,

On 1/21/23 05:14, Andres Freund wrote:

The elapsed time is already inherently unstable, so we shouldn't have any test
output showing the time.

But I doubt showing it in every explain is a good idea - we use instr_time in
plenty of other places. Why show it in explain, but not in all those other
places?


Yeah. I thought it would only be an issue if we showed it 
unconditionally in EXPLAIN ANALYZE. If we only show it with TIMING ON, 
we're likely fine with pretty much all regression tests.


But given the different opinions, I'll leave it out in the new patch set 
for the moment being.


--
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-21 Thread Andres Freund

Hi,

On 2023-01-20 21:31:57 -0800, Andres Freund wrote:
> On 2023-01-20 20:16:13 -0800, Andres Freund wrote:
> > I'm planning to push most of my changes soon, had hoped to get to it a bit
> > sooner, but ...
>
> I pushed the int64-ification commits.

There's an odd compilation failure on AIX.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly=2023-01-21%2007%3A01%3A42

/opt/IBM/xlc/16.1.0/bin/xlc_r -D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY 
-qnoansialias -g -O2 -qmaxmem=33554432 -qsuppress=1500-010:1506-995 
-qsuppress=1506-010:1506-416:1506-450:1506-480:1506-481:1506-492:1506-944:1506-1264
 -qinfo=all:nocnd:noeff:noext:nogot:noini:noord:nopar:noppc:norea:nouni:nouse 
-qinfo=nounset  -qvisibility=hidden -I. -I. -I/opt/freeware/include/python3.5m 
-I../../../src/include -I/home/nm/sw/nopath/icu58.3-64/include   
-I/home/nm/sw/nopath/libxml2-64/include/libxml2  
-I/home/nm/sw/nopath/uuid-64/include -I/home/nm/sw/nopath/openldap-64/include 
-I/home/nm/sw/nopath/icu58.3-64/include -I/home/nm/sw/nopath/libxml2-64/include 
 -c -o plpy_cursorobject.o plpy_cursorobject.c
"../../../src/include/portability/instr_time.h", line 116.9: 1506-304 (I) No 
function prototype given for "clock_gettime".
"../../../src/include/portability/instr_time.h", line 116.23: 1506-045 (S) 
Undeclared identifier CLOCK_REALTIME.
: recipe for target 'plpy_cursorobject.o' failed

but files including instr_time.h *do* build successfully, e.g. instrument.c:

/opt/IBM/xlc/16.1.0/bin/xlc_r -D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY 
-qnoansialias -g -O2 -qmaxmem=33554432 -qsuppress=1500-010:1506-995 
-qsuppress=1506-010:1506-416:1506-450:1506-480:1506-481:1506-492:1506-944:1506-1264
 -qinfo=all:nocnd:noeff:noext:nogot:noini:noord:nopar:noppc:norea:nouni:nouse 
-qinfo=nounset -I../../../src/include -I/home/nm/sw/nopath/icu58.3-64/include   
-I/home/nm/sw/nopath/libxml2-64/include/libxml2  
-I/home/nm/sw/nopath/uuid-64/include -I/home/nm/sw/nopath/openldap-64/include 
-I/home/nm/sw/nopath/icu58.3-64/include -I/home/nm/sw/nopath/libxml2-64/include 
 -c -o instrument.o instrument.c

Before the change the clock_gettime() call was in a macro and thus could be
referenced even without a prior declaration, as long as places using
INSTR_TIME_SET_CURRENT() had all the necessary includes and defines.

Argh:

There's nice bit in plpython.h:

/*
 * Include order should be: postgres.h, other postgres headers, plpython.h,
 * other plpython headers.  (In practice, other plpython headers will also
 * include this file, so that they can compile standalone.)
 */
#ifndef POSTGRES_H
#error postgres.h must be included before plpython.h
#endif

/*
 * Undefine some things that get (re)defined in the Python headers. They aren't
 * used by the PL/Python code, and all PostgreSQL headers should be included
 * earlier, so this should be pretty safe.
 */
#undef _POSIX_C_SOURCE
#undef _XOPEN_SOURCE

the relevant stuff in time.h is indeed guarded by
#if _XOPEN_SOURCE>=500

I don't think the plpython actually code follows the rule about including all
postgres headers earlier.

plpy_typeio.h:

#include "access/htup.h"
#include "fmgr.h"
#include "plpython.h"
#include "utils/typcache.h"

plpy_curserobject.c:

#include "access/xact.h"
#include "catalog/pg_type.h"
#include "mb/pg_wchar.h"
#include "plpy_cursorobject.h"
#include "plpy_elog.h"
#include "plpy_main.h"
#include "plpy_planobject.h"
#include "plpy_procedure.h"
#include "plpy_resultobject.h"
#include "plpy_spi.h"
#include "plpython.h"
#include "utils/memutils.h"

It strikes me as a uh, not good idea to undefine _POSIX_C_SOURCE,
_XOPEN_SOURCE.

The include order aspect was perhaps feasible when there just was plpython.c,
but with the split into many different C files and many headers, it seems hard
to maintain. There's a lot of violations afaics.

The undefines were added in a11cf433413, the split in 147c2482542.

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-20 Thread Andres Freund

Hi,

On 2023-01-20 20:16:13 -0800, Andres Freund wrote:
> On 2023-01-18 14:05:35 +0100, David Geier wrote:
> > @Andres: will you take care of these changes and provide me with an updated
> > patch set so I can rebase the RDTSC changes?
> > Otherwise, I can also apply Tom suggestions to your patch set and send out
> > the complete patch set.
> 
> I'm planning to push most of my changes soon, had hoped to get to it a bit
> sooner, but ...

I pushed the int64-ification commits.


> If you have time to look at the pg_test_timing part, it'd be
> appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
> hesitant to push it.

I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
patch.

Thanks to Justin I've polished the pg_test_timing docs some.


I've attached those two patches. Feel free to include them in your series if
you want, then the CF entry (and thus cfbot) makes sense again...

Greetings,

Andres Freund
>From 2546f3000455a7086ea930986b294fd79024ea59 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Fri, 20 Jan 2023 15:31:54 -0800
Subject: [PATCH v9 1/2] instr_time: Add INSTR_TIME_SET_SECONDS(),
 INSTR_TIME_IS_LT()

INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop
without having to convert into time units (which is
costly). INSTR_TIME_IS_LT() can be used to check the loop condition.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/portability/instr_time.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index cc85138e21f..aab80effb00 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -15,6 +15,8 @@
  *
  * INSTR_TIME_IS_ZERO(t)			is t equal to zero?
  *
+ * INSTR_TIME_IS_LT(x, y)			x < y
+ *
  * INSTR_TIME_SET_ZERO(t)			set t to zero (memset is acceptable too)
  *
  * INSTR_TIME_SET_CURRENT(t)		set t to current time
@@ -22,6 +24,8 @@
  * INSTR_TIME_SET_CURRENT_LAZY(t)	set t to current time if t is zero,
  *	evaluates to whether t changed
  *
+ * INSTR_TIME_SET_SECONDS(t, s)		set t to s seconds
+ *
  * INSTR_TIME_ADD(x, y)x += y
  *
  * INSTR_TIME_SUBTRACT(x, y)		x -= y
@@ -122,6 +126,9 @@ pg_clock_gettime_ns(void)
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_clock_gettime_ns())
 
+#define INSTR_TIME_SET_SECONDS(t, s) \
+	((t).ticks = NS_PER_S * (s))
+
 #define INSTR_TIME_GET_NANOSEC(t) \
 	((int64) (t).ticks)
 
@@ -156,6 +163,9 @@ GetTimerFrequency(void)
 #define INSTR_TIME_SET_CURRENT(t) \
 	((t) = pg_query_performance_counter())
 
+#define INSTR_TIME_SET_SECONDS(t, s) \
+	((t).ticks = s * GetTimerFrequency())
+
 #define INSTR_TIME_GET_NANOSEC(t) \
 	((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency(
 
@@ -168,6 +178,8 @@ GetTimerFrequency(void)
 
 #define INSTR_TIME_IS_ZERO(t)	((t).ticks == 0)
 
+#define INSTR_TIME_IS_LT(x, y)	((x).ticks < (y).ticks)
+
 
 #define INSTR_TIME_SET_ZERO(t)	((t).ticks = 0)
 
-- 
2.38.0

>From a3149e3430c9594ca2ceb3039c954e729bd7c46e Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 16 Jan 2023 11:19:11 -0800
Subject: [PATCH v9 2/2] wip: report nanoseconds in pg_test_timing

This commit also updates pg_test_timing's documentation:

- compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of
  of statement with/without EXPLAIN ANALYZE
- explain the 2x overhead (due to two timestamp acquisitions per row)
- remove old section about old versions of linux - I couldn't update the
  numbers, and it's old enough nobody would care
---
 src/bin/pg_test_timing/pg_test_timing.c |  74 +--
 doc/src/sgml/ref/pgtesttiming.sgml  | 117 ++--
 2 files changed, 95 insertions(+), 96 deletions(-)

diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f87629..e20718669a5 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -19,8 +19,8 @@ static void handle_args(int argc, char *argv[]);
 static uint64 test_timing(unsigned int duration);
 static void output(uint64 loop_count);
 
-/* record duration in powers of 2 microseconds */
-long long int histogram[32];
+/* record duration in powers of 2 nanoseconds */
+uint64		histogram[64];
 
 int
 main(int argc, char *argv[])
@@ -121,35 +121,48 @@ handle_args(int argc, char *argv[])
 static uint64
 test_timing(unsigned int duration)
 {
-	uint64		total_time;
-	int64		time_elapsed = 0;
 	uint64		loop_count = 0;
-	uint64		prev,
-cur;
+	instr_time	until_time,
+total_time;
 	instr_time	start_time,
-end_time,
-temp;
-
-	total_time = duration > 0 ? duration * INT64CONST(100) : 0;
+end_time;
+	instr_time	cur;
 
 	INSTR_TIME_SET_CURRENT(start_time);
-	cur = INSTR_TIME_GET_MICROSEC(start_time);
 
-	while (time_elapsed < total_time)
+	/*
+	 * To reduce loop overhead, check loop condition in instr_time domain.
+	 */
+

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-20 Thread Andres Freund

On 2023-01-20 22:50:37 -0600, Justin Pryzby wrote:
> On Fri, Jan 20, 2023 at 04:40:32PM -0800, Andres Freund wrote:
> > From 5a458d4584961dedd3f80a07d8faea66e57c5d94 Mon Sep 17 00:00:00 2001
> > From: Andres Freund 
> > Date: Mon, 16 Jan 2023 11:19:11 -0800
> > Subject: [PATCH v8 4/5] wip: report nanoseconds in pg_test_timing
> 
> >
> > -   The i7-860 system measured runs the count query in 9.8 ms while
> > -   the EXPLAIN ANALYZE version takes 16.6 ms, each
> > -   processing just over 100,000 rows.  That 6.8 ms difference means the 
> > timing
> > -   overhead per row is 68 ns, about twice what pg_test_timing estimated it
> > -   would be.  Even that relatively small amount of overhead is making the 
> > fully
> > -   timed count statement take almost 70% longer.  On more substantial 
> > queries,
> > -   the timing overhead would be less problematic.
> > +   The i9-9880H system measured shows an execution time of 4.116 ms for the
> > +   TIMING OFF query, and 6.965 ms for the
> > +   TIMING ON, each processing 100,000 rows.
> > +
> > +   That 2.849 ms difference means the timing overhead per row is 28 ns.  As
> > +   TIMING ON measures timestamps twice per row returned 
> > by
> > +   an executor node, the overhead is very close to what pg_test_timing
> > +   estimated it would be.
> > +
> > +   more than what pg_test_timing estimated it would be.  Even that 
> > relatively
> > +   small amount of overhead is making the fully timed count statement take
> > +   about 60% longer.  On more substantial queries, the timing overhead 
> > would
> > +   be less problematic.
> 
> I guess you intend to merge these two paragraphs ?

Oops. I was intending to drop the last paragraph.

Looking at the docs again I noticed that I needed to rephrase the 'acpi_pm'
section further, as I'd left the "a small multiple of what's measured directly
by this utility" language in there.

Do the changes otherwise make sense?

The "small multiple" stuff was just due to a) comparing "raw statement" with
explain analyze b) not accounting for two timestamps being taken per row.

I think it makes sense to remove the "jiffies" section - the output shown is
way outdated. And I don't think the jiffies time counter is one something
still sees in the wild, outside of bringing up a new cpu architecture or such.

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-20 Thread Justin Pryzby

On Fri, Jan 20, 2023 at 04:40:32PM -0800, Andres Freund wrote:
> From 5a458d4584961dedd3f80a07d8faea66e57c5d94 Mon Sep 17 00:00:00 2001
> From: Andres Freund 
> Date: Mon, 16 Jan 2023 11:19:11 -0800
> Subject: [PATCH v8 4/5] wip: report nanoseconds in pg_test_timing

>
> -   The i7-860 system measured runs the count query in 9.8 ms while
> -   the EXPLAIN ANALYZE version takes 16.6 ms, each
> -   processing just over 100,000 rows.  That 6.8 ms difference means the 
> timing
> -   overhead per row is 68 ns, about twice what pg_test_timing estimated it
> -   would be.  Even that relatively small amount of overhead is making the 
> fully
> -   timed count statement take almost 70% longer.  On more substantial 
> queries,
> -   the timing overhead would be less problematic.
> +   The i9-9880H system measured shows an execution time of 4.116 ms for the
> +   TIMING OFF query, and 6.965 ms for the
> +   TIMING ON, each processing 100,000 rows.
> +
> +   That 2.849 ms difference means the timing overhead per row is 28 ns.  As
> +   TIMING ON measures timestamps twice per row returned by
> +   an executor node, the overhead is very close to what pg_test_timing
> +   estimated it would be.
> +
> +   more than what pg_test_timing estimated it would be.  Even that relatively
> +   small amount of overhead is making the fully timed count statement take
> +   about 60% longer.  On more substantial queries, the timing overhead would
> +   be less problematic.

I guess you intend to merge these two paragraphs ?

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-20 Thread Andres Freund

Hi,

On 2023-01-18 14:02:48 +0100, David Geier wrote:
> On 1/16/23 18:37, Andres Freund wrote:
> > I'm doubtful this is worth the complexity it incurs. By the time we convert
> > out of the instr_time format, the times shouldn't be small enough that the
> > accuracy is affected much.
>
> I don't feel strong about it and you have a point that we most likely only
> convert ones we've accumulated a fair amount of cycles.

I think we can avoid the issue another way. The inaccuracy comes from the
cycles_to_sec ending up very small, right? Right now your patch has (and
probably my old version similarly had):

cycles_to_sec = 1.0 / (tsc_freq * 1000);

I think it's better if we have one multiplier to convert cycles to nanoseconds
- that'll be a double comparatively close to 1. We can use that to implement
INSTR_TIME_GET_NANOSECONDS(). The conversion to microseconds then is just a
division by 1000 (which most compilers convert into a multiplication/shift
combo), and the conversions to milliseconds and seconds will be similar.

Because we'll never "wrongly" go into the "huge number" or "very small number"
ranges, that should provide sufficient precision? We'll of course still end up
with a very small number when converting a few nanoseconds to seconds, but
that's ok because it's the precision being asked for, instead of loosing
precision in some intermediate representation.

> > Looking around, most of the existing uses of INSTR_TIME_GET_MICROSEC()
> > actually accumulate themselves, and should instead keep things in the
> > instr_time format and convert later. We'd win more accuracy / speed that 
> > way.
> > 
> > I don't think the introduction of pg_time_usec_t was a great idea, but oh
> > well.
> Fully agreed. Why not replacing pg_time_usec_t with instr_time in a separate
> patch?

pgbench used to use instr_time, but it was replaced by somebody thinking the
API is too cumbersome. Which I can't quite deny, even though I think the
specific change isn't great.

But yes, this should definitely be a separate patch.

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-20 Thread Andres Freund

Hi,

On 2023-01-18 14:05:35 +0100, David Geier wrote:
> @Andres: will you take care of these changes and provide me with an updated
> patch set so I can rebase the RDTSC changes?
> Otherwise, I can also apply Tom suggestions to your patch set and send out
> the complete patch set.

I'm planning to push most of my changes soon, had hoped to get to it a bit
sooner, but ...

If you have time to look at the pg_test_timing part, it'd be
appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
hesitant to push it.

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-20 Thread Andres Freund

Hi,

On 2023-01-20 07:43:00 +0100, David Geier wrote:
> On 1/18/23 13:52, David Geier wrote:
> > On 1/16/23 21:39, Pavel Stehule wrote:
> > > 
> > > po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra
> > >  napsal:
> > > 
> > >     Hi,
> > > 
> > >     there's minor bitrot in the Mkvcbuild.pm change, making cfbot
> > > unhappy.
> > > 
> > >     As for the patch, I don't have much comments. I'm wondering if
> > > it'd be
> > >     useful to indicate which timing source was actually used for EXPLAIN
> > >     ANALYZE, say something like:
> > > 
> > >      Planning time: 0.197 ms
> > >      Execution time: 0.225 ms
> > >      Timing source: clock_gettime (or tsc)
> > > 
> > > +1
> > 
> > I like the idea of exposing the timing source in the EXPLAIN ANALYZE
> > output.
> > It's a good tradeoff between inspectability and effort, given that RDTSC
> > should always be better to use.
> > If there are no objections I go this way.
> Thinking about this a little more made me realize that this will cause
> different pg_regress output depending on the platform. So if we go this
> route we would at least need an option for EXPLAIN ANALYZE to disable it. Or
> rather have it disabled by default and allow for enabling it. Thoughts?

The elapsed time is already inherently unstable, so we shouldn't have any test
output showing the time.

But I doubt showing it in every explain is a good idea - we use instr_time in
plenty of other places. Why show it in explain, but not in all those other
places?

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-20 Thread Andres Freund

Hi,

On 2023-01-19 11:47:49 +0100, David Geier wrote:
> > I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out
> > how to do the cycles->nanosecond conversion with integer shift and multiply 
> > in
> > the common case, which does show a noticable speedup. But that's for another
> > day.
> I also have code for that here. I decided against integrating it because we
> don't convert frequently enough to make it matter. Or am I missing
> something?

We do currently do the conversion quite frequently.  Admittedly I was
partially motivated by trying to get the per-loop overhead in pg_test_timing
down ;)

But I think it's a real issue. Places where we do, but shouldn't, convert:

- ExecReScan() - quite painful, we can end up with a lot of those
- InstrStopNode() - adds a good bit of overhead to simple
- PendingWalStats.wal_write_time - this is particularly bad because it happens
  within very contended code
- calls to pgstat_count_buffer_read_time(), pgstat_count_buffer_write_time() -
  they can be very frequent
- pgbench.c, as we already discussed
- pg_stat_statements.c
- ...

These all will get a bit slower when moving to a "variable" frequency.

What was your approach for avoiding the costly operation?  I ended up with a
integer multiplication + shift approximation for the floating point
multiplication (which in turn uses the inverse of the division by the
frequency). To allow for sufficient precision while also avoiding overflows, I
had to make that branch conditional, with a slow path for large numbers of
nanoseconds.

> > I fought a bit with myself about whether to send those patches in this 
> > thread,
> > because it'll take over the CF entry. But decided that it's ok, given that
> > David's patches should be rebased over these anyway?
> That's alright.
> Though, I would hope we attempt to bring your patch set as well as the RDTSC
> patch set in.

I think it'd be great - but I'm not sure we're there yet, reliability and
code-complexity wise.

I think it might be worth makign the rdts aspect somewhat
measurable. E.g. allowing pg_test_timing to use both at the same time, and
have it compare elapsed time with both sources of counters.

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-20 Thread Andres Freund

Hi,

On 2023-01-20 22:27:07 -0500, Tom Lane wrote:
> Andres Freund  writes:
> >> Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions
> >> could give us the best of both worlds?
> 
> > I tried that in the attached 0005. I found that it reads better if I also 
> > add
> > INSTR_TIME_CURRENT(). If we decide to go for this, I'd roll it into 0001
> > instead, but I wanted to get agreement on it first.
> 
> -1 from here.  This forecloses the possibility that it's best to use more
> than one assignment to initialize the value, and the code doesn't read
> any better than it did before.

I think it does read a bit better, but it's a pretty small improvement. So
I'll leave this aspect be for now.

Thanks for checking.

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-20 Thread Tom Lane

Andres Freund  writes:
>> Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions
>> could give us the best of both worlds?

> I tried that in the attached 0005. I found that it reads better if I also add
> INSTR_TIME_CURRENT(). If we decide to go for this, I'd roll it into 0001
> instead, but I wanted to get agreement on it first.

-1 from here.  This forecloses the possibility that it's best to use more
than one assignment to initialize the value, and the code doesn't read
any better than it did before.

regards, tom lane

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-20 Thread Andres Freund

Hi,

On 2023-01-17 10:50:53 -0800, Andres Freund wrote:
> On 2023-01-17 12:26:57 -0500, Tom Lane wrote:
> > > 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
> > >   warn
> > >   Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
> > >   just allow to assign 0.
> >
> > I think it's probably wise to keep the macro.  If we ever rethink this
> > again, we'll be glad we kept it.  Similarly, IS_ZERO is a good idea
> > even if it would work with just compare-to-zero.
>
> Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions
> could give us the best of both worlds?

I tried that in the attached 0005. I found that it reads better if I also add
INSTR_TIME_CURRENT(). If we decide to go for this, I'd roll it into 0001
instead, but I wanted to get agreement on it first.

Comments?


> > I'm almost tempted to suggest you define instr_time as a struct with a
> > uint64 field, just to help keep us honest about that.
>
> I can see that making sense. Unless somebody pipes up with opposition to that
> plan soon, I'll see how it goes.

Done in the attached. I think it looks good. Actually found a type confusion
buglet in 0004, so the type safety benefit is noticable.

It does require a new INSTR_TIME_IS_LT() for the loop exit condition in 0004,
but that seems fine.


Besides cosmetic stuff I also added back the cast to double in window's
INSTR_TIME_GET_NANOSEC() - I think there's an overflow danger without it.

We should make this faster by pre-computing
  (double) NS_PER_S / GetTimerFrequency()
once, as that'd avoid doing the the slow division on every conversion. But
that's an old issue and thus better tackled separately.

Greetings,

Andres Freund
>From dfea7f159b79aaf1fb6f4c8617614f6fe5c1176f Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 16 Jan 2023 10:04:42 -0800
Subject: [PATCH v8 1/5] Zero initialize instr_time uses causing compiler
 warnings

These are all not necessary from a correctness POV. However, in a subsequent
patch instr_time will be simplified to an int64, at which point gcc would
otherwise start to warn about the changed places.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20230116023639.rn36vf6ajqmfc...@awork3.anarazel.de
Backpatch:
---
 src/backend/access/transam/xlog.c   | 4 
 src/backend/storage/buffer/bufmgr.c | 4 
 src/backend/storage/file/buffile.c  | 4 
 src/backend/storage/ipc/latch.c | 2 ++
 src/bin/psql/common.c   | 6 ++
 5 files changed, 20 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cc0d9a05d9f..fb4c860bdea 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2191,6 +2191,8 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 /* Measure I/O timing to write WAL data */
 if (track_wal_io_timing)
 	INSTR_TIME_SET_CURRENT(start);
+else
+	INSTR_TIME_SET_ZERO(start);
 
 pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 written = pg_pwrite(openLogFile, from, nleft, startoffset);
@@ -8151,6 +8153,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 	/* Measure I/O timing to sync the WAL file */
 	if (track_wal_io_timing)
 		INSTR_TIME_SET_CURRENT(start);
+	else
+		INSTR_TIME_SET_ZERO(start);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a6..800a4248c95 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1017,6 +1017,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			if (track_io_timing)
 INSTR_TIME_SET_CURRENT(io_start);
+			else
+INSTR_TIME_SET_ZERO(io_start);
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
@@ -2902,6 +2904,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
 
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c5464b6aa62..0a51624df3b 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -446,6 +446,8 @@ BufFileLoadBuffer(BufFile *file)
 
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
 
 	/*
 	 * Read whatever we can get, up to a full bufferload.
@@ -525,6 +527,8 @@ BufFileDumpBuffer(BufFile *file)
 
 		if (track_io_timing)
 			INSTR_TIME_SET_CURRENT(io_start);
+		else
+			INSTR_TIME_SET_ZERO(io_start);
 
 		bytestowrite = FileWrite(thisfile,
  file->buffer.data + wpos,
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index d79d71a8515..f4123e7de7e 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-20 Thread Tomas Vondra




On 1/20/23 07:43, David Geier wrote:
> On 1/18/23 13:52, David Geier wrote:
>> On 1/16/23 21:39, Pavel Stehule wrote:
>>>
>>> po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra
>>>  napsal:
>>>
>>>     Hi,
>>>
>>>     there's minor bitrot in the Mkvcbuild.pm change, making cfbot
>>> unhappy.
>>>
>>>     As for the patch, I don't have much comments. I'm wondering if
>>> it'd be
>>>     useful to indicate which timing source was actually used for EXPLAIN
>>>     ANALYZE, say something like:
>>>
>>>      Planning time: 0.197 ms
>>>      Execution time: 0.225 ms
>>>      Timing source: clock_gettime (or tsc)
>>>
>>> +1
>>
>> I like the idea of exposing the timing source in the EXPLAIN ANALYZE
>> output.
>> It's a good tradeoff between inspectability and effort, given that
>> RDTSC should always be better to use.
>> If there are no objections I go this way.
> Thinking about this a little more made me realize that this will cause
> different pg_regress output depending on the platform. So if we go this
> route we would at least need an option for EXPLAIN ANALYZE to disable
> it. Or rather have it disabled by default and allow for enabling it.
> Thoughts?
> 

What about only showing it for VERBOSE mode? I don't think there are
very many tests doing EXPLAIN (ANALYZE, VERBOSE) - a quick grep found
one such place in partition_prune.sql.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-19 Thread David Geier


On 1/18/23 13:52, David Geier wrote:

On 1/16/23 21:39, Pavel Stehule wrote:


po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra 
 napsal:


    Hi,

    there's minor bitrot in the Mkvcbuild.pm change, making cfbot 
unhappy.


    As for the patch, I don't have much comments. I'm wondering if 
it'd be

    useful to indicate which timing source was actually used for EXPLAIN
    ANALYZE, say something like:

     Planning time: 0.197 ms
     Execution time: 0.225 ms
     Timing source: clock_gettime (or tsc)

+1


I like the idea of exposing the timing source in the EXPLAIN ANALYZE 
output.
It's a good tradeoff between inspectability and effort, given that 
RDTSC should always be better to use.

If there are no objections I go this way.
Thinking about this a little more made me realize that this will cause 
different pg_regress output depending on the platform. So if we go this 
route we would at least need an option for EXPLAIN ANALYZE to disable 
it. Or rather have it disabled by default and allow for enabling it. 
Thoughts?


--
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-19 Thread David Geier


Hi Andres,


I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out
how to do the cycles->nanosecond conversion with integer shift and multiply in
the common case, which does show a noticable speedup. But that's for another
day.
I also have code for that here. I decided against integrating it because 
we don't convert frequently enough to make it matter. Or am I missing 
something?

I fought a bit with myself about whether to send those patches in this thread,
because it'll take over the CF entry. But decided that it's ok, given that
David's patches should be rebased over these anyway?

That's alright.
Though, I would hope we attempt to bring your patch set as well as the 
RDTSC patch set in.


--
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-18 Thread David Geier


Hi,

@Andres: will you take care of these changes and provide me with an 
updated patch set so I can rebase the RDTSC changes?
Otherwise, I can also apply Tom suggestions to your patch set and send 
out the complete patch set.


--
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-18 Thread David Geier


On 1/16/23 18:37, Andres Freund wrote:

Hi,

On 2023-01-02 14:28:20 +0100, David Geier wrote:

I'm doubtful this is worth the complexity it incurs. By the time we convert
out of the instr_time format, the times shouldn't be small enough that the
accuracy is affected much.
I don't feel strong about it and you have a point that we most likely 
only convert ones we've accumulated a fair amount of cycles.

Looking around, most of the existing uses of INSTR_TIME_GET_MICROSEC()
actually accumulate themselves, and should instead keep things in the
instr_time format and convert later. We'd win more accuracy / speed that way.

I don't think the introduction of pg_time_usec_t was a great idea, but oh
well.
Fully agreed. Why not replacing pg_time_usec_t with instr_time in a 
separate patch? I haven't looked carefully enough if all occurrences 
could sanely replaced but at least the ones that accumulate time seem 
good starting points.

Additionally, I initialized a few variables of type instr_time which
otherwise resulted in warnings due to use of potentially uninitialized
variables.

Unless we decide, as I suggested downthread, that we deprecate
INSTR_TIME_SET_ZERO(), that's unfortunately not the right fix. I've a similar
patch that adds all the necesarry INSTR_TIME_SET_ZERO() calls.
I don't feel strong about it, but like Tom tend towards keeping the 
initialization macro.
Thanks that you have improved on the first patch and fixed these issues 
in a better way.

What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so that
it's consistent with the _MILLISEC() and _MICROSEC() variants?
The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
return double. This seems error prone. What about renaming the function or
also have the function return a double and cast where necessary at the call
site?

I think those should be a separate discussion / patch.


OK. I'll propose follow-on patches ones we're done with the ones at hand.

I'll then rebase the RDTSC patches on your patch set.

--
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-18 Thread David Geier


On 1/16/23 21:39, Pavel Stehule wrote:


po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra 
 napsal:


Hi,

there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy.

As for the patch, I don't have much comments. I'm wondering if it'd be
useful to indicate which timing source was actually used for EXPLAIN
ANALYZE, say something like:

 Planning time: 0.197 ms
 Execution time: 0.225 ms
 Timing source: clock_gettime (or tsc)

There has been a proposal to expose this as a GUC (or perhaps as
explain
option), to allow users to pick what timing source to use. I
wouldn't go
that far - AFAICS is this is meant to be universally better when
available. But knowing which source was used seems useful.


+1


Thanks for looking at the patch.

I'll fix the merge conflict.

I like the idea of exposing the timing source in the EXPLAIN ANALYZE output.
It's a good tradeoff between inspectability and effort, given that RDTSC 
should always be better to use.

If there are no objections I go this way.

--
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 12:26:57 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Here's an updated version of the move to representing instr_time as
> > nanoseconds. It's now split into a few patches:
> 
> I took a quick look through this.

Thanks!


> > 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
> >   warn
> >   Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
> >   just allow to assign 0.
> 
> I think it's probably wise to keep the macro.  If we ever rethink this
> again, we'll be glad we kept it.  Similarly, IS_ZERO is a good idea
> even if it would work with just compare-to-zero.

Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions
could give us the best of both worlds?


> I'm almost tempted to suggest you define instr_time as a struct with a
> uint64 field, just to help keep us honest about that.

I can see that making sense. Unless somebody pipes up with opposition to that
plan soon, I'll see how it goes.


> > 0003) Add INSTR_TIME_SET_SECOND()
> >   This is used in 0004. Just allows setting an instr_time to a time in
> >   seconds, allowing for a cheaper loop exit condition in 0004.
> 
> Code and comments are inconsistent about whether it's SET_SECOND or
> SET_SECONDS.  I think I prefer the latter, but don't care that much.

That's probably because I couldn't decide... So I'll go with your preference.


> > 0004) report nanoseconds in pg_test_timing
> 
> Didn't examine 0004 in any detail, but the others look good to go
> other than these nits.

Thanks for looking!

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-17 Thread Tom Lane

Andres Freund  writes:
> Here's an updated version of the move to representing instr_time as
> nanoseconds. It's now split into a few patches:

I took a quick look through this.

> 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
>   warn
>   Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
>   just allow to assign 0.

I think it's probably wise to keep the macro.  If we ever rethink this
again, we'll be glad we kept it.  Similarly, IS_ZERO is a good idea
even if it would work with just compare-to-zero.  I'm almost tempted
to suggest you define instr_time as a struct with a uint64 field,
just to help keep us honest about that.

> 0003) Add INSTR_TIME_SET_SECOND()
>   This is used in 0004. Just allows setting an instr_time to a time in
>   seconds, allowing for a cheaper loop exit condition in 0004.

Code and comments are inconsistent about whether it's SET_SECOND or
SET_SECONDS.  I think I prefer the latter, but don't care that much.

> 0004) report nanoseconds in pg_test_timing

Didn't examine 0004 in any detail, but the others look good to go
other than these nits.

regards, tom lane

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-17 Thread Andres Freund

Hi,

On 2023-01-17 08:46:12 -0500, Robert Haas wrote:
> On Fri, Jan 13, 2023 at 2:56 PM Andres Freund  wrote:
> > Does anybody see a reason to not move forward with this aspect? We do a fair
> > amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> > just using nanoseconds. We'd also save memory in BufferUsage (144-122 
> > bytes),
> > Instrumentation (16 bytes saved in Instrumentation itself, 32 via
> > BufferUsage).

Here's an updated version of the move to representing instr_time as
nanoseconds. It's now split into a few patches:

0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
  warn

  Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
  just allow to assign 0.

0002) Convert instr_time to uint64

  This is the cleaned up version of the prior patch. The main change is
  that it deduplicated a lot of the code between the architectures.

0003) Add INSTR_TIME_SET_SECOND()

  This is used in 0004. Just allows setting an instr_time to a time in
  seconds, allowing for a cheaper loop exit condition in 0004.

0004) report nanoseconds in pg_test_timing


I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out
how to do the cycles->nanosecond conversion with integer shift and multiply in
the common case, which does show a noticable speedup. But that's for another
day.

I fought a bit with myself about whether to send those patches in this thread,
because it'll take over the CF entry. But decided that it's ok, given that
David's patches should be rebased over these anyway?


> I read through 0001 and it seems basically fine to me. Comments:
>
> 1. pg_clock_gettime_ns() doesn't follow pgindent conventions.

Fixed.


> 2. I'm not entirely sure that the new .?S_PER_.?S macros are
> worthwhile but maybe they are, and in any case I don't care very much.

There's now fewer. But those I'd like to keep. I just end up counting digits
manually way too many times.


> 3. I've always found 'struct timespec' to be pretty annoying
> notationally, so I like the fact that this patch would reduce use of
> it.

Same.

Greetings,

Andres Freund
>From c1024a9dfa7f5645200b7fa68e8bce5561c9cee0 Mon Sep 17 00:00:00 2001
From: Andres Freund 
Date: Mon, 16 Jan 2023 10:04:42 -0800
Subject: [PATCH v7 1/4] Zero initialize instr_time uses causing compiler
 warnings

These are all not necessary from a correctness POV. However, in a subsequent
patch instr_time will be simplified to an int64, at which point gcc would
otherwise start to warn about the changed places.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20230116023639.rn36vf6ajqmfc...@awork3.anarazel.de
Backpatch:
---
 src/backend/access/transam/xlog.c   | 4 
 src/backend/storage/buffer/bufmgr.c | 4 
 src/backend/storage/file/buffile.c  | 4 
 src/bin/psql/common.c   | 6 ++
 4 files changed, 18 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f47fb75700..7d65b1d4159 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2191,6 +2191,8 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 /* Measure I/O timing to write WAL data */
 if (track_wal_io_timing)
 	INSTR_TIME_SET_CURRENT(start);
+else
+	INSTR_TIME_SET_ZERO(start);
 
 pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 written = pg_pwrite(openLogFile, from, nleft, startoffset);
@@ -8150,6 +8152,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 	/* Measure I/O timing to sync the WAL file */
 	if (track_wal_io_timing)
 		INSTR_TIME_SET_CURRENT(start);
+	else
+		INSTR_TIME_SET_ZERO(start);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (sync_method)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a6..800a4248c95 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1017,6 +1017,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 			if (track_io_timing)
 INSTR_TIME_SET_CURRENT(io_start);
+			else
+INSTR_TIME_SET_ZERO(io_start);
 
 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
 
@@ -2902,6 +2904,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
 
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
 
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c5464b6aa62..0a51624df3b 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -446,6 +446,8 @@ BufFileLoadBuffer(BufFile *file)
 
 	if (track_io_timing)
 		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
 
 	/*
 	 * Read whatever we can get, up to a full bufferload.
@@ -525,6 +527,8 @@ BufFileDumpBuffer(BufFile

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-17 Thread Robert Haas

On Fri, Jan 13, 2023 at 2:56 PM Andres Freund  wrote:
> Does anybody see a reason to not move forward with this aspect? We do a fair
> amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
> Instrumentation (16 bytes saved in Instrumentation itself, 32 via
> BufferUsage).

I read through 0001 and it seems basically fine to me. Comments:

1. pg_clock_gettime_ns() doesn't follow pgindent conventions.

2. I'm not entirely sure that the new .?S_PER_.?S macros are
worthwhile but maybe they are, and in any case I don't care very much.

3. I've always found 'struct timespec' to be pretty annoying
notationally, so I like the fact that this patch would reduce use of
it.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-16 Thread Pavel Stehule

po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra <
tomas.von...@enterprisedb.com> napsal:

> Hi,
>
> there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy.
>
> As for the patch, I don't have much comments. I'm wondering if it'd be
> useful to indicate which timing source was actually used for EXPLAIN
> ANALYZE, say something like:
>
>  Planning time: 0.197 ms
>  Execution time: 0.225 ms
>  Timing source: clock_gettime (or tsc)
>
> There has been a proposal to expose this as a GUC (or perhaps as explain
> option), to allow users to pick what timing source to use. I wouldn't go
> that far - AFAICS is this is meant to be universally better when
> available. But knowing which source was used seems useful.
>

+1

Pavel


>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-16 Thread Tomas Vondra

Hi,

there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy.

As for the patch, I don't have much comments. I'm wondering if it'd be
useful to indicate which timing source was actually used for EXPLAIN
ANALYZE, say something like:

 Planning time: 0.197 ms
 Execution time: 0.225 ms
 Timing source: clock_gettime (or tsc)

There has been a proposal to expose this as a GUC (or perhaps as explain
option), to allow users to pick what timing source to use. I wouldn't go
that far - AFAICS is this is meant to be universally better when
available. But knowing which source was used seems useful.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-16 Thread Andres Freund

Hi,

On 2023-01-02 14:28:20 +0100, David Geier wrote:
> I also somewhat improved the accuracy of the cycles to milli- and
> microseconds conversion functions by having two more multipliers with higher
> precision. For microseconds we could also keep the computation integer-only.
> I'm wondering what to best do for seconds and milliseconds. I'm currently
> leaning towards just keeping it as is, because the durations measured and
> converted are usually long enough that precision shouldn't be a problem.

I'm doubtful this is worth the complexity it incurs. By the time we convert
out of the instr_time format, the times shouldn't be small enough that the
accuracy is affected much.

Looking around, most of the existing uses of INSTR_TIME_GET_MICROSEC()
actually accumulate themselves, and should instead keep things in the
instr_time format and convert later. We'd win more accuracy / speed that way.

I don't think the introduction of pg_time_usec_t was a great idea, but oh
well.

> Additionally, I initialized a few variables of type instr_time which
> otherwise resulted in warnings due to use of potentially uninitialized
> variables.

Unless we decide, as I suggested downthread, that we deprecate
INSTR_TIME_SET_ZERO(), that's unfortunately not the right fix. I've a similar
patch that adds all the necesarry INSTR_TIME_SET_ZERO() calls.

> What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so that
> it's consistent with the _MILLISEC() and _MICROSEC() variants?

> The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
> return double. This seems error prone. What about renaming the function or
> also have the function return a double and cast where necessary at the call
> site?

I think those should be a separate discussion / patch.

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-15 Thread Andres Freund

Hi,

On 2023-01-13 11:55:47 -0800, Andres Freund wrote:
> Does anybody see a reason to not move forward with this aspect? We do a fair
> amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
> Instrumentation (16 bytes saved in Instrumentation itself, 32 via
> BufferUsage).

This actually under-counted the benefits, because we have two BufferUsage and
two WalUsage in Instrumentation.

Before:
/* size: 448, cachelines: 7, members: 20 */
/* sum members: 445, holes: 1, sum holes: 3 */
After
/* size: 368, cachelines: 6, members: 20 */
/* sum members: 365, holes: 1, sum holes: 3 */

The difference in the number of instructions in InstrStopNode is astounding:
1016 instructions with timespec, 96 instructions with nanoseconds. Some of
that is the simpler data structure, some because the compiler now can
auto-vectorize the four INSTR_TIME_ACCUM_DIFF in BufferUsageAccumDiff into
one.

We probably should convert Instrumentation->firsttuple to a instr_time now as
well, no point in having the code for conversion to double in the hot routine,
that can easily happen in explain. But that's for a later patch.

I suggested downthread that we should convert the win32 implementation to be
more similar to the unix-nanoseconds representation. A blind conversion looks
good, and lets us share a number of macros.

I wonder if we should deprecate INSTR_TIME_IS_ZERO()/INSTR_TIME_SET_ZERO() and
allow 0 to be used instead. Not needing INSTR_TIME_SET_ZERO() allows variable
definitions to initialize the value, which does avoid some unnecessarily
awkward code.  Alternatively we could introduce INSTR_TIME_ZERO() for that
purpose?

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-13 Thread vignesh C

On Wed, 4 Jan 2023 at 17:32, David Geier  wrote:
>
> I fixed the compilation error on CFBot.
> I missed adding instr_time.c to the Meson makefile.
> New patch set attached.

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
ff23b592ad6621563d3128b26860bcb41daf9542 ===
=== applying patch
./0002-Use-CPU-reference-cycles-via-RDTSC-to-measure-time-v6.patch

patching file src/tools/msvc/Mkvcbuild.pm
Hunk #1 FAILED at 135.
1 out of 1 hunk FAILED -- saving rejects to file src/tools/msvc/Mkvcbuild.pm.rej

[1] - http://cfbot.cputube.org/patch_41_3751.log

Regards,
Vignesh

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-13 Thread Andres Freund

Hi,

On 2023-01-13 15:25:16 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > Does anybody see a reason to not move forward with this aspect? We do a fair
> > amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> > just using nanoseconds.
>
> Cheaper, and perhaps more accurate too?  Don't recall if we have any code
> paths where the input timestamps are likely to be better-than-microsecond,
> but surely that's coming someday.

instr_time on !WIN32 use struct timespec, so we already should have nanosecond
precision available. IOW, we could add a INSTR_TIME_GET_NANOSEC today. Or am I
misunderstanding what you mean?

> I'm unsure that we want to deal with rdtsc's vagaries in general, but
> no objection to changing instr_time.

Cool.

Looking at the instr_time.h part of the change, I think it should go further,
and basically do the same thing in the WIN32 path. The only part that needs to
be be win32 specific is INSTR_TIME_SET_CURRENT(). That'd reduce duplication a
good bit.

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-13 Thread Tom Lane

Andres Freund  writes:
> On 2023-01-04 13:02:05 +0100, David Geier wrote:
>> Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
>> cheaper.

> Does anybody see a reason to not move forward with this aspect? We do a fair
> amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
> just using nanoseconds.

Cheaper, and perhaps more accurate too?  Don't recall if we have any code
paths where the input timestamps are likely to be better-than-microsecond,
but surely that's coming someday.

I'm unsure that we want to deal with rdtsc's vagaries in general, but
no objection to changing instr_time.

regards, tom lane

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-13 Thread Andres Freund

Hi,

On 2023-01-04 13:02:05 +0100, David Geier wrote:
> From be18633d4735f680c7910fcb4e8ac90c4eada131 Mon Sep 17 00:00:00 2001
> From: David Geier 
> Date: Thu, 17 Nov 2022 10:22:01 +0100
> Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
>  cheaper.

Does anybody see a reason to not move forward with this aspect? We do a fair
amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
Instrumentation (16 bytes saved in Instrumentation itself, 32 via
BufferUsage).

While the range of instr_time storing nanoseconds wouldn't be good enough for
a generic timestamp facility (hence using microsecs for Timestamp), the range
seems plenty for its use of measuring runtime:

(2 ** 63) - 1) / ((10 ** 9) * 60 * 60 * 24 * 365) = ~292 years

Of course, when using CLOCK_REALTIME, this is relative to 1970-01-01, so just
239 years.

It could theoretically be a different story, if we stored instr_time's on
disk. But we don't, they're ephemeral.

This doesn't buy a whole lot of performance - the bottlenck is the actual
timestamp computation. But in a query with not much else going on, it's
visible and reproducible. It's, unsurprisingly, a lot easier to see when using
BUFFERS.

For both timespec and nanosecond, I measured three server starts, and for each
started server three executions of
pgbench -n -Mprepared -c1 -P5 -T15 -f <(echo "EXPLAIN (ANALYZE, BUFFERS) SELECT 
generate_series(1, 1000) OFFSET 1000;")

the best result is:
timespec: 1073.431
nanosec:  957.532
a ~10% difference

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-04 Thread David Geier


Hi,


CFBot shows some compilation errors as in [1], please post an updated
version for the same:
09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning:
relocation against `cycles_to_sec' in read-only section `.text'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `pg_clock_gettime_ref_cycles':
[09:08:12.525] 
/tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119:
undefined reference to `use_rdtsc'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `test_timing':
[09:08:12.525] 
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135:
undefined reference to `pg_clock_gettime_initialize_rdtsc'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176:
undefined reference to `cycles_to_sec'
[09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
[09:08:12.525] collect2: error: ld returned 1 exit status

[1] - https://cirrus-ci.com/task/5375312565895168

Regards,
Vignesh


I fixed the compilation error on CFBot.
I missed adding instr_time.c to the Meson makefile.
New patch set attached.

--
David Geier
(ServiceNow)
From be18633d4735f680c7910fcb4e8ac90c4eada131 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
 cheaper.

---
 src/include/portability/instr_time.h | 62 
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 9ea1a68bd9..c64f21eb53 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -80,63 +80,53 @@
 #define PG_INSTR_CLOCK	CLOCK_REALTIME
 #endif
 
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(10)
+#define US_PER_S INT64CONST(100)
+#define MS_PER_S INT64CONST(1000)
 
-#define INSTR_TIME_IS_ZERO(t)	((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(100)
+#define NS_PER_US INT64CONST(1000)
 
-#define INSTR_TIME_SET_ZERO(t)	((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t)	((t) == 0)
 
-#define INSTR_TIME_SET_CURRENT(t)	((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t)	((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+	struct timespec tmp;
+
+	clock_gettime(PG_INSTR_CLOCK, );
+
+	return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+	(t) = pg_clock_gettime_ns()
 
 #define INSTR_TIME_ADD(x,y) \
 	do { \
-		(x).tv_sec += (y).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y); \
 	} while (0)
 
 #define INSTR_TIME_SUBTRACT(x,y) \
 	do { \
-		(x).tv_sec -= (y).tv_sec; \
-		(x).tv_nsec -= (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
+		(x) -= (y); \
 	} while (0)
 
 #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
 	do { \
-		(x).tv_sec += (y).tv_sec - (z).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
-		/* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y) - (z); \
 	} while (0)
 
 #define INSTR_TIME_GET_DOUBLE(t) \
-	(((double) (t).tv_sec) + ((double) (t).tv_nsec) / 10.0)
+	((double) (t) / NS_PER_S)
 
 #define INSTR_TIME_GET_MILLISEC(t) \
-	(((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 100.0)
+	((double) (t) / NS_PER_MS)
 
 #define INSTR_TIME_GET_MICROSEC(t) \
-	(((uint64) (t).tv_sec * (uint64) 100) + (uint64) ((t).tv_nsec / 1000))
+	((double) (t) / NS_PER_US)
 
 #else			/* WIN32 */
 
-- 
2.34.1

From 190ca09566eabb017ed25b1512225173ca47fb88 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Thu, 17 Nov 2022 13:03:59 +0100
Subject: [PATCH 2/3] Use CPU reference cycles, via RDTSC, to measure time for
 instrumentation.

For now this is only enabled on Linux/x86 when the system clocksource is
marked tsc as well, as determined at runtime. This way we can rely on the
Linux kernel to make a decision whether tsc is invariant and usable on the
current CPU

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-04 Thread vignesh C

On Tue, 3 Jan 2023 at 14:08, David Geier  wrote:
>
> Hi Lukas,
>
> On 1/2/23 20:50, Lukas Fittl wrote:
> > Thanks for continuing to work on this patch, and my apologies for
> > silence on the patch.
>
> It would be great if you could review it.
> Please also share your thoughts around exposing the used clock source as
> GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS().
>
> I rebased again on master because of [1]. Patches attached.
>
> >
> > Its been hard to make time, and especially so because I typically
> > develop on an ARM-based macOS system where I can't test this directly
> > - hence my tests with virtualized EC2 instances, where I ran into the
> > timing oddities.
> That's good and bad. Bad to do the development and good to test the
> implementation on more virtualized setups; given that I also encountered
> "interesting" behavior on VMWare (see my previous mails).
> >
> > On Mon, Jan 2, 2023 at 5:28 AM David Geier  wrote:
> >
> > The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other
> > variants
> > return double. This seems error prone. What about renaming the
> > function
> > or also have the function return a double and cast where necessary at
> > the call site?
> >
> >
> > Minor note, but in my understanding using a uint64 (where we can) is
> > faster for any simple arithmetic we do with the values.
>
> That's true. So the argument could be that for seconds and milliseconds
> we want the extra precision while microseconds are precise enough.
> Still, we could also make the seconds and milliseconds conversion code
> integer only and e.g. return two integers with the value before and
> after the comma. FWICS, the functions are nowhere used in performance
> critical code, so it doesn't really make a difference performance-wise.
>
> >
> > +1, and feel free to carry this patch forward - I'll try to make an
> > effort to review my earlier testing issues again, as well as your
> > later improvements to the patch.
> Moved to the current commit fest. Will you become reviewer?
> >
> > Also, FYI, I just posted an alternate idea for speeding up EXPLAIN
> > ANALYZE with timing over in [0], using a sampling-based approach to
> > reduce the timing overhead.
>
> Interesting idea. I'll reply with some thoughts on the corresponding thread.
>
> [1]
> https://www.postgresql.org/message-id/flat/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw%40mail.gmail.com

CFBot shows some compilation errors as in [1], please post an updated
version for the same:
09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning:
relocation against `cycles_to_sec' in read-only section `.text'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `pg_clock_gettime_ref_cycles':
[09:08:12.525] 
/tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119:
undefined reference to `use_rdtsc'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `test_timing':
[09:08:12.525] 
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135:
undefined reference to `pg_clock_gettime_initialize_rdtsc'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176:
undefined reference to `cycles_to_sec'
[09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
[09:08:12.525] collect2: error: ld returned 1 exit status

[1] - https://cirrus-ci.com/task/5375312565895168

Regards,
Vignesh

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-03 Thread David Geier


Hi Lukas,

On 1/2/23 20:50, Lukas Fittl wrote:
Thanks for continuing to work on this patch, and my apologies for 
silence on the patch.


It would be great if you could review it.
Please also share your thoughts around exposing the used clock source as 
GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS().


I rebased again on master because of [1]. Patches attached.



Its been hard to make time, and especially so because I typically 
develop on an ARM-based macOS system where I can't test this directly 
- hence my tests with virtualized EC2 instances, where I ran into the 
timing oddities.
That's good and bad. Bad to do the development and good to test the 
implementation on more virtualized setups; given that I also encountered 
"interesting" behavior on VMWare (see my previous mails).


On Mon, Jan 2, 2023 at 5:28 AM David Geier  wrote:

The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other
variants
return double. This seems error prone. What about renaming the
function
or also have the function return a double and cast where necessary at
the call site?


Minor note, but in my understanding using a uint64 (where we can) is 
faster for any simple arithmetic we do with the values.


That's true. So the argument could be that for seconds and milliseconds 
we want the extra precision while microseconds are precise enough. 
Still, we could also make the seconds and milliseconds conversion code 
integer only and e.g. return two integers with the value before and 
after the comma. FWICS, the functions are nowhere used in performance 
critical code, so it doesn't really make a difference performance-wise.




+1, and feel free to carry this patch forward - I'll try to make an 
effort to review my earlier testing issues again, as well as your 
later improvements to the patch.

Moved to the current commit fest. Will you become reviewer?


Also, FYI, I just posted an alternate idea for speeding up EXPLAIN 
ANALYZE with timing over in [0], using a sampling-based approach to 
reduce the timing overhead.


Interesting idea. I'll reply with some thoughts on the corresponding thread.

[1] 
https://www.postgresql.org/message-id/flat/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw%40mail.gmail.com


--
David Geier
(ServiceNow)
From f63527c8e4b3b0b71ffacaa111dd93325d973432 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
 cheaper.

---
 src/include/portability/instr_time.h | 62 
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 9ea1a68bd9..c64f21eb53 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -80,63 +80,53 @@
 #define PG_INSTR_CLOCK	CLOCK_REALTIME
 #endif
 
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(10)
+#define US_PER_S INT64CONST(100)
+#define MS_PER_S INT64CONST(1000)
 
-#define INSTR_TIME_IS_ZERO(t)	((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(100)
+#define NS_PER_US INT64CONST(1000)
 
-#define INSTR_TIME_SET_ZERO(t)	((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t)	((t) == 0)
 
-#define INSTR_TIME_SET_CURRENT(t)	((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t)	((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+	struct timespec tmp;
+
+	clock_gettime(PG_INSTR_CLOCK, );
+
+	return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+	(t) = pg_clock_gettime_ns()
 
 #define INSTR_TIME_ADD(x,y) \
 	do { \
-		(x).tv_sec += (y).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y); \
 	} while (0)
 
 #define INSTR_TIME_SUBTRACT(x,y) \
 	do { \
-		(x).tv_sec -= (y).tv_sec; \
-		(x).tv_nsec -= (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
+		(x) -= (y); \
 	} while (0)
 
 #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
 	do { \
-		(x).tv_sec += (y).tv_sec - (z).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
-		/* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y) - (z); \
 	} while (0)
 
 #define INSTR_TIME_GET_DOUBLE(t) \
-	(((double) (t).tv_sec) + ((double) (t).tv_nsec) / 10.0)
+	((double) (t) / NS_PER_S)
 
 #define INSTR_TIME_GET_MILLISEC(t) \
-	(((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 100.0)
+	((double) (t) / NS_PER_MS)
 
 #define INSTR_TIME_GET_MICROSEC(t) \
-

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-02 Thread Maciek Sakrejda

On Fri, Jul 15, 2022 at 11:21 AM Maciek Sakrejda  wrote:
> On Fri, Jul 1, 2022 at 10:26 AM Andres Freund  wrote:
> > On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
> >...
> > > Known WIP problems with this patch version:
> > >
> > > * There appears to be a timing discrepancy I haven't yet worked out, where
> > >   the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
> > >   reporting. With Andres' earlier test case, I'm seeing a consistent 
> > > ~700ms
> > >   higher for \timing than for the EXPLAIN ANALYZE time reported on the
> > > server
> > >   side, only when rdtsc measurement is used -- its likely there is a 
> > > problem
> > >   somewhere with how we perform the cycles to time conversion
> >
> > Could you explain a bit more what you're seeing? I just tested your patches
> > and didn't see that here.
>
> I did not see this either, but I did see that the execution time
> reported by \timing is (for this test case) consistently 0.5-1ms
> *lower* than the Execution Time reported by EXPLAIN. I did not see
> that on master. Is that expected?

For what it's worth, I can no longer reproduce this. In fact, I went
back to master-as-of-around-then and applied Lukas' v2 patches again,
and I still can't reproduce that. I do remember it happening
consistently across several executions, but now \timing consistently
shows 0.5-1ms slower, as expected. This does not explain the different
timing issue Lukas was seeing in his tests, but I think we can assume
what I reported originally here is not an issue.

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-02 Thread Lukas Fittl

Hi David,

Thanks for continuing to work on this patch, and my apologies for silence
on the patch.

Its been hard to make time, and especially so because I typically develop
on an ARM-based macOS system where I can't test this directly - hence my
tests with virtualized EC2 instances, where I ran into the timing oddities.

On Mon, Jan 2, 2023 at 5:28 AM David Geier  wrote:

> The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
> return double. This seems error prone. What about renaming the function
> or also have the function return a double and cast where necessary at
> the call site?
>

Minor note, but in my understanding using a uint64 (where we can) is faster
for any simple arithmetic we do with the values.

> If no one objects I would also re-register this patch in the commit fest.
>

+1, and feel free to carry this patch forward - I'll try to make an effort
to review my earlier testing issues again, as well as your later
improvements to the patch.

Also, FYI, I just posted an alternate idea for speeding up EXPLAIN ANALYZE
with timing over in [0], using a sampling-based approach to reduce the
timing overhead.

[0]:
https://www.postgresql.org/message-id/CAP53PkxXMk0j-%2B0%3DYwRti2pFR5UB2Gu4v2Oyk8hhZS0DRART6g%40mail.gmail.com

Thanks,
Lukas

-- 
Lukas Fittl

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-02 Thread David Geier


Hi,

I re-based again on master and applied the following changes:

I removed the fallback for obtaining the TSC frequency from /proc/cpu as 
suggested by Andres. Worst-case we fall back to clock_gettime().


I added code to obtain the TSC frequency via CPUID when under a 
hypervisor. I had to use __cpuid() directly instead of __get_cpuid(), 
because __get_cpuid() returns an error if the leaf is > 0x8000 
(probably the implementation pre-dates the hypervisor timing leafs). 
Unfortunately, while testing my implementation under VMWare, I found 
that RDTSC runs awfully slow there (like 30x slower). [1] indicates that 
we cannot generally rely on RDTSC being actually fast on VMs. However, 
the same applies to clock_gettime(). It runs as slow as RDTSC on my 
VMWare setup. Hence, using RDTSC is not at disadvantage. I'm not 
entirely sure if there aren't cases where e.g. clock_gettime() is 
actually faster than RDTSC and it would be advantageous to use 
clock_gettime(). We could add a GUC so that the user can decide which 
clock source to use. Any thoughts?


I also somewhat improved the accuracy of the cycles to milli- and 
microseconds conversion functions by having two more multipliers with 
higher precision. For microseconds we could also keep the computation 
integer-only. I'm wondering what to best do for seconds and 
milliseconds. I'm currently leaning towards just keeping it as is, 
because the durations measured and converted are usually long enough 
that precision shouldn't be a problem.


In vacuum_lazy.c we do if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000). I 
changed that to use INSTR_TIME_GET_MILLISECS() instead. Additionally, I 
initialized a few variables of type instr_time which otherwise resulted 
in warnings due to use of potentially uninitialized variables.


I also couldn't reproduce the reported timing discrepancy. For me the 
runtime reported by \timing is just slightly higher than the time 
reported by EXPLAIN ANALYZE, which is expected.


Beyond that:

What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so 
that it's consistent with the _MILLISEC() and _MICROSEC() variants?


The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants 
return double. This seems error prone. What about renaming the function 
or also have the function return a double and cast where necessary at 
the call site?


If no one objects I would also re-register this patch in the commit fest.

[1] 
https://vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf 
(page 11 "Virtual TSC")


--
David Geier
(ServiceNow)
From 321d00ae5dd1bcffc8fbdb39879b7f5c78e3930f Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
 cheaper.

---
 src/include/portability/instr_time.h | 62 
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 22bcf3d288..4bd555113b 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -80,63 +80,53 @@
 #define PG_INSTR_CLOCK	CLOCK_REALTIME
 #endif
 
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(10)
+#define US_PER_S INT64CONST(100)
+#define MS_PER_S INT64CONST(1000)
 
-#define INSTR_TIME_IS_ZERO(t)	((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(100)
+#define NS_PER_US INT64CONST(1000)
 
-#define INSTR_TIME_SET_ZERO(t)	((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t)	((t) == 0)
 
-#define INSTR_TIME_SET_CURRENT(t)	((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t)	((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+	struct timespec tmp;
+
+	clock_gettime(PG_INSTR_CLOCK, );
+
+	return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+	(t) = pg_clock_gettime_ns()
 
 #define INSTR_TIME_ADD(x,y) \
 	do { \
-		(x).tv_sec += (y).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y); \
 	} while (0)
 
 #define INSTR_TIME_SUBTRACT(x,y) \
 	do { \
-		(x).tv_sec -= (y).tv_sec; \
-		(x).tv_nsec -= (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
+		(x) -= (y); \
 	} while (0)
 
 #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
 	do { \
-		(x).tv_sec += (y).tv_sec - (z).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
-		/* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2022-11-19 Thread David Geier


I missed attaching the patches.

--
David Geier
(ServiceNow)
From f4e962729ca605498d0c8bfc97d0f42d68a0df06 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/2] WIP: Change instr_time to just store nanoseconds, that's
 cheaper.

---
 src/include/portability/instr_time.h | 62 
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 22bcf3d288..4bd555113b 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -80,63 +80,53 @@
 #define PG_INSTR_CLOCK	CLOCK_REALTIME
 #endif
 
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(10)
+#define US_PER_S INT64CONST(100)
+#define MS_PER_S INT64CONST(1000)
 
-#define INSTR_TIME_IS_ZERO(t)	((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(100)
+#define NS_PER_US INT64CONST(1000)
 
-#define INSTR_TIME_SET_ZERO(t)	((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t)	((t) == 0)
 
-#define INSTR_TIME_SET_CURRENT(t)	((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t)	((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+	struct timespec tmp;
+
+	clock_gettime(PG_INSTR_CLOCK, );
+
+	return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+	(t) = pg_clock_gettime_ns()
 
 #define INSTR_TIME_ADD(x,y) \
 	do { \
-		(x).tv_sec += (y).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y); \
 	} while (0)
 
 #define INSTR_TIME_SUBTRACT(x,y) \
 	do { \
-		(x).tv_sec -= (y).tv_sec; \
-		(x).tv_nsec -= (y).tv_nsec; \
-		/* Normalize */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
+		(x) -= (y); \
 	} while (0)
 
 #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
 	do { \
-		(x).tv_sec += (y).tv_sec - (z).tv_sec; \
-		(x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
-		/* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
-		while ((x).tv_nsec < 0) \
-		{ \
-			(x).tv_nsec += 10; \
-			(x).tv_sec--; \
-		} \
-		while ((x).tv_nsec >= 10) \
-		{ \
-			(x).tv_nsec -= 10; \
-			(x).tv_sec++; \
-		} \
+		(x) += (y) - (z); \
 	} while (0)
 
 #define INSTR_TIME_GET_DOUBLE(t) \
-	(((double) (t).tv_sec) + ((double) (t).tv_nsec) / 10.0)
+	((double) (t) / NS_PER_S)
 
 #define INSTR_TIME_GET_MILLISEC(t) \
-	(((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 100.0)
+	((double) (t) / NS_PER_MS)
 
 #define INSTR_TIME_GET_MICROSEC(t) \
-	(((uint64) (t).tv_sec * (uint64) 100) + (uint64) ((t).tv_nsec / 1000))
+	((double) (t) / NS_PER_US)
 
 #else			/* WIN32 */
 
-- 
2.34.1

From 7a6317fdf5b1d82f120a4fd5535cfe40c8165153 Mon Sep 17 00:00:00 2001
From: David Geier 
Date: Thu, 17 Nov 2022 13:03:59 +0100
Subject: [PATCH 2/2] WIP: Use cpu reference cycles, via rdtsc, to measure time
 for instrumentation.

For now this is only enabled on Linux/x86 when the system clocksource is
marked tsc as well, as determined at runtime. This way we can rely on the
Linux kernel to make a decision whether tsc is invariant and usable on the
current CPU architecture. In all other cases we continue to use the
clock_gettime() implementation like before.

Note that this intentionally uses rdtsc, not rdtscp, as rdtscp waits for
currently running CPU instructions to have finished, and that adds up to
noticable latency for little benefit in the typical InstrStartNode() /
InstrStopNode() use case.
---
 src/backend/utils/init/postinit.c   |   3 +
 src/bin/pg_test_timing/pg_test_timing.c |   1 +
 src/bin/pgbench/pgbench.c   |   3 +
 src/bin/psql/startup.c  |   4 +
 src/common/Makefile |   1 +
 src/common/instr_time.c | 103 
 src/include/portability/instr_time.h|  50 +---
 src/tools/msvc/Mkvcbuild.pm |   2 +-
 8 files changed, 157 insertions(+), 10 deletions(-)
 create mode 100644 src/common/instr_time.c

diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a990c833c5..c684725af3 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -804,6 +804,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
 	/* Initialize portal manager */
 	EnablePortalManager();
 
+	/* initialize high-precision interval timing */
+	INSTR_TIME_INITIALIZE();
+
 	/* Initialize status reporting */
 	pgstat_beinit();
 
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f8762..0d667ff5a7 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -132,6 +132,7 @@ test_timing(unsigned int duration)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2022-11-19 Thread David Geier

I think it would be great to get this patch committed. Beyond the 
reasons already mentioned, the significant overhead also tends to skew 
the reported runtimes in ways that makes it difficult to compare them. 
For example, if two nodes are executed equally often but one needs twice 
the time to process the rows: in such a case EXPLAIN ANALYZE should 
report timings that are 2x apart. However, currently, the high overhead 
of clock_gettime() tends to skew the relative runtimes.


On 10/12/22 10:33, Michael Paquier wrote:

No rebased version has been sent since this update, so this patch has
been marked as RwF.


I've rebased the patch set on latest master and fixed a few compiler 
warnings. Beyond that some findings and thoughts:


You're only using RDTSC if the clock source is 'tsc'. Great idea to not 
bother caring about a lot of hairy TSC details. Looking at the kernel 
code this seems to imply that the TSC is frequency invariant. I don't 
think though that this implies that Linux is not running under a 
hypervisor; which is good because I assume PostgreSQL is used a lot in 
VMs. However, when running under a hypervisor (at least with VMWare) 
CPUID leaf 0x16 is not available. In my tests __get_cpuid() indicated 
success but the returned values were garbage. Instead of using leaf 
0x16, we should then use the hypervisor interface to obtain the TSC 
frequency. Checking if a hypervisor is active can be done via:


bool IsHypervisorActive()
{
    uint32 cpuinfo[4] = {0};
    int res = __get_cpuid(0x1, [0], [1], [2], 
[3]);

    return res > 0 && (cpuinfo[2] & (1 << 30));
}

Obtaining the TSC frequency via the hypervisor interface can be done 
with the following code. See https://lwn.net/Articles/301888/ for more 
details.


// Under hypervisors (tested with VMWare) leaf 0x16 is not available, 
even though __get_cpuid() succeeds.
// Hence, if running under a hypervisor, use the hypervisor interface to 
obtain TSC frequency.

uint32 cpuinfo[4] = {0};
if (IsHypervisorActive() && __get_cpuid(0x4001, [0], 
[1], [2], [3]) > 0)

    cycles_to_sec = 1.0 / ((double)cpuinfo[0] * 1000 * 1000);

Given that we anyways switch between RDTSC and clock_gettime() with a 
global variable, what about exposing the clock source as GUC? That way 
the user can switch back to a working clock source in case we miss a 
detail around activating or reading the TSC.


I'm happy to update the patches accordingly.

--
David Geier
(ServiceNow)

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2022-10-12 Thread Michael Paquier

On Tue, Sep 06, 2022 at 11:32:18AM +0500, Ibrar Ahmed wrote:
> Hunk #5 succeeded at 147 with fuzz 2 (offset -3 lines).
> Hunk #6 FAILED at 170.
> Hunk #7 succeeded at 165 (offset -69 lines).
> 2 out of 7 hunks FAILED -- saving rejects to file
> src/include/portability/instr_time.h.rej
> patching file src/tools/msvc/Mkvcbuild.pm

No rebased version has been sent since this update, so this patch has
been marked as RwF.
--
Michael


signature.asc
Description: PGP signature

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2022-09-06 Thread Ibrar Ahmed

On Fri, Jul 15, 2022 at 11:22 PM Maciek Sakrejda 
wrote:

> I ran that original test case with and without the patch. Here are the
> numbers I'm seeing:
>
> master (best of three):
>
> postgres=# SELECT count(*) FROM lotsarows;
> Time: 582.423 ms
>
> postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
> Time: 616.102 ms
>
> postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
> Time: 1068.700 ms (00:01.069)
>
> patched (best of three):
>
> postgres=# SELECT count(*) FROM lotsarows;
> Time: 550.822 ms
>
> postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
> Time: 612.572 ms
>
> postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
> Time: 690.875 ms
>
> On Fri, Jul 1, 2022 at 10:26 AM Andres Freund  wrote:
> > On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
> >...
> > > Known WIP problems with this patch version:
> > >
> > > * There appears to be a timing discrepancy I haven't yet worked out,
> where
> > >   the \timing data reported by psql doesn't match what EXPLAIN ANALYZE
> is
> > >   reporting. With Andres' earlier test case, I'm seeing a consistent
> ~700ms
> > >   higher for \timing than for the EXPLAIN ANALYZE time reported on the
> > > server
> > >   side, only when rdtsc measurement is used -- its likely there is a
> problem
> > >   somewhere with how we perform the cycles to time conversion
> >
> > Could you explain a bit more what you're seeing? I just tested your
> patches
> > and didn't see that here.
>
> I did not see this either, but I did see that the execution time
> reported by \timing is (for this test case) consistently 0.5-1ms
> *lower* than the Execution Time reported by EXPLAIN. I did not see
> that on master. Is that expected?
>
> Thanks,
> Maciek
>
>
> The patch requires a rebase; please rebase the patch with the latest code.

Hunk #5 succeeded at 147 with fuzz 2 (offset -3 lines).
Hunk #6 FAILED at 170.
Hunk #7 succeeded at 165 (offset -69 lines).
2 out of 7 hunks FAILED -- saving rejects to file
src/include/portability/instr_time.h.rej
patching file src/tools/msvc/Mkvcbuild.pm



-- 
Ibrar Ahmed

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2022-07-15 Thread Maciek Sakrejda

I ran that original test case with and without the patch. Here are the
numbers I'm seeing:

master (best of three):

postgres=# SELECT count(*) FROM lotsarows;
Time: 582.423 ms

postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 616.102 ms

postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 1068.700 ms (00:01.069)

patched (best of three):

postgres=# SELECT count(*) FROM lotsarows;
Time: 550.822 ms

postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 612.572 ms

postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 690.875 ms

On Fri, Jul 1, 2022 at 10:26 AM Andres Freund  wrote:
> On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
>...
> > Known WIP problems with this patch version:
> >
> > * There appears to be a timing discrepancy I haven't yet worked out, where
> >   the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
> >   reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
> >   higher for \timing than for the EXPLAIN ANALYZE time reported on the
> > server
> >   side, only when rdtsc measurement is used -- its likely there is a problem
> >   somewhere with how we perform the cycles to time conversion
>
> Could you explain a bit more what you're seeing? I just tested your patches
> and didn't see that here.

I did not see this either, but I did see that the execution time
reported by \timing is (for this test case) consistently 0.5-1ms
*lower* than the Execution Time reported by EXPLAIN. I did not see
that on master. Is that expected?

Thanks,
Maciek

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2022-07-01 Thread Andres Freund

Hi,

On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
> On Fri, Jun 12, 2020 at 4:28 PM Andres Freund  wrote:
> 
> > The attached patches are really just a prototype. I'm also not really
> > planning to work on getting this into a "production ready" patchset
> > anytime soon. I developed it primarily because I found it the overhead
> > made it too hard to nail down in which part of a query tree performance
> > changed.  If somebody else wants to continue from here...
> >
> > I do think it's be a pretty significant improvement if we could reduce
> > the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
> > bunch of low-level code.
> >
> 
> Based on an off-list conversation with Andres, I decided to dust off this
> old
> patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance
> improvements (especially when using rdtsc instead of rdtsc*p*) seem to
> warrant
> giving this a more thorough look.
> 
> See attached an updated patch (adding it to the July commitfest), with a few
> changes:
> 
> - Keep using clock_gettime() as a fallback if we decide to not use rdtsc

Yep.


> - Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work

I suspect that this might not be needed anymore. Seems like it'd be ok to just
fall back to clock_gettime() in that case.


> - In an abundance of caution, for now I've decided to only enable this if we
>   are on Linux/x86, and the current kernel clocksource is TSC (the kernel
> has
>   quite sophisticated logic around making this decision, see [1])

I think our requirements are a bit lower than the kernel's - we're not
tracking wall clock over an extended period...


> Note that if we implemented the decision logic ourselves (instead of relying
> on the Linux kernel), I'd be most worried about older virtualization
> technology. In my understanding getting this right is notably more
> complicated
> than just checking cpuid, see [2].


> Known WIP problems with this patch version:
> 
> * There appears to be a timing discrepancy I haven't yet worked out, where
>   the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
>   reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
>   higher for \timing than for the EXPLAIN ANALYZE time reported on the
> server
>   side, only when rdtsc measurement is used -- its likely there is a problem
>   somewhere with how we perform the cycles to time conversion

Could you explain a bit more what you're seeing? I just tested your patches
and didn't see that here.


> * Possibly related, the floating point handling for the cycles_to_sec
> variable
>   is problematic in terms of precision (see FIXME, taken over from Andres'
> POC)

And probably also performance...


> Open questions from me:
> 
> 1) Do we need to account for different TSC offsets on different CPUs in SMP
>systems? (the Linux kernel certainly has logic to that extent, but [3]
>suggests this is no longer a problem on Nehalem and newer chips, i.e.
> those
>having an invariant TSC)

I don't think we should cater to systems where we need that.


> 2) Should we have a setting "--with-tsc" for configure? (instead of always
>enabling it when on Linux/x86 with a TSC clocksource)

Probably not worth it.


> 3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for
>current instructions to finish -- the prior discussion seemed to suggest
>we don't want it for node instruction measurements, but possibly we do
> want
>this in other cases?)

I was wondering about that too... Perhaps we should add a
INSTR_TIME_SET_CURRENT_BARRIER() or such?


> 4) Should we support using the "mrs" instruction on ARM? (which is similar
> to
>rdtsc, see [4])

I'd leave that for later personally.



>  #define NS_PER_S INT64CONST(10)
>  #define US_PER_S INT64CONST(100)
>  #define MS_PER_S INT64CONST(1000)
> @@ -95,17 +104,37 @@ typedef int64 instr_time;
>  
>  #define INSTR_TIME_SET_ZERO(t)   ((t) = 0)
>  
> -static inline instr_time pg_clock_gettime_ns(void)
> +extern double cycles_to_sec;
> +
> +bool use_rdtsc;

This should be extern and inside the ifdef below.


> +#if defined(__x86_64__) && defined(__linux__)
> +extern void pg_clock_gettime_initialize_rdtsc(void);
> +#endif
> +
> +static inline instr_time pg_clock_gettime_ref_cycles(void)
>  {
>   struct timespec tmp;
>  
> +#if defined(__x86_64__) && defined(__linux__)
> + if (use_rdtsc)
> + return __rdtsc();
> +#endif
> +
>   clock_gettime(PG_INSTR_CLOCK, );
>  
>   return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
>  }
>  

Greetings,

Andres Freund

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2022-07-01 Thread Lukas Fittl

On Fri, Jun 12, 2020 at 4:28 PM Andres Freund  wrote:

> The attached patches are really just a prototype. I'm also not really
> planning to work on getting this into a "production ready" patchset
> anytime soon. I developed it primarily because I found it the overhead
> made it too hard to nail down in which part of a query tree performance
> changed.  If somebody else wants to continue from here...
>
> I do think it's be a pretty significant improvement if we could reduce
> the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
> bunch of low-level code.
>

Based on an off-list conversation with Andres, I decided to dust off this
old
patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance
improvements (especially when using rdtsc instead of rdtsc*p*) seem to
warrant
giving this a more thorough look.

See attached an updated patch (adding it to the July commitfest), with a few
changes:

- Keep using clock_gettime() as a fallback if we decide to not use rdtsc
- Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work
- The decision to use rdtsc (or not) is made at runtime, in the new
  INSTR_TIME_INITIALIZE() -- we can't make this decision at compile time
  because this is dependent on the specific CPU in use, amongst other things
- In an abundance of caution, for now I've decided to only enable this if we
  are on Linux/x86, and the current kernel clocksource is TSC (the kernel
has
  quite sophisticated logic around making this decision, see [1])

Note that if we implemented the decision logic ourselves (instead of relying
on the Linux kernel), I'd be most worried about older virtualization
technology. In my understanding getting this right is notably more
complicated
than just checking cpuid, see [2].

Known WIP problems with this patch version:

* There appears to be a timing discrepancy I haven't yet worked out, where
  the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
  reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
  higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
  side, only when rdtsc measurement is used -- its likely there is a problem
  somewhere with how we perform the cycles to time conversion
* Possibly related, the floating point handling for the cycles_to_sec
variable
  is problematic in terms of precision (see FIXME, taken over from Andres'
POC)

Open questions from me:

1) Do we need to account for different TSC offsets on different CPUs in SMP
   systems? (the Linux kernel certainly has logic to that extent, but [3]
   suggests this is no longer a problem on Nehalem and newer chips, i.e.
those
   having an invariant TSC)

2) Should we have a setting "--with-tsc" for configure? (instead of always
   enabling it when on Linux/x86 with a TSC clocksource)

3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for
   current instructions to finish -- the prior discussion seemed to suggest
   we don't want it for node instruction measurements, but possibly we do
want
   this in other cases?)

4) Should we support using the "mrs" instruction on ARM? (which is similar
to
   rdtsc, see [4])

Thanks,
Lukas

[1] https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c
[2] http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/
[3] https://stackoverflow.com/a/11060619/1652607
[4] https://cpufun.substack.com/p/fun-with-timers-and-cpuid

-- 
Lukas Fittl

v2-0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure.patch
Description: Binary data

v2-0001-WIP-Change-instr_time-to-just-store-nanoseconds-t.patch
Description: Binary data

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2021-03-12 Thread Thomas Munro

On Sat, Jun 13, 2020 at 11:28 AM Andres Freund  wrote:
> [PATCH v1 1/2] WIP: Change instr_time to just store nanoseconds, that's 
> cheaper.

Makes a lot of sense.  If we do this, I'll need to update pgbench,
which just did something similar locally.  If I'd been paying
attention to this thread I might not have committed that piece of the
recent pgbench changes, but it's trivial stuff and I'll be happy to
tidy that up when the time comes.

> [PATCH v1 2/2] WIP: Use cpu reference cycles, via rdtsc, to measure time for 
> instrumentation.

> Some of the time is spent doing function calls, dividing into struct
> timespec, etc. But most of it just the rdtscp instruction:
>  65.30 │1  63:   rdtscp

> The reason for that is largely that rdtscp waits until all prior
> instructions have finished (but it allows later instructions to already
> start). Multiple times for each tuple.

Yeah, after reading a bit about this, I agree that there is no reason
to think that the stalling version makes the answer better in any way.
It might make sense if you use it once at the beginning of a large
computation, but it makes no sense if you sprinkle it around inside
blocks that will run multiple times.  It destroys your
instructions-per-cycle while, turning your fancy super scalar Pentium
into a 486.  It does raise some interesting questions about what
exactly you're measuring, though: I don't know enough to have a good
grip on how far out of order the TSC could be read!

> There's also other issues with using rdtsc directly: On older CPUs, in
> particular older multi-socket systems, the tsc will not be synchronized
> in detail across cores. There's bits that'd let us check whether tsc is
> suitable or not.  The more current issue of that is that things like
> virtual machines being migrated can lead to rdtsc suddenly returning a
> different value / the frequency differening. But that is supposed to be
> solved these days, by having virtualization technologies set frequency
> multipliers and offsets which then cause rdtsc[p] to return something
> meaningful, even after migration.

Googling tells me that Nehalem (2008) introduced "invariant TSC"
(clock rate independent) and also socket synchronisation at the same
time, so systems without it are already pretty long in the tooth.

A quick peek at an AMD manual[1] tells me that a similar change
happened in 15H/Bulldozer/Piledriver/Steamroller/Excavator (2011),
identified with the same CPUID test.

My first reaction is that it seems like TSC would be the least of your
worries if you're measuring a VM that's currently migrating between
hosts, but maybe the idea is just that you have to make sure you don't
assume it can't ever go backwards or something like that...

Google Benchmark has some clues about how to spell this on MSVC, what
some instructions might be to research on ARM, etc.

[1] https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf
(page 373)
[2] https://github.com/google/benchmark/blob/master/src/cycleclock.h

Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2020-06-12 Thread Pavel Stehule

so 13. 6. 2020 v 1:28 odesílatel Andres Freund  napsal:

> Hi,
>
> Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
> the resulting timing enough that the times aren't meaningful. E.g.
>
> CREATE TABLE lotsarows(key int not null);
> INSERT INTO lotsarows SELECT generate_series(1, 5000);
> VACUUM FREEZE lotsarows;
>
>
> -- best of three:
> SELECT count(*) FROM lotsarows;
> Time: 1923.394 ms (00:01.923)
>
> -- best of three:
> EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
> Time: 2319.830 ms (00:02.320)
>
> -- best of three:
> EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
> Time: 4202.649 ms (00:04.203)
>
> That nearly *double* the execution time without TIMING.
>
>
> Looking at a profile of this shows that we spend a good bit of cycles
> "normalizing" timstamps etc. That seems pretty unnecessary, just forced
> on us due to struct timespec. So the first attached patch just turns
> instr_time to be a 64bit integer, counting nanoseconds.
>
> That helps, a tiny bit:
> EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
> Time: 4179.302 ms (00:04.179)
>
> but obviously doesn't move the needle.
>
>
> Looking at a profile it's easy to confirm that we spend a lot of time
> acquiring time:
> -   95.49% 0.00%  postgres postgres [.]
> agg_retrieve_direct (inlined)
>- agg_retrieve_direct (inlined)
>   - 79.27% fetch_input_tuple
>  - ExecProcNode (inlined)
> - 75.72% ExecProcNodeInstr
>+ 25.22% SeqNext
>- 21.74% InstrStopNode
>   + 17.80% __GI___clock_gettime (inlined)
>- 21.44% InstrStartNode
>   + 19.23% __GI___clock_gettime (inlined)
>+ 4.06% ExecScan
>   + 13.09% advance_aggregates (inlined)
> 1.06% MemoryContextReset
>
> And that's even though linux avoids a syscall (in most cases) etc to
> acquire the time. Unless the kernel detects there's a reason not to do
> so, linux does this by using 'rdtscp' and multiplying it by kernel
> provided factors to turn the cycles into time.
>
> Some of the time is spent doing function calls, dividing into struct
> timespec, etc. But most of it just the rdtscp instruction:
>  65.30 │1  63:   rdtscp
>
>
> The reason for that is largely that rdtscp waits until all prior
> instructions have finished (but it allows later instructions to already
> start). Multiple times for each tuple.
>
>
> In the second attached prototype patch I've change instr_time to count
> in cpu cycles instead of nanoseconds. And then just turned the cycles
> into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).
>
> When using rdtsc that results in *vastly* lower overhead:
>
> ┌───┐
> │  QUERY PLAN
>  │
>
> ├───┤
> │ Aggregate  (cost=846239.20..846239.21 rows=1 width=8) (actual
> time=2610.235..2610.235 rows=1 loops=1) │
> │   ->  Seq Scan on lotsarows  (cost=0.00..721239.16 rows=5016
> width=0) (actual time=0.006..1512.886 rows=5000 loops=1) │
> │ Planning Time: 0.028 ms
>  │
> │ Execution Time: 2610.256 ms
>  │
>
> └───┘
> (4 rows)
>
> Time: 2610.589 ms (00:02.611)
>
> And there's still some smaller improvements that could be made ontop of
> that.
>
> As a comparison, here's the time when using rdtscp directly in
> instr_time, instead of going through clock_gettime:
> Time: 3481.162 ms (00:03.481)
>
> That shows pretty well how big the cost of the added pipeline stalls
> are, and how important out-of-order execution is for decent
> performance...
>
>
> In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
> do *not* want to wait for prior instructions to finish, since that
> actually leads to the timing being less accurate, rather than
> more. There are other cases where that'd be different, e.g. measuring
> how long an entire query takes or such (but there it's probably
> irrelevant which to use).
>
>
> I've above skipped a bit over the details of how to turn the cycles
> returned by rdtsc into time:
>
> On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
> have actually been run, but instead returns the number of 'reference
> cycles'. That's important because otherwise things like turbo mode and
> lower power modes would lead to completely bogus times.
>
> Thus, knowing the "base frequency" of the CPU allows to turn the
>

Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2020-06-12 Thread Andres Freund

Hi,

Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
the resulting timing enough that the times aren't meaningful. E.g.

CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 5000);
VACUUM FREEZE lotsarows;


-- best of three:
SELECT count(*) FROM lotsarows;
Time: 1923.394 ms (00:01.923)

-- best of three:
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 2319.830 ms (00:02.320)

-- best of three:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4202.649 ms (00:04.203)

That nearly *double* the execution time without TIMING.


Looking at a profile of this shows that we spend a good bit of cycles
"normalizing" timstamps etc. That seems pretty unnecessary, just forced
on us due to struct timespec. So the first attached patch just turns
instr_time to be a 64bit integer, counting nanoseconds.

That helps, a tiny bit:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4179.302 ms (00:04.179)

but obviously doesn't move the needle.


Looking at a profile it's easy to confirm that we spend a lot of time
acquiring time:
-   95.49% 0.00%  postgres postgres [.] 
agg_retrieve_direct (inlined)
   - agg_retrieve_direct (inlined)
  - 79.27% fetch_input_tuple
 - ExecProcNode (inlined)
- 75.72% ExecProcNodeInstr
   + 25.22% SeqNext
   - 21.74% InstrStopNode
  + 17.80% __GI___clock_gettime (inlined)
   - 21.44% InstrStartNode
  + 19.23% __GI___clock_gettime (inlined)
   + 4.06% ExecScan
  + 13.09% advance_aggregates (inlined)
1.06% MemoryContextReset

And that's even though linux avoids a syscall (in most cases) etc to
acquire the time. Unless the kernel detects there's a reason not to do
so, linux does this by using 'rdtscp' and multiplying it by kernel
provided factors to turn the cycles into time.

Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
 65.30 │1  63:   rdtscp


The reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.


In the second attached prototype patch I've change instr_time to count
in cpu cycles instead of nanoseconds. And then just turned the cycles
into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).

When using rdtsc that results in *vastly* lower overhead:
┌───┐
│  QUERY PLAN   
│
├───┤
│ Aggregate  (cost=846239.20..846239.21 rows=1 width=8) (actual 
time=2610.235..2610.235 rows=1 loops=1) │
│   ->  Seq Scan on lotsarows  (cost=0.00..721239.16 rows=5016 width=0) 
(actual time=0.006..1512.886 rows=5000 loops=1) │
│ Planning Time: 0.028 ms   
│
│ Execution Time: 2610.256 ms   
│
└───┘
(4 rows)

Time: 2610.589 ms (00:02.611)

And there's still some smaller improvements that could be made ontop of
that.

As a comparison, here's the time when using rdtscp directly in
instr_time, instead of going through clock_gettime:
Time: 3481.162 ms (00:03.481)

That shows pretty well how big the cost of the added pipeline stalls
are, and how important out-of-order execution is for decent
performance...


In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
do *not* want to wait for prior instructions to finish, since that
actually leads to the timing being less accurate, rather than
more. There are other cases where that'd be different, e.g. measuring
how long an entire query takes or such (but there it's probably
irrelevant which to use).


I've above skipped a bit over the details of how to turn the cycles
returned by rdtsc into time:

On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
have actually been run, but instead returns the number of 'reference
cycles'. That's important because otherwise things like turbo mode and
lower power modes would lead to completely bogus times.

Thus, knowing the "base frequency" of the CPU allows to turn the
difference between two rdtsc return values into seconds.

In the attached prototype I just determined the number of cycles using
cpuid(0x16). That's only available since

61 matches

Mail list logo