Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, At some point this patch switched from rdtsc to rdtscp, which imo largely negates the point of it. What lead to that? Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On Sat, 20 Jan 2024 at 09:03, vignesh C wrote: > > On Mon, 20 Feb 2023 at 16:06, David Geier wrote: > > > > Hi! > > > > On 2/14/23 13:48, David Geier wrote: > > > > > > It still fails. > > > > > > I'll get Cirrus-CI working on my own Github fork so I can make sure it > > > really compiles on all platforms before I submit a new version. > > > > It took some time until Cirrus CI allowed me to run tests against my new > > GitHub account (there's a 3 days freeze to avoid people from getting > > Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch > > which passes builds, rebased on latest master. > > > > I also reviewed the first two patches a while ago in [1]. I hope we can > > progress with them to further reduce the size of this patch set. > > > > Beyond that: I could work on support for more OSs (e.g. starting with > > Windows). Is there appetite for that or do we rather want to instead > > start with a smaller patch? > > Are we planning to continue on this and take it further? > I'm seeing that there has been no activity in this thread for nearly 1 > year now, I'm planning to close this in the current commitfest unless > someone is planning to take it forward. Since the author or no one else showed interest in taking it forward and the patch had no activity for more than 1 year, I have changed the status to RWF. Feel free to add a new CF entry when someone is planning to resume work more actively. Regards, Vignesh
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On Mon, 20 Feb 2023 at 16:06, David Geier wrote: > > Hi! > > On 2/14/23 13:48, David Geier wrote: > > > > It still fails. > > > > I'll get Cirrus-CI working on my own Github fork so I can make sure it > > really compiles on all platforms before I submit a new version. > > It took some time until Cirrus CI allowed me to run tests against my new > GitHub account (there's a 3 days freeze to avoid people from getting > Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch > which passes builds, rebased on latest master. > > I also reviewed the first two patches a while ago in [1]. I hope we can > progress with them to further reduce the size of this patch set. > > Beyond that: I could work on support for more OSs (e.g. starting with > Windows). Is there appetite for that or do we rather want to instead > start with a smaller patch? Are we planning to continue on this and take it further? I'm seeing that there has been no activity in this thread for nearly 1 year now, I'm planning to close this in the current commitfest unless someone is planning to take it forward. Regards, Vignesh
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi! On 2/14/23 13:48, David Geier wrote: It still fails. I'll get Cirrus-CI working on my own Github fork so I can make sure it really compiles on all platforms before I submit a new version. It took some time until Cirrus CI allowed me to run tests against my new GitHub account (there's a 3 days freeze to avoid people from getting Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch which passes builds, rebased on latest master. I also reviewed the first two patches a while ago in [1]. I hope we can progress with them to further reduce the size of this patch set. Beyond that: I could work on support for more OSs (e.g. starting with Windows). Is there appetite for that or do we rather want to instead start with a smaller patch? [1] https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com -- David Geier (ServiceNow) From d03a9be2522b0ef22fd58cbcfc95eb19ca8b2bea Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Fri, 20 Jan 2023 15:31:54 -0800 Subject: [PATCH v9 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop without having to convert into time units (which is costly). INSTR_TIME_IS_LT() can be used to check the loop condition. Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch: --- src/include/portability/instr_time.h | 12 1 file changed, 12 insertions(+) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index cc85138e21..aab80effb0 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -15,6 +15,8 @@ * * INSTR_TIME_IS_ZERO(t) is t equal to zero? * + * INSTR_TIME_IS_LT(x, y) x < y + * * INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too) * * INSTR_TIME_SET_CURRENT(t) set t to current time @@ -22,6 +24,8 @@ * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero, * evaluates to whether t changed * + * INSTR_TIME_SET_SECONDS(t, s) set t to s seconds + * * INSTR_TIME_ADD(x, y)x += y * * INSTR_TIME_SUBTRACT(x, y) x -= y @@ -122,6 +126,9 @@ pg_clock_gettime_ns(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_clock_gettime_ns()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = NS_PER_S * (s)) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) (t).ticks) @@ -156,6 +163,9 @@ GetTimerFrequency(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_query_performance_counter()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = s * GetTimerFrequency()) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency( @@ -168,6 +178,8 @@ GetTimerFrequency(void) #define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0) +#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks) + #define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0) -- 2.34.1 From 639213b01102a7320a62ba5ed68a8e3d6a05514b Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 16 Jan 2023 11:19:11 -0800 Subject: [PATCH v9 2/3] wip: report nanoseconds in pg_test_timing This commit also updates pg_test_timing's documentation: - compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of of statement with/without EXPLAIN ANALYZE - explain the 2x overhead (due to two timestamp acquisitions per row) - remove old section about old versions of linux - I couldn't update the numbers, and it's old enough nobody would care --- doc/src/sgml/ref/pgtesttiming.sgml | 117 ++-- src/bin/pg_test_timing/pg_test_timing.c | 74 +-- 2 files changed, 95 insertions(+), 96 deletions(-) diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml index a5eb3aa25e..7e0266cf58 100644 --- a/doc/src/sgml/ref/pgtesttiming.sgml +++ b/doc/src/sgml/ref/pgtesttiming.sgml @@ -93,28 +93,34 @@ PostgreSQL documentation Good results will show most (>90%) individual timing calls take less than - one microsecond. Average per loop overhead will be even lower, below 100 - nanoseconds. This example from an Intel i7-860 system using a TSC clock - source shows excellent performance: + one microsecond (1000 nanoseconds). Average per loop overhead will be even + lower, below 100 nanoseconds. This example from an Intel i9-9880H system + using a TSC clock source shows excellent performance: - Note that different units are used for the per loop time than the - histogram. The loop can have resolution within a few nanoseconds (ns), - while the individual timing calls can only resolve down to one microsecond - (us). + Note that the accuracy of the histogram entries may be lower than the + per loop time. @@ -125,24 +131,25 @@ Histogram of timing durations: When the query executor is running a statement using EXPLAIN ANALYZE, individual operations are timed as well
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi! On 2/14/23 12:11, David Geier wrote: Hi, I think I fixed the compilation errors. It was due to a few variables being declared under #if defined(__x86_64__) && defined(__linux__) while being used also under non x86 Linux. I also removed again the code to obtain the TSC frequency under hypervisors because the TSC is usually emulated and therefore no faster than clock_gettime() anyways. So we now simply fallback to clock_gettime() on hypervisors when we cannot obtain the frequency via leaf 0x16. Beyond that I reviewed the first two patches a while ago in [1]. I hope we can progress with them to further reduce the size of this patch set. [1] https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com It still fails. I'll get Cirrus-CI working on my own Github fork so I can make sure it really compiles on all platforms before I submit a new version. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2/7/23 19:12, Andres Freund wrote: This fails to build on several platforms: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3751 I think I fixed the compilation errors. It was due to a few variables being declared under #if defined(__x86_64__) && defined(__linux__) while being used also under non x86 Linux. I also removed again the code to obtain the TSC frequency under hypervisors because the TSC is usually emulated and therefore no faster than clock_gettime() anyways. So we now simply fallback to clock_gettime() on hypervisors when we cannot obtain the frequency via leaf 0x16. Beyond that I reviewed the first two patches a while ago in [1]. I hope we can progress with them to further reduce the size of this patch set. [1] https://www.postgresql.org/message-id/3ac157f7-085d-e071-45fc-b87cd306360c%40gmail.com -- David Geier (ServiceNow) From 36ff7f7ee14bf42ef0fb775cec428180251c3ff9 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Fri, 20 Jan 2023 15:31:54 -0800 Subject: [PATCH v8 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop without having to convert into time units (which is costly). INSTR_TIME_IS_LT() can be used to check the loop condition. Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch: --- src/include/portability/instr_time.h | 12 1 file changed, 12 insertions(+) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index cc85138e21..aab80effb0 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -15,6 +15,8 @@ * * INSTR_TIME_IS_ZERO(t) is t equal to zero? * + * INSTR_TIME_IS_LT(x, y) x < y + * * INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too) * * INSTR_TIME_SET_CURRENT(t) set t to current time @@ -22,6 +24,8 @@ * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero, * evaluates to whether t changed * + * INSTR_TIME_SET_SECONDS(t, s) set t to s seconds + * * INSTR_TIME_ADD(x, y)x += y * * INSTR_TIME_SUBTRACT(x, y) x -= y @@ -122,6 +126,9 @@ pg_clock_gettime_ns(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_clock_gettime_ns()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = NS_PER_S * (s)) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) (t).ticks) @@ -156,6 +163,9 @@ GetTimerFrequency(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_query_performance_counter()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = s * GetTimerFrequency()) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency( @@ -168,6 +178,8 @@ GetTimerFrequency(void) #define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0) +#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks) + #define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0) -- 2.34.1 From 1200bfee5b8ebf7c68d8cefb3771a4c3523c2cea Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 16 Jan 2023 11:19:11 -0800 Subject: [PATCH v8 2/3] wip: report nanoseconds in pg_test_timing This commit also updates pg_test_timing's documentation: - compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of of statement with/without EXPLAIN ANALYZE - explain the 2x overhead (due to two timestamp acquisitions per row) - remove old section about old versions of linux - I couldn't update the numbers, and it's old enough nobody would care --- doc/src/sgml/ref/pgtesttiming.sgml | 117 ++-- src/bin/pg_test_timing/pg_test_timing.c | 74 +-- 2 files changed, 95 insertions(+), 96 deletions(-) diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml index a5eb3aa25e..7e0266cf58 100644 --- a/doc/src/sgml/ref/pgtesttiming.sgml +++ b/doc/src/sgml/ref/pgtesttiming.sgml @@ -93,28 +93,34 @@ PostgreSQL documentation Good results will show most (>90%) individual timing calls take less than - one microsecond. Average per loop overhead will be even lower, below 100 - nanoseconds. This example from an Intel i7-860 system using a TSC clock - source shows excellent performance: + one microsecond (1000 nanoseconds). Average per loop overhead will be even + lower, below 100 nanoseconds. This example from an Intel i9-9880H system + using a TSC clock source shows excellent performance: - Note that different units are used for the per loop time than the - histogram. The loop can have resolution within a few nanoseconds (ns), - while the individual timing calls can only resolve down to one microsecond - (us). + Note that the accuracy of the histogram entries may be lower than the + per loop time. @@ -125,24 +131,25 @@ Histogram of timing durations: When the query executor is running a statement using EXPLAIN ANALYZE, individual operations are
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-24 14:30:34 +0100, David Geier wrote: > Attached is v7 of the patch: > > - Rebased on latest master (most importantly on top of the int64 instr_time > commits). - Includes two commits from Andres which introduce > INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() and WIP to report > pg_test_timing output in nanoseconds. - Converts ticks to nanoseconds only > with integer math, while accounting for overflow. - Supports RDTSCP via > INSTR_TIME_SET_CURRENT() and introduced INSTR_TIME_SET_CURRENT_FAST() which > uses RDTSC. > > I haven't gotten to the following: > > - Looking through all calls to INSTR_TIME_SET_CURRENT() and check if they > should be replaced by INSTR_TIME_SET_CURRENT_FAST(). - Reviewing Andres > commits. Potentially improving on pg_test_timing's output. - Looking at > enabling RDTSC on more platforms. Is there a minimum set of platforms we > would like support for? Windows should be easy. That would also allow to > unify the code a little more. - Add more documentation and do more testing > around the calls to CPUID. - Profiling and optimizing the code. A quick test > showed about 10% improvement over master with TIMING ON vs TIMING OFF, when > using the test-case from Andres' e-mail that started this thread. > > I hope I'll find time to work on these points during the next days. This fails to build on several platforms: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3751 Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 1/23/23 21:30, Andres Freund wrote: That's been the case since my first post in the thread :). Mainly, it seems easier to detect underflow cases during subtraction that way. And the factor of 2 in range doesn't change a whole lot. I just realized it the other day :). If you have time to look at the pg_test_timing part, it'd be appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit hesitant to push it. I haven't yet pushed the pg_test_timing (nor it's small prerequisite) patch. I've attached those two patches. Feel free to include them in your series if you want, then the CF entry (and thus cfbot) makes sense again... I'll include them in my new patch set and also have a careful look at them. I reviewed the prerequisite patch which introduces INSTR_TIME_SET_SECONDS(), as well as the pg_test_timing patch. Here my comments: - The prerequisite patch looks good me. - By default, the test query in the pg_test_timing doc runs serially. What about adding SET max_parallel_workers_per_gather = 0 to make sure it really always does (e.g. on a system with different settings for parallel_tuple_cost / parallel_setup_cost)? Otherwise, the numbers will be much more flaky. - Why have you added a case distinction for diff == 0? Have you encountered this case? If so, how? Maybe add a comment. - To further reduce overhead we could call INSTR_TIME_SET_CURRENT() multiple times. But then again: why do we actually care about the per-loop time? Why not instead sum up diff and divide by the number of iterations to exclude all the overhead in the first place? - In the computation of the per-loop time in nanoseconds you can now use INSTR_TIME_GET_NANOSEC() instead of INSTR_TIME_GET_DOUBLE() * NS_PER_S. The rest looks good to me. The rebased patches are part of the patch set I sent out yesterday in reply to another mail in this thread. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi I think at least some should be converted to just accumulate in an instr_time... I think that's for a later patch though? Yep, at least quite similar. OK. I coded it up in the latest version of the patch. Depending on how low we want to keep the error, I don't think we can: If I set the allowed deviation to 10**-9, we end up requiring a shift by 29 for common ghz ranges. Clearly 33bits isn't an interesting range. But even if you accept a higher error - we don't have *that* much range available. Assuming an uint64, the range is ~584 years. If we want 10 years range, we end up math.log(((2**64)-1) / (10 * 365 * 60 * 60 * 24 * 10**9), 2) ~= 5.87 So 5 bits available that we could "use" for multiply/shift. For something like 2.5ghz, that'd be ~2% error, clearly not acceptable. And even just a year of range, ends up allowing a failure of 30796s = 8min over a year, still too high. Thanks for doing the math. Agreed. The latest patch detects overflow and correctly handles it. But I don't think it's really an issue - normally that branch will never be taken (at least within the memory of the branch predictor), which on modern CPUs means it'll just be predicted as not taken. So as long as we tell the compiler what's the likely branch, it should be fine. At least as long as the branch compares with a hardcoded number. Yeah. The overflow detection just compares two int64. The "overflow threshold" is pre-computed. - the restriction just to linux, that'll make testing harder for some, and ends up encoding too much OS dependency - I think we need both the barrier and non-barrier variant, otherwise I suspect we'll end up with inccuracies we don't want - needs lots more documentation about why certain cpuid registers are used - cpu microarch dependencies - isn't there, e.g., the case that the scale on nehalem has to be different than on later architectures? - lack of facility to evaluate how well the different time sources work Makes sense. I carried that list over to my latest e-mail which also includes the patch to have some sort of summary of where we are in a single place. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 1/23/23 18:41, Andres Freund wrote: If we add it, it probably shouldn't depend on TIMING, but on SUMMARY. Regression test queries showing EXPLAIN ANALYZE output all do something like EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) the SUMMARY OFF gets rid of the "top-level" "Planning Time" and "Execution Time", whereas the TIMING OFF gets rid of the per-node timing. Those are separate options because per-node timing is problematic performance-wise (right now), but whole-query timing rarely is. Makes sense. I wasn't aware of SUMMARY. Let's keep this option in mind, in case we'll revisit exposing the clock source in the future. Another, independent, thing worth thinking about: I think we might want to expose both rdtsc and rdtscp. For something like InstrStartNode()/InstrStopNode(), avoiding the "one-way barrier" of rdtscp is quite important to avoid changing the query performance. But for measuring whole-query time, we likely want to measure the actual time. It probably won't matter hugely for the whole query time - the out of order window of modern CPUs is large, but not *that* large - but I don't think we can generally assume that. That's what I thought as well. I added INSTR_TIME_SET_CURRENT_FAST() and for now call that variant from InstrStartNode(), InstrEndNode() and pg_test_timing. To do so in InstrEndNode(), I removed INSTR_TIME_SET_CURRENT_LAZY(). Otherwise, two variants of that macro would be needed. INSTR_TIME_SET_CURRENT_LAZY() was only used in a single place and the code is more readable that way. INSTR_TIME_SET_CURRENT() is called from a bunch of places. I still have to go through all of them and see which should be changed to call the _FAST() variant. Attached is v7 of the patch: - Rebased on latest master (most importantly on top of the int64 instr_time commits). - Includes two commits from Andres which introduce INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() and WIP to report pg_test_timing output in nanoseconds. - Converts ticks to nanoseconds only with integer math, while accounting for overflow. - Supports RDTSCP via INSTR_TIME_SET_CURRENT() and introduced INSTR_TIME_SET_CURRENT_FAST() which uses RDTSC. I haven't gotten to the following: - Looking through all calls to INSTR_TIME_SET_CURRENT() and check if they should be replaced by INSTR_TIME_SET_CURRENT_FAST(). - Reviewing Andres commits. Potentially improving on pg_test_timing's output. - Looking at enabling RDTSC on more platforms. Is there a minimum set of platforms we would like support for? Windows should be easy. That would also allow to unify the code a little more. - Add more documentation and do more testing around the calls to CPUID. - Profiling and optimizing the code. A quick test showed about 10% improvement over master with TIMING ON vs TIMING OFF, when using the test-case from Andres' e-mail that started this thread. I hope I'll find time to work on these points during the next days. -- David Geier (ServiceNow) From 0b5ce706bed13c0c242e6ace809d3c37a8064029 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Fri, 20 Jan 2023 15:31:54 -0800 Subject: [PATCH v7 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop without having to convert into time units (which is costly). INSTR_TIME_IS_LT() can be used to check the loop condition. Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch: --- src/include/portability/instr_time.h | 12 1 file changed, 12 insertions(+) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index cc85138e21..aab80effb0 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -15,6 +15,8 @@ * * INSTR_TIME_IS_ZERO(t) is t equal to zero? * + * INSTR_TIME_IS_LT(x, y) x < y + * * INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too) * * INSTR_TIME_SET_CURRENT(t) set t to current time @@ -22,6 +24,8 @@ * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero, * evaluates to whether t changed * + * INSTR_TIME_SET_SECONDS(t, s) set t to s seconds + * * INSTR_TIME_ADD(x, y)x += y * * INSTR_TIME_SUBTRACT(x, y) x -= y @@ -122,6 +126,9 @@ pg_clock_gettime_ns(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_clock_gettime_ns()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = NS_PER_S * (s)) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) (t).ticks) @@ -156,6 +163,9 @@ GetTimerFrequency(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_query_performance_counter()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = s * GetTimerFrequency()) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency( @@ -168,6 +178,8 @@ GetTimerFrequency(void) #define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0) +#define INSTR_TIME_IS_LT(x, y)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-23 18:52:44 +0100, David Geier wrote: > One thing I was wondering about: why did you chose to use a signed instead > of an unsigned 64-bit integer for the ticks? That's been the case since my first post in the thread :). Mainly, it seems easier to detect underflow cases during subtraction that way. And the factor of 2 in range doesn't change a whole lot. > > > If you have time to look at the pg_test_timing part, it'd be > > > appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit > > > hesitant to push it. > > I haven't yet pushed the pg_test_timing (nor it's small prerequisite) > > patch. > > > > I've attached those two patches. Feel free to include them in your series if > > you want, then the CF entry (and thus cfbot) makes sense again... > I'll include them in my new patch set and also have a careful look at them. Thanks. Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-23 18:49:37 +0100, David Geier wrote: > On 1/21/23 05:12, Andres Freund wrote: > > We do currently do the conversion quite frequently. Admittedly I was > > partially motivated by trying to get the per-loop overhead in pg_test_timing > > down ;) > > > > But I think it's a real issue. Places where we do, but shouldn't, convert: > > > > - ExecReScan() - quite painful, we can end up with a lot of those > > - InstrStopNode() - adds a good bit of overhead to simple > InstrStopNode() doesn't convert in the general case but only for the first > tuple or when async. So it goes somewhat hand in hand with ExecReScan(). I think even the first-scan portion is likely noticable for quick queries - you can quickly end up with 5-10 nodes, even for queries processed in the < 0.1ms range. Of course it's way worse with rescans / loops. > > - PendingWalStats.wal_write_time - this is particularly bad because it > > happens > >within very contended code > > - calls to pgstat_count_buffer_read_time(), > > pgstat_count_buffer_write_time() - > >they can be very frequent > > - pgbench.c, as we already discussed > > - pg_stat_statements.c > > - ... > > > > These all will get a bit slower when moving to a "variable" frequency. > I wonder if we will be able to measure any of them easily. But given that > it's many more places than I had realized and given that the optimized code > is not too involved, let's give it a try. I think at least some should be converted to just accumulate in an instr_time... > > What was your approach for avoiding the costly operation? I ended up with a > > integer multiplication + shift approximation for the floating point > > multiplication (which in turn uses the inverse of the division by the > > frequency). To allow for sufficient precision while also avoiding > > overflows, I > > had to make that branch conditional, with a slow path for large numbers of > > nanoseconds. > > It seems like we ended up with the same. I do: > > sec = ticks / frequency_hz > ns = ticks / frequency_hz * 1,000,000,000 > ns = ticks * (1,000,000,000 / frequency_hz) > ns = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz > > Now, the constant scaling factor in parentheses is typically a floating > point number. For example for a frequency of 2.5 GHz it would be 2.5. To > work around that we can do something like: > > ns = ticks * (1,000,000 * scaler / frequency_khz) / scaler > > Where scaler is a power-of-2, big enough to maintain enough precision while > allowing for a shift to implement the division. Yep, at least quite similar. > The additional multiplication with scaler makes that the maximum range go > down, because we must ensure we never overflow. I'm wondering if we cannot > pick scaler in such a way that remaining range of cycles is large enough for > our use case and we can therefore live without bothering for the overflow > case. What would be "enough"? 1 year? 10 years? ... Depending on how low we want to keep the error, I don't think we can: If I set the allowed deviation to 10**-9, we end up requiring a shift by 29 for common ghz ranges. Clearly 33bits isn't an interesting range. But even if you accept a higher error - we don't have *that* much range available. Assuming an uint64, the range is ~584 years. If we want 10 years range, we end up math.log(((2**64)-1) / (10 * 365 * 60 * 60 * 24 * 10**9), 2) ~= 5.87 So 5 bits available that we could "use" for multiply/shift. For something like 2.5ghz, that'd be ~2% error, clearly not acceptable. And even just a year of range, ends up allowing a failure of 30796s = 8min over a year, still too high. But I don't think it's really an issue - normally that branch will never be taken (at least within the memory of the branch predictor), which on modern CPUs means it'll just be predicted as not taken. So as long as we tell the compiler what's the likely branch, it should be fine. At least as long as the branch compares with a hardcoded number. > > I think it'd be great - but I'm not sure we're there yet, reliability and > > code-complexity wise. > Thanks to your commits, the diff of the new patch set will be already much > smaller and easier to review. What's your biggest concern in terms of > reliability? - the restriction just to linux, that'll make testing harder for some, and ends up encoding too much OS dependency - I think we need both the barrier and non-barrier variant, otherwise I suspect we'll end up with inccuracies we don't want - needs lots more documentation about why certain cpuid registers are used - cpu microarch dependencies - isn't there, e.g., the case that the scale on nehalem has to be different than on later architectures? - lack of facility to evaluate how well the different time sources work > > I think it might be worth makign the rdts aspect somewhat > > measurable. E.g. allowing pg_test_timing to use both at the same time, and > > have it compare elapsed time with both
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 1/21/23 06:31, Andres Freund wrote: I pushed the int64-ification commits. Great. I started rebasing. One thing I was wondering about: why did you chose to use a signed instead of an unsigned 64-bit integer for the ticks? If you have time to look at the pg_test_timing part, it'd be appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit hesitant to push it. I haven't yet pushed the pg_test_timing (nor it's small prerequisite) patch. I've attached those two patches. Feel free to include them in your series if you want, then the CF entry (and thus cfbot) makes sense again... I'll include them in my new patch set and also have a careful look at them. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 1/21/23 05:12, Andres Freund wrote: We do currently do the conversion quite frequently. Admittedly I was partially motivated by trying to get the per-loop overhead in pg_test_timing down ;) But I think it's a real issue. Places where we do, but shouldn't, convert: - ExecReScan() - quite painful, we can end up with a lot of those - InstrStopNode() - adds a good bit of overhead to simple InstrStopNode() doesn't convert in the general case but only for the first tuple or when async. So it goes somewhat hand in hand with ExecReScan(). - PendingWalStats.wal_write_time - this is particularly bad because it happens within very contended code - calls to pgstat_count_buffer_read_time(), pgstat_count_buffer_write_time() - they can be very frequent - pgbench.c, as we already discussed - pg_stat_statements.c - ... These all will get a bit slower when moving to a "variable" frequency. I wonder if we will be able to measure any of them easily. But given that it's many more places than I had realized and given that the optimized code is not too involved, let's give it a try. What was your approach for avoiding the costly operation? I ended up with a integer multiplication + shift approximation for the floating point multiplication (which in turn uses the inverse of the division by the frequency). To allow for sufficient precision while also avoiding overflows, I had to make that branch conditional, with a slow path for large numbers of nanoseconds. It seems like we ended up with the same. I do: sec = ticks / frequency_hz ns = ticks / frequency_hz * 1,000,000,000 ns = ticks * (1,000,000,000 / frequency_hz) ns = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz Now, the constant scaling factor in parentheses is typically a floating point number. For example for a frequency of 2.5 GHz it would be 2.5. To work around that we can do something like: ns = ticks * (1,000,000 * scaler / frequency_khz) / scaler Where scaler is a power-of-2, big enough to maintain enough precision while allowing for a shift to implement the division. The additional multiplication with scaler makes that the maximum range go down, because we must ensure we never overflow. I'm wondering if we cannot pick scaler in such a way that remaining range of cycles is large enough for our use case and we can therefore live without bothering for the overflow case. What would be "enough"? 1 year? 10 years? ... Otherwise, we indeed need code that cares for the potential overflow. My hunch is that it can be done branchless, but it for sure adds dependent instructions. Maybe in that case a branch is better that almost certainly will never be taken? I'll include the code in the new patch set which I'll latest submit tomorrow. I think it'd be great - but I'm not sure we're there yet, reliability and code-complexity wise. Thanks to your commits, the diff of the new patch set will be already much smaller and easier to review. What's your biggest concern in terms of reliability? I think it might be worth makign the rdts aspect somewhat measurable. E.g. allowing pg_test_timing to use both at the same time, and have it compare elapsed time with both sources of counters. I haven't yet looked into pg_test_timing. I'll do that while including your patches into the new patch set. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-23 18:23:17 +0100, David Geier wrote: > On 1/21/23 05:14, Andres Freund wrote: > > The elapsed time is already inherently unstable, so we shouldn't have any > > test > > output showing the time. > > > > But I doubt showing it in every explain is a good idea - we use instr_time > > in > > plenty of other places. Why show it in explain, but not in all those other > > places? > > Yeah. I thought it would only be an issue if we showed it unconditionally in > EXPLAIN ANALYZE. If we only show it with TIMING ON, we're likely fine with > pretty much all regression tests. If we add it, it probably shouldn't depend on TIMING, but on SUMMARY. Regression test queries showing EXPLAIN ANALYZE output all do something like EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF) the SUMMARY OFF gets rid of the "top-level" "Planning Time" and "Execution Time", whereas the TIMING OFF gets rid of the per-node timing. Those are separate options because per-node timing is problematic performance-wise (right now), but whole-query timing rarely is. > But given the different opinions, I'll leave it out in the new patch set for > the moment being. Makes sense. Another, independent, thing worth thinking about: I think we might want to expose both rdtsc and rdtscp. For something like InstrStartNode()/InstrStopNode(), avoiding the "one-way barrier" of rdtscp is quite important to avoid changing the query performance. But for measuring whole-query time, we likely want to measure the actual time. It probably won't matter hugely for the whole query time - the out of order window of modern CPUs is large, but not *that* large - but I don't think we can generally assume that. I'm thinking of something like INSTR_TIME_SET_CURRENT() and INSTR_TIME_SET_CURRENT_FAST() or _NOBARRIER(). Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 1/21/23 05:14, Andres Freund wrote: The elapsed time is already inherently unstable, so we shouldn't have any test output showing the time. But I doubt showing it in every explain is a good idea - we use instr_time in plenty of other places. Why show it in explain, but not in all those other places? Yeah. I thought it would only be an issue if we showed it unconditionally in EXPLAIN ANALYZE. If we only show it with TIMING ON, we're likely fine with pretty much all regression tests. But given the different opinions, I'll leave it out in the new patch set for the moment being. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-20 21:31:57 -0800, Andres Freund wrote: > On 2023-01-20 20:16:13 -0800, Andres Freund wrote: > > I'm planning to push most of my changes soon, had hoped to get to it a bit > > sooner, but ... > > I pushed the int64-ification commits. There's an odd compilation failure on AIX. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly=2023-01-21%2007%3A01%3A42 /opt/IBM/xlc/16.1.0/bin/xlc_r -D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY -qnoansialias -g -O2 -qmaxmem=33554432 -qsuppress=1500-010:1506-995 -qsuppress=1506-010:1506-416:1506-450:1506-480:1506-481:1506-492:1506-944:1506-1264 -qinfo=all:nocnd:noeff:noext:nogot:noini:noord:nopar:noppc:norea:nouni:nouse -qinfo=nounset -qvisibility=hidden -I. -I. -I/opt/freeware/include/python3.5m -I../../../src/include -I/home/nm/sw/nopath/icu58.3-64/include -I/home/nm/sw/nopath/libxml2-64/include/libxml2 -I/home/nm/sw/nopath/uuid-64/include -I/home/nm/sw/nopath/openldap-64/include -I/home/nm/sw/nopath/icu58.3-64/include -I/home/nm/sw/nopath/libxml2-64/include -c -o plpy_cursorobject.o plpy_cursorobject.c "../../../src/include/portability/instr_time.h", line 116.9: 1506-304 (I) No function prototype given for "clock_gettime". "../../../src/include/portability/instr_time.h", line 116.23: 1506-045 (S) Undeclared identifier CLOCK_REALTIME. : recipe for target 'plpy_cursorobject.o' failed but files including instr_time.h *do* build successfully, e.g. instrument.c: /opt/IBM/xlc/16.1.0/bin/xlc_r -D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY -qnoansialias -g -O2 -qmaxmem=33554432 -qsuppress=1500-010:1506-995 -qsuppress=1506-010:1506-416:1506-450:1506-480:1506-481:1506-492:1506-944:1506-1264 -qinfo=all:nocnd:noeff:noext:nogot:noini:noord:nopar:noppc:norea:nouni:nouse -qinfo=nounset -I../../../src/include -I/home/nm/sw/nopath/icu58.3-64/include -I/home/nm/sw/nopath/libxml2-64/include/libxml2 -I/home/nm/sw/nopath/uuid-64/include -I/home/nm/sw/nopath/openldap-64/include -I/home/nm/sw/nopath/icu58.3-64/include -I/home/nm/sw/nopath/libxml2-64/include -c -o instrument.o instrument.c Before the change the clock_gettime() call was in a macro and thus could be referenced even without a prior declaration, as long as places using INSTR_TIME_SET_CURRENT() had all the necessary includes and defines. Argh: There's nice bit in plpython.h: /* * Include order should be: postgres.h, other postgres headers, plpython.h, * other plpython headers. (In practice, other plpython headers will also * include this file, so that they can compile standalone.) */ #ifndef POSTGRES_H #error postgres.h must be included before plpython.h #endif /* * Undefine some things that get (re)defined in the Python headers. They aren't * used by the PL/Python code, and all PostgreSQL headers should be included * earlier, so this should be pretty safe. */ #undef _POSIX_C_SOURCE #undef _XOPEN_SOURCE the relevant stuff in time.h is indeed guarded by #if _XOPEN_SOURCE>=500 I don't think the plpython actually code follows the rule about including all postgres headers earlier. plpy_typeio.h: #include "access/htup.h" #include "fmgr.h" #include "plpython.h" #include "utils/typcache.h" plpy_curserobject.c: #include "access/xact.h" #include "catalog/pg_type.h" #include "mb/pg_wchar.h" #include "plpy_cursorobject.h" #include "plpy_elog.h" #include "plpy_main.h" #include "plpy_planobject.h" #include "plpy_procedure.h" #include "plpy_resultobject.h" #include "plpy_spi.h" #include "plpython.h" #include "utils/memutils.h" It strikes me as a uh, not good idea to undefine _POSIX_C_SOURCE, _XOPEN_SOURCE. The include order aspect was perhaps feasible when there just was plpython.c, but with the split into many different C files and many headers, it seems hard to maintain. There's a lot of violations afaics. The undefines were added in a11cf433413, the split in 147c2482542. Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-20 20:16:13 -0800, Andres Freund wrote: > On 2023-01-18 14:05:35 +0100, David Geier wrote: > > @Andres: will you take care of these changes and provide me with an updated > > patch set so I can rebase the RDTSC changes? > > Otherwise, I can also apply Tom suggestions to your patch set and send out > > the complete patch set. > > I'm planning to push most of my changes soon, had hoped to get to it a bit > sooner, but ... I pushed the int64-ification commits. > If you have time to look at the pg_test_timing part, it'd be > appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit > hesitant to push it. I haven't yet pushed the pg_test_timing (nor it's small prerequisite) patch. Thanks to Justin I've polished the pg_test_timing docs some. I've attached those two patches. Feel free to include them in your series if you want, then the CF entry (and thus cfbot) makes sense again... Greetings, Andres Freund >From 2546f3000455a7086ea930986b294fd79024ea59 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Fri, 20 Jan 2023 15:31:54 -0800 Subject: [PATCH v9 1/2] instr_time: Add INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop without having to convert into time units (which is costly). INSTR_TIME_IS_LT() can be used to check the loop condition. Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch: --- src/include/portability/instr_time.h | 12 1 file changed, 12 insertions(+) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index cc85138e21f..aab80effb00 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -15,6 +15,8 @@ * * INSTR_TIME_IS_ZERO(t) is t equal to zero? * + * INSTR_TIME_IS_LT(x, y) x < y + * * INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too) * * INSTR_TIME_SET_CURRENT(t) set t to current time @@ -22,6 +24,8 @@ * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero, * evaluates to whether t changed * + * INSTR_TIME_SET_SECONDS(t, s) set t to s seconds + * * INSTR_TIME_ADD(x, y)x += y * * INSTR_TIME_SUBTRACT(x, y) x -= y @@ -122,6 +126,9 @@ pg_clock_gettime_ns(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_clock_gettime_ns()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = NS_PER_S * (s)) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) (t).ticks) @@ -156,6 +163,9 @@ GetTimerFrequency(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_query_performance_counter()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = s * GetTimerFrequency()) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency( @@ -168,6 +178,8 @@ GetTimerFrequency(void) #define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0) +#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks) + #define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0) -- 2.38.0 >From a3149e3430c9594ca2ceb3039c954e729bd7c46e Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 16 Jan 2023 11:19:11 -0800 Subject: [PATCH v9 2/2] wip: report nanoseconds in pg_test_timing This commit also updates pg_test_timing's documentation: - compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of of statement with/without EXPLAIN ANALYZE - explain the 2x overhead (due to two timestamp acquisitions per row) - remove old section about old versions of linux - I couldn't update the numbers, and it's old enough nobody would care --- src/bin/pg_test_timing/pg_test_timing.c | 74 +-- doc/src/sgml/ref/pgtesttiming.sgml | 117 ++-- 2 files changed, 95 insertions(+), 96 deletions(-) diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c index c29d6f87629..e20718669a5 100644 --- a/src/bin/pg_test_timing/pg_test_timing.c +++ b/src/bin/pg_test_timing/pg_test_timing.c @@ -19,8 +19,8 @@ static void handle_args(int argc, char *argv[]); static uint64 test_timing(unsigned int duration); static void output(uint64 loop_count); -/* record duration in powers of 2 microseconds */ -long long int histogram[32]; +/* record duration in powers of 2 nanoseconds */ +uint64 histogram[64]; int main(int argc, char *argv[]) @@ -121,35 +121,48 @@ handle_args(int argc, char *argv[]) static uint64 test_timing(unsigned int duration) { - uint64 total_time; - int64 time_elapsed = 0; uint64 loop_count = 0; - uint64 prev, -cur; + instr_time until_time, +total_time; instr_time start_time, -end_time, -temp; - - total_time = duration > 0 ? duration * INT64CONST(100) : 0; +end_time; + instr_time cur; INSTR_TIME_SET_CURRENT(start_time); - cur = INSTR_TIME_GET_MICROSEC(start_time); - while (time_elapsed < total_time) + /* + * To reduce loop overhead, check loop condition in instr_time domain. + */ +
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On 2023-01-20 22:50:37 -0600, Justin Pryzby wrote: > On Fri, Jan 20, 2023 at 04:40:32PM -0800, Andres Freund wrote: > > From 5a458d4584961dedd3f80a07d8faea66e57c5d94 Mon Sep 17 00:00:00 2001 > > From: Andres Freund > > Date: Mon, 16 Jan 2023 11:19:11 -0800 > > Subject: [PATCH v8 4/5] wip: report nanoseconds in pg_test_timing > > > > > - The i7-860 system measured runs the count query in 9.8 ms while > > - the EXPLAIN ANALYZE version takes 16.6 ms, each > > - processing just over 100,000 rows. That 6.8 ms difference means the > > timing > > - overhead per row is 68 ns, about twice what pg_test_timing estimated it > > - would be. Even that relatively small amount of overhead is making the > > fully > > - timed count statement take almost 70% longer. On more substantial > > queries, > > - the timing overhead would be less problematic. > > + The i9-9880H system measured shows an execution time of 4.116 ms for the > > + TIMING OFF query, and 6.965 ms for the > > + TIMING ON, each processing 100,000 rows. > > + > > + That 2.849 ms difference means the timing overhead per row is 28 ns. As > > + TIMING ON measures timestamps twice per row returned > > by > > + an executor node, the overhead is very close to what pg_test_timing > > + estimated it would be. > > + > > + more than what pg_test_timing estimated it would be. Even that > > relatively > > + small amount of overhead is making the fully timed count statement take > > + about 60% longer. On more substantial queries, the timing overhead > > would > > + be less problematic. > > I guess you intend to merge these two paragraphs ? Oops. I was intending to drop the last paragraph. Looking at the docs again I noticed that I needed to rephrase the 'acpi_pm' section further, as I'd left the "a small multiple of what's measured directly by this utility" language in there. Do the changes otherwise make sense? The "small multiple" stuff was just due to a) comparing "raw statement" with explain analyze b) not accounting for two timestamps being taken per row. I think it makes sense to remove the "jiffies" section - the output shown is way outdated. And I don't think the jiffies time counter is one something still sees in the wild, outside of bringing up a new cpu architecture or such. Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On Fri, Jan 20, 2023 at 04:40:32PM -0800, Andres Freund wrote: > From 5a458d4584961dedd3f80a07d8faea66e57c5d94 Mon Sep 17 00:00:00 2001 > From: Andres Freund > Date: Mon, 16 Jan 2023 11:19:11 -0800 > Subject: [PATCH v8 4/5] wip: report nanoseconds in pg_test_timing > > - The i7-860 system measured runs the count query in 9.8 ms while > - the EXPLAIN ANALYZE version takes 16.6 ms, each > - processing just over 100,000 rows. That 6.8 ms difference means the > timing > - overhead per row is 68 ns, about twice what pg_test_timing estimated it > - would be. Even that relatively small amount of overhead is making the > fully > - timed count statement take almost 70% longer. On more substantial > queries, > - the timing overhead would be less problematic. > + The i9-9880H system measured shows an execution time of 4.116 ms for the > + TIMING OFF query, and 6.965 ms for the > + TIMING ON, each processing 100,000 rows. > + > + That 2.849 ms difference means the timing overhead per row is 28 ns. As > + TIMING ON measures timestamps twice per row returned by > + an executor node, the overhead is very close to what pg_test_timing > + estimated it would be. > + > + more than what pg_test_timing estimated it would be. Even that relatively > + small amount of overhead is making the fully timed count statement take > + about 60% longer. On more substantial queries, the timing overhead would > + be less problematic. I guess you intend to merge these two paragraphs ?
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-18 14:02:48 +0100, David Geier wrote: > On 1/16/23 18:37, Andres Freund wrote: > > I'm doubtful this is worth the complexity it incurs. By the time we convert > > out of the instr_time format, the times shouldn't be small enough that the > > accuracy is affected much. > > I don't feel strong about it and you have a point that we most likely only > convert ones we've accumulated a fair amount of cycles. I think we can avoid the issue another way. The inaccuracy comes from the cycles_to_sec ending up very small, right? Right now your patch has (and probably my old version similarly had): cycles_to_sec = 1.0 / (tsc_freq * 1000); I think it's better if we have one multiplier to convert cycles to nanoseconds - that'll be a double comparatively close to 1. We can use that to implement INSTR_TIME_GET_NANOSECONDS(). The conversion to microseconds then is just a division by 1000 (which most compilers convert into a multiplication/shift combo), and the conversions to milliseconds and seconds will be similar. Because we'll never "wrongly" go into the "huge number" or "very small number" ranges, that should provide sufficient precision? We'll of course still end up with a very small number when converting a few nanoseconds to seconds, but that's ok because it's the precision being asked for, instead of loosing precision in some intermediate representation. > > Looking around, most of the existing uses of INSTR_TIME_GET_MICROSEC() > > actually accumulate themselves, and should instead keep things in the > > instr_time format and convert later. We'd win more accuracy / speed that > > way. > > > > I don't think the introduction of pg_time_usec_t was a great idea, but oh > > well. > Fully agreed. Why not replacing pg_time_usec_t with instr_time in a separate > patch? pgbench used to use instr_time, but it was replaced by somebody thinking the API is too cumbersome. Which I can't quite deny, even though I think the specific change isn't great. But yes, this should definitely be a separate patch. Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-18 14:05:35 +0100, David Geier wrote: > @Andres: will you take care of these changes and provide me with an updated > patch set so I can rebase the RDTSC changes? > Otherwise, I can also apply Tom suggestions to your patch set and send out > the complete patch set. I'm planning to push most of my changes soon, had hoped to get to it a bit sooner, but ... If you have time to look at the pg_test_timing part, it'd be appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit hesitant to push it. Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-20 07:43:00 +0100, David Geier wrote: > On 1/18/23 13:52, David Geier wrote: > > On 1/16/23 21:39, Pavel Stehule wrote: > > > > > > po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra > > > napsal: > > > > > > Hi, > > > > > > there's minor bitrot in the Mkvcbuild.pm change, making cfbot > > > unhappy. > > > > > > As for the patch, I don't have much comments. I'm wondering if > > > it'd be > > > useful to indicate which timing source was actually used for EXPLAIN > > > ANALYZE, say something like: > > > > > > Planning time: 0.197 ms > > > Execution time: 0.225 ms > > > Timing source: clock_gettime (or tsc) > > > > > > +1 > > > > I like the idea of exposing the timing source in the EXPLAIN ANALYZE > > output. > > It's a good tradeoff between inspectability and effort, given that RDTSC > > should always be better to use. > > If there are no objections I go this way. > Thinking about this a little more made me realize that this will cause > different pg_regress output depending on the platform. So if we go this > route we would at least need an option for EXPLAIN ANALYZE to disable it. Or > rather have it disabled by default and allow for enabling it. Thoughts? The elapsed time is already inherently unstable, so we shouldn't have any test output showing the time. But I doubt showing it in every explain is a good idea - we use instr_time in plenty of other places. Why show it in explain, but not in all those other places? Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-19 11:47:49 +0100, David Geier wrote: > > I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out > > how to do the cycles->nanosecond conversion with integer shift and multiply > > in > > the common case, which does show a noticable speedup. But that's for another > > day. > I also have code for that here. I decided against integrating it because we > don't convert frequently enough to make it matter. Or am I missing > something? We do currently do the conversion quite frequently. Admittedly I was partially motivated by trying to get the per-loop overhead in pg_test_timing down ;) But I think it's a real issue. Places where we do, but shouldn't, convert: - ExecReScan() - quite painful, we can end up with a lot of those - InstrStopNode() - adds a good bit of overhead to simple - PendingWalStats.wal_write_time - this is particularly bad because it happens within very contended code - calls to pgstat_count_buffer_read_time(), pgstat_count_buffer_write_time() - they can be very frequent - pgbench.c, as we already discussed - pg_stat_statements.c - ... These all will get a bit slower when moving to a "variable" frequency. What was your approach for avoiding the costly operation? I ended up with a integer multiplication + shift approximation for the floating point multiplication (which in turn uses the inverse of the division by the frequency). To allow for sufficient precision while also avoiding overflows, I had to make that branch conditional, with a slow path for large numbers of nanoseconds. > > I fought a bit with myself about whether to send those patches in this > > thread, > > because it'll take over the CF entry. But decided that it's ok, given that > > David's patches should be rebased over these anyway? > That's alright. > Though, I would hope we attempt to bring your patch set as well as the RDTSC > patch set in. I think it'd be great - but I'm not sure we're there yet, reliability and code-complexity wise. I think it might be worth makign the rdts aspect somewhat measurable. E.g. allowing pg_test_timing to use both at the same time, and have it compare elapsed time with both sources of counters. Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-20 22:27:07 -0500, Tom Lane wrote: > Andres Freund writes: > >> Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions > >> could give us the best of both worlds? > > > I tried that in the attached 0005. I found that it reads better if I also > > add > > INSTR_TIME_CURRENT(). If we decide to go for this, I'd roll it into 0001 > > instead, but I wanted to get agreement on it first. > > -1 from here. This forecloses the possibility that it's best to use more > than one assignment to initialize the value, and the code doesn't read > any better than it did before. I think it does read a bit better, but it's a pretty small improvement. So I'll leave this aspect be for now. Thanks for checking. Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Andres Freund writes: >> Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions >> could give us the best of both worlds? > I tried that in the attached 0005. I found that it reads better if I also add > INSTR_TIME_CURRENT(). If we decide to go for this, I'd roll it into 0001 > instead, but I wanted to get agreement on it first. -1 from here. This forecloses the possibility that it's best to use more than one assignment to initialize the value, and the code doesn't read any better than it did before. regards, tom lane
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-17 10:50:53 -0800, Andres Freund wrote: > On 2023-01-17 12:26:57 -0500, Tom Lane wrote: > > > 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to > > > warn > > > Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and > > > just allow to assign 0. > > > > I think it's probably wise to keep the macro. If we ever rethink this > > again, we'll be glad we kept it. Similarly, IS_ZERO is a good idea > > even if it would work with just compare-to-zero. > > Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions > could give us the best of both worlds? I tried that in the attached 0005. I found that it reads better if I also add INSTR_TIME_CURRENT(). If we decide to go for this, I'd roll it into 0001 instead, but I wanted to get agreement on it first. Comments? > > I'm almost tempted to suggest you define instr_time as a struct with a > > uint64 field, just to help keep us honest about that. > > I can see that making sense. Unless somebody pipes up with opposition to that > plan soon, I'll see how it goes. Done in the attached. I think it looks good. Actually found a type confusion buglet in 0004, so the type safety benefit is noticable. It does require a new INSTR_TIME_IS_LT() for the loop exit condition in 0004, but that seems fine. Besides cosmetic stuff I also added back the cast to double in window's INSTR_TIME_GET_NANOSEC() - I think there's an overflow danger without it. We should make this faster by pre-computing (double) NS_PER_S / GetTimerFrequency() once, as that'd avoid doing the the slow division on every conversion. But that's an old issue and thus better tackled separately. Greetings, Andres Freund >From dfea7f159b79aaf1fb6f4c8617614f6fe5c1176f Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 16 Jan 2023 10:04:42 -0800 Subject: [PATCH v8 1/5] Zero initialize instr_time uses causing compiler warnings These are all not necessary from a correctness POV. However, in a subsequent patch instr_time will be simplified to an int64, at which point gcc would otherwise start to warn about the changed places. Author: Reviewed-by: Discussion: https://postgr.es/m/20230116023639.rn36vf6ajqmfc...@awork3.anarazel.de Backpatch: --- src/backend/access/transam/xlog.c | 4 src/backend/storage/buffer/bufmgr.c | 4 src/backend/storage/file/buffile.c | 4 src/backend/storage/ipc/latch.c | 2 ++ src/bin/psql/common.c | 6 ++ 5 files changed, 20 insertions(+) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index cc0d9a05d9f..fb4c860bdea 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2191,6 +2191,8 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) /* Measure I/O timing to write WAL data */ if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); +else + INSTR_TIME_SET_ZERO(start); pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE); written = pg_pwrite(openLogFile, from, nleft, startoffset); @@ -8151,6 +8153,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) /* Measure I/O timing to sync the WAL file */ if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); + else + INSTR_TIME_SET_ZERO(start); pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC); switch (sync_method) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 8075828e8a6..800a4248c95 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1017,6 +1017,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); + else +INSTR_TIME_SET_ZERO(io_start); smgrread(smgr, forkNum, blockNum, (char *) bufBlock); @@ -2902,6 +2904,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln) if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); + else + INSTR_TIME_SET_ZERO(io_start); /* * bufToWrite is either the shared buffer or a copy, as appropriate. diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c index c5464b6aa62..0a51624df3b 100644 --- a/src/backend/storage/file/buffile.c +++ b/src/backend/storage/file/buffile.c @@ -446,6 +446,8 @@ BufFileLoadBuffer(BufFile *file) if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); + else + INSTR_TIME_SET_ZERO(io_start); /* * Read whatever we can get, up to a full bufferload. @@ -525,6 +527,8 @@ BufFileDumpBuffer(BufFile *file) if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); + else + INSTR_TIME_SET_ZERO(io_start); bytestowrite = FileWrite(thisfile, file->buffer.data + wpos, diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index d79d71a8515..f4123e7de7e 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On 1/20/23 07:43, David Geier wrote: > On 1/18/23 13:52, David Geier wrote: >> On 1/16/23 21:39, Pavel Stehule wrote: >>> >>> po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra >>> napsal: >>> >>> Hi, >>> >>> there's minor bitrot in the Mkvcbuild.pm change, making cfbot >>> unhappy. >>> >>> As for the patch, I don't have much comments. I'm wondering if >>> it'd be >>> useful to indicate which timing source was actually used for EXPLAIN >>> ANALYZE, say something like: >>> >>> Planning time: 0.197 ms >>> Execution time: 0.225 ms >>> Timing source: clock_gettime (or tsc) >>> >>> +1 >> >> I like the idea of exposing the timing source in the EXPLAIN ANALYZE >> output. >> It's a good tradeoff between inspectability and effort, given that >> RDTSC should always be better to use. >> If there are no objections I go this way. > Thinking about this a little more made me realize that this will cause > different pg_regress output depending on the platform. So if we go this > route we would at least need an option for EXPLAIN ANALYZE to disable > it. Or rather have it disabled by default and allow for enabling it. > Thoughts? > What about only showing it for VERBOSE mode? I don't think there are very many tests doing EXPLAIN (ANALYZE, VERBOSE) - a quick grep found one such place in partition_prune.sql. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On 1/18/23 13:52, David Geier wrote: On 1/16/23 21:39, Pavel Stehule wrote: po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra napsal: Hi, there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy. As for the patch, I don't have much comments. I'm wondering if it'd be useful to indicate which timing source was actually used for EXPLAIN ANALYZE, say something like: Planning time: 0.197 ms Execution time: 0.225 ms Timing source: clock_gettime (or tsc) +1 I like the idea of exposing the timing source in the EXPLAIN ANALYZE output. It's a good tradeoff between inspectability and effort, given that RDTSC should always be better to use. If there are no objections I go this way. Thinking about this a little more made me realize that this will cause different pg_regress output depending on the platform. So if we go this route we would at least need an option for EXPLAIN ANALYZE to disable it. Or rather have it disabled by default and allow for enabling it. Thoughts? -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi Andres, I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out how to do the cycles->nanosecond conversion with integer shift and multiply in the common case, which does show a noticable speedup. But that's for another day. I also have code for that here. I decided against integrating it because we don't convert frequently enough to make it matter. Or am I missing something? I fought a bit with myself about whether to send those patches in this thread, because it'll take over the CF entry. But decided that it's ok, given that David's patches should be rebased over these anyway? That's alright. Though, I would hope we attempt to bring your patch set as well as the RDTSC patch set in. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, @Andres: will you take care of these changes and provide me with an updated patch set so I can rebase the RDTSC changes? Otherwise, I can also apply Tom suggestions to your patch set and send out the complete patch set. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On 1/16/23 18:37, Andres Freund wrote: Hi, On 2023-01-02 14:28:20 +0100, David Geier wrote: I'm doubtful this is worth the complexity it incurs. By the time we convert out of the instr_time format, the times shouldn't be small enough that the accuracy is affected much. I don't feel strong about it and you have a point that we most likely only convert ones we've accumulated a fair amount of cycles. Looking around, most of the existing uses of INSTR_TIME_GET_MICROSEC() actually accumulate themselves, and should instead keep things in the instr_time format and convert later. We'd win more accuracy / speed that way. I don't think the introduction of pg_time_usec_t was a great idea, but oh well. Fully agreed. Why not replacing pg_time_usec_t with instr_time in a separate patch? I haven't looked carefully enough if all occurrences could sanely replaced but at least the ones that accumulate time seem good starting points. Additionally, I initialized a few variables of type instr_time which otherwise resulted in warnings due to use of potentially uninitialized variables. Unless we decide, as I suggested downthread, that we deprecate INSTR_TIME_SET_ZERO(), that's unfortunately not the right fix. I've a similar patch that adds all the necesarry INSTR_TIME_SET_ZERO() calls. I don't feel strong about it, but like Tom tend towards keeping the initialization macro. Thanks that you have improved on the first patch and fixed these issues in a better way. What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so that it's consistent with the _MILLISEC() and _MICROSEC() variants? The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants return double. This seems error prone. What about renaming the function or also have the function return a double and cast where necessary at the call site? I think those should be a separate discussion / patch. OK. I'll propose follow-on patches ones we're done with the ones at hand. I'll then rebase the RDTSC patches on your patch set. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On 1/16/23 21:39, Pavel Stehule wrote: po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra napsal: Hi, there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy. As for the patch, I don't have much comments. I'm wondering if it'd be useful to indicate which timing source was actually used for EXPLAIN ANALYZE, say something like: Planning time: 0.197 ms Execution time: 0.225 ms Timing source: clock_gettime (or tsc) There has been a proposal to expose this as a GUC (or perhaps as explain option), to allow users to pick what timing source to use. I wouldn't go that far - AFAICS is this is meant to be universally better when available. But knowing which source was used seems useful. +1 Thanks for looking at the patch. I'll fix the merge conflict. I like the idea of exposing the timing source in the EXPLAIN ANALYZE output. It's a good tradeoff between inspectability and effort, given that RDTSC should always be better to use. If there are no objections I go this way. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-17 12:26:57 -0500, Tom Lane wrote: > Andres Freund writes: > > Here's an updated version of the move to representing instr_time as > > nanoseconds. It's now split into a few patches: > > I took a quick look through this. Thanks! > > 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to > > warn > > Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and > > just allow to assign 0. > > I think it's probably wise to keep the macro. If we ever rethink this > again, we'll be glad we kept it. Similarly, IS_ZERO is a good idea > even if it would work with just compare-to-zero. Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions could give us the best of both worlds? > I'm almost tempted to suggest you define instr_time as a struct with a > uint64 field, just to help keep us honest about that. I can see that making sense. Unless somebody pipes up with opposition to that plan soon, I'll see how it goes. > > 0003) Add INSTR_TIME_SET_SECOND() > > This is used in 0004. Just allows setting an instr_time to a time in > > seconds, allowing for a cheaper loop exit condition in 0004. > > Code and comments are inconsistent about whether it's SET_SECOND or > SET_SECONDS. I think I prefer the latter, but don't care that much. That's probably because I couldn't decide... So I'll go with your preference. > > 0004) report nanoseconds in pg_test_timing > > Didn't examine 0004 in any detail, but the others look good to go > other than these nits. Thanks for looking! Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Andres Freund writes: > Here's an updated version of the move to representing instr_time as > nanoseconds. It's now split into a few patches: I took a quick look through this. > 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to > warn > Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and > just allow to assign 0. I think it's probably wise to keep the macro. If we ever rethink this again, we'll be glad we kept it. Similarly, IS_ZERO is a good idea even if it would work with just compare-to-zero. I'm almost tempted to suggest you define instr_time as a struct with a uint64 field, just to help keep us honest about that. > 0003) Add INSTR_TIME_SET_SECOND() > This is used in 0004. Just allows setting an instr_time to a time in > seconds, allowing for a cheaper loop exit condition in 0004. Code and comments are inconsistent about whether it's SET_SECOND or SET_SECONDS. I think I prefer the latter, but don't care that much. > 0004) report nanoseconds in pg_test_timing Didn't examine 0004 in any detail, but the others look good to go other than these nits. regards, tom lane
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-17 08:46:12 -0500, Robert Haas wrote: > On Fri, Jan 13, 2023 at 2:56 PM Andres Freund wrote: > > Does anybody see a reason to not move forward with this aspect? We do a fair > > amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by > > just using nanoseconds. We'd also save memory in BufferUsage (144-122 > > bytes), > > Instrumentation (16 bytes saved in Instrumentation itself, 32 via > > BufferUsage). Here's an updated version of the move to representing instr_time as nanoseconds. It's now split into a few patches: 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to warn Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and just allow to assign 0. 0002) Convert instr_time to uint64 This is the cleaned up version of the prior patch. The main change is that it deduplicated a lot of the code between the architectures. 0003) Add INSTR_TIME_SET_SECOND() This is used in 0004. Just allows setting an instr_time to a time in seconds, allowing for a cheaper loop exit condition in 0004. 0004) report nanoseconds in pg_test_timing I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out how to do the cycles->nanosecond conversion with integer shift and multiply in the common case, which does show a noticable speedup. But that's for another day. I fought a bit with myself about whether to send those patches in this thread, because it'll take over the CF entry. But decided that it's ok, given that David's patches should be rebased over these anyway? > I read through 0001 and it seems basically fine to me. Comments: > > 1. pg_clock_gettime_ns() doesn't follow pgindent conventions. Fixed. > 2. I'm not entirely sure that the new .?S_PER_.?S macros are > worthwhile but maybe they are, and in any case I don't care very much. There's now fewer. But those I'd like to keep. I just end up counting digits manually way too many times. > 3. I've always found 'struct timespec' to be pretty annoying > notationally, so I like the fact that this patch would reduce use of > it. Same. Greetings, Andres Freund >From c1024a9dfa7f5645200b7fa68e8bce5561c9cee0 Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Mon, 16 Jan 2023 10:04:42 -0800 Subject: [PATCH v7 1/4] Zero initialize instr_time uses causing compiler warnings These are all not necessary from a correctness POV. However, in a subsequent patch instr_time will be simplified to an int64, at which point gcc would otherwise start to warn about the changed places. Author: Reviewed-by: Discussion: https://postgr.es/m/20230116023639.rn36vf6ajqmfc...@awork3.anarazel.de Backpatch: --- src/backend/access/transam/xlog.c | 4 src/backend/storage/buffer/bufmgr.c | 4 src/backend/storage/file/buffile.c | 4 src/bin/psql/common.c | 6 ++ 4 files changed, 18 insertions(+) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 8f47fb75700..7d65b1d4159 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2191,6 +2191,8 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) /* Measure I/O timing to write WAL data */ if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); +else + INSTR_TIME_SET_ZERO(start); pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE); written = pg_pwrite(openLogFile, from, nleft, startoffset); @@ -8150,6 +8152,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) /* Measure I/O timing to sync the WAL file */ if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); + else + INSTR_TIME_SET_ZERO(start); pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC); switch (sync_method) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 8075828e8a6..800a4248c95 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1017,6 +1017,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); + else +INSTR_TIME_SET_ZERO(io_start); smgrread(smgr, forkNum, blockNum, (char *) bufBlock); @@ -2902,6 +2904,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln) if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); + else + INSTR_TIME_SET_ZERO(io_start); /* * bufToWrite is either the shared buffer or a copy, as appropriate. diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c index c5464b6aa62..0a51624df3b 100644 --- a/src/backend/storage/file/buffile.c +++ b/src/backend/storage/file/buffile.c @@ -446,6 +446,8 @@ BufFileLoadBuffer(BufFile *file) if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); + else + INSTR_TIME_SET_ZERO(io_start); /* * Read whatever we can get, up to a full bufferload. @@ -525,6 +527,8 @@ BufFileDumpBuffer(BufFile
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On Fri, Jan 13, 2023 at 2:56 PM Andres Freund wrote: > Does anybody see a reason to not move forward with this aspect? We do a fair > amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by > just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes), > Instrumentation (16 bytes saved in Instrumentation itself, 32 via > BufferUsage). I read through 0001 and it seems basically fine to me. Comments: 1. pg_clock_gettime_ns() doesn't follow pgindent conventions. 2. I'm not entirely sure that the new .?S_PER_.?S macros are worthwhile but maybe they are, and in any case I don't care very much. 3. I've always found 'struct timespec' to be pretty annoying notationally, so I like the fact that this patch would reduce use of it. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra < tomas.von...@enterprisedb.com> napsal: > Hi, > > there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy. > > As for the patch, I don't have much comments. I'm wondering if it'd be > useful to indicate which timing source was actually used for EXPLAIN > ANALYZE, say something like: > > Planning time: 0.197 ms > Execution time: 0.225 ms > Timing source: clock_gettime (or tsc) > > There has been a proposal to expose this as a GUC (or perhaps as explain > option), to allow users to pick what timing source to use. I wouldn't go > that far - AFAICS is this is meant to be universally better when > available. But knowing which source was used seems useful. > +1 Pavel > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > >
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy. As for the patch, I don't have much comments. I'm wondering if it'd be useful to indicate which timing source was actually used for EXPLAIN ANALYZE, say something like: Planning time: 0.197 ms Execution time: 0.225 ms Timing source: clock_gettime (or tsc) There has been a proposal to expose this as a GUC (or perhaps as explain option), to allow users to pick what timing source to use. I wouldn't go that far - AFAICS is this is meant to be universally better when available. But knowing which source was used seems useful. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-02 14:28:20 +0100, David Geier wrote: > I also somewhat improved the accuracy of the cycles to milli- and > microseconds conversion functions by having two more multipliers with higher > precision. For microseconds we could also keep the computation integer-only. > I'm wondering what to best do for seconds and milliseconds. I'm currently > leaning towards just keeping it as is, because the durations measured and > converted are usually long enough that precision shouldn't be a problem. I'm doubtful this is worth the complexity it incurs. By the time we convert out of the instr_time format, the times shouldn't be small enough that the accuracy is affected much. Looking around, most of the existing uses of INSTR_TIME_GET_MICROSEC() actually accumulate themselves, and should instead keep things in the instr_time format and convert later. We'd win more accuracy / speed that way. I don't think the introduction of pg_time_usec_t was a great idea, but oh well. > Additionally, I initialized a few variables of type instr_time which > otherwise resulted in warnings due to use of potentially uninitialized > variables. Unless we decide, as I suggested downthread, that we deprecate INSTR_TIME_SET_ZERO(), that's unfortunately not the right fix. I've a similar patch that adds all the necesarry INSTR_TIME_SET_ZERO() calls. > What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so that > it's consistent with the _MILLISEC() and _MICROSEC() variants? > The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants > return double. This seems error prone. What about renaming the function or > also have the function return a double and cast where necessary at the call > site? I think those should be a separate discussion / patch. Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-13 11:55:47 -0800, Andres Freund wrote: > Does anybody see a reason to not move forward with this aspect? We do a fair > amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by > just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes), > Instrumentation (16 bytes saved in Instrumentation itself, 32 via > BufferUsage). This actually under-counted the benefits, because we have two BufferUsage and two WalUsage in Instrumentation. Before: /* size: 448, cachelines: 7, members: 20 */ /* sum members: 445, holes: 1, sum holes: 3 */ After /* size: 368, cachelines: 6, members: 20 */ /* sum members: 365, holes: 1, sum holes: 3 */ The difference in the number of instructions in InstrStopNode is astounding: 1016 instructions with timespec, 96 instructions with nanoseconds. Some of that is the simpler data structure, some because the compiler now can auto-vectorize the four INSTR_TIME_ACCUM_DIFF in BufferUsageAccumDiff into one. We probably should convert Instrumentation->firsttuple to a instr_time now as well, no point in having the code for conversion to double in the hot routine, that can easily happen in explain. But that's for a later patch. I suggested downthread that we should convert the win32 implementation to be more similar to the unix-nanoseconds representation. A blind conversion looks good, and lets us share a number of macros. I wonder if we should deprecate INSTR_TIME_IS_ZERO()/INSTR_TIME_SET_ZERO() and allow 0 to be used instead. Not needing INSTR_TIME_SET_ZERO() allows variable definitions to initialize the value, which does avoid some unnecessarily awkward code. Alternatively we could introduce INSTR_TIME_ZERO() for that purpose? Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On Wed, 4 Jan 2023 at 17:32, David Geier wrote: > > I fixed the compilation error on CFBot. > I missed adding instr_time.c to the Meson makefile. > New patch set attached. The patch does not apply on top of HEAD as in [1], please post a rebased patch: === Applying patches on top of PostgreSQL commit ID ff23b592ad6621563d3128b26860bcb41daf9542 === === applying patch ./0002-Use-CPU-reference-cycles-via-RDTSC-to-measure-time-v6.patch patching file src/tools/msvc/Mkvcbuild.pm Hunk #1 FAILED at 135. 1 out of 1 hunk FAILED -- saving rejects to file src/tools/msvc/Mkvcbuild.pm.rej [1] - http://cfbot.cputube.org/patch_41_3751.log Regards, Vignesh
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-13 15:25:16 -0500, Tom Lane wrote: > Andres Freund writes: > > Does anybody see a reason to not move forward with this aspect? We do a fair > > amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by > > just using nanoseconds. > > Cheaper, and perhaps more accurate too? Don't recall if we have any code > paths where the input timestamps are likely to be better-than-microsecond, > but surely that's coming someday. instr_time on !WIN32 use struct timespec, so we already should have nanosecond precision available. IOW, we could add a INSTR_TIME_GET_NANOSEC today. Or am I misunderstanding what you mean? > I'm unsure that we want to deal with rdtsc's vagaries in general, but > no objection to changing instr_time. Cool. Looking at the instr_time.h part of the change, I think it should go further, and basically do the same thing in the WIN32 path. The only part that needs to be be win32 specific is INSTR_TIME_SET_CURRENT(). That'd reduce duplication a good bit. Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Andres Freund writes: > On 2023-01-04 13:02:05 +0100, David Geier wrote: >> Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's >> cheaper. > Does anybody see a reason to not move forward with this aspect? We do a fair > amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by > just using nanoseconds. Cheaper, and perhaps more accurate too? Don't recall if we have any code paths where the input timestamps are likely to be better-than-microsecond, but surely that's coming someday. I'm unsure that we want to deal with rdtsc's vagaries in general, but no objection to changing instr_time. regards, tom lane
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2023-01-04 13:02:05 +0100, David Geier wrote: > From be18633d4735f680c7910fcb4e8ac90c4eada131 Mon Sep 17 00:00:00 2001 > From: David Geier > Date: Thu, 17 Nov 2022 10:22:01 +0100 > Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's > cheaper. Does anybody see a reason to not move forward with this aspect? We do a fair amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes), Instrumentation (16 bytes saved in Instrumentation itself, 32 via BufferUsage). While the range of instr_time storing nanoseconds wouldn't be good enough for a generic timestamp facility (hence using microsecs for Timestamp), the range seems plenty for its use of measuring runtime: (2 ** 63) - 1) / ((10 ** 9) * 60 * 60 * 24 * 365) = ~292 years Of course, when using CLOCK_REALTIME, this is relative to 1970-01-01, so just 239 years. It could theoretically be a different story, if we stored instr_time's on disk. But we don't, they're ephemeral. This doesn't buy a whole lot of performance - the bottlenck is the actual timestamp computation. But in a query with not much else going on, it's visible and reproducible. It's, unsurprisingly, a lot easier to see when using BUFFERS. For both timespec and nanosecond, I measured three server starts, and for each started server three executions of pgbench -n -Mprepared -c1 -P5 -T15 -f <(echo "EXPLAIN (ANALYZE, BUFFERS) SELECT generate_series(1, 1000) OFFSET 1000;") the best result is: timespec: 1073.431 nanosec: 957.532 a ~10% difference Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, CFBot shows some compilation errors as in [1], please post an updated version for the same: 09:08:12.525] /usr/bin/ld: src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning: relocation against `cycles_to_sec' in read-only section `.text' [09:08:12.525] /usr/bin/ld: src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in function `pg_clock_gettime_ref_cycles': [09:08:12.525] /tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119: undefined reference to `use_rdtsc' [09:08:12.525] /usr/bin/ld: src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in function `test_timing': [09:08:12.525] /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135: undefined reference to `pg_clock_gettime_initialize_rdtsc' [09:08:12.525] /usr/bin/ld: /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137: undefined reference to `cycles_to_us' [09:08:12.525] /usr/bin/ld: /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146: undefined reference to `cycles_to_us' [09:08:12.525] /usr/bin/ld: /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169: undefined reference to `cycles_to_us' [09:08:12.525] /usr/bin/ld: /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176: undefined reference to `cycles_to_sec' [09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE [09:08:12.525] collect2: error: ld returned 1 exit status [1] - https://cirrus-ci.com/task/5375312565895168 Regards, Vignesh I fixed the compilation error on CFBot. I missed adding instr_time.c to the Meson makefile. New patch set attached. -- David Geier (ServiceNow) From be18633d4735f680c7910fcb4e8ac90c4eada131 Mon Sep 17 00:00:00 2001 From: David Geier Date: Thu, 17 Nov 2022 10:22:01 +0100 Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's cheaper. --- src/include/portability/instr_time.h | 62 1 file changed, 26 insertions(+), 36 deletions(-) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index 9ea1a68bd9..c64f21eb53 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -80,63 +80,53 @@ #define PG_INSTR_CLOCK CLOCK_REALTIME #endif -typedef struct timespec instr_time; +typedef int64 instr_time; +#define NS_PER_S INT64CONST(10) +#define US_PER_S INT64CONST(100) +#define MS_PER_S INT64CONST(1000) -#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0) +#define NS_PER_MS INT64CONST(100) +#define NS_PER_US INT64CONST(1000) -#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0) +#define INSTR_TIME_IS_ZERO(t) ((t) == 0) -#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t))) +#define INSTR_TIME_SET_ZERO(t) ((t) = 0) + +static inline instr_time pg_clock_gettime_ns(void) +{ + struct timespec tmp; + + clock_gettime(PG_INSTR_CLOCK, ); + + return tmp.tv_sec * NS_PER_S + tmp.tv_nsec; +} + +#define INSTR_TIME_SET_CURRENT(t) \ + (t) = pg_clock_gettime_ns() #define INSTR_TIME_ADD(x,y) \ do { \ - (x).tv_sec += (y).tv_sec; \ - (x).tv_nsec += (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ - (x).tv_sec++; \ - } \ + (x) += (y); \ } while (0) #define INSTR_TIME_SUBTRACT(x,y) \ do { \ - (x).tv_sec -= (y).tv_sec; \ - (x).tv_nsec -= (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ + (x) -= (y); \ } while (0) #define INSTR_TIME_ACCUM_DIFF(x,y,z) \ do { \ - (x).tv_sec += (y).tv_sec - (z).tv_sec; \ - (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \ - /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ - (x).tv_sec++; \ - } \ + (x) += (y) - (z); \ } while (0) #define INSTR_TIME_GET_DOUBLE(t) \ - (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 10.0) + ((double) (t) / NS_PER_S) #define INSTR_TIME_GET_MILLISEC(t) \ - (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 100.0) + ((double) (t) / NS_PER_MS) #define INSTR_TIME_GET_MICROSEC(t) \ - (((uint64) (t).tv_sec * (uint64) 100) + (uint64) ((t).tv_nsec / 1000)) + ((double) (t) / NS_PER_US) #else /* WIN32 */ -- 2.34.1 From 190ca09566eabb017ed25b1512225173ca47fb88 Mon Sep 17 00:00:00 2001 From: David Geier Date: Thu, 17 Nov 2022 13:03:59 +0100 Subject: [PATCH 2/3] Use CPU reference cycles, via RDTSC, to measure time for instrumentation. For now this is only enabled on Linux/x86 when the system clocksource is marked tsc as well, as determined at runtime. This way we can rely on the Linux kernel to make a decision whether tsc is invariant and usable on the current CPU
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On Tue, 3 Jan 2023 at 14:08, David Geier wrote: > > Hi Lukas, > > On 1/2/23 20:50, Lukas Fittl wrote: > > Thanks for continuing to work on this patch, and my apologies for > > silence on the patch. > > It would be great if you could review it. > Please also share your thoughts around exposing the used clock source as > GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS(). > > I rebased again on master because of [1]. Patches attached. > > > > > Its been hard to make time, and especially so because I typically > > develop on an ARM-based macOS system where I can't test this directly > > - hence my tests with virtualized EC2 instances, where I ran into the > > timing oddities. > That's good and bad. Bad to do the development and good to test the > implementation on more virtualized setups; given that I also encountered > "interesting" behavior on VMWare (see my previous mails). > > > > On Mon, Jan 2, 2023 at 5:28 AM David Geier wrote: > > > > The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other > > variants > > return double. This seems error prone. What about renaming the > > function > > or also have the function return a double and cast where necessary at > > the call site? > > > > > > Minor note, but in my understanding using a uint64 (where we can) is > > faster for any simple arithmetic we do with the values. > > That's true. So the argument could be that for seconds and milliseconds > we want the extra precision while microseconds are precise enough. > Still, we could also make the seconds and milliseconds conversion code > integer only and e.g. return two integers with the value before and > after the comma. FWICS, the functions are nowhere used in performance > critical code, so it doesn't really make a difference performance-wise. > > > > > +1, and feel free to carry this patch forward - I'll try to make an > > effort to review my earlier testing issues again, as well as your > > later improvements to the patch. > Moved to the current commit fest. Will you become reviewer? > > > > Also, FYI, I just posted an alternate idea for speeding up EXPLAIN > > ANALYZE with timing over in [0], using a sampling-based approach to > > reduce the timing overhead. > > Interesting idea. I'll reply with some thoughts on the corresponding thread. > > [1] > https://www.postgresql.org/message-id/flat/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw%40mail.gmail.com CFBot shows some compilation errors as in [1], please post an updated version for the same: 09:08:12.525] /usr/bin/ld: src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning: relocation against `cycles_to_sec' in read-only section `.text' [09:08:12.525] /usr/bin/ld: src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in function `pg_clock_gettime_ref_cycles': [09:08:12.525] /tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119: undefined reference to `use_rdtsc' [09:08:12.525] /usr/bin/ld: src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in function `test_timing': [09:08:12.525] /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135: undefined reference to `pg_clock_gettime_initialize_rdtsc' [09:08:12.525] /usr/bin/ld: /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137: undefined reference to `cycles_to_us' [09:08:12.525] /usr/bin/ld: /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146: undefined reference to `cycles_to_us' [09:08:12.525] /usr/bin/ld: /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169: undefined reference to `cycles_to_us' [09:08:12.525] /usr/bin/ld: /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176: undefined reference to `cycles_to_sec' [09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE [09:08:12.525] collect2: error: ld returned 1 exit status [1] - https://cirrus-ci.com/task/5375312565895168 Regards, Vignesh
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi Lukas, On 1/2/23 20:50, Lukas Fittl wrote: Thanks for continuing to work on this patch, and my apologies for silence on the patch. It would be great if you could review it. Please also share your thoughts around exposing the used clock source as GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS(). I rebased again on master because of [1]. Patches attached. Its been hard to make time, and especially so because I typically develop on an ARM-based macOS system where I can't test this directly - hence my tests with virtualized EC2 instances, where I ran into the timing oddities. That's good and bad. Bad to do the development and good to test the implementation on more virtualized setups; given that I also encountered "interesting" behavior on VMWare (see my previous mails). On Mon, Jan 2, 2023 at 5:28 AM David Geier wrote: The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants return double. This seems error prone. What about renaming the function or also have the function return a double and cast where necessary at the call site? Minor note, but in my understanding using a uint64 (where we can) is faster for any simple arithmetic we do with the values. That's true. So the argument could be that for seconds and milliseconds we want the extra precision while microseconds are precise enough. Still, we could also make the seconds and milliseconds conversion code integer only and e.g. return two integers with the value before and after the comma. FWICS, the functions are nowhere used in performance critical code, so it doesn't really make a difference performance-wise. +1, and feel free to carry this patch forward - I'll try to make an effort to review my earlier testing issues again, as well as your later improvements to the patch. Moved to the current commit fest. Will you become reviewer? Also, FYI, I just posted an alternate idea for speeding up EXPLAIN ANALYZE with timing over in [0], using a sampling-based approach to reduce the timing overhead. Interesting idea. I'll reply with some thoughts on the corresponding thread. [1] https://www.postgresql.org/message-id/flat/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw%40mail.gmail.com -- David Geier (ServiceNow) From f63527c8e4b3b0b71ffacaa111dd93325d973432 Mon Sep 17 00:00:00 2001 From: David Geier Date: Thu, 17 Nov 2022 10:22:01 +0100 Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's cheaper. --- src/include/portability/instr_time.h | 62 1 file changed, 26 insertions(+), 36 deletions(-) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index 9ea1a68bd9..c64f21eb53 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -80,63 +80,53 @@ #define PG_INSTR_CLOCK CLOCK_REALTIME #endif -typedef struct timespec instr_time; +typedef int64 instr_time; +#define NS_PER_S INT64CONST(10) +#define US_PER_S INT64CONST(100) +#define MS_PER_S INT64CONST(1000) -#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0) +#define NS_PER_MS INT64CONST(100) +#define NS_PER_US INT64CONST(1000) -#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0) +#define INSTR_TIME_IS_ZERO(t) ((t) == 0) -#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t))) +#define INSTR_TIME_SET_ZERO(t) ((t) = 0) + +static inline instr_time pg_clock_gettime_ns(void) +{ + struct timespec tmp; + + clock_gettime(PG_INSTR_CLOCK, ); + + return tmp.tv_sec * NS_PER_S + tmp.tv_nsec; +} + +#define INSTR_TIME_SET_CURRENT(t) \ + (t) = pg_clock_gettime_ns() #define INSTR_TIME_ADD(x,y) \ do { \ - (x).tv_sec += (y).tv_sec; \ - (x).tv_nsec += (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ - (x).tv_sec++; \ - } \ + (x) += (y); \ } while (0) #define INSTR_TIME_SUBTRACT(x,y) \ do { \ - (x).tv_sec -= (y).tv_sec; \ - (x).tv_nsec -= (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ + (x) -= (y); \ } while (0) #define INSTR_TIME_ACCUM_DIFF(x,y,z) \ do { \ - (x).tv_sec += (y).tv_sec - (z).tv_sec; \ - (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \ - /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ - (x).tv_sec++; \ - } \ + (x) += (y) - (z); \ } while (0) #define INSTR_TIME_GET_DOUBLE(t) \ - (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 10.0) + ((double) (t) / NS_PER_S) #define INSTR_TIME_GET_MILLISEC(t) \ - (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 100.0) + ((double) (t) / NS_PER_MS) #define INSTR_TIME_GET_MICROSEC(t) \ -
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On Fri, Jul 15, 2022 at 11:21 AM Maciek Sakrejda wrote: > On Fri, Jul 1, 2022 at 10:26 AM Andres Freund wrote: > > On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote: > >... > > > Known WIP problems with this patch version: > > > > > > * There appears to be a timing discrepancy I haven't yet worked out, where > > > the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is > > > reporting. With Andres' earlier test case, I'm seeing a consistent > > > ~700ms > > > higher for \timing than for the EXPLAIN ANALYZE time reported on the > > > server > > > side, only when rdtsc measurement is used -- its likely there is a > > > problem > > > somewhere with how we perform the cycles to time conversion > > > > Could you explain a bit more what you're seeing? I just tested your patches > > and didn't see that here. > > I did not see this either, but I did see that the execution time > reported by \timing is (for this test case) consistently 0.5-1ms > *lower* than the Execution Time reported by EXPLAIN. I did not see > that on master. Is that expected? For what it's worth, I can no longer reproduce this. In fact, I went back to master-as-of-around-then and applied Lukas' v2 patches again, and I still can't reproduce that. I do remember it happening consistently across several executions, but now \timing consistently shows 0.5-1ms slower, as expected. This does not explain the different timing issue Lukas was seeing in his tests, but I think we can assume what I reported originally here is not an issue.
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi David, Thanks for continuing to work on this patch, and my apologies for silence on the patch. Its been hard to make time, and especially so because I typically develop on an ARM-based macOS system where I can't test this directly - hence my tests with virtualized EC2 instances, where I ran into the timing oddities. On Mon, Jan 2, 2023 at 5:28 AM David Geier wrote: > The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants > return double. This seems error prone. What about renaming the function > or also have the function return a double and cast where necessary at > the call site? > Minor note, but in my understanding using a uint64 (where we can) is faster for any simple arithmetic we do with the values. > If no one objects I would also re-register this patch in the commit fest. > +1, and feel free to carry this patch forward - I'll try to make an effort to review my earlier testing issues again, as well as your later improvements to the patch. Also, FYI, I just posted an alternate idea for speeding up EXPLAIN ANALYZE with timing over in [0], using a sampling-based approach to reduce the timing overhead. [0]: https://www.postgresql.org/message-id/CAP53PkxXMk0j-%2B0%3DYwRti2pFR5UB2Gu4v2Oyk8hhZS0DRART6g%40mail.gmail.com Thanks, Lukas -- Lukas Fittl
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, I re-based again on master and applied the following changes: I removed the fallback for obtaining the TSC frequency from /proc/cpu as suggested by Andres. Worst-case we fall back to clock_gettime(). I added code to obtain the TSC frequency via CPUID when under a hypervisor. I had to use __cpuid() directly instead of __get_cpuid(), because __get_cpuid() returns an error if the leaf is > 0x8000 (probably the implementation pre-dates the hypervisor timing leafs). Unfortunately, while testing my implementation under VMWare, I found that RDTSC runs awfully slow there (like 30x slower). [1] indicates that we cannot generally rely on RDTSC being actually fast on VMs. However, the same applies to clock_gettime(). It runs as slow as RDTSC on my VMWare setup. Hence, using RDTSC is not at disadvantage. I'm not entirely sure if there aren't cases where e.g. clock_gettime() is actually faster than RDTSC and it would be advantageous to use clock_gettime(). We could add a GUC so that the user can decide which clock source to use. Any thoughts? I also somewhat improved the accuracy of the cycles to milli- and microseconds conversion functions by having two more multipliers with higher precision. For microseconds we could also keep the computation integer-only. I'm wondering what to best do for seconds and milliseconds. I'm currently leaning towards just keeping it as is, because the durations measured and converted are usually long enough that precision shouldn't be a problem. In vacuum_lazy.c we do if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000). I changed that to use INSTR_TIME_GET_MILLISECS() instead. Additionally, I initialized a few variables of type instr_time which otherwise resulted in warnings due to use of potentially uninitialized variables. I also couldn't reproduce the reported timing discrepancy. For me the runtime reported by \timing is just slightly higher than the time reported by EXPLAIN ANALYZE, which is expected. Beyond that: What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so that it's consistent with the _MILLISEC() and _MICROSEC() variants? The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants return double. This seems error prone. What about renaming the function or also have the function return a double and cast where necessary at the call site? If no one objects I would also re-register this patch in the commit fest. [1] https://vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf (page 11 "Virtual TSC") -- David Geier (ServiceNow) From 321d00ae5dd1bcffc8fbdb39879b7f5c78e3930f Mon Sep 17 00:00:00 2001 From: David Geier Date: Thu, 17 Nov 2022 10:22:01 +0100 Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's cheaper. --- src/include/portability/instr_time.h | 62 1 file changed, 26 insertions(+), 36 deletions(-) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index 22bcf3d288..4bd555113b 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -80,63 +80,53 @@ #define PG_INSTR_CLOCK CLOCK_REALTIME #endif -typedef struct timespec instr_time; +typedef int64 instr_time; +#define NS_PER_S INT64CONST(10) +#define US_PER_S INT64CONST(100) +#define MS_PER_S INT64CONST(1000) -#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0) +#define NS_PER_MS INT64CONST(100) +#define NS_PER_US INT64CONST(1000) -#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0) +#define INSTR_TIME_IS_ZERO(t) ((t) == 0) -#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t))) +#define INSTR_TIME_SET_ZERO(t) ((t) = 0) + +static inline instr_time pg_clock_gettime_ns(void) +{ + struct timespec tmp; + + clock_gettime(PG_INSTR_CLOCK, ); + + return tmp.tv_sec * NS_PER_S + tmp.tv_nsec; +} + +#define INSTR_TIME_SET_CURRENT(t) \ + (t) = pg_clock_gettime_ns() #define INSTR_TIME_ADD(x,y) \ do { \ - (x).tv_sec += (y).tv_sec; \ - (x).tv_nsec += (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ - (x).tv_sec++; \ - } \ + (x) += (y); \ } while (0) #define INSTR_TIME_SUBTRACT(x,y) \ do { \ - (x).tv_sec -= (y).tv_sec; \ - (x).tv_nsec -= (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ + (x) -= (y); \ } while (0) #define INSTR_TIME_ACCUM_DIFF(x,y,z) \ do { \ - (x).tv_sec += (y).tv_sec - (z).tv_sec; \ - (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \ - /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ - (x).tv_sec++; \ - } \ + (x) += (y)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
I missed attaching the patches. -- David Geier (ServiceNow) From f4e962729ca605498d0c8bfc97d0f42d68a0df06 Mon Sep 17 00:00:00 2001 From: David Geier Date: Thu, 17 Nov 2022 10:22:01 +0100 Subject: [PATCH 1/2] WIP: Change instr_time to just store nanoseconds, that's cheaper. --- src/include/portability/instr_time.h | 62 1 file changed, 26 insertions(+), 36 deletions(-) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index 22bcf3d288..4bd555113b 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -80,63 +80,53 @@ #define PG_INSTR_CLOCK CLOCK_REALTIME #endif -typedef struct timespec instr_time; +typedef int64 instr_time; +#define NS_PER_S INT64CONST(10) +#define US_PER_S INT64CONST(100) +#define MS_PER_S INT64CONST(1000) -#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0) +#define NS_PER_MS INT64CONST(100) +#define NS_PER_US INT64CONST(1000) -#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0) +#define INSTR_TIME_IS_ZERO(t) ((t) == 0) -#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t))) +#define INSTR_TIME_SET_ZERO(t) ((t) = 0) + +static inline instr_time pg_clock_gettime_ns(void) +{ + struct timespec tmp; + + clock_gettime(PG_INSTR_CLOCK, ); + + return tmp.tv_sec * NS_PER_S + tmp.tv_nsec; +} + +#define INSTR_TIME_SET_CURRENT(t) \ + (t) = pg_clock_gettime_ns() #define INSTR_TIME_ADD(x,y) \ do { \ - (x).tv_sec += (y).tv_sec; \ - (x).tv_nsec += (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ - (x).tv_sec++; \ - } \ + (x) += (y); \ } while (0) #define INSTR_TIME_SUBTRACT(x,y) \ do { \ - (x).tv_sec -= (y).tv_sec; \ - (x).tv_nsec -= (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ + (x) -= (y); \ } while (0) #define INSTR_TIME_ACCUM_DIFF(x,y,z) \ do { \ - (x).tv_sec += (y).tv_sec - (z).tv_sec; \ - (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \ - /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 10; \ - (x).tv_sec--; \ - } \ - while ((x).tv_nsec >= 10) \ - { \ - (x).tv_nsec -= 10; \ - (x).tv_sec++; \ - } \ + (x) += (y) - (z); \ } while (0) #define INSTR_TIME_GET_DOUBLE(t) \ - (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 10.0) + ((double) (t) / NS_PER_S) #define INSTR_TIME_GET_MILLISEC(t) \ - (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 100.0) + ((double) (t) / NS_PER_MS) #define INSTR_TIME_GET_MICROSEC(t) \ - (((uint64) (t).tv_sec * (uint64) 100) + (uint64) ((t).tv_nsec / 1000)) + ((double) (t) / NS_PER_US) #else /* WIN32 */ -- 2.34.1 From 7a6317fdf5b1d82f120a4fd5535cfe40c8165153 Mon Sep 17 00:00:00 2001 From: David Geier Date: Thu, 17 Nov 2022 13:03:59 +0100 Subject: [PATCH 2/2] WIP: Use cpu reference cycles, via rdtsc, to measure time for instrumentation. For now this is only enabled on Linux/x86 when the system clocksource is marked tsc as well, as determined at runtime. This way we can rely on the Linux kernel to make a decision whether tsc is invariant and usable on the current CPU architecture. In all other cases we continue to use the clock_gettime() implementation like before. Note that this intentionally uses rdtsc, not rdtscp, as rdtscp waits for currently running CPU instructions to have finished, and that adds up to noticable latency for little benefit in the typical InstrStartNode() / InstrStopNode() use case. --- src/backend/utils/init/postinit.c | 3 + src/bin/pg_test_timing/pg_test_timing.c | 1 + src/bin/pgbench/pgbench.c | 3 + src/bin/psql/startup.c | 4 + src/common/Makefile | 1 + src/common/instr_time.c | 103 src/include/portability/instr_time.h| 50 +--- src/tools/msvc/Mkvcbuild.pm | 2 +- 8 files changed, 157 insertions(+), 10 deletions(-) create mode 100644 src/common/instr_time.c diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c index a990c833c5..c684725af3 100644 --- a/src/backend/utils/init/postinit.c +++ b/src/backend/utils/init/postinit.c @@ -804,6 +804,9 @@ InitPostgres(const char *in_dbname, Oid dboid, /* Initialize portal manager */ EnablePortalManager(); + /* initialize high-precision interval timing */ + INSTR_TIME_INITIALIZE(); + /* Initialize status reporting */ pgstat_beinit(); diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c index c29d6f8762..0d667ff5a7 100644 --- a/src/bin/pg_test_timing/pg_test_timing.c +++ b/src/bin/pg_test_timing/pg_test_timing.c @@ -132,6 +132,7 @@ test_timing(unsigned int duration)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
I think it would be great to get this patch committed. Beyond the reasons already mentioned, the significant overhead also tends to skew the reported runtimes in ways that makes it difficult to compare them. For example, if two nodes are executed equally often but one needs twice the time to process the rows: in such a case EXPLAIN ANALYZE should report timings that are 2x apart. However, currently, the high overhead of clock_gettime() tends to skew the relative runtimes. On 10/12/22 10:33, Michael Paquier wrote: No rebased version has been sent since this update, so this patch has been marked as RwF. I've rebased the patch set on latest master and fixed a few compiler warnings. Beyond that some findings and thoughts: You're only using RDTSC if the clock source is 'tsc'. Great idea to not bother caring about a lot of hairy TSC details. Looking at the kernel code this seems to imply that the TSC is frequency invariant. I don't think though that this implies that Linux is not running under a hypervisor; which is good because I assume PostgreSQL is used a lot in VMs. However, when running under a hypervisor (at least with VMWare) CPUID leaf 0x16 is not available. In my tests __get_cpuid() indicated success but the returned values were garbage. Instead of using leaf 0x16, we should then use the hypervisor interface to obtain the TSC frequency. Checking if a hypervisor is active can be done via: bool IsHypervisorActive() { uint32 cpuinfo[4] = {0}; int res = __get_cpuid(0x1, [0], [1], [2], [3]); return res > 0 && (cpuinfo[2] & (1 << 30)); } Obtaining the TSC frequency via the hypervisor interface can be done with the following code. See https://lwn.net/Articles/301888/ for more details. // Under hypervisors (tested with VMWare) leaf 0x16 is not available, even though __get_cpuid() succeeds. // Hence, if running under a hypervisor, use the hypervisor interface to obtain TSC frequency. uint32 cpuinfo[4] = {0}; if (IsHypervisorActive() && __get_cpuid(0x4001, [0], [1], [2], [3]) > 0) cycles_to_sec = 1.0 / ((double)cpuinfo[0] * 1000 * 1000); Given that we anyways switch between RDTSC and clock_gettime() with a global variable, what about exposing the clock source as GUC? That way the user can switch back to a working clock source in case we miss a detail around activating or reading the TSC. I'm happy to update the patches accordingly. -- David Geier (ServiceNow)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On Tue, Sep 06, 2022 at 11:32:18AM +0500, Ibrar Ahmed wrote: > Hunk #5 succeeded at 147 with fuzz 2 (offset -3 lines). > Hunk #6 FAILED at 170. > Hunk #7 succeeded at 165 (offset -69 lines). > 2 out of 7 hunks FAILED -- saving rejects to file > src/include/portability/instr_time.h.rej > patching file src/tools/msvc/Mkvcbuild.pm No rebased version has been sent since this update, so this patch has been marked as RwF. -- Michael signature.asc Description: PGP signature
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On Fri, Jul 15, 2022 at 11:22 PM Maciek Sakrejda wrote: > I ran that original test case with and without the patch. Here are the > numbers I'm seeing: > > master (best of three): > > postgres=# SELECT count(*) FROM lotsarows; > Time: 582.423 ms > > postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows; > Time: 616.102 ms > > postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows; > Time: 1068.700 ms (00:01.069) > > patched (best of three): > > postgres=# SELECT count(*) FROM lotsarows; > Time: 550.822 ms > > postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows; > Time: 612.572 ms > > postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows; > Time: 690.875 ms > > On Fri, Jul 1, 2022 at 10:26 AM Andres Freund wrote: > > On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote: > >... > > > Known WIP problems with this patch version: > > > > > > * There appears to be a timing discrepancy I haven't yet worked out, > where > > > the \timing data reported by psql doesn't match what EXPLAIN ANALYZE > is > > > reporting. With Andres' earlier test case, I'm seeing a consistent > ~700ms > > > higher for \timing than for the EXPLAIN ANALYZE time reported on the > > > server > > > side, only when rdtsc measurement is used -- its likely there is a > problem > > > somewhere with how we perform the cycles to time conversion > > > > Could you explain a bit more what you're seeing? I just tested your > patches > > and didn't see that here. > > I did not see this either, but I did see that the execution time > reported by \timing is (for this test case) consistently 0.5-1ms > *lower* than the Execution Time reported by EXPLAIN. I did not see > that on master. Is that expected? > > Thanks, > Maciek > > > The patch requires a rebase; please rebase the patch with the latest code. Hunk #5 succeeded at 147 with fuzz 2 (offset -3 lines). Hunk #6 FAILED at 170. Hunk #7 succeeded at 165 (offset -69 lines). 2 out of 7 hunks FAILED -- saving rejects to file src/include/portability/instr_time.h.rej patching file src/tools/msvc/Mkvcbuild.pm -- Ibrar Ahmed
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
I ran that original test case with and without the patch. Here are the numbers I'm seeing: master (best of three): postgres=# SELECT count(*) FROM lotsarows; Time: 582.423 ms postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows; Time: 616.102 ms postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows; Time: 1068.700 ms (00:01.069) patched (best of three): postgres=# SELECT count(*) FROM lotsarows; Time: 550.822 ms postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows; Time: 612.572 ms postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows; Time: 690.875 ms On Fri, Jul 1, 2022 at 10:26 AM Andres Freund wrote: > On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote: >... > > Known WIP problems with this patch version: > > > > * There appears to be a timing discrepancy I haven't yet worked out, where > > the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is > > reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms > > higher for \timing than for the EXPLAIN ANALYZE time reported on the > > server > > side, only when rdtsc measurement is used -- its likely there is a problem > > somewhere with how we perform the cycles to time conversion > > Could you explain a bit more what you're seeing? I just tested your patches > and didn't see that here. I did not see this either, but I did see that the execution time reported by \timing is (for this test case) consistently 0.5-1ms *lower* than the Execution Time reported by EXPLAIN. I did not see that on master. Is that expected? Thanks, Maciek
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote: > On Fri, Jun 12, 2020 at 4:28 PM Andres Freund wrote: > > > The attached patches are really just a prototype. I'm also not really > > planning to work on getting this into a "production ready" patchset > > anytime soon. I developed it primarily because I found it the overhead > > made it too hard to nail down in which part of a query tree performance > > changed. If somebody else wants to continue from here... > > > > I do think it's be a pretty significant improvement if we could reduce > > the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a > > bunch of low-level code. > > > > Based on an off-list conversation with Andres, I decided to dust off this > old > patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance > improvements (especially when using rdtsc instead of rdtsc*p*) seem to > warrant > giving this a more thorough look. > > See attached an updated patch (adding it to the July commitfest), with a few > changes: > > - Keep using clock_gettime() as a fallback if we decide to not use rdtsc Yep. > - Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work I suspect that this might not be needed anymore. Seems like it'd be ok to just fall back to clock_gettime() in that case. > - In an abundance of caution, for now I've decided to only enable this if we > are on Linux/x86, and the current kernel clocksource is TSC (the kernel > has > quite sophisticated logic around making this decision, see [1]) I think our requirements are a bit lower than the kernel's - we're not tracking wall clock over an extended period... > Note that if we implemented the decision logic ourselves (instead of relying > on the Linux kernel), I'd be most worried about older virtualization > technology. In my understanding getting this right is notably more > complicated > than just checking cpuid, see [2]. > Known WIP problems with this patch version: > > * There appears to be a timing discrepancy I haven't yet worked out, where > the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is > reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms > higher for \timing than for the EXPLAIN ANALYZE time reported on the > server > side, only when rdtsc measurement is used -- its likely there is a problem > somewhere with how we perform the cycles to time conversion Could you explain a bit more what you're seeing? I just tested your patches and didn't see that here. > * Possibly related, the floating point handling for the cycles_to_sec > variable > is problematic in terms of precision (see FIXME, taken over from Andres' > POC) And probably also performance... > Open questions from me: > > 1) Do we need to account for different TSC offsets on different CPUs in SMP >systems? (the Linux kernel certainly has logic to that extent, but [3] >suggests this is no longer a problem on Nehalem and newer chips, i.e. > those >having an invariant TSC) I don't think we should cater to systems where we need that. > 2) Should we have a setting "--with-tsc" for configure? (instead of always >enabling it when on Linux/x86 with a TSC clocksource) Probably not worth it. > 3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for >current instructions to finish -- the prior discussion seemed to suggest >we don't want it for node instruction measurements, but possibly we do > want >this in other cases?) I was wondering about that too... Perhaps we should add a INSTR_TIME_SET_CURRENT_BARRIER() or such? > 4) Should we support using the "mrs" instruction on ARM? (which is similar > to >rdtsc, see [4]) I'd leave that for later personally. > #define NS_PER_S INT64CONST(10) > #define US_PER_S INT64CONST(100) > #define MS_PER_S INT64CONST(1000) > @@ -95,17 +104,37 @@ typedef int64 instr_time; > > #define INSTR_TIME_SET_ZERO(t) ((t) = 0) > > -static inline instr_time pg_clock_gettime_ns(void) > +extern double cycles_to_sec; > + > +bool use_rdtsc; This should be extern and inside the ifdef below. > +#if defined(__x86_64__) && defined(__linux__) > +extern void pg_clock_gettime_initialize_rdtsc(void); > +#endif > + > +static inline instr_time pg_clock_gettime_ref_cycles(void) > { > struct timespec tmp; > > +#if defined(__x86_64__) && defined(__linux__) > + if (use_rdtsc) > + return __rdtsc(); > +#endif > + > clock_gettime(PG_INSTR_CLOCK, ); > > return tmp.tv_sec * NS_PER_S + tmp.tv_nsec; > } > Greetings, Andres Freund
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On Fri, Jun 12, 2020 at 4:28 PM Andres Freund wrote: > The attached patches are really just a prototype. I'm also not really > planning to work on getting this into a "production ready" patchset > anytime soon. I developed it primarily because I found it the overhead > made it too hard to nail down in which part of a query tree performance > changed. If somebody else wants to continue from here... > > I do think it's be a pretty significant improvement if we could reduce > the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a > bunch of low-level code. > Based on an off-list conversation with Andres, I decided to dust off this old patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance improvements (especially when using rdtsc instead of rdtsc*p*) seem to warrant giving this a more thorough look. See attached an updated patch (adding it to the July commitfest), with a few changes: - Keep using clock_gettime() as a fallback if we decide to not use rdtsc - Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work - The decision to use rdtsc (or not) is made at runtime, in the new INSTR_TIME_INITIALIZE() -- we can't make this decision at compile time because this is dependent on the specific CPU in use, amongst other things - In an abundance of caution, for now I've decided to only enable this if we are on Linux/x86, and the current kernel clocksource is TSC (the kernel has quite sophisticated logic around making this decision, see [1]) Note that if we implemented the decision logic ourselves (instead of relying on the Linux kernel), I'd be most worried about older virtualization technology. In my understanding getting this right is notably more complicated than just checking cpuid, see [2]. Known WIP problems with this patch version: * There appears to be a timing discrepancy I haven't yet worked out, where the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms higher for \timing than for the EXPLAIN ANALYZE time reported on the server side, only when rdtsc measurement is used -- its likely there is a problem somewhere with how we perform the cycles to time conversion * Possibly related, the floating point handling for the cycles_to_sec variable is problematic in terms of precision (see FIXME, taken over from Andres' POC) Open questions from me: 1) Do we need to account for different TSC offsets on different CPUs in SMP systems? (the Linux kernel certainly has logic to that extent, but [3] suggests this is no longer a problem on Nehalem and newer chips, i.e. those having an invariant TSC) 2) Should we have a setting "--with-tsc" for configure? (instead of always enabling it when on Linux/x86 with a TSC clocksource) 3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for current instructions to finish -- the prior discussion seemed to suggest we don't want it for node instruction measurements, but possibly we do want this in other cases?) 4) Should we support using the "mrs" instruction on ARM? (which is similar to rdtsc, see [4]) Thanks, Lukas [1] https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c [2] http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/ [3] https://stackoverflow.com/a/11060619/1652607 [4] https://cpufun.substack.com/p/fun-with-timers-and-cpuid -- Lukas Fittl v2-0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure.patch Description: Binary data v2-0001-WIP-Change-instr_time-to-just-store-nanoseconds-t.patch Description: Binary data
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
On Sat, Jun 13, 2020 at 11:28 AM Andres Freund wrote: > [PATCH v1 1/2] WIP: Change instr_time to just store nanoseconds, that's > cheaper. Makes a lot of sense. If we do this, I'll need to update pgbench, which just did something similar locally. If I'd been paying attention to this thread I might not have committed that piece of the recent pgbench changes, but it's trivial stuff and I'll be happy to tidy that up when the time comes. > [PATCH v1 2/2] WIP: Use cpu reference cycles, via rdtsc, to measure time for > instrumentation. > Some of the time is spent doing function calls, dividing into struct > timespec, etc. But most of it just the rdtscp instruction: > 65.30 │1 63: rdtscp > The reason for that is largely that rdtscp waits until all prior > instructions have finished (but it allows later instructions to already > start). Multiple times for each tuple. Yeah, after reading a bit about this, I agree that there is no reason to think that the stalling version makes the answer better in any way. It might make sense if you use it once at the beginning of a large computation, but it makes no sense if you sprinkle it around inside blocks that will run multiple times. It destroys your instructions-per-cycle while, turning your fancy super scalar Pentium into a 486. It does raise some interesting questions about what exactly you're measuring, though: I don't know enough to have a good grip on how far out of order the TSC could be read! > There's also other issues with using rdtsc directly: On older CPUs, in > particular older multi-socket systems, the tsc will not be synchronized > in detail across cores. There's bits that'd let us check whether tsc is > suitable or not. The more current issue of that is that things like > virtual machines being migrated can lead to rdtsc suddenly returning a > different value / the frequency differening. But that is supposed to be > solved these days, by having virtualization technologies set frequency > multipliers and offsets which then cause rdtsc[p] to return something > meaningful, even after migration. Googling tells me that Nehalem (2008) introduced "invariant TSC" (clock rate independent) and also socket synchronisation at the same time, so systems without it are already pretty long in the tooth. A quick peek at an AMD manual[1] tells me that a similar change happened in 15H/Bulldozer/Piledriver/Steamroller/Excavator (2011), identified with the same CPUID test. My first reaction is that it seems like TSC would be the least of your worries if you're measuring a VM that's currently migrating between hosts, but maybe the idea is just that you have to make sure you don't assume it can't ever go backwards or something like that... Google Benchmark has some clues about how to spell this on MSVC, what some instructions might be to research on ARM, etc. [1] https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf (page 373) [2] https://github.com/google/benchmark/blob/master/src/cycleclock.h
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
so 13. 6. 2020 v 1:28 odesílatel Andres Freund napsal: > Hi, > > Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes > the resulting timing enough that the times aren't meaningful. E.g. > > CREATE TABLE lotsarows(key int not null); > INSERT INTO lotsarows SELECT generate_series(1, 5000); > VACUUM FREEZE lotsarows; > > > -- best of three: > SELECT count(*) FROM lotsarows; > Time: 1923.394 ms (00:01.923) > > -- best of three: > EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows; > Time: 2319.830 ms (00:02.320) > > -- best of three: > EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows; > Time: 4202.649 ms (00:04.203) > > That nearly *double* the execution time without TIMING. > > > Looking at a profile of this shows that we spend a good bit of cycles > "normalizing" timstamps etc. That seems pretty unnecessary, just forced > on us due to struct timespec. So the first attached patch just turns > instr_time to be a 64bit integer, counting nanoseconds. > > That helps, a tiny bit: > EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows; > Time: 4179.302 ms (00:04.179) > > but obviously doesn't move the needle. > > > Looking at a profile it's easy to confirm that we spend a lot of time > acquiring time: > - 95.49% 0.00% postgres postgres [.] > agg_retrieve_direct (inlined) >- agg_retrieve_direct (inlined) > - 79.27% fetch_input_tuple > - ExecProcNode (inlined) > - 75.72% ExecProcNodeInstr >+ 25.22% SeqNext >- 21.74% InstrStopNode > + 17.80% __GI___clock_gettime (inlined) >- 21.44% InstrStartNode > + 19.23% __GI___clock_gettime (inlined) >+ 4.06% ExecScan > + 13.09% advance_aggregates (inlined) > 1.06% MemoryContextReset > > And that's even though linux avoids a syscall (in most cases) etc to > acquire the time. Unless the kernel detects there's a reason not to do > so, linux does this by using 'rdtscp' and multiplying it by kernel > provided factors to turn the cycles into time. > > Some of the time is spent doing function calls, dividing into struct > timespec, etc. But most of it just the rdtscp instruction: > 65.30 │1 63: rdtscp > > > The reason for that is largely that rdtscp waits until all prior > instructions have finished (but it allows later instructions to already > start). Multiple times for each tuple. > > > In the second attached prototype patch I've change instr_time to count > in cpu cycles instead of nanoseconds. And then just turned the cycles > into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later). > > When using rdtsc that results in *vastly* lower overhead: > > ┌───┐ > │ QUERY PLAN > │ > > ├───┤ > │ Aggregate (cost=846239.20..846239.21 rows=1 width=8) (actual > time=2610.235..2610.235 rows=1 loops=1) │ > │ -> Seq Scan on lotsarows (cost=0.00..721239.16 rows=5016 > width=0) (actual time=0.006..1512.886 rows=5000 loops=1) │ > │ Planning Time: 0.028 ms > │ > │ Execution Time: 2610.256 ms > │ > > └───┘ > (4 rows) > > Time: 2610.589 ms (00:02.611) > > And there's still some smaller improvements that could be made ontop of > that. > > As a comparison, here's the time when using rdtscp directly in > instr_time, instead of going through clock_gettime: > Time: 3481.162 ms (00:03.481) > > That shows pretty well how big the cost of the added pipeline stalls > are, and how important out-of-order execution is for decent > performance... > > > In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we > do *not* want to wait for prior instructions to finish, since that > actually leads to the timing being less accurate, rather than > more. There are other cases where that'd be different, e.g. measuring > how long an entire query takes or such (but there it's probably > irrelevant which to use). > > > I've above skipped a bit over the details of how to turn the cycles > returned by rdtsc into time: > > On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that > have actually been run, but instead returns the number of 'reference > cycles'. That's important because otherwise things like turbo mode and > lower power modes would lead to completely bogus times. > > Thus, knowing the "base frequency" of the CPU allows to turn the >
Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi, Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes the resulting timing enough that the times aren't meaningful. E.g. CREATE TABLE lotsarows(key int not null); INSERT INTO lotsarows SELECT generate_series(1, 5000); VACUUM FREEZE lotsarows; -- best of three: SELECT count(*) FROM lotsarows; Time: 1923.394 ms (00:01.923) -- best of three: EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows; Time: 2319.830 ms (00:02.320) -- best of three: EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows; Time: 4202.649 ms (00:04.203) That nearly *double* the execution time without TIMING. Looking at a profile of this shows that we spend a good bit of cycles "normalizing" timstamps etc. That seems pretty unnecessary, just forced on us due to struct timespec. So the first attached patch just turns instr_time to be a 64bit integer, counting nanoseconds. That helps, a tiny bit: EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows; Time: 4179.302 ms (00:04.179) but obviously doesn't move the needle. Looking at a profile it's easy to confirm that we spend a lot of time acquiring time: - 95.49% 0.00% postgres postgres [.] agg_retrieve_direct (inlined) - agg_retrieve_direct (inlined) - 79.27% fetch_input_tuple - ExecProcNode (inlined) - 75.72% ExecProcNodeInstr + 25.22% SeqNext - 21.74% InstrStopNode + 17.80% __GI___clock_gettime (inlined) - 21.44% InstrStartNode + 19.23% __GI___clock_gettime (inlined) + 4.06% ExecScan + 13.09% advance_aggregates (inlined) 1.06% MemoryContextReset And that's even though linux avoids a syscall (in most cases) etc to acquire the time. Unless the kernel detects there's a reason not to do so, linux does this by using 'rdtscp' and multiplying it by kernel provided factors to turn the cycles into time. Some of the time is spent doing function calls, dividing into struct timespec, etc. But most of it just the rdtscp instruction: 65.30 │1 63: rdtscp The reason for that is largely that rdtscp waits until all prior instructions have finished (but it allows later instructions to already start). Multiple times for each tuple. In the second attached prototype patch I've change instr_time to count in cpu cycles instead of nanoseconds. And then just turned the cycles into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later). When using rdtsc that results in *vastly* lower overhead: ┌───┐ │ QUERY PLAN │ ├───┤ │ Aggregate (cost=846239.20..846239.21 rows=1 width=8) (actual time=2610.235..2610.235 rows=1 loops=1) │ │ -> Seq Scan on lotsarows (cost=0.00..721239.16 rows=5016 width=0) (actual time=0.006..1512.886 rows=5000 loops=1) │ │ Planning Time: 0.028 ms │ │ Execution Time: 2610.256 ms │ └───┘ (4 rows) Time: 2610.589 ms (00:02.611) And there's still some smaller improvements that could be made ontop of that. As a comparison, here's the time when using rdtscp directly in instr_time, instead of going through clock_gettime: Time: 3481.162 ms (00:03.481) That shows pretty well how big the cost of the added pipeline stalls are, and how important out-of-order execution is for decent performance... In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we do *not* want to wait for prior instructions to finish, since that actually leads to the timing being less accurate, rather than more. There are other cases where that'd be different, e.g. measuring how long an entire query takes or such (but there it's probably irrelevant which to use). I've above skipped a bit over the details of how to turn the cycles returned by rdtsc into time: On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that have actually been run, but instead returns the number of 'reference cycles'. That's important because otherwise things like turbo mode and lower power modes would lead to completely bogus times. Thus, knowing the "base frequency" of the CPU allows to turn the difference between two rdtsc return values into seconds. In the attached prototype I just determined the number of cycles using cpuid(0x16). That's only available since