Hi, On 2023-01-17 10:50:53 -0800, Andres Freund wrote: > On 2023-01-17 12:26:57 -0500, Tom Lane wrote: > > > 0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to > > > warn > > > Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and > > > just allow to assign 0. > > > > I think it's probably wise to keep the macro. If we ever rethink this > > again, we'll be glad we kept it. Similarly, IS_ZERO is a good idea > > even if it would work with just compare-to-zero. > > Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions > could give us the best of both worlds?
I tried that in the attached 0005. I found that it reads better if I also add INSTR_TIME_CURRENT(). If we decide to go for this, I'd roll it into 0001 instead, but I wanted to get agreement on it first. Comments? > > I'm almost tempted to suggest you define instr_time as a struct with a > > uint64 field, just to help keep us honest about that. > > I can see that making sense. Unless somebody pipes up with opposition to that > plan soon, I'll see how it goes. Done in the attached. I think it looks good. Actually found a type confusion buglet in 0004, so the type safety benefit is noticable. It does require a new INSTR_TIME_IS_LT() for the loop exit condition in 0004, but that seems fine. Besides cosmetic stuff I also added back the cast to double in window's INSTR_TIME_GET_NANOSEC() - I think there's an overflow danger without it. We should make this faster by pre-computing (double) NS_PER_S / GetTimerFrequency() once, as that'd avoid doing the the slow division on every conversion. But that's an old issue and thus better tackled separately. Greetings, Andres Freund
>From dfea7f159b79aaf1fb6f4c8617614f6fe5c1176f Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Mon, 16 Jan 2023 10:04:42 -0800 Subject: [PATCH v8 1/5] Zero initialize instr_time uses causing compiler warnings These are all not necessary from a correctness POV. However, in a subsequent patch instr_time will be simplified to an int64, at which point gcc would otherwise start to warn about the changed places. Author: Reviewed-by: Discussion: https://postgr.es/m/20230116023639.rn36vf6ajqmfc...@awork3.anarazel.de Backpatch: --- src/backend/access/transam/xlog.c | 4 ++++ src/backend/storage/buffer/bufmgr.c | 4 ++++ src/backend/storage/file/buffile.c | 4 ++++ src/backend/storage/ipc/latch.c | 2 ++ src/bin/psql/common.c | 6 ++++++ 5 files changed, 20 insertions(+) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index cc0d9a05d9f..fb4c860bdea 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2191,6 +2191,8 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) /* Measure I/O timing to write WAL data */ if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); + else + INSTR_TIME_SET_ZERO(start); pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE); written = pg_pwrite(openLogFile, from, nleft, startoffset); @@ -8151,6 +8153,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) /* Measure I/O timing to sync the WAL file */ if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); + else + INSTR_TIME_SET_ZERO(start); pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC); switch (sync_method) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 8075828e8a6..800a4248c95 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1017,6 +1017,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); + else + INSTR_TIME_SET_ZERO(io_start); smgrread(smgr, forkNum, blockNum, (char *) bufBlock); @@ -2902,6 +2904,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln) if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); + else + INSTR_TIME_SET_ZERO(io_start); /* * bufToWrite is either the shared buffer or a copy, as appropriate. diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c index c5464b6aa62..0a51624df3b 100644 --- a/src/backend/storage/file/buffile.c +++ b/src/backend/storage/file/buffile.c @@ -446,6 +446,8 @@ BufFileLoadBuffer(BufFile *file) if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); + else + INSTR_TIME_SET_ZERO(io_start); /* * Read whatever we can get, up to a full bufferload. @@ -525,6 +527,8 @@ BufFileDumpBuffer(BufFile *file) if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); + else + INSTR_TIME_SET_ZERO(io_start); bytestowrite = FileWrite(thisfile, file->buffer.data + wpos, diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index d79d71a8515..f4123e7de7e 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -1401,6 +1401,8 @@ WaitEventSetWait(WaitEventSet *set, long timeout, Assert(timeout >= 0 && timeout <= INT_MAX); cur_timeout = timeout; } + else + INSTR_TIME_SET_ZERO(start_time); pgstat_report_wait_start(wait_event_info); diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c index 00627830c47..f907f5d4e8d 100644 --- a/src/bin/psql/common.c +++ b/src/bin/psql/common.c @@ -1276,6 +1276,8 @@ DescribeQuery(const char *query, double *elapsed_msec) if (timing) INSTR_TIME_SET_CURRENT(before); + else + INSTR_TIME_SET_ZERO(before); /* * To parse the query but not execute it, we prepare it, using the unnamed @@ -1406,6 +1408,8 @@ ExecQueryAndProcessResults(const char *query, if (timing) INSTR_TIME_SET_CURRENT(before); + else + INSTR_TIME_SET_ZERO(before); if (pset.bind_flag) success = PQsendQueryParams(pset.db, query, pset.bind_nparams, NULL, (const char * const *) pset.bind_params, NULL, NULL, 0); @@ -1702,6 +1706,8 @@ ExecQueryUsingCursor(const char *query, double *elapsed_msec) if (timing) INSTR_TIME_SET_CURRENT(before); + else + INSTR_TIME_SET_ZERO(before); /* if we're not in a transaction, start one */ if (PQtransactionStatus(pset.db) == PQTRANS_IDLE) -- 2.38.0
>From 8c6dc714d2854c279f73399ac11fafa2182d0012 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Mon, 16 Jan 2023 20:38:12 -0800 Subject: [PATCH v8 2/5] instr_time: Represent time as an int64 on all platforms Until now we used struct timespec on all platforms but windows. Using struct timespe causes a fair bit of memory (struct timeval is 16 bytes) and runtime overhead (much more complicated additions). Instead we can convert the time to nanoseconds in INSTR_TIME_SET_CURRENT(), making the remaining operations cheaper. Representing time as int64 nanoseconds provides sufficient range, ~292 years relative to a starting point (depending on clock source, relative to the unix epoch or the system's boot time). That'd not be sufficient for calendar time stored on disk, but is plenty for runtime interval time measurement. On windows instr_time already is represented as cycles. It might make sense to represent time as cycles on other platforms as well, as using cycle acquisition instructions like rdtsc directly can reduce the overhead of time acquisition substantially. This could be done in a fairly localized manner as the code stands after this commit. Because the windows and non-windows paths are now more similar, use a common set of macros. To make that possible, most of the use of LARGE_INTEGER had to be removed, which looks nicer anyway. To avoid users of the API relying on the integer representation, we wrap the 64bit integer inside struct struct instr_time. Author: Andres Freund <and...@anarazel.de> Author: Lukas Fittl <lu...@fittl.com> Author: David Geier <geidav...@gmail.com> Reviewed-by: Tom Lane <t...@sss.pgh.pa.us> Discussion: https://postgr.es/m/20230113195547.k4nlrmawpijqw...@awork3.anarazel.de --- src/include/portability/instr_time.h | 162 ++++++++++++++------------- 1 file changed, 86 insertions(+), 76 deletions(-) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index 9ea1a68bd94..c0ed491395d 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -34,6 +34,8 @@ * * INSTR_TIME_GET_MICROSEC(t) convert t to uint64 (in microseconds) * + * INSTR_TIME_GET_NANOSEC(t) convert t to uint64 (in nanoseconds) + * * Note that INSTR_TIME_SUBTRACT and INSTR_TIME_ACCUM_DIFF convert * absolute times to intervals. The INSTR_TIME_GET_xxx operations are * only useful on intervals. @@ -54,8 +56,32 @@ #ifndef INSTR_TIME_H #define INSTR_TIME_H + +/* + * We store interval times as an int64 integer on all platforms, as int64 is + * cheap to add/subtract, the most common operation for instr_time. The + * acquisition of time and converting to specific units of time is platform + * specific. + * + * To avoid users of the API relying on the integer representation, we wrap + * the 64bit integer in a struct. + */ +typedef struct instr_time +{ + int64 ticks; /* in platforms specific unit */ +} instr_time; + + +/* helpers macros used in platform specific code below */ + +#define NS_PER_S INT64CONST(1000000000) +#define NS_PER_MS INT64CONST(1000000) +#define NS_PER_US INT64CONST(1000) + + #ifndef WIN32 + /* Use clock_gettime() */ #include <time.h> @@ -80,93 +106,43 @@ #define PG_INSTR_CLOCK CLOCK_REALTIME #endif -typedef struct timespec instr_time; +/* helper for INSTR_TIME_SET_CURRENT */ +static inline instr_time +pg_clock_gettime_ns(void) +{ + instr_time now; + struct timespec tmp; -#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0) + clock_gettime(PG_INSTR_CLOCK, &tmp); + now.ticks = tmp.tv_sec * NS_PER_S + tmp.tv_nsec; -#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0) + return now; +} -#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t))) +#define INSTR_TIME_SET_CURRENT(t) \ + ((t) = pg_clock_gettime_ns()) -#define INSTR_TIME_ADD(x,y) \ - do { \ - (x).tv_sec += (y).tv_sec; \ - (x).tv_nsec += (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec >= 1000000000) \ - { \ - (x).tv_nsec -= 1000000000; \ - (x).tv_sec++; \ - } \ - } while (0) +#define INSTR_TIME_GET_NANOSEC(t) \ + ((int64) (t).ticks) -#define INSTR_TIME_SUBTRACT(x,y) \ - do { \ - (x).tv_sec -= (y).tv_sec; \ - (x).tv_nsec -= (y).tv_nsec; \ - /* Normalize */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 1000000000; \ - (x).tv_sec--; \ - } \ - } while (0) - -#define INSTR_TIME_ACCUM_DIFF(x,y,z) \ - do { \ - (x).tv_sec += (y).tv_sec - (z).tv_sec; \ - (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \ - /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \ - while ((x).tv_nsec < 0) \ - { \ - (x).tv_nsec += 1000000000; \ - (x).tv_sec--; \ - } \ - while ((x).tv_nsec >= 1000000000) \ - { \ - (x).tv_nsec -= 1000000000; \ - (x).tv_sec++; \ - } \ - } while (0) - -#define INSTR_TIME_GET_DOUBLE(t) \ - (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 1000000000.0) - -#define INSTR_TIME_GET_MILLISEC(t) \ - (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 1000000.0) - -#define INSTR_TIME_GET_MICROSEC(t) \ - (((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000)) #else /* WIN32 */ + /* Use QueryPerformanceCounter() */ -typedef LARGE_INTEGER instr_time; +/* helper for INSTR_TIME_SET_CURRENT */ +static inline instr_time +pg_query_performance_counter(void) +{ + instr_time now; + LARGE_INTEGER tmp; -#define INSTR_TIME_IS_ZERO(t) ((t).QuadPart == 0) + QueryPerformanceCounter(&tmp); + now.ticks = tmp.QuadPart; -#define INSTR_TIME_SET_ZERO(t) ((t).QuadPart = 0) - -#define INSTR_TIME_SET_CURRENT(t) QueryPerformanceCounter(&(t)) - -#define INSTR_TIME_ADD(x,y) \ - ((x).QuadPart += (y).QuadPart) - -#define INSTR_TIME_SUBTRACT(x,y) \ - ((x).QuadPart -= (y).QuadPart) - -#define INSTR_TIME_ACCUM_DIFF(x,y,z) \ - ((x).QuadPart += (y).QuadPart - (z).QuadPart) - -#define INSTR_TIME_GET_DOUBLE(t) \ - (((double) (t).QuadPart) / GetTimerFrequency()) - -#define INSTR_TIME_GET_MILLISEC(t) \ - (((double) (t).QuadPart * 1000.0) / GetTimerFrequency()) - -#define INSTR_TIME_GET_MICROSEC(t) \ - ((uint64) (((double) (t).QuadPart * 1000000.0) / GetTimerFrequency())) + return now; +} static inline double GetTimerFrequency(void) @@ -177,11 +153,45 @@ GetTimerFrequency(void) return (double) f.QuadPart; } +#define INSTR_TIME_SET_CURRENT(t) \ + ((t) = pg_query_performance_counter()) + +#define INSTR_TIME_GET_NANOSEC(t) \ + ((int64) (((double) (t).ticks * NS_PER_S) / GetTimerFrequency())) + #endif /* WIN32 */ -/* same macro on all platforms */ + +/* + * Common macros + */ + +#define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0) + + +#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0) #define INSTR_TIME_SET_CURRENT_LAZY(t) \ (INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false) + +#define INSTR_TIME_ADD(x,y) \ + ((x).ticks += (y).ticks) + +#define INSTR_TIME_SUBTRACT(x,y) \ + ((x).ticks -= (y).ticks) + +#define INSTR_TIME_ACCUM_DIFF(x,y,z) \ + ((x).ticks += (y).ticks - (z).ticks) + + +#define INSTR_TIME_GET_DOUBLE(t) \ + ((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S) + +#define INSTR_TIME_GET_MILLISEC(t) \ + ((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_MS) + +#define INSTR_TIME_GET_MICROSEC(t) \ + (INSTR_TIME_GET_NANOSEC(t) / NS_PER_US) + #endif /* INSTR_TIME_H */ -- 2.38.0
>From 0dcdbb6b1ad1121d2bdb2ba370060ef0519640b8 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Fri, 20 Jan 2023 15:31:54 -0800 Subject: [PATCH v8 3/5] instr_time: Add INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop without having to convert into time units (which is costly). INSTR_TIME_IS_LT() can be used to check the loop condition. Author: Reviewed-by: Discussion: https://postgr.es/m/ Backpatch: --- src/include/portability/instr_time.h | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index c0ed491395d..af2ab6ec887 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -15,6 +15,8 @@ * * INSTR_TIME_IS_ZERO(t) is t equal to zero? * + * INSTR_TIME_IS_LT(x, y) x < y + * * INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too) * * INSTR_TIME_SET_CURRENT(t) set t to current time @@ -22,6 +24,8 @@ * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero, * evaluates to whether t changed * + * INSTR_TIME_SET_SECONDS(t, s) set t to s seconds + * * INSTR_TIME_ADD(x, y) x += y * * INSTR_TIME_SUBTRACT(x, y) x -= y @@ -122,6 +126,9 @@ pg_clock_gettime_ns(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_clock_gettime_ns()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = NS_PER_S * (s)) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) (t).ticks) @@ -156,6 +163,9 @@ GetTimerFrequency(void) #define INSTR_TIME_SET_CURRENT(t) \ ((t) = pg_query_performance_counter()) +#define INSTR_TIME_SET_SECONDS(t, s) \ + ((t).ticks = s * GetTimerFrequency()) + #define INSTR_TIME_GET_NANOSEC(t) \ ((int64) (((double) (t).ticks * NS_PER_S) / GetTimerFrequency())) @@ -168,6 +178,8 @@ GetTimerFrequency(void) #define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0) +#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks) + #define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0) -- 2.38.0
>From 5a458d4584961dedd3f80a07d8faea66e57c5d94 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Mon, 16 Jan 2023 11:19:11 -0800 Subject: [PATCH v8 4/5] wip: report nanoseconds in pg_test_timing This commit also updates pg_test_timing's documentation: - compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of of statement with/without EXPLAIN ANALYZE - explain the 2x overhead (due to two timestamp acquisitions per row) - remove old section about old versions of linux - I couldn't update the numbers, and it's old enough nobody would care --- src/bin/pg_test_timing/pg_test_timing.c | 74 ++++++++------ doc/src/sgml/ref/pgtesttiming.sgml | 122 +++++++++++------------- 2 files changed, 100 insertions(+), 96 deletions(-) diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c index c29d6f87629..e20718669a5 100644 --- a/src/bin/pg_test_timing/pg_test_timing.c +++ b/src/bin/pg_test_timing/pg_test_timing.c @@ -19,8 +19,8 @@ static void handle_args(int argc, char *argv[]); static uint64 test_timing(unsigned int duration); static void output(uint64 loop_count); -/* record duration in powers of 2 microseconds */ -long long int histogram[32]; +/* record duration in powers of 2 nanoseconds */ +uint64 histogram[64]; int main(int argc, char *argv[]) @@ -121,35 +121,48 @@ handle_args(int argc, char *argv[]) static uint64 test_timing(unsigned int duration) { - uint64 total_time; - int64 time_elapsed = 0; uint64 loop_count = 0; - uint64 prev, - cur; + instr_time until_time, + total_time; instr_time start_time, - end_time, - temp; - - total_time = duration > 0 ? duration * INT64CONST(1000000) : 0; + end_time; + instr_time cur; INSTR_TIME_SET_CURRENT(start_time); - cur = INSTR_TIME_GET_MICROSEC(start_time); - while (time_elapsed < total_time) + /* + * To reduce loop overhead, check loop condition in instr_time domain. + */ + INSTR_TIME_SET_SECONDS(total_time, duration); + until_time = start_time; + INSTR_TIME_ADD(until_time, total_time); + + cur = start_time; + + while (INSTR_TIME_IS_LT(cur, until_time)) { - int32 diff, - bits = 0; + instr_time temp; + instr_time prev; + int64 diff; + int32 bits = 0; prev = cur; - INSTR_TIME_SET_CURRENT(temp); - cur = INSTR_TIME_GET_MICROSEC(temp); - diff = cur - prev; + INSTR_TIME_SET_CURRENT(cur); + temp = cur; + INSTR_TIME_SUBTRACT(temp, prev); + diff = INSTR_TIME_GET_NANOSEC(temp); /* Did time go backwards? */ - if (diff < 0) + if (unlikely(diff <= 0)) { + /* can't do anything with that measurement */ + if (diff == 0) + { + loop_count++; + continue; + } fprintf(stderr, _("Detected clock going backwards in time.\n")); - fprintf(stderr, _("Time warp: %d ms\n"), diff); + fprintf(stderr, _("Time warp: %lld ns\n"), (long long) diff); exit(1); } @@ -164,8 +177,6 @@ test_timing(unsigned int duration) histogram[bits]++; loop_count++; - INSTR_TIME_SUBTRACT(temp, start_time); - time_elapsed = INSTR_TIME_GET_MICROSEC(temp); } INSTR_TIME_SET_CURRENT(end_time); @@ -173,7 +184,7 @@ test_timing(unsigned int duration) INSTR_TIME_SUBTRACT(end_time, start_time); printf(_("Per loop time including overhead: %0.2f ns\n"), - INSTR_TIME_GET_DOUBLE(end_time) * 1e9 / loop_count); + (INSTR_TIME_GET_DOUBLE(end_time) * NS_PER_S) / loop_count); return loop_count; } @@ -181,9 +192,10 @@ test_timing(unsigned int duration) static void output(uint64 loop_count) { - int64 max_bit = 31, + int64 low_bit = 0, + max_bit = 63, i; - char *header1 = _("< us"); + char *header1 = _("< ns"); char *header2 = /* xgettext:no-c-format */ _("% of total"); char *header3 = _("count"); int len1 = strlen(header1); @@ -194,15 +206,19 @@ output(uint64 loop_count) while (max_bit > 0 && histogram[max_bit] == 0) max_bit--; + /* find lowest bit value */ + while (low_bit < max_bit && histogram[low_bit] == 0) + low_bit++; + printf(_("Histogram of timing durations:\n")); printf("%*s %*s %*s\n", - Max(6, len1), header1, + Max(9, len1), header1, Max(10, len2), header2, Max(10, len3), header3); - for (i = 0; i <= max_bit; i++) - printf("%*ld %*.5f %*lld\n", - Max(6, len1), 1l << i, + for (i = low_bit; i <= max_bit; i++) + printf("%*ld %*.5f %*llu\n", + Max(9, len1), 1l << i, Max(10, len2) - 1, (double) histogram[i] * 100 / loop_count, - Max(10, len3), histogram[i]); + Max(10, len3), (long long unsigned) histogram[i]); } diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml index a5eb3aa25e0..82c625e3a2a 100644 --- a/doc/src/sgml/ref/pgtesttiming.sgml +++ b/doc/src/sgml/ref/pgtesttiming.sgml @@ -93,28 +93,34 @@ PostgreSQL documentation <para> Good results will show most (>90%) individual timing calls take less than - one microsecond. Average per loop overhead will be even lower, below 100 - nanoseconds. This example from an Intel i7-860 system using a TSC clock - source shows excellent performance: + one microsecond (1000 nanoseconds). Average per loop overhead will be even + lower, below 100 nanoseconds. This example from an Intel i9-9880H system + using a TSC clock source shows excellent performance: <screen><![CDATA[ Testing timing overhead for 3 seconds. -Per loop time including overhead: 35.96 ns +Per loop time including overhead: 13.74 ns Histogram of timing durations: - < us % of total count - 1 96.40465 80435604 - 2 3.59518 2999652 - 4 0.00015 126 - 8 0.00002 13 - 16 0.00000 2 + < ns % of total count + 16 97.00221 211857215 + 32 2.99555 6542412 + 64 0.00115 2505 + 128 0.00035 759 + 256 0.00004 78 + 512 0.00000 3 + 1024 0.00000 4 + 2048 0.00034 732 + 4096 0.00000 6 + 8192 0.00000 8 + 16384 0.00019 409 + 32768 0.00018 403 + 65536 0.00000 1 ]]></screen> </para> <para> - Note that different units are used for the per loop time than the - histogram. The loop can have resolution within a few nanoseconds (ns), - while the individual timing calls can only resolve down to one microsecond - (us). + Note that the accuracy of the histogram entries may be lower than the + per loop time. </para> </refsect2> @@ -125,24 +131,30 @@ Histogram of timing durations: When the query executor is running a statement using <command>EXPLAIN ANALYZE</command>, individual operations are timed as well as showing a summary. The overhead of your system can be checked by - counting rows with the <application>psql</application> program: + disabling the per-row timing, using the <literal>TIMING OFF</literal> + option: <screen> -CREATE TABLE t AS SELECT * FROM generate_series(1,100000); -\timing -SELECT COUNT(*) FROM t; -EXPLAIN ANALYZE SELECT COUNT(*) FROM t; +CREATE TABLE t AS SELECT * FROM generate_series(1, 100000); +EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t; +EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM t; </screen> </para> <para> - The i7-860 system measured runs the count query in 9.8 ms while - the <command>EXPLAIN ANALYZE</command> version takes 16.6 ms, each - processing just over 100,000 rows. That 6.8 ms difference means the timing - overhead per row is 68 ns, about twice what pg_test_timing estimated it - would be. Even that relatively small amount of overhead is making the fully - timed count statement take almost 70% longer. On more substantial queries, - the timing overhead would be less problematic. + The i9-9880H system measured shows an execution time of 4.116 ms for the + <literal>TIMING OFF</literal> query, and 6.965 ms for the + <literal>TIMING ON</literal>, each processing 100,000 rows. + + That 2.849 ms difference means the timing overhead per row is 28 ns. As + <literal>TIMING ON</literal> measures timestamps twice per row returned by + an executor node, the overhead is very close to what pg_test_timing + estimated it would be. + + more than what pg_test_timing estimated it would be. Even that relatively + small amount of overhead is making the fully timed count statement take + about 60% longer. On more substantial queries, the timing overhead would + be less problematic. </para> </refsect2> @@ -157,28 +169,31 @@ EXPLAIN ANALYZE SELECT COUNT(*) FROM t; <screen><![CDATA[ # cat /sys/devices/system/clocksource/clocksource0/available_clocksource -tsc hpet acpi_pm +tsc acpi_pm # echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource # pg_test_timing -Per loop time including overhead: 722.92 ns +Testing timing overhead for 3 seconds. +Per loop time including overhead: 708.58 ns Histogram of timing durations: - < us % of total count - 1 27.84870 1155682 - 2 72.05956 2990371 - 4 0.07810 3241 - 8 0.01357 563 - 16 0.00007 3 + < ns % of total count + 1024 99.79796 4225270 + 2048 0.15560 6588 + 4096 0.00035 15 + 8192 0.01738 736 + 16384 0.01679 711 + 32768 0.01190 504 ]]></screen> </para> <para> In this configuration, the sample <command>EXPLAIN ANALYZE</command> above - takes 115.9 ms. That's 1061 ns of timing overhead, again a small multiple - of what's measured directly by this utility. That much timing overhead - means the actual query itself is only taking a tiny fraction of the - accounted for time, most of it is being consumed in overhead instead. In - this configuration, any <command>EXPLAIN ANALYZE</command> totals involving - many timed operations would be inflated significantly by timing overhead. + shows an execution time of 148.7 ms. That's 1392 ns of timing overhead, + again a small multiple of what's measured directly by this utility. That + much timing overhead means the actual query itself is only taking a tiny + fraction of the accounted for time, most of it is being consumed in + overhead instead. In this configuration, any <command>EXPLAIN + ANALYZE</command> totals involving many timed operations would be inflated + significantly by timing overhead. </para> <para> @@ -196,33 +211,6 @@ kern.timecounter.hardware: ACPI-fast -> TSC </screen> </para> - <para> - Other systems may only allow setting the time source on boot. On older - Linux systems the "clock" kernel setting is the only way to make this sort - of change. And even on some more recent ones, the only option you'll see - for a clock source is "jiffies". Jiffies are the older Linux software clock - implementation, which can have good resolution when it's backed by fast - enough timing hardware, as in this example: - -<screen><![CDATA[ -$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource -jiffies -$ dmesg | grep time.c -time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer. -time.c: Detected 2400.153 MHz processor. -$ pg_test_timing -Testing timing overhead for 3 seconds. -Per timing duration including loop overhead: 97.75 ns -Histogram of timing durations: - < us % of total count - 1 90.23734 27694571 - 2 9.75277 2993204 - 4 0.00981 3010 - 8 0.00007 22 - 16 0.00000 1 - 32 0.00000 1 -]]></screen></para> - </refsect2> <refsect2> -- 2.38.0
>From ceca2ab6b2df034fde9be61b05ae6ce9ccf668b3 Mon Sep 17 00:00:00 2001 From: Andres Freund <and...@anarazel.de> Date: Fri, 20 Jan 2023 16:09:37 -0800 Subject: [PATCH v8 5/5] wip: instr_time: Add and use INSTR_TIME_ZERO(), INSTR_TIME_CURRENT() This just updates the places that "Zero initialize instr_time uses causing compiler warnings" changed, to see whether this is a nicer approach. --- src/include/portability/instr_time.h | 24 ++++++++++++++------ src/backend/access/transam/xlog.c | 11 +++------- src/backend/storage/buffer/bufmgr.c | 16 +++++--------- src/backend/storage/file/buffile.c | 16 +++++--------- src/backend/storage/ipc/latch.c | 8 +++---- src/bin/psql/common.c | 33 +++++++++++++--------------- 6 files changed, 50 insertions(+), 58 deletions(-) diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h index af2ab6ec887..2d1ff4f7f82 100644 --- a/src/include/portability/instr_time.h +++ b/src/include/portability/instr_time.h @@ -17,6 +17,10 @@ * * INSTR_TIME_IS_LT(x, y) x < y * + * INSTR_TIME_ZERO() an instr_time set to 0 + * + * INSTR_TIME_CURRENT() an instr_time set to current time + * * INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too) * * INSTR_TIME_SET_CURRENT(t) set t to current time @@ -110,7 +114,7 @@ typedef struct instr_time #define PG_INSTR_CLOCK CLOCK_REALTIME #endif -/* helper for INSTR_TIME_SET_CURRENT */ +/* helper for INSTR_TIME_CURRENT */ static inline instr_time pg_clock_gettime_ns(void) { @@ -123,8 +127,8 @@ pg_clock_gettime_ns(void) return now; } -#define INSTR_TIME_SET_CURRENT(t) \ - ((t) = pg_clock_gettime_ns()) +#define INSTR_TIME_CURRENT(t) \ + pg_clock_gettime_ns() #define INSTR_TIME_SET_SECONDS(t, s) \ ((t).ticks = NS_PER_S * (s)) @@ -138,7 +142,7 @@ pg_clock_gettime_ns(void) /* Use QueryPerformanceCounter() */ -/* helper for INSTR_TIME_SET_CURRENT */ +/* helper for INSTR_TIME_CURRENT */ static inline instr_time pg_query_performance_counter(void) { @@ -160,8 +164,8 @@ GetTimerFrequency(void) return (double) f.QuadPart; } -#define INSTR_TIME_SET_CURRENT(t) \ - ((t) = pg_query_performance_counter()) +#define INSTR_TIME_CURRENT(t) \ + pg_query_performance_counter() #define INSTR_TIME_SET_SECONDS(t, s) \ ((t).ticks = s * GetTimerFrequency()) @@ -181,7 +185,13 @@ GetTimerFrequency(void) #define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks) -#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0) +#define INSTR_TIME_ZERO(t) (instr_time){0} + +#define INSTR_TIME_SET_CURRENT(t) \ + (t) = INSTR_TIME_CURRENT() + +#define INSTR_TIME_SET_ZERO(t) \ + ((t) = INSTR_TIME_ZERO()) #define INSTR_TIME_SET_CURRENT_LAZY(t) \ (INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index fb4c860bdea..f563800c8ab 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -2178,7 +2178,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) Size nbytes; Size nleft; int written; - instr_time start; + instr_time start = INSTR_TIME_ZERO(); /* OK to write the page(s) */ from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ; @@ -2191,8 +2191,6 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) /* Measure I/O timing to write WAL data */ if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); - else - INSTR_TIME_SET_ZERO(start); pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE); written = pg_pwrite(openLogFile, from, nleft, startoffset); @@ -2204,9 +2202,8 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible) */ if (track_wal_io_timing) { - instr_time duration; + instr_time duration = INSTR_TIME_CURRENT(); - INSTR_TIME_SET_CURRENT(duration); INSTR_TIME_SUBTRACT(duration, start); PendingWalStats.wal_write_time += INSTR_TIME_GET_MICROSEC(duration); } @@ -8137,7 +8134,7 @@ void issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) { char *msg = NULL; - instr_time start; + instr_time start = INSTR_TIME_ZERO(); Assert(tli != 0); @@ -8153,8 +8150,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli) /* Measure I/O timing to sync the WAL file */ if (track_wal_io_timing) INSTR_TIME_SET_CURRENT(start); - else - INSTR_TIME_SET_ZERO(start); pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC); switch (sync_method) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 800a4248c95..d8baf80e650 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -1012,19 +1012,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, MemSet((char *) bufBlock, 0, BLCKSZ); else { - instr_time io_start, - io_time; + instr_time io_start = INSTR_TIME_ZERO(); if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); - else - INSTR_TIME_SET_ZERO(io_start); smgrread(smgr, forkNum, blockNum, (char *) bufBlock); if (track_io_timing) { - INSTR_TIME_SET_CURRENT(io_time); + instr_time io_time = INSTR_TIME_CURRENT(); + INSTR_TIME_SUBTRACT(io_time, io_start); pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time)); INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time); @@ -2826,8 +2824,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln) { XLogRecPtr recptr; ErrorContextCallback errcallback; - instr_time io_start, - io_time; + instr_time io_start = INSTR_TIME_ZERO(); Block bufBlock; char *bufToWrite; uint32 buf_state; @@ -2904,8 +2901,6 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln) if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); - else - INSTR_TIME_SET_ZERO(io_start); /* * bufToWrite is either the shared buffer or a copy, as appropriate. @@ -2918,7 +2913,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln) if (track_io_timing) { - INSTR_TIME_SET_CURRENT(io_time); + instr_time io_time = INSTR_TIME_CURRENT(); + INSTR_TIME_SUBTRACT(io_time, io_start); pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time)); INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time); diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c index 0a51624df3b..6f813279690 100644 --- a/src/backend/storage/file/buffile.c +++ b/src/backend/storage/file/buffile.c @@ -429,8 +429,7 @@ static void BufFileLoadBuffer(BufFile *file) { File thisfile; - instr_time io_start; - instr_time io_time; + instr_time io_start = INSTR_TIME_ZERO(); /* * Advance to next component file if necessary and possible. @@ -446,8 +445,6 @@ BufFileLoadBuffer(BufFile *file) if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); - else - INSTR_TIME_SET_ZERO(io_start); /* * Read whatever we can get, up to a full bufferload. @@ -468,7 +465,8 @@ BufFileLoadBuffer(BufFile *file) if (track_io_timing) { - INSTR_TIME_SET_CURRENT(io_time); + instr_time io_time = INSTR_TIME_CURRENT(); + INSTR_TIME_SUBTRACT(io_time, io_start); INSTR_TIME_ADD(pgBufferUsage.temp_blk_read_time, io_time); } @@ -500,8 +498,7 @@ BufFileDumpBuffer(BufFile *file) while (wpos < file->nbytes) { off_t availbytes; - instr_time io_start; - instr_time io_time; + instr_time io_start = INSTR_TIME_ZERO(); /* * Advance to next component file if necessary and possible. @@ -527,8 +524,6 @@ BufFileDumpBuffer(BufFile *file) if (track_io_timing) INSTR_TIME_SET_CURRENT(io_start); - else - INSTR_TIME_SET_ZERO(io_start); bytestowrite = FileWrite(thisfile, file->buffer.data + wpos, @@ -543,7 +538,8 @@ BufFileDumpBuffer(BufFile *file) if (track_io_timing) { - INSTR_TIME_SET_CURRENT(io_time); + instr_time io_time = INSTR_TIME_CURRENT(); + INSTR_TIME_SUBTRACT(io_time, io_start); INSTR_TIME_ADD(pgBufferUsage.temp_blk_write_time, io_time); } diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c index f4123e7de7e..8092ff4a984 100644 --- a/src/backend/storage/ipc/latch.c +++ b/src/backend/storage/ipc/latch.c @@ -1385,8 +1385,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout, uint32 wait_event_info) { int returned_events = 0; - instr_time start_time; - instr_time cur_time; + instr_time start_time = INSTR_TIME_ZERO(); long cur_timeout = -1; Assert(nevents > 0); @@ -1401,8 +1400,6 @@ WaitEventSetWait(WaitEventSet *set, long timeout, Assert(timeout >= 0 && timeout <= INT_MAX); cur_timeout = timeout; } - else - INSTR_TIME_SET_ZERO(start_time); pgstat_report_wait_start(wait_event_info); @@ -1489,7 +1486,8 @@ WaitEventSetWait(WaitEventSet *set, long timeout, /* If we're not done, update cur_timeout for next iteration */ if (returned_events == 0 && timeout >= 0) { - INSTR_TIME_SET_CURRENT(cur_time); + instr_time cur_time = INSTR_TIME_CURRENT(); + INSTR_TIME_SUBTRACT(cur_time, start_time); cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time); if (cur_timeout <= 0) diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c index f907f5d4e8d..5badb029e83 100644 --- a/src/bin/psql/common.c +++ b/src/bin/psql/common.c @@ -1269,15 +1269,12 @@ DescribeQuery(const char *query, double *elapsed_msec) bool timing = pset.timing; PGresult *result; bool OK; - instr_time before, - after; + instr_time before = INSTR_TIME_ZERO(); *elapsed_msec = 0; if (timing) INSTR_TIME_SET_CURRENT(before); - else - INSTR_TIME_SET_ZERO(before); /* * To parse the query but not execute it, we prepare it, using the unnamed @@ -1350,7 +1347,8 @@ DescribeQuery(const char *query, double *elapsed_msec) if (timing) { - INSTR_TIME_SET_CURRENT(after); + instr_time after = INSTR_TIME_CURRENT(); + INSTR_TIME_SUBTRACT(after, before); *elapsed_msec += INSTR_TIME_GET_MILLISEC(after); } @@ -1400,16 +1398,13 @@ ExecQueryAndProcessResults(const char *query, { bool timing = pset.timing; bool success; - instr_time before, - after; + instr_time before = INSTR_TIME_ZERO(); PGresult *result; FILE *gfile_fout = NULL; bool gfile_is_pipe = false; if (timing) INSTR_TIME_SET_CURRENT(before); - else - INSTR_TIME_SET_ZERO(before); if (pset.bind_flag) success = PQsendQueryParams(pset.db, query, pset.bind_nparams, NULL, (const char * const *) pset.bind_params, NULL, NULL, 0); @@ -1490,7 +1485,8 @@ ExecQueryAndProcessResults(const char *query, */ if (timing) { - INSTR_TIME_SET_CURRENT(after); + instr_time after = INSTR_TIME_CURRENT(); + INSTR_TIME_SUBTRACT(after, before); *elapsed_msec = INSTR_TIME_GET_MILLISEC(after); } @@ -1595,7 +1591,8 @@ ExecQueryAndProcessResults(const char *query, */ if (timing) { - INSTR_TIME_SET_CURRENT(after); + instr_time after = INSTR_TIME_CURRENT(); + INSTR_TIME_SUBTRACT(after, before); *elapsed_msec = INSTR_TIME_GET_MILLISEC(after); } @@ -1693,8 +1690,7 @@ ExecQueryUsingCursor(const char *query, double *elapsed_msec) int ntuples; int fetch_count; char fetch_cmd[64]; - instr_time before, - after; + instr_time before = INSTR_TIME_ZERO(); int flush_error; *elapsed_msec = 0; @@ -1706,8 +1702,6 @@ ExecQueryUsingCursor(const char *query, double *elapsed_msec) if (timing) INSTR_TIME_SET_CURRENT(before); - else - INSTR_TIME_SET_ZERO(before); /* if we're not in a transaction, start one */ if (PQtransactionStatus(pset.db) == PQTRANS_IDLE) @@ -1738,7 +1732,8 @@ ExecQueryUsingCursor(const char *query, double *elapsed_msec) if (timing) { - INSTR_TIME_SET_CURRENT(after); + instr_time after = INSTR_TIME_CURRENT(); + INSTR_TIME_SUBTRACT(after, before); *elapsed_msec += INSTR_TIME_GET_MILLISEC(after); } @@ -1786,7 +1781,8 @@ ExecQueryUsingCursor(const char *query, double *elapsed_msec) if (timing) { - INSTR_TIME_SET_CURRENT(after); + instr_time after = INSTR_TIME_CURRENT(); + INSTR_TIME_SUBTRACT(after, before); *elapsed_msec += INSTR_TIME_GET_MILLISEC(after); } @@ -1926,7 +1922,8 @@ cleanup: if (timing) { - INSTR_TIME_SET_CURRENT(after); + instr_time after = INSTR_TIME_CURRENT(); + INSTR_TIME_SUBTRACT(after, before); *elapsed_msec += INSTR_TIME_GET_MILLISEC(after); } -- 2.38.0