On Wed, Oct 30, 2019 at 5:51 AM, imai.yoshik...@fujitsu.com wrote:
> The overhead which is induced by getting wait event info was discussed from 
> old times, but I couldn't find the actual
> measuring results, so I want to measure its overhead.

And here is the patch which counts the wait event and measuring the wait event 
time. It is currently like POC and has several things to be improved.

You can get wait event info by executing "select * from pg_stat_waitaccum;" and 
reset its counters by "select pg_stat_reset_shared('waitaccum');".

I tried to reduce the overhead of counting the wait event. It is difficult to 
reduce the overhead of measuring wait events times, I made measuring time is 
configurable like track_io_timing.

In the other DB, they use more light function than gettimeofday or 
clock_gettime and it is the CPU cycle counter, rdtsc.
So I also created the patch which uses rdtsc for measuring wait events times.
There are some investigations for rdtsc before[1]. If we want to use rdtsc, we 
need more investigation for which platform can use it or how to prevent time go 
backwards in the rdtsc and so on.
Here, I wanted to see its overhead, so I didn't care such things in this patch.

I measured its performance with pgbench on a VM machine which has 4 core CPU 
and 8 GB Mem.
I tested against below four cases.

    master:      master (2fc2a88)

    only counts: 0001 patched, only counts wait events
                 (with track_wait_timing is off)
    counts/time: 0001 patched, counts wait event and measure its time by 
gettimeofday or clock_gettime
                 (with track_wait_timing is on)
    counts/time(rdtsc): 0001 + 0002 patched, counts wait event and measure its 
time by rdtsc
                 (with track_wait_timing is on)

I executed the below pgbench scripts, initializing database with scale 1 or 100 
and executing pgbench with standard mode or select-only mode.

    [standard mode]
    for i in `seq 1 15`
    do  
    pgbench -i -s (1 or 100) -q
    pgbench -c 8 -j 8 -n -T 60
    done

    [select only mode]
    pgbench -i -s (1 or 100) -q
    for i in `seq 1 10`
    do  
    pgbench -c 8 -j 8 -n -S -T 60
    done

The result was below.

[standard, scale 1]
      version      |   TPS  | diff(%)
            master | 813.82 |       0
       only counts | 797.48 |    2.01
       counts/time | 833.16 |   -2.38
counts/time(rdtsc) | 876.29 |   -7.68

[standard, scale 100]
      version      |   TPS   | diff(%)
            master | 2170.34 |       0
       only counts | 2125.37 |    2.07
       counts/time | 2147.8  |    1.04
counts/time(rdtsc) | 2187.37 |  -0.785

[select-only, scale 1]
      version      |   TPS   | diff(%)
            master | 28487.6 |       0
       only counts | 28481.1 |   0.023
       counts/time | 28364.7 |   0.431
counts/time(rdtsc) | 28462.6 |   0.088

[select-only, scale 100]
      version      |   TPS    | diff(%)
            master | 25767.89 |       0
       only counts | 26068.65 |  -1.167
       counts/time | 25567.69 |   0.777
counts/time(rdtsc) | 25525.26 |   0.942


An example of wait event info after executing pgbench was below.

[standard, scale 100]
number of transactions actually processed: 129844
latency average = 3.697 ms
tps = 2163.667427 (including connections establishing)
tps = 2163.918398 (excluding connections establishing)

 wait_event_type |      wait_event      | calls  |   times   
-----------------+----------------------+--------+-----------
 Client          | ClientRead           | 908807 | 114473878
 IO              | DataFileRead         | 216025 |   2867211
 LWLock          | WALWriteLock         | 191977 | 195192237
 IO              | DataFileWrite        | 154540 |   3406232
 IO              | WALWrite             |  49932 |   2728543
 IO              | WALSync              |  49737 |  49649308
 Lock            | transactionid        |   6209 |   8999545
 LWLock          | buffer_content       |   5337 |    288951
 IO              | DataFileExtend       |   2346 |     90375
 LWLock          | wal_insert           |   2013 |     25141
 LWLock          | WALBufMappingLock    |    630 |     14680
 LWLock          | CLogControlLock      |    454 |      2414
 LWLock          | buffer_mapping       |    170 |       852
 LWLock          | XidGenLock           |    146 |      3268
 LWLock          | lock_manager         |    141 |      5209
 Lock            | tuple                |    112 |    120163
 LWLock          | ProcArrayLock        |     97 |       495
 Lock            | extend               |     42 |     26875
 IO              | RelationMapRead      |     22 |        57
 LWLock          | CheckpointerCommLock |     18 |      1217
 IO              | DataFilePrefetch     |     18 |        24
 IPC             | ClogGroupUpdate      |      9 |      1867
 LWLock          | SyncRepLock          |      3 |       116
 IO              | DataFileTruncate     |      1 |       116


[select-only, scale 1]
number of transactions actually processed: 1682642
latency average = 0.285 ms
tps = 28043.407989 (including connections establishing)
tps = 28048.158085 (excluding connections establishing)
 wait_event_type |   wait_event    |  calls  |   times   
-----------------+-----------------+---------+-----------
 Client          | ClientRead      | 1682661 | 287999638
 IO              | RelationMapRead |      22 |        54
 LWLock          | lock_manager    |       2 |      2087


[select-only, scale 100]
number of transactions actually processed: 1513536
latency average = 0.317 ms
tps = 25223.558569 (including connections establishing)
tps = 25228.820644 (excluding connections establishing)
 wait_event_type |   wait_event    |  calls  |   times   
-----------------+-----------------+---------+-----------
 IO              | DataFileRead    | 2524682 |  14579531
 Client          | ClientRead      | 1513558 | 283968554
 LWLock          | buffer_mapping  |    1225 |      6392
 IO              | RelationMapRead |      22 |        46
 LWLock          | buffer_io       |      11 |       876
 LWLock          | lock_manager    |       6 |       507


I wanted to measure and compare the performance where wait_event occurs many 
times, but unfortunately, TPS was not stable and differed in each standard test 
that wait_event occurs many times that I couldn't get consistent performance 
tendency between each version.

I need to find more suitable test for clarifying its performance or if there 
are any good tests, please let me know. Also, any tests are very welcome.


--
Yoshikazu Imai

Attachment: 0001-Adding-the-pg_stat_waitaccum-view-which-shows-counts-v1.patch
Description: 0001-Adding-the-pg_stat_waitaccum-view-which-shows-counts-v1.patch

Attachment: 0002-POC-Changed-measuring-method-of-wait-event-timed-v2.patch
Description: 0002-POC-Changed-measuring-method-of-wait-event-timed-v2.patch

Reply via email to