Re: pgbench rate limiting changes transaction latency computation

Heikki Linnakangas Wed, 12 Jun 2019 00:09:30 -0700

On 12/06/2019 09:23, Fabien COELHO wrote:

But that was just because I was showing a simplistic example. E.g.
here's a log of a vacuum finishing, and then another starting a few
seconds later (both vacuums lasting a fair while):


progress: 139.0 s, 2438.4 tps, txtime 13.033 ms stddev 3.830, lat 17530.784 ms 
stddev 590.153, lag 17517.751 ms
progress: 140.0 s, 2489.0 tps, txtime 12.911 ms stddev 3.642, lat 17752.862 ms 
stddev 600.661, lag 17739.952 ms
progress: 141.0 s, 2270.0 tps, txtime 14.021 ms stddev 4.965, lat 17973.805 ms 
stddev 594.784, lag 17959.784 ms
progress: 142.0 s, 1408.0 tps, txtime 22.848 ms stddev 5.365, lat 18417.808 ms 
stddev 632.729, lag 18394.960 ms
progress: 143.0 s, 3001.0 tps, txtime 10.724 ms stddev 4.318, lat 18796.971 ms 
stddev 617.462, lag 18786.247 ms
progress: 144.0 s, 4678.0 tps, txtime 6.823 ms stddev 2.136, lat 18503.253 ms 
stddev 669.072, lag 18496.431 ms
progress: 145.0 s, 4577.0 tps, txtime 7.001 ms stddev 1.526, lat 18108.596 ms 
stddev 689.843, lag 18101.596 ms
progress: 146.0 s, 2596.0 tps, txtime 12.261 ms stddev 3.060, lat 17961.623 ms 
stddev 683.498, lag 17949.363 ms
progress: 147.0 s, 2654.0 tps, txtime 12.072 ms stddev 3.282, lat 18120.009 ms 
stddev 685.074, lag 18107.938 ms
progress: 148.0 s, 3471.0 tps, txtime 9.240 ms stddev 3.702, lat 18251.712 ms 
stddev 676.572, lag 18242.472 ms
progress: 149.0 s, 3056.0 tps, txtime 10.468 ms stddev 5.131, lat 18058.950 ms 
stddev 675.334, lag 18048.482 ms
progress: 150.0 s, 2319.0 tps, txtime 13.778 ms stddev 3.762, lat 18305.101 ms 
stddev 688.186, lag 18291.323 ms
progress: 151.0 s, 2355.0 tps, txtime 13.567 ms stddev 3.891, lat 18586.073 ms 
stddev 691.656, lag 18572.506 ms
progress: 152.0 s, 2321.0 tps, txtime 13.742 ms stddev 3.708, lat 18835.985 ms 
stddev 709.580, lag 18822.244 ms
progress: 153.0 s, 2360.0 tps, txtime 13.604 ms stddev 3.533, lat 19121.166 ms 
stddev 709.682, lag 19107.562 ms

The period inbetween where no vacuum was running is imo considerably
harder to spot looking at 'lat'.


ISTM that the signal is pretty clear in whether the lag increases or
decreases. Basically the database is 18 seconds behind its load, which is
very bad if a user is waiting.

That was my thought too, when looking at this example. When there isalready a long queue of transactions, the txtime of individualtransactions doesn't matter much. The most important thing under thatcondition is how fast the system can dissolve the queue (or how fast itbuilds up even more). So the derivative of the lag or lat seems like themost important figure. We don't print exactly that, but it's roughly thesame as the TPS. Jitter experienced by the user matters too, i.e. stddevof 'lat'.

To illustrate this, imagine that the server magically detected thatthere's a long queue of transactions. It would be beneficial to go into"batch mode", where it collects incoming transactions into largerbatches. The effect of this imaginary batch mode is that the TPS rateincreases by 50%, but the txtime also increases by 1000%, and becomeshighly variable. Would that be a good tradeoff? I would say yes. Theuser is experiencing an 18 s delay anyway, and the increase in txtimewould be insignificant compared to that, but the queue would be bustedmore quickly.

Of course, there is no such batch mode in PostgreSQL, and I wouldn'tsuggest trying to implement anything like that. In a different kind ofapplication, you would rather maintain a steady txtime when the serveris at full load, even if it means a lower overall TPS rate. And thatfeels like a more important goal than just TPS. I think we all agree onthat. To simulate that kind of an application, though, you probablydon't want to use -R, or you would use it with --latency-limit. Exceptclearly Andres is trying to do just that, which is why I'm still a bitconfused :-).


- Heikki

Re: pgbench rate limiting changes transaction latency computation

Reply via email to