On 12/06/2019 09:23, Fabien COELHO wrote:
But that was just because I was showing a simplistic example. E.g.
here's a log of a vacuum finishing, and then another starting a few
seconds later (both vacuums lasting a fair while):

progress: 139.0 s, 2438.4 tps, txtime 13.033 ms stddev 3.830, lat 17530.784 ms 
stddev 590.153, lag 17517.751 ms
progress: 140.0 s, 2489.0 tps, txtime 12.911 ms stddev 3.642, lat 17752.862 ms 
stddev 600.661, lag 17739.952 ms
progress: 141.0 s, 2270.0 tps, txtime 14.021 ms stddev 4.965, lat 17973.805 ms 
stddev 594.784, lag 17959.784 ms
progress: 142.0 s, 1408.0 tps, txtime 22.848 ms stddev 5.365, lat 18417.808 ms 
stddev 632.729, lag 18394.960 ms
progress: 143.0 s, 3001.0 tps, txtime 10.724 ms stddev 4.318, lat 18796.971 ms 
stddev 617.462, lag 18786.247 ms
progress: 144.0 s, 4678.0 tps, txtime 6.823 ms stddev 2.136, lat 18503.253 ms 
stddev 669.072, lag 18496.431 ms
progress: 145.0 s, 4577.0 tps, txtime 7.001 ms stddev 1.526, lat 18108.596 ms 
stddev 689.843, lag 18101.596 ms
progress: 146.0 s, 2596.0 tps, txtime 12.261 ms stddev 3.060, lat 17961.623 ms 
stddev 683.498, lag 17949.363 ms
progress: 147.0 s, 2654.0 tps, txtime 12.072 ms stddev 3.282, lat 18120.009 ms 
stddev 685.074, lag 18107.938 ms
progress: 148.0 s, 3471.0 tps, txtime 9.240 ms stddev 3.702, lat 18251.712 ms 
stddev 676.572, lag 18242.472 ms
progress: 149.0 s, 3056.0 tps, txtime 10.468 ms stddev 5.131, lat 18058.950 ms 
stddev 675.334, lag 18048.482 ms
progress: 150.0 s, 2319.0 tps, txtime 13.778 ms stddev 3.762, lat 18305.101 ms 
stddev 688.186, lag 18291.323 ms
progress: 151.0 s, 2355.0 tps, txtime 13.567 ms stddev 3.891, lat 18586.073 ms 
stddev 691.656, lag 18572.506 ms
progress: 152.0 s, 2321.0 tps, txtime 13.742 ms stddev 3.708, lat 18835.985 ms 
stddev 709.580, lag 18822.244 ms
progress: 153.0 s, 2360.0 tps, txtime 13.604 ms stddev 3.533, lat 19121.166 ms 
stddev 709.682, lag 19107.562 ms

The period inbetween where no vacuum was running is imo considerably
harder to spot looking at 'lat'.

ISTM that the signal is pretty clear in whether the lag increases or
decreases. Basically the database is 18 seconds behind its load, which is
very bad if a user is waiting.

That was my thought too, when looking at this example. When there is already a long queue of transactions, the txtime of individual transactions doesn't matter much. The most important thing under that condition is how fast the system can dissolve the queue (or how fast it builds up even more). So the derivative of the lag or lat seems like the most important figure. We don't print exactly that, but it's roughly the same as the TPS. Jitter experienced by the user matters too, i.e. stddev of 'lat'.

To illustrate this, imagine that the server magically detected that there's a long queue of transactions. It would be beneficial to go into "batch mode", where it collects incoming transactions into larger batches. The effect of this imaginary batch mode is that the TPS rate increases by 50%, but the txtime also increases by 1000%, and becomes highly variable. Would that be a good tradeoff? I would say yes. The user is experiencing an 18 s delay anyway, and the increase in txtime would be insignificant compared to that, but the queue would be busted more quickly.

Of course, there is no such batch mode in PostgreSQL, and I wouldn't suggest trying to implement anything like that. In a different kind of application, you would rather maintain a steady txtime when the server is at full load, even if it means a lower overall TPS rate. And that feels like a more important goal than just TPS. I think we all agree on that. To simulate that kind of an application, though, you probably don't want to use -R, or you would use it with --latency-limit. Except clearly Andres is trying to do just that, which is why I'm still a bit confused :-).

- Heikki


Reply via email to