Re: Syncrep and improving latency due to WAL throttling

Andres Freund Wed, 08 Nov 2023 09:11:53 -0800

Hi,

On 2023-11-08 13:59:55 +0100, Tomas Vondra wrote:
> > I used netperf's tcp_rr between my workstation and my laptop on a local 
> > 10Gbit
> > network (albeit with a crappy external card for my laptop), to put some
> > numbers to this. I used -r $s,100 to test sending a variable sized data to 
> > the
> > other size, with the other side always responding with 100 bytes (assuming
> > that'd more than fit a feedback response).
> >
> > Command:
> > fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate";
> >  echo $fields; for s in 10 100 1000 10000 100000 1000000;do netperf -P0 -t 
> > TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done
> >
> > 10gbe:
> >
> > request_size    response_size   min_latency     mean_latency    max_latency 
> >     p99_latency     transaction_rate
> > 10              100             43              64.30           390         
> >     96              15526.084
> > 100             100             57              75.12           428         
> >     122             13286.602
> > 1000            100             47              74.41           270         
> >     108             13412.125
> > 10000           100             89              114.63          712         
> >     152             8700.643
> > 100000          100             167             255.90          584         
> >     312             3903.516
> > 1000000         100             891             1015.99         2470        
> >     1143            983.708
> >
> >
> > Same hosts, but with my workstation forced to use a 1gbit connection:
> >
> > request_size    response_size   min_latency     mean_latency    max_latency 
> >     p99_latency     transaction_rate
> > 10              100             78              131.18          2425        
> >     257             7613.416
> > 100             100             81              129.25          425         
> >     255             7727.473
> > 1000            100             100             162.12          1444        
> >     266             6161.388
> > 10000           100             310             686.19          1797        
> >     927             1456.204
> > 100000          100             1006            1114.20         1472        
> >     1199            896.770
> > 1000000         100             8338            8420.96         8827        
> >     8498            118.410


Looks like the 1gbit numbers were somewhat bogus-ified due having configured
jumbo frames and some network component doing something odd with that
(handling them in software maybe?).

10gbe:
request_size    response_size   min_latency     mean_latency    max_latency     
p99_latency     transaction_rate
10              100             56              68.56           483             
87              14562.476
100             100             57              75.68           353             
123             13185.485
1000            100             60              71.97           391             
94              13870.659
10000           100             58              92.42           489             
140             10798.444
100000          100             184             260.48          1141            
338             3834.504
1000000         100             926             1071.46         2012            
1466            933.009

1gbe
request_size    response_size   min_latency     mean_latency    max_latency     
p99_latency     transaction_rate
10              100             77              132.19          1097            
257             7555.420
100             100             79              127.85          534             
249             7810.862
1000            100             98              155.91          966             
265             6406.818
10000           100             176             235.37          1451            
314             4245.304
100000          100             944             1022.00         1380            
1148            977.930
1000000         100             8649            8768.42         9018            
8895            113.703


> > I haven't checked, but I'd assume that 100bytes back and forth should easily
> > fit a new message to update LSNs and the existing feedback response. Even 
> > just
> > the difference between sending 100 bytes and sending 10k (a bit more than a
> > single WAL page) is pretty significant on a 1gbit network.
> >
>
> I'm on decaf so I may be a bit slow, but it's not very clear to me what
> conclusion to draw from these numbers. What is the takeaway?
>
> My understanding is that in both cases the latency is initially fairly
> stable, independent of the request size. This applies to request up to
> ~1000B. And then the latency starts increasing fairly quickly, even
> though it shouldn't hit the bandwidth (except maybe the 1MB requests).

Except for the smallest end, these are bandwidth related, I think. Converting
1gbit/s to bytes/us is 125 bytes / us - before tcp/ip overhead. Even leaving
the overhead aside, 10kB/100kB outstanding take ~80us/800us to send on
1gbit. If you subtract the minmum latency of about 130us, that's nearly all of
the latency.

The reason this matters is that the numbers show that the latency of having to
send a small message with updated positions is far smaller than having to send
all the outstanding data. Even having to send a single WAL page over the
network ~doubles the latency of the response on 1gbit!  Of course the impact
is smaller on 10gbit, but even there latency substantially increases around
100kB of outstanding data.

In a local pgbench with 32 clients I see WAL write sizes between 8kB and
~220kB. Being able to stream those out before the local flush completed
therefore seems likely to reduce synchronous_commit overhead substantially.


> I don't think it says we should be replicating WAL in tiny chunks,
> because if you need to send a chunk of data it's always more efficient
> to send it at once (compared to sending multiple smaller pieces).

I don't think that's a very large factor for network data, once your minimal
data sizes is ~8kB (or ~4kB if we lower wal_block_size). TCP messsages will
get chunked into something smaller anyway and small messages don't need to get
acknowledged individually. Sending more data at once is good for CPU
efficiency (reducing syscall and network device overhead), but doesn't do much
for throughput.

Sending 4kB of data in each send() in a bandwidth oriented test already gets
to ~9.3gbit/s in my network. That's close to the maximum atainable with normal
framing. If I change the mtu back to 9000 I get 9.89 gbit/s, again very close
to the theoretical max.

Greetings,

Andres Freund

Re: Syncrep and improving latency due to WAL throttling

Reply via email to