Re: [HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send

2017-01-09 Thread Jonathon Nelson
On Sun, Jan 8, 2017 at 11:36 AM, Greg Stark  wrote:

> On 8 January 2017 at 17:26, Greg Stark  wrote:
> > On 5 January 2017 at 19:01, Andres Freund  wrote:
> >> That's a bit odd - shouldn't the OS network stack take care of this in
> >> both cases?  I mean either is too big for TCP packets (including jumbo
> >> frames).  What type of OS and network is involved here?
> >
> > 2x may be plausible. The first 128k goes out, then the rest queues up
> > until the first ack comes back. Then the next 128kB goes out again
> > without waiting... I think this is what Nagle is supposed to actually
> > address but either it may be off by default these days or our usage
> > pattern may be defeating it in some way.
>
> Hm. That wasn't very clear.  And the more I think about it, it's not right.
>
> The first block of data -- one byte in the worst case, 128kB in our
> case -- gets put in the output buffers and since there's nothing
> stopping it it immediately gets sent out. Then all the subsequent data
> gets put in output buffers but buffers up due to Nagle. Until there's
> a full packet of data buffered, the ack arrives, or the timeout
> expires, at which point the buffered data drains efficiently in full
> packets. Eventually it all drains away and the next 128kB arrives and
> is sent out immediately.
>
> So most packets are full size with the occasional 128kB packet thrown
> in whenever the buffer empties. And I think even when the 128kB packet
> is pending Nagle only stops small packets, not full packets, and the
> window should allow more than one packet of data to be pending.
>
> So, uh, forget what I said. Nagle should be our friend here.
>

[I have not done a rigid analysis, here, but...]

I *think* libpq is the culprit here.

walsender says "Hey, libpq - please send (up to) 128KB of data!" and
doesn't "return" until it's "sent". Then it sends more.  Regardless of the
underlying cause (nagle, tcp congestion control algorithms, umpteen
different combos of hardware and settings, etc..) in almost every test I
saw improvement (usually quite a bit). This was most easily observable with
high bandwidth-delay product links, but my time in the lab is somewhat
limited.

I calculated "performance" the most simple measurement possible: how long
did it take for Y volume of data to get transferred, performed over a
long-enough interval (typically 1800 seconds) for TCP windows to open up,
etc...

-- 
Jon Nelson
Dyn / Principal Software Engineer


Re: [HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send

2017-01-06 Thread Jonathon Nelson
On Fri, Jan 6, 2017 at 8:52 AM, Kevin Grittner  wrote:

> On Thu, Jan 5, 2017 at 7:32 PM, Jonathon Nelson  wrote:
> > On Thu, Jan 5, 2017 at 1:01 PM, Andres Freund 
> wrote:
> >> On 2017-01-05 12:55:44 -0600, Jonathon Nelson wrote:
>
> >>> In our lab environment and with a 16MiB setting, we saw substantially
> >>> better network utilization (almost 2x!), primarily over high bandwidth
> >>> delay product links.
> >>
> >> That's a bit odd - shouldn't the OS network stack take care of this in
> >> both cases?  I mean either is too big for TCP packets (including jumbo
> >> frames).  What type of OS and network is involved here?
> >
> > In our test lab, we make use of multiple flavors of Linux. No jumbo
> frames.
> > We simulated anything from 0 to 160ms RTT (with varying degrees of
> jitter,
> > packet loss, etc.) using tc. Even with everything fairly clean, at 80ms
> RTT
> > there was a 2x improvement in performance.
>
> Is there compression and/or encryption being performed by the
> network layers?  My experience with both is that they run faster on
> bigger chunks of data, and that might happen before the data is
> broken into packets.
>

There is no compression or encryption. The testing was with and without
various forms of hardware offload, etc. but otherwise there is no magic up
these sleeves.

-- 
Jon Nelson
Dyn / Principal Software Engineer


Re: [HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send

2017-01-05 Thread Jonathon Nelson
On Thu, Jan 5, 2017 at 1:01 PM, Andres Freund  wrote:

> Hi,
>
> On 2017-01-05 12:55:44 -0600, Jonathon Nelson wrote:
> > Attached please find a patch for PostgreSQL 9.4 which changes the maximum
> > amount of data that the wal sender will send at any point in time from
> the
> > hard-coded value of 128KiB to a user-controllable value up to 16MiB. It
> has
> > been primarily tested under 9.4 but there has been some testing with 9.5.
> >
> > In our lab environment and with a 16MiB setting, we saw substantially
> > better network utilization (almost 2x!), primarily over high bandwidth
> > delay product links.
>
> That's a bit odd - shouldn't the OS network stack take care of this in
> both cases?  I mean either is too big for TCP packets (including jumbo
> frames).  What type of OS and network is involved here?
>

In our test lab, we make use of multiple flavors of Linux. No jumbo frames.
We simulated anything from 0 to 160ms RTT (with varying degrees of jitter,
packet loss, etc.) using tc. Even with everything fairly clean, at 80ms RTT
there was a 2x improvement in performance.

-- 
Jon Nelson
Dyn / Principal Software Engineer


[HACKERS] [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send

2017-01-05 Thread Jonathon Nelson
Attached please find a patch for PostgreSQL 9.4 which changes the maximum
amount of data that the wal sender will send at any point in time from the
hard-coded value of 128KiB to a user-controllable value up to 16MiB. It has
been primarily tested under 9.4 but there has been some testing with 9.5.

In our lab environment and with a 16MiB setting, we saw substantially
better network utilization (almost 2x!), primarily over high bandwidth
delay product links.

-- 
Jon Nelson
Dyn / Principal Software Engineer
From 5ba24d84d880d756bec538e35c499811d88e2fc3 Mon Sep 17 00:00:00 2001
From: Jon Nelson 
Date: Wed, 7 Sep 2016 07:23:53 -0500
Subject: [PATCH] guc-ify the formerly hard-coded MAX_SEND_SIZE to max_wal_send

---
 src/backend/replication/walsender.c | 14 --
 src/backend/utils/misc/guc.c| 12 
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b671c43..743d6c8 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -89,7 +89,7 @@
  * because signals are checked only between messages.  128kB (with
  * default 8k blocks) seems like a reasonable guess for now.
  */
-#define MAX_SEND_SIZE (XLOG_BLCKSZ * 16)
+int	max_wal_send_guc = 0;
 
 /* Array of WalSnds in shared memory */
 WalSndCtlData *WalSndCtl = NULL;
@@ -2181,7 +2181,7 @@ retry:
 /*
  * Send out the WAL in its normal physical/stored form.
  *
- * Read up to MAX_SEND_SIZE bytes of WAL that's been flushed to disk,
+ * Read up to max_wal_send bytes of WAL that's been flushed to disk,
  * but not yet sent to the client, and buffer it in the libpq output
  * buffer.
  *
@@ -2195,6 +2195,7 @@ XLogSendPhysical(void)
 	XLogRecPtr	startptr;
 	XLogRecPtr	endptr;
 	Size		nbytes;
+	int		max_wal_send = max_wal_send_guc * 1024;
 
 	if (streamingDoneSending)
 	{
@@ -2333,8 +2334,8 @@ XLogSendPhysical(void)
 
 	/*
 	 * Figure out how much to send in one message. If there's no more than
-	 * MAX_SEND_SIZE bytes to send, send everything. Otherwise send
-	 * MAX_SEND_SIZE bytes, but round back to logfile or page boundary.
+	 * max_wal_send bytes to send, send everything. Otherwise send
+	 * max_wal_send bytes, but round back to logfile or page boundary.
 	 *
 	 * The rounding is not only for performance reasons. Walreceiver relies on
 	 * the fact that we never split a WAL record across two messages. Since a
@@ -2344,7 +2345,7 @@ XLogSendPhysical(void)
 	 */
 	startptr = sentPtr;
 	endptr = startptr;
-	endptr += MAX_SEND_SIZE;
+	endptr += max_wal_send;
 
 	/* if we went beyond SendRqstPtr, back off */
 	if (SendRqstPtr <= endptr)
@@ -2363,7 +2364,8 @@ XLogSendPhysical(void)
 	}
 
 	nbytes = endptr - startptr;
-	Assert(nbytes <= MAX_SEND_SIZE);
+	Assert(nbytes <= max_wal_send);
+	elog(DEBUG2, "walsender sending WAL payload of %d bytes", nbytes);
 
 	/*
 	 * OK to read and send the slice.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a9f31ef..3a5018d 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -128,6 +128,7 @@ extern bool synchronize_seqscans;
 extern char *SSLCipherSuites;
 extern char *SSLECDHCurve;
 extern bool SSLPreferServerCiphers;
+extern int max_wal_send_guc;
 
 #ifdef TRACE_SORT
 extern bool trace_sort;
@@ -2145,6 +2146,17 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
+		{"max_wal_send", PGC_SIGHUP, REPLICATION_SENDING,
+			gettext_noop("Sets the maximum WAL payload size for WAL replication."),
+			NULL,
+			GUC_UNIT_KB
+		},
+		&max_wal_send_guc,
+		128, 4, 16384,
+		NULL, NULL, NULL
+	},
+
+	{
 		{"commit_delay", PGC_SUSET, WAL_SETTINGS,
 			gettext_noop("Sets the delay in microseconds between transaction commit and "
 		 "flushing WAL to disk."),
-- 
2.10.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers