Re: [HACKERS] subscription worker signalling wal writer too much

Andres Freund Sat, 24 Jun 2017 17:10:53 -0700

Hi,

On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
> > Well, wal_writer_delay doesn't work if walwriter is in sleep mode, and
> > this afaics would allow wal writer to go into sleep mode with half a
> > page filled, and it'd not be woken up again until the page is filled.
> > No?
> >
> 
> If it is taking the big sleep, then we always wake it up, since
> acd4c7d58baf09f.
> 
> I didn't change that part.  I only changed what we do if it not hibernating.


Right, I hadn't read enough of the context.


> It looks like only limited consolidation was happening, the number of kills
> invoked was more than half of the number of SetLatch.  I think the reason
> is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately
> suspends the calling process and gives the signalled process a chance to
> run in that time slice.  The Wal Writer gets woken up, sees that it only
> has a partial page to write and decides not to do anything, but resets the
> latch.  It does this fast enough that the subscription worker hasn't
> migrated CPUs yet.

I think part of the problem here is that latches signal the owning
process even if the owning process isn't actually sleeping - that's
obviously quite pointless.  In cases where walwriter is busy, that
actually causes a lot of superflous interrupted syscalls, because it
actually never ends up waiting on the latch. There's also a lot of
superflous context switches.  I think we can avoid doing so quite
easily, as e.g. with the attached patch.  Could you check how much that
helps your benchmark?


> The first change made the WALWriter wake up and do a write and flush
> whenever an async commit occurred and there was a completed WAL page to be
> written.  This way the hint bits could be set while the heap page was still
> hot, because they couldn't get set until the WAL covering the hinted-at
> transaction commit is flushed.
> 
> The second change said we can set hint bits before the WAL covering the
> hinted-at transaction is flushed, as long the page LSN is newer than that
> (so the wal covering the hinted-at transaction commit must be flushed
> before the page containing the hint bit can be written).
> 
> Then the third change makes the wal writer only write the full pages most
> of the times it is woken up, not flush them.  It only flushes them once
> every wal_writer_delay or wal_writer_flush_after, regardless of how often
> it is woken up.
> 
> It seems like the third change rendered the first one useless.

I don't think so. Isn't the walwriter writing out the contents of the
WAL is quite important because it makes room in wal_buffers for new WAL?

I suspect we actually should wake up wal-writer *more* aggressively, to
offload wal fsyncs from individual backends, even when they're not
committing.  Right now we fsync whenever a segment is finished - we
really don't want to do that in backends that could do other useful
work.  I suspect it'd be a good idea to remove that logic from
XLogWrite() and replace it with
        if (ProcGlobal->walwriterLatch)
                SetLatch(ProcGlobal->walwriterLatch);


> Wouldn't it
> better, and much simpler, just to have reverted the first change once the
> second one was done?

I think both can actually happen independently, no? It's pretty common
for the page lsn to be *older* than the corresponding commit.  In fact
you'll always hit that case unless there's also concurrent writes also
touching said page.


> If that were done, would the third change still be
> needed?  (It did seem to add some other features as well, so I'm going to
> assume we still want those...).

It's still useful, yes.  It avoids flushing the WAL too often
(page-by-page when there's a lot of writes), which can eat up a lot of
IOPS on fast storage.



> But now the first change is even worse than useless, it is positively
> harmful.  The only thing to stop it from waking the WALWriter for every
> async commit is this line:
> 
>         /* if we have already flushed that far, we're done */
>         if (WriteRqstPtr <= LogwrtResult.Flush)
>             return;
> 
> Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't
> becomes false due to us waking walwriter, it only becomes false once the
> timeout expires (which it would have done anyway with no help from us), or
> once wal_writer_flush_after is exceeded.  So now every async commit is
> waking the walwriter.  Most of those wake up do nothing (there is not a
> completely new patch to write), some write one completed page but don't
> flush anything, and very few do a flush, and that one would have been done
> anyway.

s/completely new patch/completely new page/?

In my opinion we actually *do* want to write (but not flush!) out such
pages, so I'm not sure I agree with that logic.  Have to think about
this some more...


Greetings,

Andres Freund

>From 50a3e62b16b752837f5a388e3259fee2bca6c2ab Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Sat, 24 Jun 2017 16:46:11 -0700
Subject: [PATCH 1/3] Don't signal latch owners if owner not waiting on latch.

Doing so can generate a lot of useless syscalls and context switches
in cases where the owner of the latch is busy doing actual work.

Discussion: https://postgr.es/m/CAMkU=1z6_k4me12vnirgv5qnu58+czqdnxd+pb5bzxnvovv...@mail.gmail.com
---
 src/backend/storage/ipc/latch.c | 62 ++++++++++++++++++++++++++++++++++-------
 src/include/storage/latch.h     |  1 +
 2 files changed, 53 insertions(+), 10 deletions(-)

diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 07b1364de8..c1a1a76d0a 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -221,6 +221,7 @@ InitLatch(volatile Latch *latch)
 {
 	latch->is_set = false;
 	latch->owner_pid = MyProcPid;
+	latch->owner_waiting = false;
 	latch->is_shared = false;
 
 #ifndef WIN32
@@ -268,6 +269,7 @@ InitSharedLatch(volatile Latch *latch)
 
 	latch->is_set = false;
 	latch->owner_pid = 0;
+	latch->owner_waiting = false;
 	latch->is_shared = true;
 }
 
@@ -311,6 +313,7 @@ DisownLatch(volatile Latch *latch)
 	Assert(latch->owner_pid == MyProcPid);
 
 	latch->owner_pid = 0;
+	latch->owner_waiting = false; /* paranoia */
 }
 
 /*
@@ -466,7 +469,17 @@ SetLatch(volatile Latch *latch)
 			sendSelfPipeByte();
 	}
 	else
-		kill(owner_pid, SIGUSR1);
+	{
+		/*
+		 * If the owner of the latch is currently waiting on it, send signal
+		 * to wake up that process.  The read-barrier is necessary so we see
+		 * an up-to-date value, and it pairs with the memory barrier in the
+		 * path setting owner_waiting in WaitEventSetWait.
+		 */
+		pg_read_barrier();
+		if (latch->owner_waiting)
+			kill(owner_pid, SIGUSR1);
+	}
 #else
 
 	/*
@@ -954,6 +967,12 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 		 * immediately, avoid blocking again. We don't attempt to report any
 		 * other events that might also be satisfied.
 		 *
+		 * To avoid other processes signalling us if we're not waiting,
+		 * wasting context switches, we first check whether the latch is
+		 * already set, and only enable signalling from other processes after
+		 * that.  To avoid a race, we've to recheck whether the latch is set
+		 * after asking to be woken up.
+		 *
 		 * If someone sets the latch between this and the
 		 * WaitEventSetWaitBlock() below, the setter will write a byte to the
 		 * pipe (or signal us and the signal handler will do that), and the
@@ -976,17 +995,37 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 		 * ordering, so that we cannot miss seeing is_set if a notification
 		 * has already been queued.
 		 */
-		if (set->latch && set->latch->is_set)
+		if (set->latch)
 		{
-			occurred_events->fd = PGINVALID_SOCKET;
-			occurred_events->pos = set->latch_pos;
-			occurred_events->user_data =
-				set->events[set->latch_pos].user_data;
-			occurred_events->events = WL_LATCH_SET;
-			occurred_events++;
-			returned_events++;
+			bool latch_set = false;
+
+			if (set->latch->is_set)
+			{
+				latch_set = true;
+			}
+			else
+			{
+				set->latch->owner_waiting = true;
+				pg_memory_barrier();
+				if (set->latch->is_set)
+				{
+					latch_set = true;
+				}
+			}
+
+			if (latch_set)
+			{
+				occurred_events->fd = PGINVALID_SOCKET;
+				occurred_events->pos = set->latch_pos;
+				occurred_events->user_data =
+					set->events[set->latch_pos].user_data;
+				occurred_events->events = WL_LATCH_SET;
+				occurred_events++;
+				returned_events++;
+
+				break;
+			}
 
-			break;
 		}
 
 		/*
@@ -1016,6 +1055,9 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
 	waiting = false;
 #endif
 
+	if (set->latch)
+		set->latch->owner_waiting = false;
+
 	pgstat_report_wait_end();
 
 	return returned_events;
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 73abfafec5..442344d914 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -112,6 +112,7 @@ typedef struct Latch
 	sig_atomic_t is_set;
 	bool		is_shared;
 	int			owner_pid;
+	int			owner_waiting; /* int, so all platforms can set atomically */
 #ifdef WIN32
 	HANDLE		event;
 #endif
-- 
2.13.1.392.g8d1b10321b.dirty

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] subscription worker signalling wal writer too much

Reply via email to