On Thu, Jul 23, 2020 at 4:11 PM Hugh Dickins <hu...@google.com> wrote:
>
> On Thu, 23 Jul 2020, Linus Torvalds wrote:
> >
> > I'll send a new version after I actually test it.
>
> I'll give it a try when you're happy with it.

Ok, what I described is what I've been running for a while now. But I
don't put much stress on my system with my normal workload, so..

> I did try yesterday's
> with my swapping loads on home machines (3 of 4 survived 16 hours),
> and with some google stresstests on work machines (0 of 10 survived).
>
> I've not spent long analyzing the crashes, all of them in or below
> __wake_up_common() called from __wake_up_locked_key_bookmark():
> sometimes gets to run the curr->func() and crashes on something
> inside there (often list_del's lib/list_debug.c:53!), sometimes
> cannot get that far. Looks like the wait queue entries on the list
> were not entirely safe with that patch.

Hmm. The bug Oleg pointed out should be pretty theoretical. But I
think the new approach with WQ_FLAG_WOKEN was much better anyway,
despite me missing that one spot in the first version of the patch.

So here's two patches - the first one does that wake_page_function()
conversion, and the second one just does the memory ordering cleanup I
mentioned.

I don't think the second one shouldn't matter on x86, but who knows.

I don't enable list debugging, but I find list corruption surprising.
All of _that_ should be inside the page waiqueue lock, the only
unlocked part was the "list_empty_careful()" part.

But I'll walk over my patch mentally one more time. Here's the current
version, anyway.

                Linus
From cf6db0b8554723f0308fd9299e642898e36c9c8c Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torva...@linux-foundation.org>
Date: Thu, 23 Jul 2020 10:16:49 -0700
Subject: [PATCH 1/2] mm: rewrite wait_on_page_bit_common() logic

It turns out that wait_on_page_bit_common() had several problems,
ranging from just unfair behavioe due to re-queueing at the end of the
wait queue when re-trying, and an outright bug that could result in
missed wakeups (but probably never happened in practice).

This rewrites the whole logic to avoid both issues, by simply moving the
logic to check (and possibly take) the bit lock into the wakeup path
instead.

That makes everything much more straightforward, and means that we never
need to re-queue the wait entry: if we get woken up, we'll be notified
through WQ_FLAG_WOKEN, and the wait queue entry will have been removed,
and everything will have been done for us.

Link: https://lore.kernel.org/lkml/CAHk-=wjJA2Z3kUFb-5s=6+n0qbts8elqkft9b3ph85a8fgd...@mail.gmail.com/
Link: https://lore.kernel.org/lkml/alpine.LSU.2.11.2007221359450.1017@eggly.anvils/
Reported-by: Oleg Nesterov <o...@redhat.com>
Reported-by: Hugh Dickins <hu...@google.com>
Cc: Michal Hocko <mho...@suse.com>
Signed-off-by: Linus Torvalds <torva...@linux-foundation.org>
---
 mm/filemap.c | 121 +++++++++++++++++++++++++++++++--------------------
 1 file changed, 73 insertions(+), 48 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 385759c4ce4b..1143c0652d81 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1002,6 +1002,7 @@ struct wait_page_queue {
 
 static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync, void *arg)
 {
+	int ret;
 	struct wait_page_key *key = arg;
 	struct wait_page_queue *wait_page
 		= container_of(wait, struct wait_page_queue, wait);
@@ -1013,18 +1014,40 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
 	if (wait_page->bit_nr != key->bit_nr)
 		return 0;
 
+	/* Stop walking if it's locked */
+	if (wait->flags & WQ_FLAG_EXCLUSIVE) {
+		if (test_and_set_bit(key->bit_nr, &key->page->flags))
+			return -1;
+	} else {
+		if (test_bit(key->bit_nr, &key->page->flags))
+			return -1;
+	}
+
 	/*
-	 * Stop walking if it's locked.
-	 * Is this safe if put_and_wait_on_page_locked() is in use?
-	 * Yes: the waker must hold a reference to this page, and if PG_locked
-	 * has now already been set by another task, that task must also hold
-	 * a reference to the *same usage* of this page; so there is no need
-	 * to walk on to wake even the put_and_wait_on_page_locked() callers.
+	 * Let the waiter know we have done the page flag
+	 * handling for it (and the return value lets the
+	 * wakeup logic count exclusive wakeup events).
 	 */
-	if (test_bit(key->bit_nr, &key->page->flags))
-		return -1;
+	ret = (wait->flags & WQ_FLAG_EXCLUSIVE) != 0;
+	wait->flags |= WQ_FLAG_WOKEN;
+	wake_up_state(wait->private, mode);
 
-	return autoremove_wake_function(wait, mode, sync, key);
+	/*
+	 * Ok, we have successfully done what we're waiting for,
+	 * and we can unconditionally remove the wait entry.
+	 *
+	 * Note that this has to be the absolute last thing we do,
+	 * since after list_del_init(&wait->entry) the wait entry
+	 * might be de-allocated and the process might even have
+	 * exited.
+	 *
+	 * We _really_ should have a "list_del_init_careful()" to
+	 * properly pair with the unlocked "list_empty_careful()"
+	 * in finish_wait().
+	 */
+	smp_mb();
+	list_del_init(&wait->entry);
+	return ret;
 }
 
 static void wake_up_page_bit(struct page *page, int bit_nr)
@@ -1103,16 +1126,22 @@ enum behavior {
 			 */
 };
 
+static inline int trylock_page_bit_common(struct page *page, int bit_nr,
+	enum behavior behavior)
+{
+	return behavior == EXCLUSIVE ?
+		!test_and_set_bit(bit_nr, &page->flags) :
+		!test_bit(bit_nr, &page->flags);
+}
+
 static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 	struct page *page, int bit_nr, int state, enum behavior behavior)
 {
 	struct wait_page_queue wait_page;
 	wait_queue_entry_t *wait = &wait_page.wait;
-	bool bit_is_set;
 	bool thrashing = false;
 	bool delayacct = false;
 	unsigned long pflags;
-	int ret = 0;
 
 	if (bit_nr == PG_locked &&
 	    !PageUptodate(page) && PageWorkingset(page)) {
@@ -1130,48 +1159,44 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 	wait_page.page = page;
 	wait_page.bit_nr = bit_nr;
 
+	/*
+	 * Add ourselves to the wait queue.
+	 *
+	 * NOTE! This is where we also check the page
+	 * state synchronously the last time to see that
+	 * somebody didn't just clear the bit. Do the
+	 * SetPageWaiters() before that to let anybody
+	 * we just miss know they need to wake us up.
+	 */
+	spin_lock_irq(&q->lock);
+	SetPageWaiters(page);
+	if (!trylock_page_bit_common(page, bit_nr, behavior))
+		__add_wait_queue_entry_tail(q, wait);
+	else
+		wait->flags |= WQ_FLAG_WOKEN;
+	spin_unlock_irq(&q->lock);
+
+	/*
+	 * From now on, all the logic will be based on
+	 * whether the wait entry is on the queue or not,
+	 * and the page bit testing (and setting) will be
+	 * done by the wake function, not us.
+	 *
+	 * We can drop our reference to the page.
+	 */
+	if (behavior == DROP)
+		put_page(page);
+
 	for (;;) {
-		spin_lock_irq(&q->lock);
-
-		if (likely(list_empty(&wait->entry))) {
-			__add_wait_queue_entry_tail(q, wait);
-			SetPageWaiters(page);
-		}
-
 		set_current_state(state);
 
-		spin_unlock_irq(&q->lock);
-
-		bit_is_set = test_bit(bit_nr, &page->flags);
-		if (behavior == DROP)
-			put_page(page);
-
-		if (likely(bit_is_set))
-			io_schedule();
-
-		if (behavior == EXCLUSIVE) {
-			if (!test_and_set_bit_lock(bit_nr, &page->flags))
-				break;
-		} else if (behavior == SHARED) {
-			if (!test_bit(bit_nr, &page->flags))
-				break;
-		}
-
-		if (signal_pending_state(state, current)) {
-			ret = -EINTR;
+		if (signal_pending_state(state, current))
 			break;
-		}
 
-		if (behavior == DROP) {
-			/*
-			 * We can no longer safely access page->flags:
-			 * even if CONFIG_MEMORY_HOTREMOVE is not enabled,
-			 * there is a risk of waiting forever on a page reused
-			 * for something that keeps it locked indefinitely.
-			 * But best check for -EINTR above before breaking.
-			 */
+		if (wait->flags & WQ_FLAG_WOKEN)
 			break;
-		}
+
+		io_schedule();
 	}
 
 	finish_wait(q, wait);
@@ -1190,7 +1215,7 @@ static inline int wait_on_page_bit_common(wait_queue_head_t *q,
 	 * bother with signals either.
 	 */
 
-	return ret;
+	return wait->flags & WQ_FLAG_WOKEN ? 0 : -EINTR;
 }
 
 void wait_on_page_bit(struct page *page, int bit_nr)
-- 
2.28.0.rc0.3.g1e25d3a62f

From ddc00aaf8e020bab630ec641b37564634454634c Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torva...@linux-foundation.org>
Date: Thu, 23 Jul 2020 12:33:41 -0700
Subject: [PATCH 2/2] list: add "list_del_init_careful()" to go with
 "list_empty_careful()"

That gives us ordering guarantees around the pair.

Signed-off-by: Linus Torvalds <torva...@linux-foundation.org>
---
 include/linux/list.h | 20 +++++++++++++++++++-
 kernel/sched/wait.c  |  2 +-
 mm/filemap.c         |  7 +------
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/include/linux/list.h b/include/linux/list.h
index aff44d34f4e4..0d0d17a10d25 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -282,6 +282,24 @@ static inline int list_empty(const struct list_head *head)
 	return READ_ONCE(head->next) == head;
 }
 
+/**
+ * list_del_init_careful - deletes entry from list and reinitialize it.
+ * @entry: the element to delete from the list.
+ *
+ * This is the same as list_del_init(), except designed to be used
+ * together with list_empty_careful() in a way to guarantee ordering
+ * of other memory operations.
+ *
+ * Any memory operations done before a list_del_init_careful() are
+ * guaranteed to be visible after a list_empty_careful() test.
+ */
+static inline void list_del_init_careful(struct list_head *entry)
+{
+	__list_del_entry(entry);
+	entry->prev = entry;
+	smp_store_release(&entry->next, entry);
+}
+
 /**
  * list_empty_careful - tests whether a list is empty and not being modified
  * @head: the list to test
@@ -297,7 +315,7 @@ static inline int list_empty(const struct list_head *head)
  */
 static inline int list_empty_careful(const struct list_head *head)
 {
-	struct list_head *next = head->next;
+	struct list_head *next = smp_load_acquire(&head->next);
 	return (next == head) && (next == head->prev);
 }
 
diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index ba059fbfc53a..01f5d3020589 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -389,7 +389,7 @@ int autoremove_wake_function(struct wait_queue_entry *wq_entry, unsigned mode, i
 	int ret = default_wake_function(wq_entry, mode, sync, key);
 
 	if (ret)
-		list_del_init(&wq_entry->entry);
+		list_del_init_careful(&wq_entry->entry);
 
 	return ret;
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index 1143c0652d81..239d156a38ea 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1040,13 +1040,8 @@ static int wake_page_function(wait_queue_entry_t *wait, unsigned mode, int sync,
 	 * since after list_del_init(&wait->entry) the wait entry
 	 * might be de-allocated and the process might even have
 	 * exited.
-	 *
-	 * We _really_ should have a "list_del_init_careful()" to
-	 * properly pair with the unlocked "list_empty_careful()"
-	 * in finish_wait().
 	 */
-	smp_mb();
-	list_del_init(&wait->entry);
+	list_del_init_careful(&wait->entry);
 	return ret;
 }
 
-- 
2.28.0.rc0.3.g1e25d3a62f

Reply via email to