When a FUSE process is making shrink, it must not wait
on page writeback. Otherwise, it may meet a page,
that is being writebacked by him, and the process will stall.

So, our kernel does not wait writeback after commit a9707947010d
"mm: vmscan: never wait on writeback pages".

But in case of huge number of writebacked pages and
memory pressure, this lead to busy loop: many process
in the system are trying to shrink memory and have
no success. And the node shows high time, spent in kernel.

This patch reduces the number of processes, which may
busy looping on shrink. Only one userspace process --
vstorage -- will be allowed not to sleep on writeback.
Other processes will sleep up to 5 seconds to wait
writeback completion on every page.

The detection of vstorage is very simple and it based
on process name. It seems, there is no a way to detect
all FUSE processes, especially from !ve0, because FUSE
mount is tricky, and a process doing mount may not be
a FUSE daemon. So, we remain the vanila kernel behaviour,
but we don't wait forever, just 5 second. This will save
us from lookup messages from kernel and will allow
to kill FUSE daemon if necessary.

https://jira.sw.ru/browse/PSBM-69296

Signed-off-by: Kirill Tkhai <ktk...@virtuozzo.com>
---
 mm/vmscan.c |   19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a5db5940bb1..e72d515c111 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -959,8 +959,16 @@ static unsigned long shrink_page_list(struct list_head 
*page_list,
 
                        /* Case 3 above */
                        } else {
-                               nr_immediate++;
-                               goto keep_locked;
+                               /*
+                                * Currently, vstorage is the only fuse process,
+                                * exercising writeback; it mustn't sleep to 
avoid
+                                * deadlocks.
+                                */
+                               if (!strncmp(current->comm, "vstorage", 8) ||
+                                   wait_on_page_bit_killable_timeout(page, 
PG_writeback, 5 * HZ) != 0) {
+                                       nr_immediate++;
+                                       goto keep_locked;
+                               }
                        }
                }
 
@@ -1592,9 +1600,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct 
lruvec *lruvec,
        if (nr_writeback && nr_writeback == nr_taken)
                zone_set_flag(zone, ZONE_WRITEBACK);
 
-       if (!global_reclaim(sc) && nr_immediate)
-               congestion_wait(BLK_RW_ASYNC, HZ/10);
-
+       /*
+        * memcg will stall in page writeback so only consider forcibly
+        * stalling for global reclaim
+        */
        if (global_reclaim(sc)) {
                /*
                 * Tag a zone as congested if all the dirty pages scanned were

_______________________________________________
Devel mailing list
Devel@openvz.org
https://lists.openvz.org/mailman/listinfo/devel

Reply via email to