from:"Dan Schatzberg"

Re: [PATCH 1/3] loop: Use worker per cgroup instead of kworker

2021-04-07 Thread Dan Schatzberg

On Wed, Apr 07, 2021 at 02:53:00PM +0800, Hillf Danton wrote:
> On Tue, 6 Apr 2021 Dan Schatzberg wrote:
> >On Sat, Apr 03, 2021 at 10:09:02AM +0800, Hillf Danton wrote:
> >> On Fri,  2 Apr 2021 12:16:32 Dan Schatzberg wrote:
> >> > +queue_work:
> >> > +if (worker) {
> >> > +/*
> >> > + * We need to remove from the idle list here while
> >> > + * holding the lock so that the idle timer doesn't
> >> > + * free the worker
> >> > + */
> >> > +if (!list_empty(>idle_list))
> >> > +list_del_init(>idle_list);
> >> 
> >> Nit, only queue work if the worker is inactive - otherwise it is taking
> >> care of the cmd_list.
> >
> >By worker is inactive, you mean worker is on the idle_list? Yes, I
> >think you're right that queue_work() is unnecessary in that case since
> >each worker checks empty cmd_list then adds itself to idle_list under
> >the lock.

A couple other corner cases - When worker is just allocated, it needs
a queue_work() and rootcg always needs a queue_work() since it never
sits on the idle_list. It does add to the logic a bit rather than just
unconditionally invoking queue_work()

> >
> >> 
> >> > +work = >work;
> >> > +cmd_list = >cmd_list;
> >> > +} else {
> >> > +work = >rootcg_work;
> >> > +cmd_list = >rootcg_cmd_list;
> >> > +}
> >> > +list_add_tail(>list_entry, cmd_list);
> >> > +queue_work(lo->workqueue, work);
> >> > +spin_unlock_irq(>lo_work_lock);
> >> >  }
> >> [...]
> >> > +/*
> >> > + * We only add to the idle list if there are no pending cmds
> >> > + * *and* the worker will not run again which ensures that it
> >> > + * is safe to free any worker on the idle list
> >> > + */
> >> > +if (worker && !work_pending(>work)) {
> >> 
> >> The empty cmd_list is a good enough reason for worker to become idle.
> >
> >This is only true with the above change to avoid a gratuitous
> >queue_work(), right?
> 
> It is always true because of the empty cmd_list - the idle_list is the only
> place for the worker to go at this point.
> 
> >Otherwise we run the risk of freeing a worker
> >concurrently with loop_process_work() being invoked.
> 
> My suggestion is a minor optimization at most without any change to removing
> worker off the idle_list on queuing work - that cuts the risk for you.

If I just change this line from

if (worker && !work_pending(>work)) {

to

if (worker) {

then the following sequence of events is possible:

1) loop_queue_work runs, adds a command to the worker list
2) loop_process_work runs, processes a single command and then drops
the lock and reschedules
3) loop_queue_work runs again, acquires the lock, adds to the list and
invokes queue_work() again
4) loop_process_work resumes, acquires lock, processes work, notices
list is empty and adds itself to the idle_list
5) idle timer fires and frees the worker
6) loop_process_work runs again (because of the queue_work in 3) and
accesses freed memory

The !work_pending... check prevents 4) from adding itself to the
idle_list so this is not possible. I believe we can only make this
change if we also make the other change you suggested to avoid
gratuitous queue_work()

Re: [PATCH 1/3] loop: Use worker per cgroup instead of kworker

2021-04-06 Thread Dan Schatzberg

Hi Hillf, thanks for the review

On Sat, Apr 03, 2021 at 10:09:02AM +0800, Hillf Danton wrote:
> On Fri,  2 Apr 2021 12:16:32 Dan Schatzberg wrote:
> > +queue_work:
> > +   if (worker) {
> > +   /*
> > +* We need to remove from the idle list here while
> > +* holding the lock so that the idle timer doesn't
> > +* free the worker
> > +*/
> > +   if (!list_empty(>idle_list))
> > +   list_del_init(>idle_list);
> 
> Nit, only queue work if the worker is inactive - otherwise it is taking
> care of the cmd_list.

By worker is inactive, you mean worker is on the idle_list? Yes, I
think you're right that queue_work() is unnecessary in that case since
each worker checks empty cmd_list then adds itself to idle_list under
the lock.

> 
> > +   work = >work;
> > +   cmd_list = >cmd_list;
> > +   } else {
> > +   work = >rootcg_work;
> > +   cmd_list = >rootcg_cmd_list;
> > +   }
> > +   list_add_tail(>list_entry, cmd_list);
> > +   queue_work(lo->workqueue, work);
> > +   spin_unlock_irq(>lo_work_lock);
> >  }
> [...]
> > +   /*
> > +* We only add to the idle list if there are no pending cmds
> > +* *and* the worker will not run again which ensures that it
> > +* is safe to free any worker on the idle list
> > +*/
> > +   if (worker && !work_pending(>work)) {
> 
> The empty cmd_list is a good enough reason for worker to become idle.

This is only true with the above change to avoid a gratuitous
queue_work(), right? Otherwise we run the risk of freeing a worker
concurrently with loop_process_work() being invoked.

[PATCH 3/3] loop: Charge i/o to mem and blk cg

2021-04-02 Thread Dan Schatzberg

The current code only associates with the existing blkcg when aio is
used to access the backing file. This patch covers all types of i/o to
the backing file and also associates the memcg so if the backing file is
on tmpfs, memory is charged appropriately.

This patch also exports cgroup_get_e_css and int_active_memcg so it
can be used by the loop module.

Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
---
 drivers/block/loop.c   | 61 +-
 drivers/block/loop.h   |  3 +-
 include/linux/memcontrol.h |  6 
 kernel/cgroup/cgroup.c |  1 +
 mm/memcontrol.c|  1 +
 5 files changed, 51 insertions(+), 21 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 4750b373d4bb..d2759f8a7c2a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -78,6 +78,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "loop.h"
 
@@ -516,8 +517,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
ret, long ret2)
 {
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
 
-   if (cmd->css)
-   css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
 }
@@ -578,8 +577,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
-   if (cmd->css)
-   kthread_associate_blkcg(cmd->css);
 
if (rw == WRITE)
ret = call_write_iter(file, >iocb, );
@@ -587,7 +584,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
ret = call_read_iter(file, >iocb, );
 
lo_rw_aio_do_completion(cmd);
-   kthread_associate_blkcg(NULL);
 
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(>iocb, ret, 0);
@@ -928,7 +924,7 @@ struct loop_worker {
struct list_head cmd_list;
struct list_head idle_list;
struct loop_device *lo;
-   struct cgroup_subsys_state *css;
+   struct cgroup_subsys_state *blkcg_css;
unsigned long last_ran_at;
 };
 
@@ -945,7 +941,7 @@ static void loop_queue_work(struct loop_device *lo, struct 
loop_cmd *cmd)
 
spin_lock_irq(>lo_work_lock);
 
-   if (!cmd->css)
+   if (!cmd->blkcg_css)
goto queue_work;
 
node = >worker_tree.rb_node;
@@ -953,10 +949,10 @@ static void loop_queue_work(struct loop_device *lo, 
struct loop_cmd *cmd)
while (*node) {
parent = *node;
cur_worker = container_of(*node, struct loop_worker, rb_node);
-   if (cur_worker->css == cmd->css) {
+   if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
-   } else if ((long)cur_worker->css < (long)cmd->css) {
+   } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
node = &(*node)->rb_left;
} else {
node = &(*node)->rb_right;
@@ -968,13 +964,18 @@ static void loop_queue_work(struct loop_device *lo, 
struct loop_cmd *cmd)
worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
/*
 * In the event we cannot allocate a worker, just queue on the
-* rootcg worker
+* rootcg worker and issue the I/O as the rootcg
 */
-   if (!worker)
+   if (!worker) {
+   cmd->blkcg_css = NULL;
+   if (cmd->memcg_css)
+   css_put(cmd->memcg_css);
+   cmd->memcg_css = NULL;
goto queue_work;
+   }
 
-   worker->css = cmd->css;
-   css_get(worker->css);
+   worker->blkcg_css = cmd->blkcg_css;
+   css_get(worker->blkcg_css);
INIT_WORK(>work, loop_workfn);
INIT_LIST_HEAD(>cmd_list);
INIT_LIST_HEAD(>idle_list);
@@ -1294,7 +1295,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool 
release)
idle_list) {
list_del(>idle_list);
rb_erase(>rb_node, >worker_tree);
-   css_put(worker->css);
+   css_put(worker->blkcg_css);
kfree(worker);
}
spin_unlock_irq(>lo_work_lock);
@@ -2099,13 +2100,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx 
*hctx,
}
 
/* always use the first bio's css */
+   cmd->blkcg_css = NULL;
+   cmd->memcg_css = NULL;
 #ifdef CONFIG_BLK_CGROUP
-   if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
-   cmd->css = _blkcg(rq->bio)->css;
-   css_get(cmd->css);
-

[PATCH 2/3] mm: Charge active memcg when no mm is set

2021-04-02 Thread Dan Schatzberg

set_active_memcg() worked for kernel allocations but was silently
ignored for user pages.

This patch establishes a precedence order for who gets charged:

1. If there is a memcg associated with the page already, that memcg is
   charged. This happens during swapin.

2. If an explicit mm is passed, mm->memcg is charged. This happens
   during page faults, which can be triggered in remote VMs (eg gup).

3. Otherwise consult the current process context. If there is an
   active_memcg, use that. Otherwise, current->mm->memcg.

Previously, if a NULL mm was passed to mem_cgroup_charge (case 3) it
would always charge the root cgroup. Now it looks up the active_memcg
first (falling back to charging the root cgroup if not set).

Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
Acked-by: Tejun Heo 
Acked-by: Chris Down 
Reviewed-by: Shakeel Butt 
---
 mm/filemap.c|  2 +-
 mm/memcontrol.c | 48 +++-
 mm/shmem.c  |  4 ++--
 3 files changed, 34 insertions(+), 20 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c03463cb72d6..38648f7d2106 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -872,7 +872,7 @@ noinline int __add_to_page_cache_locked(struct page *page,
page->index = offset;
 
if (!huge) {
-   error = mem_cgroup_charge(page, current->mm, gfp);
+   error = mem_cgroup_charge(page, NULL, gfp);
if (error)
goto error;
charged = true;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c0b83a396299..d2939d6602b3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -886,13 +886,24 @@ struct mem_cgroup *mem_cgroup_from_task(struct 
task_struct *p)
 }
 EXPORT_SYMBOL(mem_cgroup_from_task);
 
+static __always_inline struct mem_cgroup *active_memcg(void)
+{
+   if (in_interrupt())
+   return this_cpu_read(int_active_memcg);
+   else
+   return current->active_memcg;
+}
+
 /**
  * get_mem_cgroup_from_mm: Obtain a reference on given mm_struct's memcg.
  * @mm: mm from which memcg should be extracted. It can be NULL.
  *
- * Obtain a reference on mm->memcg and returns it if successful. Otherwise
- * root_mem_cgroup is returned. However if mem_cgroup is disabled, NULL is
- * returned.
+ * Obtain a reference on mm->memcg and returns it if successful. If mm
+ * is NULL, then the memcg is chosen as follows:
+ * 1) The active memcg, if set.
+ * 2) current->mm->memcg, if available
+ * 3) root memcg
+ * If mem_cgroup is disabled, NULL is returned.
  */
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
 {
@@ -901,13 +912,23 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct 
mm_struct *mm)
if (mem_cgroup_disabled())
return NULL;
 
+   /*
+* Page cache insertions can happen without an
+* actual mm context, e.g. during disk probing
+* on boot, loopback IO, acct() writes etc.
+*/
+   if (unlikely(!mm)) {
+   memcg = active_memcg();
+   if (unlikely(memcg)) {
+   /* remote memcg must hold a ref */
+   css_get(>css);
+   return memcg;
+   }
+   mm = current->mm;
+   }
+
rcu_read_lock();
do {
-   /*
-* Page cache insertions can happen without an
-* actual mm context, e.g. during disk probing
-* on boot, loopback IO, acct() writes etc.
-*/
if (unlikely(!mm))
memcg = root_mem_cgroup;
else {
@@ -921,14 +942,6 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct 
*mm)
 }
 EXPORT_SYMBOL(get_mem_cgroup_from_mm);
 
-static __always_inline struct mem_cgroup *active_memcg(void)
-{
-   if (in_interrupt())
-   return this_cpu_read(int_active_memcg);
-   else
-   return current->active_memcg;
-}
-
 static __always_inline bool memcg_kmem_bypass(void)
 {
/* Allow remote memcg charging from any context. */
@@ -6537,7 +6550,8 @@ static int __mem_cgroup_charge(struct page *page, struct 
mem_cgroup *memcg,
  * @gfp_mask: reclaim mode
  *
  * Try to charge @page to the memcg that @mm belongs to, reclaiming
- * pages according to @gfp_mask if necessary.
+ * pages according to @gfp_mask if necessary. if @mm is NULL, try to
+ * charge to the active memcg.
  *
  * Do not use this for pages allocated for swapin.
  *
diff --git a/mm/shmem.c b/mm/shmem.c
index 5cfd2fb6e52b..524fa5aa0459 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1694,7 +1694,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t 
index,
 {
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
-   struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
+   struct mm_struct *charge_mm = vm

[PATCH 1/3] loop: Use worker per cgroup instead of kworker

2021-04-02 Thread Dan Schatzberg

Existing uses of loop device may have multiple cgroups reading/writing
to the same device. Simply charging resources for I/O to the backing
file could result in priority inversion where one cgroup gets
synchronously blocked, holding up all other I/O to the loop device.

In order to avoid this priority inversion, we use a single workqueue
where each work item is a "struct loop_worker" which contains a queue of
struct loop_cmds to issue. The loop device maintains a tree mapping blk
css_id -> loop_worker. This allows each cgroup to independently make
forward progress issuing I/O to the backing file.

There is also a single queue for I/O associated with the rootcg which
can be used in cases of extreme memory shortage where we cannot allocate
a loop_worker.

The locking for the tree and queues is fairly heavy handed - we acquire
a per-loop-device spinlock any time either is accessed. The existing
implementation serializes all I/O through a single thread anyways, so I
don't believe this is any worse.

Fixes-from: Colin Ian King 
Signed-off-by: Dan Schatzberg 
---
 drivers/block/loop.c | 203 ---
 drivers/block/loop.h |  12 ++-
 2 files changed, 178 insertions(+), 37 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d58d68f3c7cd..4750b373d4bb 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -71,7 +71,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -84,6 +83,8 @@
 
 #include 
 
+#define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
+
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_ctl_mutex);
 
@@ -921,27 +922,83 @@ static void loop_config_discard(struct loop_device *lo)
q->limits.discard_alignment = 0;
 }
 
-static void loop_unprepare_queue(struct loop_device *lo)
-{
-   kthread_flush_worker(>worker);
-   kthread_stop(lo->worker_task);
-}
+struct loop_worker {
+   struct rb_node rb_node;
+   struct work_struct work;
+   struct list_head cmd_list;
+   struct list_head idle_list;
+   struct loop_device *lo;
+   struct cgroup_subsys_state *css;
+   unsigned long last_ran_at;
+};
 
-static int loop_kthread_worker_fn(void *worker_ptr)
-{
-   current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
-   return kthread_worker_fn(worker_ptr);
-}
+static void loop_workfn(struct work_struct *work);
+static void loop_rootcg_workfn(struct work_struct *work);
+static void loop_free_idle_workers(struct timer_list *timer);
 
-static int loop_prepare_queue(struct loop_device *lo)
+static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
 {
-   kthread_init_worker(>worker);
-   lo->worker_task = kthread_run(loop_kthread_worker_fn,
-   >worker, "loop%d", lo->lo_number);
-   if (IS_ERR(lo->worker_task))
-   return -ENOMEM;
-   set_user_nice(lo->worker_task, MIN_NICE);
-   return 0;
+   struct rb_node **node = &(lo->worker_tree.rb_node), *parent = NULL;
+   struct loop_worker *cur_worker, *worker = NULL;
+   struct work_struct *work;
+   struct list_head *cmd_list;
+
+   spin_lock_irq(>lo_work_lock);
+
+   if (!cmd->css)
+   goto queue_work;
+
+   node = >worker_tree.rb_node;
+
+   while (*node) {
+   parent = *node;
+   cur_worker = container_of(*node, struct loop_worker, rb_node);
+   if (cur_worker->css == cmd->css) {
+   worker = cur_worker;
+   break;
+   } else if ((long)cur_worker->css < (long)cmd->css) {
+   node = &(*node)->rb_left;
+   } else {
+   node = &(*node)->rb_right;
+   }
+   }
+   if (worker)
+   goto queue_work;
+
+   worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
+   /*
+* In the event we cannot allocate a worker, just queue on the
+* rootcg worker
+*/
+   if (!worker)
+   goto queue_work;
+
+   worker->css = cmd->css;
+   css_get(worker->css);
+   INIT_WORK(>work, loop_workfn);
+   INIT_LIST_HEAD(>cmd_list);
+   INIT_LIST_HEAD(>idle_list);
+   worker->lo = lo;
+   rb_link_node(>rb_node, parent, node);
+   rb_insert_color(>rb_node, >worker_tree);
+queue_work:
+   if (worker) {
+   /*
+* We need to remove from the idle list here while
+* holding the lock so that the idle timer doesn't
+* free the worker
+*/
+   if (!list_empty(>idle_list))
+   list_del_init(>idle_list);
+   work = >work;
+   cmd_list = >cmd_list;
+   } else {
+   work = >rootcg_wo

[PATCH V12 0/3] Charge loop device i/o to issuing cgroup

2021-04-02 Thread Dan Schatzberg

No major changes, rebased on top of latest mm tree

Changes since V12:

* Small change to get_mem_cgroup_from_mm to avoid needing
  get_active_memcg

Changes since V11:

* Removed WQ_MEM_RECLAIM flag from loop workqueue. Technically, this
  can be driven by writeback, but this was causing a warning in xfs
  and likely other filesystems aren't equipped to be driven by reclaim
  at the VFS layer.
* Included a small fix from Colin Ian King.
* reworked get_mem_cgroup_from_mm to institute the necessary charge
  priority.

Changes since V10:

* Added page-cache charging to mm: Charge active memcg when no mm is set

Changes since V9:

* Rebased against linus's branch which now includes Roman Gushchin's
  patch this series is based off of

Changes since V8:

* Rebased on top of Roman Gushchin's patch
  (https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
  support for setting active memcg. Dropped the patch from this series
  that did the same thing.

Changes since V7:

* Rebased against linus's branch

Changes since V6:

* Added separate spinlock for worker synchronization
* Minor style changes

Changes since V5:

* Fixed a missing css_put when failing to allocate a worker
* Minor style changes

Changes since V4:

Only patches 1 and 2 have changed.

* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg

Changes since V3:

* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes

Changes since V2:

* Deferred destruction of workqueue items so in the common case there
  is no allocation needed

Changes since V1:

* Split out and reordered patches so cgroup charging changes are
  separate from kworker -> workqueue change

* Add mem_css to struct loop_cmd to simplify logic

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.

Dan Schatzberg (3):
  loop: Use worker per cgroup instead of kworker
  mm: Charge active memcg when no mm is set
  loop: Charge i/o to mem and blk cg

 drivers/block/loop.c   | 244 ++---
 drivers/block/loop.h   |  15 ++-
 include/linux/memcontrol.h |   6 +
 kernel/cgroup/cgroup.c |   1 +
 mm/filemap.c   |   2 +-
 mm/memcontrol.c|  49 +---
 mm/shmem.c |   4 +-
 7 files changed, 253 insertions(+), 68 deletions(-)

-- 
2.30.2

[PATCH 3/3] loop: Charge i/o to mem and blk cg

2021-03-29 Thread Dan Schatzberg

The current code only associates with the existing blkcg when aio is
used to access the backing file. This patch covers all types of i/o to
the backing file and also associates the memcg so if the backing file is
on tmpfs, memory is charged appropriately.

This patch also exports cgroup_get_e_css and int_active_memcg so it
can be used by the loop module.

Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
---
 drivers/block/loop.c   | 61 +-
 drivers/block/loop.h   |  3 +-
 include/linux/memcontrol.h |  6 
 kernel/cgroup/cgroup.c |  1 +
 mm/memcontrol.c|  1 +
 5 files changed, 51 insertions(+), 21 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5c18e6b856c2..96ade57c9f7c 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -78,6 +78,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "loop.h"
 
@@ -516,8 +517,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
ret, long ret2)
 {
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
 
-   if (cmd->css)
-   css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
 }
@@ -578,8 +577,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
-   if (cmd->css)
-   kthread_associate_blkcg(cmd->css);
 
if (rw == WRITE)
ret = call_write_iter(file, >iocb, );
@@ -587,7 +584,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
ret = call_read_iter(file, >iocb, );
 
lo_rw_aio_do_completion(cmd);
-   kthread_associate_blkcg(NULL);
 
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(>iocb, ret, 0);
@@ -928,7 +924,7 @@ struct loop_worker {
struct list_head cmd_list;
struct list_head idle_list;
struct loop_device *lo;
-   struct cgroup_subsys_state *css;
+   struct cgroup_subsys_state *blkcg_css;
unsigned long last_ran_at;
 };
 
@@ -945,7 +941,7 @@ static void loop_queue_work(struct loop_device *lo, struct 
loop_cmd *cmd)
 
spin_lock_irq(>lo_work_lock);
 
-   if (!cmd->css)
+   if (!cmd->blkcg_css)
goto queue_work;
 
node = >worker_tree.rb_node;
@@ -953,10 +949,10 @@ static void loop_queue_work(struct loop_device *lo, 
struct loop_cmd *cmd)
while (*node) {
parent = *node;
cur_worker = container_of(*node, struct loop_worker, rb_node);
-   if (cur_worker->css == cmd->css) {
+   if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
-   } else if ((long)cur_worker->css < (long)cmd->css) {
+   } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
node = &(*node)->rb_left;
} else {
node = &(*node)->rb_right;
@@ -968,13 +964,18 @@ static void loop_queue_work(struct loop_device *lo, 
struct loop_cmd *cmd)
worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
/*
 * In the event we cannot allocate a worker, just queue on the
-* rootcg worker
+* rootcg worker and issue the I/O as the rootcg
 */
-   if (!worker)
+   if (!worker) {
+   cmd->blkcg_css = NULL;
+   if (cmd->memcg_css)
+   css_put(cmd->memcg_css);
+   cmd->memcg_css = NULL;
goto queue_work;
+   }
 
-   worker->css = cmd->css;
-   css_get(worker->css);
+   worker->blkcg_css = cmd->blkcg_css;
+   css_get(worker->blkcg_css);
INIT_WORK(>work, loop_workfn);
INIT_LIST_HEAD(>cmd_list);
INIT_LIST_HEAD(>idle_list);
@@ -1298,7 +1299,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool 
release)
idle_list) {
list_del(>idle_list);
rb_erase(>rb_node, >worker_tree);
-   css_put(worker->css);
+   css_put(worker->blkcg_css);
kfree(worker);
}
spin_unlock_irq(>lo_work_lock);
@@ -2103,13 +2104,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx 
*hctx,
}
 
/* always use the first bio's css */
+   cmd->blkcg_css = NULL;
+   cmd->memcg_css = NULL;
 #ifdef CONFIG_BLK_CGROUP
-   if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
-   cmd->css = _blkcg(rq->bio)->css;
-   css_get(cmd->css);
-

[PATCH 2/3] mm: Charge active memcg when no mm is set

2021-03-29 Thread Dan Schatzberg

set_active_memcg() worked for kernel allocations but was silently
ignored for user pages.

This patch establishes a precedence order for who gets charged:

1. If there is a memcg associated with the page already, that memcg is
   charged. This happens during swapin.

2. If an explicit mm is passed, mm->memcg is charged. This happens
   during page faults, which can be triggered in remote VMs (eg gup).

3. Otherwise consult the current process context. If there is an
   active_memcg, use that. Otherwise, current->mm->memcg.

Previously, if a NULL mm was passed to mem_cgroup_charge (case 3) it
would always charge the root cgroup. Now it looks up the active_memcg
first (falling back to charging the root cgroup if not set).

Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
Acked-by: Tejun Heo 
Acked-by: Chris Down 
Reviewed-by: Shakeel Butt 
---
 mm/filemap.c|  2 +-
 mm/memcontrol.c | 72 -
 mm/shmem.c  |  4 +--
 3 files changed, 44 insertions(+), 34 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index eeeb8e2cc36a..63fd980e863a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -872,7 +872,7 @@ noinline int __add_to_page_cache_locked(struct page *page,
page->index = offset;
 
if (!huge) {
-   error = mem_cgroup_charge(page, current->mm, gfp);
+   error = mem_cgroup_charge(page, NULL, gfp);
if (error)
goto error;
charged = true;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 668d1d7c2645..adc618814fd2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -884,13 +884,38 @@ struct mem_cgroup *mem_cgroup_from_task(struct 
task_struct *p)
 }
 EXPORT_SYMBOL(mem_cgroup_from_task);
 
+static __always_inline struct mem_cgroup *active_memcg(void)
+{
+   if (in_interrupt())
+   return this_cpu_read(int_active_memcg);
+   else
+   return current->active_memcg;
+}
+
+static __always_inline struct mem_cgroup *get_active_memcg(void)
+{
+   struct mem_cgroup *memcg;
+
+   rcu_read_lock();
+   memcg = active_memcg();
+   /* remote memcg must hold a ref. */
+   if (memcg && WARN_ON_ONCE(!css_tryget(>css)))
+   memcg = root_mem_cgroup;
+   rcu_read_unlock();
+
+   return memcg;
+}
+
 /**
  * get_mem_cgroup_from_mm: Obtain a reference on given mm_struct's memcg.
  * @mm: mm from which memcg should be extracted. It can be NULL.
  *
- * Obtain a reference on mm->memcg and returns it if successful. Otherwise
- * root_mem_cgroup is returned. However if mem_cgroup is disabled, NULL is
- * returned.
+ * Obtain a reference on mm->memcg and returns it if successful. If mm
+ * is NULL, then the memcg is chosen as follows:
+ * 1) The active memcg, if set.
+ * 2) current->mm->memcg, if available
+ * 3) root memcg
+ * If mem_cgroup is disabled, NULL is returned.
  */
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
 {
@@ -899,13 +924,19 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct 
mm_struct *mm)
if (mem_cgroup_disabled())
return NULL;
 
+   /*
+* Page cache insertions can happen without an
+* actual mm context, e.g. during disk probing
+* on boot, loopback IO, acct() writes etc.
+*/
+   if (unlikely(!mm)) {
+   if (unlikely(active_memcg()))
+   return get_active_memcg();
+   mm = current->mm;
+   }
+
rcu_read_lock();
do {
-   /*
-* Page cache insertions can happen withou an
-* actual mm context, e.g. during disk probing
-* on boot, loopback IO, acct() writes etc.
-*/
if (unlikely(!mm))
memcg = root_mem_cgroup;
else {
@@ -919,28 +950,6 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct 
*mm)
 }
 EXPORT_SYMBOL(get_mem_cgroup_from_mm);
 
-static __always_inline struct mem_cgroup *active_memcg(void)
-{
-   if (in_interrupt())
-   return this_cpu_read(int_active_memcg);
-   else
-   return current->active_memcg;
-}
-
-static __always_inline struct mem_cgroup *get_active_memcg(void)
-{
-   struct mem_cgroup *memcg;
-
-   rcu_read_lock();
-   memcg = active_memcg();
-   /* remote memcg must hold a ref. */
-   if (memcg && WARN_ON_ONCE(!css_tryget(>css)))
-   memcg = root_mem_cgroup;
-   rcu_read_unlock();
-
-   return memcg;
-}
-
 static __always_inline bool memcg_kmem_bypass(void)
 {
/* Allow remote memcg charging from any context. */
@@ -6549,7 +6558,8 @@ static int __mem_cgroup_charge(struct page *page, struct 
mem_cgroup *memcg,
  * @gfp_mask: reclaim mode
  *
  * Try to charge @page to the memcg that @mm belongs to, reclaiming
- * pages according to @gfp_mask if necessary.
+ *

[PATCH 1/3] loop: Use worker per cgroup instead of kworker

2021-03-29 Thread Dan Schatzberg

Existing uses of loop device may have multiple cgroups reading/writing
to the same device. Simply charging resources for I/O to the backing
file could result in priority inversion where one cgroup gets
synchronously blocked, holding up all other I/O to the loop device.

In order to avoid this priority inversion, we use a single workqueue
where each work item is a "struct loop_worker" which contains a queue of
struct loop_cmds to issue. The loop device maintains a tree mapping blk
css_id -> loop_worker. This allows each cgroup to independently make
forward progress issuing I/O to the backing file.

There is also a single queue for I/O associated with the rootcg which
can be used in cases of extreme memory shortage where we cannot allocate
a loop_worker.

The locking for the tree and queues is fairly heavy handed - we acquire
a per-loop-device spinlock any time either is accessed. The existing
implementation serializes all I/O through a single thread anyways, so I
don't believe this is any worse.

Fixes-from: Colin Ian King 
Signed-off-by: Dan Schatzberg 
---
 drivers/block/loop.c | 207 ---
 drivers/block/loop.h |  12 ++-
 2 files changed, 182 insertions(+), 37 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d58d68f3c7cd..5c18e6b856c2 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -71,7 +71,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -84,6 +83,8 @@
 
 #include 
 
+#define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
+
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_ctl_mutex);
 
@@ -921,27 +922,83 @@ static void loop_config_discard(struct loop_device *lo)
q->limits.discard_alignment = 0;
 }
 
-static void loop_unprepare_queue(struct loop_device *lo)
-{
-   kthread_flush_worker(>worker);
-   kthread_stop(lo->worker_task);
-}
+struct loop_worker {
+   struct rb_node rb_node;
+   struct work_struct work;
+   struct list_head cmd_list;
+   struct list_head idle_list;
+   struct loop_device *lo;
+   struct cgroup_subsys_state *css;
+   unsigned long last_ran_at;
+};
 
-static int loop_kthread_worker_fn(void *worker_ptr)
-{
-   current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
-   return kthread_worker_fn(worker_ptr);
-}
+static void loop_workfn(struct work_struct *work);
+static void loop_rootcg_workfn(struct work_struct *work);
+static void loop_free_idle_workers(struct timer_list *timer);
 
-static int loop_prepare_queue(struct loop_device *lo)
+static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
 {
-   kthread_init_worker(>worker);
-   lo->worker_task = kthread_run(loop_kthread_worker_fn,
-   >worker, "loop%d", lo->lo_number);
-   if (IS_ERR(lo->worker_task))
-   return -ENOMEM;
-   set_user_nice(lo->worker_task, MIN_NICE);
-   return 0;
+   struct rb_node **node = &(lo->worker_tree.rb_node), *parent = NULL;
+   struct loop_worker *cur_worker, *worker = NULL;
+   struct work_struct *work;
+   struct list_head *cmd_list;
+
+   spin_lock_irq(>lo_work_lock);
+
+   if (!cmd->css)
+   goto queue_work;
+
+   node = >worker_tree.rb_node;
+
+   while (*node) {
+   parent = *node;
+   cur_worker = container_of(*node, struct loop_worker, rb_node);
+   if (cur_worker->css == cmd->css) {
+   worker = cur_worker;
+   break;
+   } else if ((long)cur_worker->css < (long)cmd->css) {
+   node = &(*node)->rb_left;
+   } else {
+   node = &(*node)->rb_right;
+   }
+   }
+   if (worker)
+   goto queue_work;
+
+   worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
+   /*
+* In the event we cannot allocate a worker, just queue on the
+* rootcg worker
+*/
+   if (!worker)
+   goto queue_work;
+
+   worker->css = cmd->css;
+   css_get(worker->css);
+   INIT_WORK(>work, loop_workfn);
+   INIT_LIST_HEAD(>cmd_list);
+   INIT_LIST_HEAD(>idle_list);
+   worker->lo = lo;
+   rb_link_node(>rb_node, parent, node);
+   rb_insert_color(>rb_node, >worker_tree);
+queue_work:
+   if (worker) {
+   /*
+* We need to remove from the idle list here while
+* holding the lock so that the idle timer doesn't
+* free the worker
+*/
+   if (!list_empty(>idle_list))
+   list_del_init(>idle_list);
+   work = >work;
+   cmd_list = >cmd_list;
+   } else {
+   work = >rootcg_wo

[PATCH V11 0/3] Charge loop device i/o to issuing cgroup

2021-03-29 Thread Dan Schatzberg

No major changes, rebased on top of latest mm tree

Changes since V11:

* Removed WQ_MEM_RECLAIM flag from loop workqueue. Technically, this
  can be driven by writeback, but this was causing a warning in xfs
  and likely other filesystems aren't equipped to be driven by reclaim
  at the VFS layer.
* Included a small fix from Colin Ian King.
* reworked get_mem_cgroup_from_mm to institute the necessary charge
  priority.

Changes since V10:

* Added page-cache charging to mm: Charge active memcg when no mm is set

Changes since V9:

* Rebased against linus's branch which now includes Roman Gushchin's
  patch this series is based off of

Changes since V8:

* Rebased on top of Roman Gushchin's patch
  (https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
  support for setting active memcg. Dropped the patch from this series
  that did the same thing.

Changes since V7:

* Rebased against linus's branch

Changes since V6:

* Added separate spinlock for worker synchronization
* Minor style changes

Changes since V5:

* Fixed a missing css_put when failing to allocate a worker
* Minor style changes

Changes since V4:

Only patches 1 and 2 have changed.

* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg

Changes since V3:

* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes

Changes since V2:

* Deferred destruction of workqueue items so in the common case there
  is no allocation needed

Changes since V1:

* Split out and reordered patches so cgroup charging changes are
  separate from kworker -> workqueue change

* Add mem_css to struct loop_cmd to simplify logic

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.

Dan Schatzberg (3):
  loop: Use worker per cgroup instead of kworker
  mm: Charge active memcg when no mm is set
  loop: Charge i/o to mem and blk cg

 drivers/block/loop.c   | 248 ++---
 drivers/block/loop.h   |  15 ++-
 include/linux/memcontrol.h |   6 +
 kernel/cgroup/cgroup.c |   1 +
 mm/filemap.c   |   2 +-
 mm/memcontrol.c|  73 ++-
 mm/shmem.c |   4 +-
 7 files changed, 267 insertions(+), 82 deletions(-)

-- 
2.30.2

Re: [loop] eaba742710: WARNING:at_kernel/workqueue.c:#check_flush_dependency

2021-03-22 Thread Dan Schatzberg

On Mon, Mar 22, 2021 at 02:03:34PM +0800, kernel test robot wrote:
> 
> 
> Greeting,
> 
> FYI, we noticed the following commit (built with gcc-9):
> 
> commit: eaba7427107045752f7454f94a40839c0880cf02 ("[PATCH 1/3] loop: Use 
> worker per cgroup instead of kworker")
> url: 
> https://github.com/0day-ci/linux/commits/Dan-Schatzberg/Charge-loop-device-i-o-to-issuing-cgroup/20210316-233842
> base: https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git 
> for-next
> 
> in testcase: xfstests
> version: xfstests-x86_64-73c0871-1_20210318
> with following parameters:
> 
>   disk: 4HDD
>   fs: xfs
>   test: generic-group-18
>   ucode: 0xe2
> 
> test-description: xfstests is a regression test suite for xfs and other files 
> ystems.
> test-url: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
> 
> 
> on test machine: 4 threads Intel(R) Xeon(R) CPU E3-1225 v5 @ 3.30GHz with 16G 
> memory
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> ... 
> [   50.428387] WARNING: CPU: 0 PID: 35 at kernel/workqueue.c:2613 
> check_flush_dependency (kbuild/src/consumer/kernel/workqueue.c:2613 
> (discriminator 9)) 
> [   50.450013] Modules linked in: loop xfs dm_mod btrfs blake2b_generic xor 
> zstd_compress raid6_pq libcrc32c sd_mod t10_pi sg ipmi_devintf 
> ipmi_msghandler intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal i915 
> intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul hp_wmi sparse_keymap 
> intel_gtt crc32c_intel ghash_clmulni_intel mei_wdt rfkill wmi_bmof rapl 
> drm_kms_helper ahci intel_cstate syscopyarea mei_me libahci sysfillrect 
> sysimgblt fb_sys_fops intel_uncore serio_raw mei drm libata intel_pch_thermal 
> ie31200_edac wmi video tpm_infineon intel_pmc_core acpi_pad ip_tables
> [   50.500731] CPU: 0 PID: 35 Comm: kworker/u8:3 Not tainted 
> 5.12.0-rc2-00093-geaba74271070 #1
> [   50.509081] Hardware name: HP HP Z238 Microtower Workstation/8183, BIOS 
> N51 Ver. 01.63 10/05/2017
> [   50.517963] Workqueue: loop0 loop_rootcg_workfn [loop]
> [   50.523109] RIP: 0010:check_flush_dependency 
> (kbuild/src/consumer/kernel/workqueue.c:2613 (discriminator 9))
> ...
> [   50.625837] __flush_work (kbuild/src/consumer/kernel/workqueue.c:2669 
> kbuild/src/consumer/kernel/workqueue.c:3011 
> kbuild/src/consumer/kernel/workqueue.c:3051) 
> [   50.629418] ? __queue_work 
> (kbuild/src/consumer/arch/x86/include/asm/paravirt.h:559 
> kbuild/src/consumer/arch/x86/include/asm/qspinlock.h:56 
> kbuild/src/consumer/include/linux/spinlock.h:212 
> kbuild/src/consumer/include/linux/spinlock_api_smp.h:151 
> kbuild/src/consumer/kernel/workqueue.c:1500) 
> [   50.633261] xfs_file_buffered_write 
> (kbuild/src/consumer/fs/xfs/xfs_file.c:761) xfs
> [   50.638468] do_iter_readv_writev (kbuild/src/consumer/fs/read_write.c:741) 
> [   50.642833] do_iter_write (kbuild/src/consumer/fs/read_write.c:866 
> kbuild/src/consumer/fs/read_write.c:847) 
> [   50.646513] lo_write_bvec (kbuild/src/consumer/include/linux/fs.h:2903 
> kbuild/src/consumer/drivers/block/loop.c:286) loop
> [   50.650804] loop_process_work 
> (kbuild/src/consumer/drivers/block/loop.c:307 
> kbuild/src/consumer/drivers/block/loop.c:630 
> kbuild/src/consumer/drivers/block/loop.c:2129 
> kbuild/src/consumer/drivers/block/loop.c:2161) loop
> [   50.655543] ? newidle_balance 
> (kbuild/src/consumer/kernel/sched/fair.c:10635) 
> [   50.659647] process_one_work 
> (kbuild/src/consumer/arch/x86/include/asm/jump_label.h:25 
> kbuild/src/consumer/include/linux/jump_label.h:200 
> kbuild/src/consumer/include/trace/events/workqueue.h:108 
> kbuild/src/consumer/kernel/workqueue.c:2280) 
> [   50.663696] worker_thread (kbuild/src/consumer/include/linux/list.h:282 
> kbuild/src/consumer/kernel/workqueue.c:2422) 
> [   50.667365] ? process_one_work 
> (kbuild/src/consumer/kernel/workqueue.c:2364) 
> [   50.671568] kthread (kbuild/src/consumer/kernel/kthread.c:292) 
> [   50.674813] ? kthread_park (kbuild/src/consumer/kernel/kthread.c:245) 
> [   50.678476] ret_from_fork 
> (kbuild/src/consumer/arch/x86/entry/entry_64.S:300) 

My understanding is that this warning is firing because the loop
workqueue sets WQ_MEM_RECLAIM but the XFS workqueue (m_sync_workqueue)
does not. I believe that the WQ_MEM_RECLAIM on the loop device is
sensible because reclaim may flush dirty writes through the loop
device. I'm not familiar with xfs and its not clear why
m_sync_workqueue (flushed from xfs_flush_inodes) wouldn't have the
same reclaim dependency. I'll keep digging, but if anyone has
insights, please let me know.

Re: [PATCH v10 0/3] Charge loop device i/o to issuing cgroup

2021-03-19 Thread Dan Schatzberg

On Fri, Mar 19, 2021 at 09:20:16AM -0700, Shakeel Butt wrote:
> One suggestion would be to make get_mem_cgroup_from_mm() more generic
> (i.e. handle !mm && active_memcg() case) and avoid
> get_mem_cgroup_from_current() as it might go away.

Yeah, that occurred to me as well. I'll take a stab at doing that.

Re: [PATCH][next] loop: Fix missing max_active argument in alloc_workqueue call

2021-03-19 Thread Dan Schatzberg

On Thu, Mar 18, 2021 at 03:16:26PM +, Colin King wrote:
> From: Colin Ian King 
> 
> The 3rd argument to alloc_workqueue should be the max_active count,
> however currently it is the lo->lo_number that is intended for the
> loop%d number. Fix this by adding in the missing max_active count.
> 

Thanks for catching this Colin. I'm fairly new to kernel development.
Is there some tool I could have run locally to catch this?

Re: [PATCH][next] loop: Fix missing max_active argument in alloc_workqueue call

2021-03-19 Thread Dan Schatzberg

On Thu, Mar 18, 2021 at 02:12:10PM -0600, Jens Axboe wrote:
> On 3/18/21 9:16 AM, Colin King wrote:
> > From: Colin Ian King 
> > 
> > The 3rd argument to alloc_workqueue should be the max_active count,
> > however currently it is the lo->lo_number that is intended for the
> > loop%d number. Fix this by adding in the missing max_active count.
> 
> Dan, please fold this (or something similar) in when you're redoing the
> series.
> 
> -- 
> Jens Axboe
> 

Will do.

Re: [PATCH v10 0/3] Charge loop device i/o to issuing cgroup

2021-03-19 Thread Dan Schatzberg

On Thu, Mar 18, 2021 at 05:56:28PM -0700, Shakeel Butt wrote:
> 
> We need something similar for mem_cgroup_swapin_charge_page() as well.
> 
> It is better to take this series in mm tree and Jens is ok with that [1].
> 
> [1] 
> https://lore.kernel.org/linux-next/4fea89a5-0e18-0791-18a8-4c5907b0d...@kernel.dk/

It sounds like there are no concerns about the loop-related work in
the patch series. I'll rebase on the mm tree and resubmit.

Re: [PATCH 2/3] mm: Charge active memcg when no mm is set

2021-03-16 Thread Dan Schatzberg

On Tue, Mar 16, 2021 at 08:50:16AM -0700, Shakeel Butt wrote:
> You will need to rebase to the latest mm tree. This code has changed.

Thanks for the feedback, I will address these comments in another
rebase. I'll wait and see if there's any comments concerning the
loop-related patches but it sounds like this will need to go through
the mm-tree

[PATCH 1/1] perf: display raw event for mem report

2021-03-16 Thread Dan Schatzberg

`perf mem record` can be used to capture memory loads/stores using
PEBS counters. `perf mem report -D` can be used to dump the raw
samples (e.g. each sampled memory access). However, it does not output
the raw event (load vs store).

This patch supplements the output of `perf mem report -D` with the
event name.

Signed-off-by: Dan Schatzberg 
---
 tools/perf/builtin-mem.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/tools/perf/builtin-mem.c b/tools/perf/builtin-mem.c
index cdd2b9f643f6..1f4ec350bca3 100644
--- a/tools/perf/builtin-mem.c
+++ b/tools/perf/builtin-mem.c
@@ -172,6 +172,7 @@ static int
 dump_raw_samples(struct perf_tool *tool,
 union perf_event *event,
 struct perf_sample *sample,
+struct evsel *evsel,
 struct machine *machine)
 {
struct perf_mem *mem = container_of(tool, struct perf_mem, tool);
@@ -220,6 +221,11 @@ dump_raw_samples(struct perf_tool *tool,
symbol_conf.field_sep);
}
 
+   if (evsel->name)
+   printf("%s%s", evsel->name, symbol_conf.field_sep);
+   else
+   printf("%s", symbol_conf.field_sep);
+
if (field_sep)
fmt = "%"PRIu64"%s0x%"PRIx64"%s%s:%s\n";
else
@@ -240,10 +246,10 @@ dump_raw_samples(struct perf_tool *tool,
 static int process_sample_event(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
-   struct evsel *evsel __maybe_unused,
+   struct evsel *evsel,
struct machine *machine)
 {
-   return dump_raw_samples(tool, event, sample, machine);
+   return dump_raw_samples(tool, event, sample, evsel, machine);
 }
 
 static int report_raw_events(struct perf_mem *mem)
@@ -287,7 +293,7 @@ static int report_raw_events(struct perf_mem *mem)
if (mem->data_page_size)
printf("DATA PAGE SIZE, ");
 
-   printf("LOCAL WEIGHT, DSRC, SYMBOL\n");
+   printf("EVENT, LOCAL WEIGHT, DSRC, SYMBOL\n");
 
ret = perf_session__process_events(session);
 
-- 
2.30.2

[PATCH 0/1] Display raw event in perf mem report

2021-03-16 Thread Dan Schatzberg

In doing some memory trace analysis I've noticed a lack of a way to
disambiguate loads from stores in perf mem report. In order to address
this I've added the raw event name to the output.

Please let me know if there's some better way to achieve this instead.

Dan Schatzberg (1):
  perf: display raw event for mem report

 tools/perf/builtin-mem.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

-- 
2.30.2

[PATCH 3/3] loop: Charge i/o to mem and blk cg

2021-03-16 Thread Dan Schatzberg

The current code only associates with the existing blkcg when aio is
used to access the backing file. This patch covers all types of i/o to
the backing file and also associates the memcg so if the backing file is
on tmpfs, memory is charged appropriately.

This patch also exports cgroup_get_e_css and int_active_memcg so it
can be used by the loop module.

Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
---
 drivers/block/loop.c   | 61 +-
 drivers/block/loop.h   |  3 +-
 include/linux/memcontrol.h | 11 +++
 kernel/cgroup/cgroup.c |  1 +
 mm/memcontrol.c|  1 +
 5 files changed, 56 insertions(+), 21 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5c080af73caa..6cf3086a2e75 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -77,6 +77,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "loop.h"
 
@@ -515,8 +516,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
ret, long ret2)
 {
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
 
-   if (cmd->css)
-   css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
 }
@@ -577,8 +576,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
-   if (cmd->css)
-   kthread_associate_blkcg(cmd->css);
 
if (rw == WRITE)
ret = call_write_iter(file, >iocb, );
@@ -586,7 +583,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
ret = call_read_iter(file, >iocb, );
 
lo_rw_aio_do_completion(cmd);
-   kthread_associate_blkcg(NULL);
 
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(>iocb, ret, 0);
@@ -927,7 +923,7 @@ struct loop_worker {
struct list_head cmd_list;
struct list_head idle_list;
struct loop_device *lo;
-   struct cgroup_subsys_state *css;
+   struct cgroup_subsys_state *blkcg_css;
unsigned long last_ran_at;
 };
 
@@ -944,7 +940,7 @@ static void loop_queue_work(struct loop_device *lo, struct 
loop_cmd *cmd)
 
spin_lock_irq(>lo_work_lock);
 
-   if (!cmd->css)
+   if (!cmd->blkcg_css)
goto queue_work;
 
node = >worker_tree.rb_node;
@@ -952,10 +948,10 @@ static void loop_queue_work(struct loop_device *lo, 
struct loop_cmd *cmd)
while (*node) {
parent = *node;
cur_worker = container_of(*node, struct loop_worker, rb_node);
-   if (cur_worker->css == cmd->css) {
+   if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
-   } else if ((long)cur_worker->css < (long)cmd->css) {
+   } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
node = &(*node)->rb_left;
} else {
node = &(*node)->rb_right;
@@ -967,13 +963,18 @@ static void loop_queue_work(struct loop_device *lo, 
struct loop_cmd *cmd)
worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
/*
 * In the event we cannot allocate a worker, just queue on the
-* rootcg worker
+* rootcg worker and issue the I/O as the rootcg
 */
-   if (!worker)
+   if (!worker) {
+   cmd->blkcg_css = NULL;
+   if (cmd->memcg_css)
+   css_put(cmd->memcg_css);
+   cmd->memcg_css = NULL;
goto queue_work;
+   }
 
-   worker->css = cmd->css;
-   css_get(worker->css);
+   worker->blkcg_css = cmd->blkcg_css;
+   css_get(worker->blkcg_css);
INIT_WORK(>work, loop_workfn);
INIT_LIST_HEAD(>cmd_list);
INIT_LIST_HEAD(>idle_list);
@@ -1297,7 +1298,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool 
release)
idle_list) {
list_del(>idle_list);
rb_erase(>rb_node, >worker_tree);
-   css_put(worker->css);
+   css_put(worker->blkcg_css);
kfree(worker);
}
spin_unlock_irq(>lo_work_lock);
@@ -2102,13 +2103,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx 
*hctx,
}
 
/* always use the first bio's css */
+   cmd->blkcg_css = NULL;
+   cmd->memcg_css = NULL;
 #ifdef CONFIG_BLK_CGROUP
-   if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
-   cmd->css = _blkcg(rq->bio)->css;
-   css_get(cmd->css);
-

[PATCH 1/3] loop: Use worker per cgroup instead of kworker

2021-03-16 Thread Dan Schatzberg

Existing uses of loop device may have multiple cgroups reading/writing
to the same device. Simply charging resources for I/O to the backing
file could result in priority inversion where one cgroup gets
synchronously blocked, holding up all other I/O to the loop device.

In order to avoid this priority inversion, we use a single workqueue
where each work item is a "struct loop_worker" which contains a queue of
struct loop_cmds to issue. The loop device maintains a tree mapping blk
css_id -> loop_worker. This allows each cgroup to independently make
forward progress issuing I/O to the backing file.

There is also a single queue for I/O associated with the rootcg which
can be used in cases of extreme memory shortage where we cannot allocate
a loop_worker.

The locking for the tree and queues is fairly heavy handed - we acquire
a per-loop-device spinlock any time either is accessed. The existing
implementation serializes all I/O through a single thread anyways, so I
don't believe this is any worse.

Signed-off-by: Dan Schatzberg 
---
 drivers/block/loop.c | 207 ---
 drivers/block/loop.h |  12 ++-
 2 files changed, 182 insertions(+), 37 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index a370cde3ddd4..5c080af73caa 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -70,7 +70,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -83,6 +82,8 @@
 
 #include 
 
+#define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
+
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_ctl_mutex);
 
@@ -920,27 +921,83 @@ static void loop_config_discard(struct loop_device *lo)
q->limits.discard_alignment = 0;
 }
 
-static void loop_unprepare_queue(struct loop_device *lo)
-{
-   kthread_flush_worker(>worker);
-   kthread_stop(lo->worker_task);
-}
+struct loop_worker {
+   struct rb_node rb_node;
+   struct work_struct work;
+   struct list_head cmd_list;
+   struct list_head idle_list;
+   struct loop_device *lo;
+   struct cgroup_subsys_state *css;
+   unsigned long last_ran_at;
+};
 
-static int loop_kthread_worker_fn(void *worker_ptr)
-{
-   current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
-   return kthread_worker_fn(worker_ptr);
-}
+static void loop_workfn(struct work_struct *work);
+static void loop_rootcg_workfn(struct work_struct *work);
+static void loop_free_idle_workers(struct timer_list *timer);
 
-static int loop_prepare_queue(struct loop_device *lo)
+static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
 {
-   kthread_init_worker(>worker);
-   lo->worker_task = kthread_run(loop_kthread_worker_fn,
-   >worker, "loop%d", lo->lo_number);
-   if (IS_ERR(lo->worker_task))
-   return -ENOMEM;
-   set_user_nice(lo->worker_task, MIN_NICE);
-   return 0;
+   struct rb_node **node = &(lo->worker_tree.rb_node), *parent = NULL;
+   struct loop_worker *cur_worker, *worker = NULL;
+   struct work_struct *work;
+   struct list_head *cmd_list;
+
+   spin_lock_irq(>lo_work_lock);
+
+   if (!cmd->css)
+   goto queue_work;
+
+   node = >worker_tree.rb_node;
+
+   while (*node) {
+   parent = *node;
+   cur_worker = container_of(*node, struct loop_worker, rb_node);
+   if (cur_worker->css == cmd->css) {
+   worker = cur_worker;
+   break;
+   } else if ((long)cur_worker->css < (long)cmd->css) {
+   node = &(*node)->rb_left;
+   } else {
+   node = &(*node)->rb_right;
+   }
+   }
+   if (worker)
+   goto queue_work;
+
+   worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
+   /*
+* In the event we cannot allocate a worker, just queue on the
+* rootcg worker
+*/
+   if (!worker)
+   goto queue_work;
+
+   worker->css = cmd->css;
+   css_get(worker->css);
+   INIT_WORK(>work, loop_workfn);
+   INIT_LIST_HEAD(>cmd_list);
+   INIT_LIST_HEAD(>idle_list);
+   worker->lo = lo;
+   rb_link_node(>rb_node, parent, node);
+   rb_insert_color(>rb_node, >worker_tree);
+queue_work:
+   if (worker) {
+   /*
+* We need to remove from the idle list here while
+* holding the lock so that the idle timer doesn't
+* free the worker
+*/
+   if (!list_empty(>idle_list))
+   list_del_init(>idle_list);
+   work = >work;
+   cmd_list = >cmd_list;
+   } else {
+   work = >rootcg_work;
+   cmd_list = >rootc

[PATCH 2/3] mm: Charge active memcg when no mm is set

2021-03-16 Thread Dan Schatzberg

memalloc_use_memcg() worked for kernel allocations but was silently
ignored for user pages.

This patch establishes a precedence order for who gets charged:

1. If there is a memcg associated with the page already, that memcg is
   charged. This happens during swapin.

2. If an explicit mm is passed, mm->memcg is charged. This happens
   during page faults, which can be triggered in remote VMs (eg gup).

3. Otherwise consult the current process context. If it has configured
   a current->active_memcg, use that. Otherwise, current->mm->memcg.

Previously, if a NULL mm was passed to mem_cgroup_try_charge (case 3) it
would always charge the root cgroup. Now it looks up the current
active_memcg first (falling back to charging the root cgroup if not
set).

Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
Acked-by: Tejun Heo 
Acked-by: Chris Down 
Reviewed-by: Shakeel Butt 
---
 mm/filemap.c|  2 +-
 mm/memcontrol.c | 14 +++---
 mm/shmem.c  |  4 ++--
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 43700480d897..5135f330f05c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -843,7 +843,7 @@ noinline int __add_to_page_cache_locked(struct page *page,
page->index = offset;
 
if (!huge) {
-   error = mem_cgroup_charge(page, current->mm, gfp);
+   error = mem_cgroup_charge(page, NULL, gfp);
if (error)
goto error;
charged = true;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e064ac0d850a..9a1b23ed3412 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6690,7 +6690,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup 
*root,
  * @gfp_mask: reclaim mode
  *
  * Try to charge @page to the memcg that @mm belongs to, reclaiming
- * pages according to @gfp_mask if necessary.
+ * pages according to @gfp_mask if necessary. if @mm is NULL, try to
+ * charge to the active memcg.
  *
  * Returns 0 on success. Otherwise, an error code is returned.
  */
@@ -6726,8 +6727,15 @@ int mem_cgroup_charge(struct page *page, struct 
mm_struct *mm, gfp_t gfp_mask)
rcu_read_unlock();
}
 
-   if (!memcg)
-   memcg = get_mem_cgroup_from_mm(mm);
+   if (!memcg) {
+   if (!mm) {
+   memcg = get_mem_cgroup_from_current();
+   if (!memcg)
+   memcg = get_mem_cgroup_from_mm(current->mm);
+   } else {
+   memcg = get_mem_cgroup_from_mm(mm);
+   }
+   }
 
ret = try_charge(memcg, gfp_mask, nr_pages);
if (ret)
diff --git a/mm/shmem.c b/mm/shmem.c
index b2db4ed0fbc7..353b362c370e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1695,7 +1695,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t 
index,
 {
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
-   struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
+   struct mm_struct *charge_mm = vma ? vma->vm_mm : NULL;
struct page *page;
swp_entry_t swap;
int error;
@@ -1816,7 +1816,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t 
index,
}
 
sbinfo = SHMEM_SB(inode->i_sb);
-   charge_mm = vma ? vma->vm_mm : current->mm;
+   charge_mm = vma ? vma->vm_mm : NULL;
 
page = pagecache_get_page(mapping, index,
FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0);
-- 
2.30.2

[PATCH v10 0/3] Charge loop device i/o to issuing cgroup

2021-03-16 Thread Dan Schatzberg

No major changes, just rebasing and resubmitting

Changes since V10:

* Added page-cache charging to mm: Charge active memcg when no mm is set

Changes since V9:

* Rebased against linus's branch which now includes Roman Gushchin's
  patch this series is based off of

Changes since V8:

* Rebased on top of Roman Gushchin's patch
  (https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
  support for setting active memcg. Dropped the patch from this series
  that did the same thing.

Changes since V7:

* Rebased against linus's branch

Changes since V6:

* Added separate spinlock for worker synchronization
* Minor style changes

Changes since V5:

* Fixed a missing css_put when failing to allocate a worker
* Minor style changes

Changes since V4:

Only patches 1 and 2 have changed.

* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg

Changes since V3:

* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes

Changes since V2:

* Deferred destruction of workqueue items so in the common case there
  is no allocation needed

Changes since V1:

* Split out and reordered patches so cgroup charging changes are
  separate from kworker -> workqueue change

* Add mem_css to struct loop_cmd to simplify logic

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.

Dan Schatzberg (3):
  loop: Use worker per cgroup instead of kworker
  mm: Charge active memcg when no mm is set
  loop: Charge i/o to mem and blk cg

 drivers/block/loop.c   | 248 ++---
 drivers/block/loop.h   |  15 ++-
 include/linux/memcontrol.h |  11 ++
 kernel/cgroup/cgroup.c |   1 +
 mm/filemap.c   |   2 +-
 mm/memcontrol.c|  15 ++-
 mm/shmem.c |   4 +-
 7 files changed, 242 insertions(+), 54 deletions(-)

-- 
2.30.2

Re: [PATCH v2 0/2] cgroup: fix psi monitor for root cgroup

2021-01-19 Thread Dan Schatzberg

> This patchset fixes PSI monitors on the root cgroup, as they currently
> only works on the non-root cgroups. Reading works for all, since that
> was fixed when adding support for the root PSI files. It also contains
> a doc update to reflect the current implementation.
> 
> Changes since v1:
> - Added Reviewed-by tag on the original patch (Suren)
> - Updated patch title
> - Added a separate patch to update doc
>
>
> Odin Ugedal (2):
>  cgroup: fix psi monitor for root cgroup
>  cgroup: update PSI file description in docs
>
> Documentation/admin-guide/cgroup-v2.rst | 6 +++---
> kernel/cgroup/cgroup.c  | 4 +++-
> 2 files changed, 6 insertions(+), 4 deletions(-)

Both patches look good.

Acked-by: Dan Schatzberg

Re: [PATCH v8 0/3] Charge loop device i/o to issuing cgroup

2020-09-15 Thread Dan Schatzberg

Jens,

How would you like to resolve this patch series? Roman's patch that
this is based off of just made it into linux-next:

https://kernel.googlesource.com/pub/scm/linux/kernel/git/next/linux-next/+/928776957ea4e1c35b1c4c35b8fe7203354fbae3

I suppose you can pull that into the block tree and have git merge
resolve it when it all goes to linus. Another option would be routing
this through Andrew's -mm tree (which already has that commit). Up to
you which you prefer.

[PATCH 3/3] loop: Charge i/o to mem and blk cg

2020-08-31 Thread Dan Schatzberg

The current code only associates with the existing blkcg when aio is
used to access the backing file. This patch covers all types of i/o to
the backing file and also associates the memcg so if the backing file is
on tmpfs, memory is charged appropriately.

This patch also exports cgroup_get_e_css so it can be used by the loop
module.

Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
---
 drivers/block/loop.c   | 61 +-
 drivers/block/loop.h   |  3 +-
 include/linux/memcontrol.h |  6 
 kernel/cgroup/cgroup.c |  1 +
 4 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 771685a6c259..3da34d454287 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -77,6 +77,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "loop.h"
 
@@ -518,8 +519,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
ret, long ret2)
 {
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
 
-   if (cmd->css)
-   css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
 }
@@ -580,8 +579,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
-   if (cmd->css)
-   kthread_associate_blkcg(cmd->css);
 
if (rw == WRITE)
ret = call_write_iter(file, >iocb, );
@@ -589,7 +586,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
ret = call_read_iter(file, >iocb, );
 
lo_rw_aio_do_completion(cmd);
-   kthread_associate_blkcg(NULL);
 
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(>iocb, ret, 0);
@@ -932,7 +928,7 @@ struct loop_worker {
struct list_head cmd_list;
struct list_head idle_list;
struct loop_device *lo;
-   struct cgroup_subsys_state *css;
+   struct cgroup_subsys_state *blkcg_css;
unsigned long last_ran_at;
 };
 
@@ -949,7 +945,7 @@ static void loop_queue_work(struct loop_device *lo, struct 
loop_cmd *cmd)
 
spin_lock_irq(>lo_work_lock);
 
-   if (!cmd->css)
+   if (!cmd->blkcg_css)
goto queue_work;
 
node = >worker_tree.rb_node;
@@ -957,10 +953,10 @@ static void loop_queue_work(struct loop_device *lo, 
struct loop_cmd *cmd)
while (*node) {
parent = *node;
cur_worker = container_of(*node, struct loop_worker, rb_node);
-   if (cur_worker->css == cmd->css) {
+   if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
-   } else if ((long)cur_worker->css < (long)cmd->css) {
+   } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
node = &(*node)->rb_left;
} else {
node = &(*node)->rb_right;
@@ -972,13 +968,18 @@ static void loop_queue_work(struct loop_device *lo, 
struct loop_cmd *cmd)
worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
/*
 * In the event we cannot allocate a worker, just queue on the
-* rootcg worker
+* rootcg worker and issue the I/O as the rootcg
 */
-   if (!worker)
+   if (!worker) {
+   cmd->blkcg_css = NULL;
+   if (cmd->memcg_css)
+   css_put(cmd->memcg_css);
+   cmd->memcg_css = NULL;
goto queue_work;
+   }
 
-   worker->css = cmd->css;
-   css_get(worker->css);
+   worker->blkcg_css = cmd->blkcg_css;
+   css_get(worker->blkcg_css);
INIT_WORK(>work, loop_workfn);
INIT_LIST_HEAD(>cmd_list);
INIT_LIST_HEAD(>idle_list);
@@ -1304,7 +1305,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool 
release)
idle_list) {
list_del(>idle_list);
rb_erase(>rb_node, >worker_tree);
-   css_put(worker->css);
+   css_put(worker->blkcg_css);
kfree(worker);
}
spin_unlock_irq(>lo_work_lock);
@@ -2105,13 +2106,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx 
*hctx,
}
 
/* always use the first bio's css */
+   cmd->blkcg_css = NULL;
+   cmd->memcg_css = NULL;
 #ifdef CONFIG_BLK_CGROUP
-   if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
-   cmd->css = _blkcg(rq->bio)->css;
-   css_get(cmd->css);
-   } else
+   if

[PATCH 1/3] loop: Use worker per cgroup instead of kworker

2020-08-31 Thread Dan Schatzberg

Existing uses of loop device may have multiple cgroups reading/writing
to the same device. Simply charging resources for I/O to the backing
file could result in priority inversion where one cgroup gets
synchronously blocked, holding up all other I/O to the loop device.

In order to avoid this priority inversion, we use a single workqueue
where each work item is a "struct loop_worker" which contains a queue of
struct loop_cmds to issue. The loop device maintains a tree mapping blk
css_id -> loop_worker. This allows each cgroup to independently make
forward progress issuing I/O to the backing file.

There is also a single queue for I/O associated with the rootcg which
can be used in cases of extreme memory shortage where we cannot allocate
a loop_worker.

The locking for the tree and queues is fairly heavy handed - we acquire
a per-loop-device spinlock any time either is accessed. The existing
implementation serializes all I/O through a single thread anyways, so I
don't believe this is any worse.

Signed-off-by: Dan Schatzberg 
---
 drivers/block/loop.c | 207 ---
 drivers/block/loop.h |  12 ++-
 2 files changed, 182 insertions(+), 37 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d3394191e168..771685a6c259 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -70,7 +70,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -83,6 +82,8 @@
 
 #include 
 
+#define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
+
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_ctl_mutex);
 
@@ -925,27 +926,83 @@ static void loop_config_discard(struct loop_device *lo)
q->limits.discard_alignment = 0;
 }
 
-static void loop_unprepare_queue(struct loop_device *lo)
-{
-   kthread_flush_worker(>worker);
-   kthread_stop(lo->worker_task);
-}
+struct loop_worker {
+   struct rb_node rb_node;
+   struct work_struct work;
+   struct list_head cmd_list;
+   struct list_head idle_list;
+   struct loop_device *lo;
+   struct cgroup_subsys_state *css;
+   unsigned long last_ran_at;
+};
 
-static int loop_kthread_worker_fn(void *worker_ptr)
-{
-   current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
-   return kthread_worker_fn(worker_ptr);
-}
+static void loop_workfn(struct work_struct *work);
+static void loop_rootcg_workfn(struct work_struct *work);
+static void loop_free_idle_workers(struct timer_list *timer);
 
-static int loop_prepare_queue(struct loop_device *lo)
+static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
 {
-   kthread_init_worker(>worker);
-   lo->worker_task = kthread_run(loop_kthread_worker_fn,
-   >worker, "loop%d", lo->lo_number);
-   if (IS_ERR(lo->worker_task))
-   return -ENOMEM;
-   set_user_nice(lo->worker_task, MIN_NICE);
-   return 0;
+   struct rb_node **node = &(lo->worker_tree.rb_node), *parent = NULL;
+   struct loop_worker *cur_worker, *worker = NULL;
+   struct work_struct *work;
+   struct list_head *cmd_list;
+
+   spin_lock_irq(>lo_work_lock);
+
+   if (!cmd->css)
+   goto queue_work;
+
+   node = >worker_tree.rb_node;
+
+   while (*node) {
+   parent = *node;
+   cur_worker = container_of(*node, struct loop_worker, rb_node);
+   if (cur_worker->css == cmd->css) {
+   worker = cur_worker;
+   break;
+   } else if ((long)cur_worker->css < (long)cmd->css) {
+   node = &(*node)->rb_left;
+   } else {
+   node = &(*node)->rb_right;
+   }
+   }
+   if (worker)
+   goto queue_work;
+
+   worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
+   /*
+* In the event we cannot allocate a worker, just queue on the
+* rootcg worker
+*/
+   if (!worker)
+   goto queue_work;
+
+   worker->css = cmd->css;
+   css_get(worker->css);
+   INIT_WORK(>work, loop_workfn);
+   INIT_LIST_HEAD(>cmd_list);
+   INIT_LIST_HEAD(>idle_list);
+   worker->lo = lo;
+   rb_link_node(>rb_node, parent, node);
+   rb_insert_color(>rb_node, >worker_tree);
+queue_work:
+   if (worker) {
+   /*
+* We need to remove from the idle list here while
+* holding the lock so that the idle timer doesn't
+* free the worker
+*/
+   if (!list_empty(>idle_list))
+   list_del_init(>idle_list);
+   work = >work;
+   cmd_list = >cmd_list;
+   } else {
+   work = >rootcg_work;
+   cmd_list = >rootc

[PATCH v8 0/3] Charge loop device i/o to issuing cgroup

2020-08-31 Thread Dan Schatzberg

Much of the discussion about this has died down. There's been a
concern raised that we could generalize infrastructure across loop,
md, etc. This may be possible, in the future, but it isn't clear to me
how this would look like. I'm inclined to fix the existing issue with
loop devices now (this is a problem we hit at FB) and address
consolidation with other cases if and when those need to be addressed.

Note that this series needs to be based off of Roman Gushchin's patch
(https://lkml.org/lkml/2020/8/21/1464) to compile.

Changes since V8:

* Rebased on top of Roman Gushchin's patch
  (https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
  support for setting active memcg. Dropped the patch from this series
  that did the same thing.

Changes since V7:

* Rebased against linus's branch

Changes since V6:

* Added separate spinlock for worker synchronization
* Minor style changes

Changes since V5:

* Fixed a missing css_put when failing to allocate a worker
* Minor style changes

Changes since V4:

Only patches 1 and 2 have changed.

* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg

Changes since V3:

* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes

Changes since V2:

* Deferred destruction of workqueue items so in the common case there
  is no allocation needed

Changes since V1:

* Split out and reordered patches so cgroup charging changes are
  separate from kworker -> workqueue change

* Add mem_css to struct loop_cmd to simplify logic

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.

Dan Schatzberg (3):
  loop: Use worker per cgroup instead of kworker
  mm: Charge active memcg when no mm is set
  loop: Charge i/o to mem and blk cg

 drivers/block/loop.c   | 248 ++---
 drivers/block/loop.h   |  15 ++-
 include/linux/memcontrol.h |   6 +
 kernel/cgroup/cgroup.c |   1 +
 mm/memcontrol.c|  11 +-
 mm/shmem.c |   4 +-
 6 files changed, 232 insertions(+), 53 deletions(-)

-- 
2.24.1

[PATCH 2/3] mm: Charge active memcg when no mm is set

2020-08-31 Thread Dan Schatzberg

memalloc_use_memcg() worked for kernel allocations but was silently
ignored for user pages.

This patch establishes a precedence order for who gets charged:

1. If there is a memcg associated with the page already, that memcg is
   charged. This happens during swapin.

2. If an explicit mm is passed, mm->memcg is charged. This happens
   during page faults, which can be triggered in remote VMs (eg gup).

3. Otherwise consult the current process context. If it has configured
   a current->active_memcg, use that. Otherwise, current->mm->memcg.

Previously, if a NULL mm was passed to mem_cgroup_try_charge (case 3) it
would always charge the root cgroup. Now it looks up the current
active_memcg first (falling back to charging the root cgroup if not
set).

Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
Acked-by: Tejun Heo 
Acked-by: Chris Down 
Reviewed-by: Shakeel Butt 
---
 mm/memcontrol.c | 11 ---
 mm/shmem.c  |  4 ++--
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e27ac6d79a32..88b792da 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6676,7 +6676,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup 
*root,
  * @gfp_mask: reclaim mode
  *
  * Try to charge @page to the memcg that @mm belongs to, reclaiming
- * pages according to @gfp_mask if necessary.
+ * pages according to @gfp_mask if necessary. if @mm is NULL, try to
+ * charge to the active memcg.
  *
  * Returns 0 on success. Otherwise, an error code is returned.
  */
@@ -6712,8 +6713,12 @@ int mem_cgroup_charge(struct page *page, struct 
mm_struct *mm, gfp_t gfp_mask)
rcu_read_unlock();
}
 
-   if (!memcg)
-   memcg = get_mem_cgroup_from_mm(mm);
+   if (!memcg) {
+   if (!mm)
+   memcg = get_mem_cgroup_from_current();
+   else
+   memcg = get_mem_cgroup_from_mm(mm);
+   }
 
ret = try_charge(memcg, gfp_mask, nr_pages);
if (ret)
diff --git a/mm/shmem.c b/mm/shmem.c
index 271548ca20f3..1139f52ac4ee 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1695,7 +1695,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t 
index,
 {
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
-   struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
+   struct mm_struct *charge_mm = vma ? vma->vm_mm : NULL;
struct page *page;
swp_entry_t swap;
int error;
@@ -1809,7 +1809,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t 
index,
}
 
sbinfo = SHMEM_SB(inode->i_sb);
-   charge_mm = vma ? vma->vm_mm : current->mm;
+   charge_mm = vma ? vma->vm_mm : NULL;
 
page = find_lock_entry(mapping, index);
if (xa_is_value(page)) {
-- 
2.24.1

Re: [PATCH 2/4] mm: support nesting memalloc_use_memcg()

2020-08-24 Thread Dan Schatzberg

On Mon, Aug 24, 2020 at 09:19:01AM -0700, Roman Gushchin wrote:
> Hi Dan,
> 
> JFYI: I need a similar patch for the bpf memory accounting rework,
> so I ended up sending it separately (with some modifications including
> different naming): https://lkml.org/lkml/2020/8/21/1464 .
> 
> Can you please, rebase your patchset using this patch?
> 
> I hope Andrew can pull this standalone patch into 5.9-rc*,
> as Shakeel suggested. It will help us to avoid merge conflicts
> during the 5.10 merge window.
> 
> Thanks!

Yeah I mentioned it in the cover letter and linked to your patch. I
had not realized the naming change, so I can rebase on top of that -
I'll wait to see if there's other feedback first.

[PATCH 2/4] mm: support nesting memalloc_use_memcg()

2020-08-24 Thread Dan Schatzberg

From: Johannes Weiner 

The memalloc_use_memcg() function to override the default memcg
accounting context currently doesn't nest. But the patches to make the
loop driver cgroup-aware will end up nesting:

[   98.137605]  alloc_page_buffers+0x210/0x288
[   98.141799]  __getblk_gfp+0x1d4/0x400
[   98.145475]  ext4_read_block_bitmap_nowait+0x148/0xbc8
[   98.150628]  ext4_mb_init_cache+0x25c/0x9b0
[   98.154821]  ext4_mb_init_group+0x270/0x390
[   98.159014]  ext4_mb_good_group+0x264/0x270
[   98.163208]  ext4_mb_regular_allocator+0x480/0x798
[   98.168011]  ext4_mb_new_blocks+0x958/0x10f8
[   98.172294]  ext4_ext_map_blocks+0xec8/0x1618
[   98.176660]  ext4_map_blocks+0x1b8/0x8a0
[   98.180592]  ext4_writepages+0x830/0xf10
[   98.184523]  do_writepages+0xb4/0x198
[   98.188195]  __filemap_fdatawrite_range+0x170/0x1c8
[   98.193086]  filemap_write_and_wait_range+0x40/0xb0
[   98.197974]  ext4_punch_hole+0x4a4/0x660
[   98.201907]  ext4_fallocate+0x294/0x1190
[   98.205839]  loop_process_work+0x690/0x1100
[   98.210032]  loop_workfn+0x2c/0x110
[   98.213529]  process_one_work+0x3e0/0x648
[   98.217546]  worker_thread+0x70/0x670
[   98.221217]  kthread+0x1b8/0x1c0
[   98.224452]  ret_from_fork+0x10/0x18

where loop_process_work() sets the memcg override to the memcg that
submitted the IO request, and alloc_page_buffers() sets the override
to the memcg that instantiated the cache page, which may differ.

Make memalloc_use_memcg() return the old memcg and convert existing
users to a stacking model. Delete the unused memalloc_unuse_memcg().

Signed-off-by: Johannes Weiner 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
Reported-by: Naresh Kamboju 
---
 fs/buffer.c  |  6 +++---
 fs/notify/fanotify/fanotify.c|  5 +++--
 fs/notify/inotify/inotify_fsnotify.c |  5 +++--
 include/linux/sched/mm.h | 28 +---
 mm/memcontrol.c  |  6 +++---
 5 files changed, 21 insertions(+), 29 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index d468ed9981e0..804170cb59fe 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -842,13 +842,13 @@ struct buffer_head *alloc_page_buffers(struct page *page, 
unsigned long size,
struct buffer_head *bh, *head;
gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
long offset;
-   struct mem_cgroup *memcg;
+   struct mem_cgroup *memcg, *old_memcg;
 
if (retry)
gfp |= __GFP_NOFAIL;
 
memcg = get_mem_cgroup_from_page(page);
-   memalloc_use_memcg(memcg);
+   old_memcg = memalloc_use_memcg(memcg);
 
head = NULL;
offset = PAGE_SIZE;
@@ -867,7 +867,7 @@ struct buffer_head *alloc_page_buffers(struct page *page, 
unsigned long size,
set_bh_page(bh, page, offset);
}
 out:
-   memalloc_unuse_memcg();
+   memalloc_use_memcg(old_memcg);
mem_cgroup_put(memcg);
return head;
 /*
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index c942910a8649..c8fd563e02a3 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -533,6 +533,7 @@ static struct fanotify_event *fanotify_alloc_event(struct 
fsnotify_group *group,
unsigned int fid_mode = FAN_GROUP_FLAG(group, FANOTIFY_FID_BITS);
struct inode *child = NULL;
bool name_event = false;
+   struct mem_cgroup *old_memcg;
 
if ((fid_mode & FAN_REPORT_DIR_FID) && dirid) {
/*
@@ -580,7 +581,7 @@ static struct fanotify_event *fanotify_alloc_event(struct 
fsnotify_group *group,
gfp |= __GFP_RETRY_MAYFAIL;
 
/* Whoever is interested in the event, pays for the allocation. */
-   memalloc_use_memcg(group->memcg);
+   old_memcg = memalloc_use_memcg(group->memcg);
 
if (fanotify_is_perm_event(mask)) {
event = fanotify_alloc_perm_event(path, gfp);
@@ -608,7 +609,7 @@ static struct fanotify_event *fanotify_alloc_event(struct 
fsnotify_group *group,
event->pid = get_pid(task_tgid(current));
 
 out:
-   memalloc_unuse_memcg();
+   memalloc_use_memcg(old_memcg);
return event;
 }
 
diff --git a/fs/notify/inotify/inotify_fsnotify.c 
b/fs/notify/inotify/inotify_fsnotify.c
index a65cf8c9f600..8017a51561c4 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -66,6 +66,7 @@ static int inotify_one_event(struct fsnotify_group *group, 
u32 mask,
int ret;
int len = 0;
int alloc_len = sizeof(struct inotify_event_info);
+   struct mem_cgroup *old_memcg;
 
if ((inode_mark->mask & FS_EXCL_UNLINK) &&
path && d_unlinked(path->dentry))
@@ -87,9 +88,9 @@ static int inotify_one_event(struct fsnotify_group *group, 
u32 mask,
 * trigger OOM killer in the target monitoring memcg as it may have
 * security repercussion.
 */
-   memalloc_use_memcg(group->memcg);
+   old_memcg =

[PATCH 3/4] mm: Charge active memcg when no mm is set

2020-08-24 Thread Dan Schatzberg

memalloc_use_memcg() worked for kernel allocations but was silently
ignored for user pages.

This patch establishes a precedence order for who gets charged:

1. If there is a memcg associated with the page already, that memcg is
   charged. This happens during swapin.

2. If an explicit mm is passed, mm->memcg is charged. This happens
   during page faults, which can be triggered in remote VMs (eg gup).

3. Otherwise consult the current process context. If it has configured
   a current->active_memcg, use that. Otherwise, current->mm->memcg.

Previously, if a NULL mm was passed to mem_cgroup_try_charge (case 3) it
would always charge the root cgroup. Now it looks up the current
active_memcg first (falling back to charging the root cgroup if not
set).

Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
Acked-by: Tejun Heo 
Acked-by: Chris Down 
Reviewed-by: Shakeel Butt 
---
 mm/memcontrol.c | 11 ---
 mm/shmem.c  |  5 +++--
 2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b2468c80085d..79c70eef3ec3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6676,7 +6676,8 @@ void mem_cgroup_calculate_protection(struct mem_cgroup 
*root,
  * @gfp_mask: reclaim mode
  *
  * Try to charge @page to the memcg that @mm belongs to, reclaiming
- * pages according to @gfp_mask if necessary.
+ * pages according to @gfp_mask if necessary. if @mm is NULL, try to
+ * charge to the active memcg.
  *
  * Returns 0 on success. Otherwise, an error code is returned.
  */
@@ -6712,8 +6713,12 @@ int mem_cgroup_charge(struct page *page, struct 
mm_struct *mm, gfp_t gfp_mask)
rcu_read_unlock();
}
 
-   if (!memcg)
-   memcg = get_mem_cgroup_from_mm(mm);
+   if (!memcg) {
+   if (!mm)
+   memcg = get_mem_cgroup_from_current();
+   else
+   memcg = get_mem_cgroup_from_mm(mm);
+   }
 
ret = try_charge(memcg, gfp_mask, nr_pages);
if (ret)
diff --git a/mm/shmem.c b/mm/shmem.c
index 271548ca20f3..77c908730be4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1695,7 +1695,8 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t 
index,
 {
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
-   struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
+   struct mm_struct *charge_mm = vma ? vma->vm_mm : NULL;
+   struct mem_cgroup *memcg;
struct page *page;
swp_entry_t swap;
int error;
@@ -1809,7 +1810,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t 
index,
}
 
sbinfo = SHMEM_SB(inode->i_sb);
-   charge_mm = vma ? vma->vm_mm : current->mm;
+   charge_mm = vma ? vma->vm_mm : NULL;
 
page = find_lock_entry(mapping, index);
if (xa_is_value(page)) {
-- 
2.24.1

[PATCH 4/4] loop: Charge i/o to mem and blk cg

2020-08-24 Thread Dan Schatzberg

The current code only associates with the existing blkcg when aio is
used to access the backing file. This patch covers all types of i/o to
the backing file and also associates the memcg so if the backing file is
on tmpfs, memory is charged appropriately.

This patch also exports cgroup_get_e_css so it can be used by the loop
module.

Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
---
 drivers/block/loop.c   | 61 +-
 drivers/block/loop.h   |  3 +-
 include/linux/memcontrol.h |  6 
 kernel/cgroup/cgroup.c |  1 +
 mm/shmem.c |  1 -
 5 files changed, 50 insertions(+), 22 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index c8dae53f3914..102d236d1540 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -77,6 +77,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "loop.h"
 
@@ -518,8 +519,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
ret, long ret2)
 {
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
 
-   if (cmd->css)
-   css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
 }
@@ -580,8 +579,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
-   if (cmd->css)
-   kthread_associate_blkcg(cmd->css);
 
if (rw == WRITE)
ret = call_write_iter(file, >iocb, );
@@ -589,7 +586,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
ret = call_read_iter(file, >iocb, );
 
lo_rw_aio_do_completion(cmd);
-   kthread_associate_blkcg(NULL);
 
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(>iocb, ret, 0);
@@ -929,7 +925,7 @@ struct loop_worker {
struct list_head cmd_list;
struct list_head idle_list;
struct loop_device *lo;
-   struct cgroup_subsys_state *css;
+   struct cgroup_subsys_state *blkcg_css;
unsigned long last_ran_at;
 };
 
@@ -946,7 +942,7 @@ static void loop_queue_work(struct loop_device *lo, struct 
loop_cmd *cmd)
 
spin_lock_irq(>lo_work_lock);
 
-   if (!cmd->css)
+   if (!cmd->blkcg_css)
goto queue_work;
 
node = >worker_tree.rb_node;
@@ -954,10 +950,10 @@ static void loop_queue_work(struct loop_device *lo, 
struct loop_cmd *cmd)
while (*node) {
parent = *node;
cur_worker = container_of(*node, struct loop_worker, rb_node);
-   if (cur_worker->css == cmd->css) {
+   if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
-   } else if ((long)cur_worker->css < (long)cmd->css) {
+   } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
node = &(*node)->rb_left;
} else {
node = &(*node)->rb_right;
@@ -969,13 +965,18 @@ static void loop_queue_work(struct loop_device *lo, 
struct loop_cmd *cmd)
worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
/*
 * In the event we cannot allocate a worker, just queue on the
-* rootcg worker
+* rootcg worker and issue the I/O as the rootcg
 */
-   if (!worker)
+   if (!worker) {
+   cmd->blkcg_css = NULL;
+   if (cmd->memcg_css)
+   css_put(cmd->memcg_css);
+   cmd->memcg_css = NULL;
goto queue_work;
+   }
 
-   worker->css = cmd->css;
-   css_get(worker->css);
+   worker->blkcg_css = cmd->blkcg_css;
+   css_get(worker->blkcg_css);
INIT_WORK(>work, loop_workfn);
INIT_LIST_HEAD(>cmd_list);
INIT_LIST_HEAD(>idle_list);
@@ -1301,7 +1302,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool 
release)
idle_list) {
list_del(>idle_list);
rb_erase(>rb_node, >worker_tree);
-   css_put(worker->css);
+   css_put(worker->blkcg_css);
kfree(worker);
}
spin_unlock_irq(>lo_work_lock);
@@ -2102,13 +2103,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx 
*hctx,
}
 
/* always use the first bio's css */
+   cmd->blkcg_css = NULL;
+   cmd->memcg_css = NULL;
 #ifdef CONFIG_BLK_CGROUP
-   if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
-   cmd->css = _blkcg(rq->bio)->css;
-   css_get(cmd->css);
-   } else
+   if

[PATCH 1/4] loop: Use worker per cgroup instead of kworker

2020-08-24 Thread Dan Schatzberg

Existing uses of loop device may have multiple cgroups reading/writing
to the same device. Simply charging resources for I/O to the backing
file could result in priority inversion where one cgroup gets
synchronously blocked, holding up all other I/O to the loop device.

In order to avoid this priority inversion, we use a single workqueue
where each work item is a "struct loop_worker" which contains a queue of
struct loop_cmds to issue. The loop device maintains a tree mapping blk
css_id -> loop_worker. This allows each cgroup to independently make
forward progress issuing I/O to the backing file.

There is also a single queue for I/O associated with the rootcg which
can be used in cases of extreme memory shortage where we cannot allocate
a loop_worker.

The locking for the tree and queues is fairly heavy handed - we acquire
a per-loop-device spinlock any time either is accessed. The existing
implementation serializes all I/O through a single thread anyways, so I
don't believe this is any worse.

Signed-off-by: Dan Schatzberg 
---
 drivers/block/loop.c | 207 ---
 drivers/block/loop.h |  12 ++-
 2 files changed, 182 insertions(+), 37 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 2f137d6ce169..c8dae53f3914 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -70,7 +70,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -83,6 +82,8 @@
 
 #include 
 
+#define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
+
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_ctl_mutex);
 
@@ -922,27 +923,83 @@ static void loop_config_discard(struct loop_device *lo)
blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
 }
 
-static void loop_unprepare_queue(struct loop_device *lo)
-{
-   kthread_flush_worker(>worker);
-   kthread_stop(lo->worker_task);
-}
+struct loop_worker {
+   struct rb_node rb_node;
+   struct work_struct work;
+   struct list_head cmd_list;
+   struct list_head idle_list;
+   struct loop_device *lo;
+   struct cgroup_subsys_state *css;
+   unsigned long last_ran_at;
+};
 
-static int loop_kthread_worker_fn(void *worker_ptr)
-{
-   current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
-   return kthread_worker_fn(worker_ptr);
-}
+static void loop_workfn(struct work_struct *work);
+static void loop_rootcg_workfn(struct work_struct *work);
+static void loop_free_idle_workers(struct timer_list *timer);
 
-static int loop_prepare_queue(struct loop_device *lo)
+static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
 {
-   kthread_init_worker(>worker);
-   lo->worker_task = kthread_run(loop_kthread_worker_fn,
-   >worker, "loop%d", lo->lo_number);
-   if (IS_ERR(lo->worker_task))
-   return -ENOMEM;
-   set_user_nice(lo->worker_task, MIN_NICE);
-   return 0;
+   struct rb_node **node = &(lo->worker_tree.rb_node), *parent = NULL;
+   struct loop_worker *cur_worker, *worker = NULL;
+   struct work_struct *work;
+   struct list_head *cmd_list;
+
+   spin_lock_irq(>lo_work_lock);
+
+   if (!cmd->css)
+   goto queue_work;
+
+   node = >worker_tree.rb_node;
+
+   while (*node) {
+   parent = *node;
+   cur_worker = container_of(*node, struct loop_worker, rb_node);
+   if (cur_worker->css == cmd->css) {
+   worker = cur_worker;
+   break;
+   } else if ((long)cur_worker->css < (long)cmd->css) {
+   node = &(*node)->rb_left;
+   } else {
+   node = &(*node)->rb_right;
+   }
+   }
+   if (worker)
+   goto queue_work;
+
+   worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
+   /*
+* In the event we cannot allocate a worker, just queue on the
+* rootcg worker
+*/
+   if (!worker)
+   goto queue_work;
+
+   worker->css = cmd->css;
+   css_get(worker->css);
+   INIT_WORK(>work, loop_workfn);
+   INIT_LIST_HEAD(>cmd_list);
+   INIT_LIST_HEAD(>idle_list);
+   worker->lo = lo;
+   rb_link_node(>rb_node, parent, node);
+   rb_insert_color(>rb_node, >worker_tree);
+queue_work:
+   if (worker) {
+   /*
+* We need to remove from the idle list here while
+* holding the lock so that the idle timer doesn't
+* free the worker
+*/
+   if (!list_empty(>idle_list))
+   list_del_init(>idle_list);
+   work = >work;
+   cmd_list = >cmd_list;
+   } else {
+   work = >rootcg_work;
+   cmd_list

[PATCH v7 0/4] Charge loop device i/o to issuing cgroup

2020-08-24 Thread Dan Schatzberg

Much of the discussion about this has died down. There's been a
concern raised that we could generalize infrastructure across loop,
md, etc. This may be possible, in the future, but it isn't clear to me
how this would look like. I'm inclined to fix the existing issue with
loop devices now (this is a problem we hit at FB) and address
consolidation with other cases if and when those need to be addressed.

Note that patch 2 in this series (mm: support nesting
memalloc_use_memcg()) has also been submitted by Roman Guschchin to
the mm tree and can be seen here:
http://lkml.iu.edu/hypermail/linux/kernel/2008.2/09214.html - I
include it here for completeness, but this series works with either
version of the patch.

Changes since V7:

* Rebased against linus's branch

Changes since V6:

* Added separate spinlock for worker synchronization
* Minor style changes

Changes since V5:

* Fixed a missing css_put when failing to allocate a worker
* Minor style changes

Changes since V4:

Only patches 1 and 2 have changed.

* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg

Changes since V3:

* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes

Changes since V2:

* Deferred destruction of workqueue items so in the common case there
  is no allocation needed

Changes since V1:

* Split out and reordered patches so cgroup charging changes are
  separate from kworker -> workqueue change

* Add mem_css to struct loop_cmd to simplify logic

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.

Dan Schatzberg (3):
  loop: Use worker per cgroup instead of kworker
  mm: Charge active memcg when no mm is set
  loop: Charge i/o to mem and blk cg

Johannes Weiner (1):
  mm: support nesting memalloc_use_memcg()

 drivers/block/loop.c | 248 ++-
 drivers/block/loop.h |  15 +-
 fs/buffer.c  |   6 +-
 fs/notify/fanotify/fanotify.c|   5 +-
 fs/notify/inotify/inotify_fsnotify.c |   5 +-
 include/linux/memcontrol.h   |   6 +
 include/linux/sched/mm.h |  28 +--
 kernel/cgroup/cgroup.c   |   1 +
 mm/memcontrol.c  |  17 +-
 mm/shmem.c   |   4 +-
 10 files changed, 253 insertions(+), 82 deletions(-)

-- 
2.24.1

Re: [PATCH v6 0/4] Charge loop device i/o to issuing cgroup

2020-08-21 Thread Dan Schatzberg

On Thu, Aug 20, 2020 at 10:06:44AM -0700, Shakeel Butt wrote:
> On Thu, May 28, 2020 at 6:55 AM Dan Schatzberg  
> wrote:
> >
> > Much of the discussion about this has died down. There's been a
> > concern raised that we could generalize infrastructure across loop,
> > md, etc. This may be possible, in the future, but it isn't clear to me
> > how this would look like. I'm inclined to fix the existing issue with
> > loop devices now (this is a problem we hit at FB) and address
> > consolidation with other cases if and when those need to be addressed.
> >
> 
> What's the status of this series?

Thanks for reminding me about this. I haven't got any further
feedback. I'll bug Jens to take a look and see if he has any concerns
and if not send a rebased version.

[PATCH 2/4] mm: support nesting memalloc_use_memcg()

2020-05-28 Thread Dan Schatzberg

From: Johannes Weiner 

The memalloc_use_memcg() function to override the default memcg
accounting context currently doesn't nest. But the patches to make the
loop driver cgroup-aware will end up nesting:

[   98.137605]  alloc_page_buffers+0x210/0x288
[   98.141799]  __getblk_gfp+0x1d4/0x400
[   98.145475]  ext4_read_block_bitmap_nowait+0x148/0xbc8
[   98.150628]  ext4_mb_init_cache+0x25c/0x9b0
[   98.154821]  ext4_mb_init_group+0x270/0x390
[   98.159014]  ext4_mb_good_group+0x264/0x270
[   98.163208]  ext4_mb_regular_allocator+0x480/0x798
[   98.168011]  ext4_mb_new_blocks+0x958/0x10f8
[   98.172294]  ext4_ext_map_blocks+0xec8/0x1618
[   98.176660]  ext4_map_blocks+0x1b8/0x8a0
[   98.180592]  ext4_writepages+0x830/0xf10
[   98.184523]  do_writepages+0xb4/0x198
[   98.188195]  __filemap_fdatawrite_range+0x170/0x1c8
[   98.193086]  filemap_write_and_wait_range+0x40/0xb0
[   98.197974]  ext4_punch_hole+0x4a4/0x660
[   98.201907]  ext4_fallocate+0x294/0x1190
[   98.205839]  loop_process_work+0x690/0x1100
[   98.210032]  loop_workfn+0x2c/0x110
[   98.213529]  process_one_work+0x3e0/0x648
[   98.217546]  worker_thread+0x70/0x670
[   98.221217]  kthread+0x1b8/0x1c0
[   98.224452]  ret_from_fork+0x10/0x18

where loop_process_work() sets the memcg override to the memcg that
submitted the IO request, and alloc_page_buffers() sets the override
to the memcg that instantiated the cache page, which may differ.

Make memalloc_use_memcg() return the old memcg and convert existing
users to a stacking model. Delete the unused memalloc_unuse_memcg().

Signed-off-by: Johannes Weiner 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
Reported-by: Naresh Kamboju 
---
 fs/buffer.c  |  6 +++---
 fs/notify/fanotify/fanotify.c|  5 +++--
 fs/notify/inotify/inotify_fsnotify.c |  5 +++--
 include/linux/sched/mm.h | 28 +---
 4 files changed, 18 insertions(+), 26 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index a60f60396cfa..585416dec6a2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -851,13 +851,13 @@ struct buffer_head *alloc_page_buffers(struct page *page, 
unsigned long size,
struct buffer_head *bh, *head;
gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
long offset;
-   struct mem_cgroup *memcg;
+   struct mem_cgroup *memcg, *old_memcg;
 
if (retry)
gfp |= __GFP_NOFAIL;
 
memcg = get_mem_cgroup_from_page(page);
-   memalloc_use_memcg(memcg);
+   old_memcg = memalloc_use_memcg(memcg);
 
head = NULL;
offset = PAGE_SIZE;
@@ -876,7 +876,7 @@ struct buffer_head *alloc_page_buffers(struct page *page, 
unsigned long size,
set_bh_page(bh, page, offset);
}
 out:
-   memalloc_unuse_memcg();
+   memalloc_use_memcg(old_memcg);
mem_cgroup_put(memcg);
return head;
 /*
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 5435a40f82be..6b869d95bfb6 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -353,6 +353,7 @@ struct fanotify_event *fanotify_alloc_event(struct 
fsnotify_group *group,
gfp_t gfp = GFP_KERNEL_ACCOUNT;
struct inode *id = fanotify_fid_inode(inode, mask, data, data_type);
const struct path *path = fsnotify_data_path(data, data_type);
+   struct mem_cgroup *old_memcg;
 
/*
 * For queues with unlimited length lost events are not expected and
@@ -366,7 +367,7 @@ struct fanotify_event *fanotify_alloc_event(struct 
fsnotify_group *group,
gfp |= __GFP_RETRY_MAYFAIL;
 
/* Whoever is interested in the event, pays for the allocation. */
-   memalloc_use_memcg(group->memcg);
+   old_memcg = memalloc_use_memcg(group->memcg);
 
if (fanotify_is_perm_event(mask)) {
struct fanotify_perm_event *pevent;
@@ -451,7 +452,7 @@ struct fanotify_event *fanotify_alloc_event(struct 
fsnotify_group *group,
}
}
 out:
-   memalloc_unuse_memcg();
+   memalloc_use_memcg(old_memcg);
return event;
 }
 
diff --git a/fs/notify/inotify/inotify_fsnotify.c 
b/fs/notify/inotify/inotify_fsnotify.c
index 2ebc89047153..52f38e6e81b7 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -69,6 +69,7 @@ int inotify_handle_event(struct fsnotify_group *group,
int ret;
int len = 0;
int alloc_len = sizeof(struct inotify_event_info);
+   struct mem_cgroup *old_memcg;
 
if (WARN_ON(fsnotify_iter_vfsmount_mark(iter_info)))
return 0;
@@ -93,9 +94,9 @@ int inotify_handle_event(struct fsnotify_group *group,
 * trigger OOM killer in the target monitoring memcg as it may have
 * security repercussion.
 */
-   memalloc_use_memcg(group->memcg);
+   old_memcg = memalloc_use_memcg(group->memcg);
event = kmalloc(alloc_len, GFP_KERNEL_ACCOUNT |

[PATCH 4/4] loop: Charge i/o to mem and blk cg

2020-05-28 Thread Dan Schatzberg

The current code only associates with the existing blkcg when aio is
used to access the backing file. This patch covers all types of i/o to
the backing file and also associates the memcg so if the backing file is
on tmpfs, memory is charged appropriately.

This patch also exports cgroup_get_e_css so it can be used by the loop
module.

Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
---
 drivers/block/loop.c   | 61 +-
 drivers/block/loop.h   |  3 +-
 include/linux/memcontrol.h |  6 
 kernel/cgroup/cgroup.c |  1 +
 4 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 378a68e5ccf3..c238e274788e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -77,6 +77,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "loop.h"
 
@@ -507,8 +508,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
ret, long ret2)
 {
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
 
-   if (cmd->css)
-   css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
 }
@@ -569,8 +568,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
-   if (cmd->css)
-   kthread_associate_blkcg(cmd->css);
 
if (rw == WRITE)
ret = call_write_iter(file, >iocb, );
@@ -578,7 +575,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
ret = call_read_iter(file, >iocb, );
 
lo_rw_aio_do_completion(cmd);
-   kthread_associate_blkcg(NULL);
 
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(>iocb, ret, 0);
@@ -918,7 +914,7 @@ struct loop_worker {
struct list_head cmd_list;
struct list_head idle_list;
struct loop_device *lo;
-   struct cgroup_subsys_state *css;
+   struct cgroup_subsys_state *blkcg_css;
unsigned long last_ran_at;
 };
 
@@ -935,7 +931,7 @@ static void loop_queue_work(struct loop_device *lo, struct 
loop_cmd *cmd)
 
spin_lock_irq(>lo_work_lock);
 
-   if (!cmd->css)
+   if (!cmd->blkcg_css)
goto queue_work;
 
node = >worker_tree.rb_node;
@@ -943,10 +939,10 @@ static void loop_queue_work(struct loop_device *lo, 
struct loop_cmd *cmd)
while (*node) {
parent = *node;
cur_worker = container_of(*node, struct loop_worker, rb_node);
-   if (cur_worker->css == cmd->css) {
+   if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
-   } else if ((long)cur_worker->css < (long)cmd->css) {
+   } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
node = &(*node)->rb_left;
} else {
node = &(*node)->rb_right;
@@ -958,13 +954,18 @@ static void loop_queue_work(struct loop_device *lo, 
struct loop_cmd *cmd)
worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
/*
 * In the event we cannot allocate a worker, just queue on the
-* rootcg worker
+* rootcg worker and issue the I/O as the rootcg
 */
-   if (!worker)
+   if (!worker) {
+   cmd->blkcg_css = NULL;
+   if (cmd->memcg_css)
+   css_put(cmd->memcg_css);
+   cmd->memcg_css = NULL;
goto queue_work;
+   }
 
-   worker->css = cmd->css;
-   css_get(worker->css);
+   worker->blkcg_css = cmd->blkcg_css;
+   css_get(worker->blkcg_css);
INIT_WORK(>work, loop_workfn);
INIT_LIST_HEAD(>cmd_list);
INIT_LIST_HEAD(>idle_list);
@@ -1214,7 +1215,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool 
release)
idle_list) {
list_del(>idle_list);
rb_erase(>rb_node, >worker_tree);
-   css_put(worker->css);
+   css_put(worker->blkcg_css);
kfree(worker);
}
spin_unlock_irq(>lo_work_lock);
@@ -2027,13 +2028,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx 
*hctx,
}
 
/* always use the first bio's css */
+   cmd->blkcg_css = NULL;
+   cmd->memcg_css = NULL;
 #ifdef CONFIG_BLK_CGROUP
-   if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
-   cmd->css = _blkcg(rq->bio)->css;
-   css_get(cmd->css);
-   } else
+   if

[PATCH 1/4] loop: Use worker per cgroup instead of kworker

2020-05-28 Thread Dan Schatzberg

Existing uses of loop device may have multiple cgroups reading/writing
to the same device. Simply charging resources for I/O to the backing
file could result in priority inversion where one cgroup gets
synchronously blocked, holding up all other I/O to the loop device.

In order to avoid this priority inversion, we use a single workqueue
where each work item is a "struct loop_worker" which contains a queue of
struct loop_cmds to issue. The loop device maintains a tree mapping blk
css_id -> loop_worker. This allows each cgroup to independently make
forward progress issuing I/O to the backing file.

There is also a single queue for I/O associated with the rootcg which
can be used in cases of extreme memory shortage where we cannot allocate
a loop_worker.

The locking for the tree and queues is fairly heavy handed - we acquire
a per-loop-device spinlock any time either is accessed. The existing
implementation serializes all I/O through a single thread anyways, so I
don't believe this is any worse.

Signed-off-by: Dan Schatzberg 
---
 drivers/block/loop.c | 203 ---
 drivers/block/loop.h |  12 ++-
 2 files changed, 178 insertions(+), 37 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e..378a68e5ccf3 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -70,7 +70,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -83,6 +82,8 @@
 
 #include 
 
+#define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
+
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_ctl_mutex);
 
@@ -911,27 +912,83 @@ static void loop_config_discard(struct loop_device *lo)
blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
 }
 
-static void loop_unprepare_queue(struct loop_device *lo)
-{
-   kthread_flush_worker(>worker);
-   kthread_stop(lo->worker_task);
-}
+struct loop_worker {
+   struct rb_node rb_node;
+   struct work_struct work;
+   struct list_head cmd_list;
+   struct list_head idle_list;
+   struct loop_device *lo;
+   struct cgroup_subsys_state *css;
+   unsigned long last_ran_at;
+};
 
-static int loop_kthread_worker_fn(void *worker_ptr)
-{
-   current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
-   return kthread_worker_fn(worker_ptr);
-}
+static void loop_workfn(struct work_struct *work);
+static void loop_rootcg_workfn(struct work_struct *work);
+static void loop_free_idle_workers(struct timer_list *timer);
 
-static int loop_prepare_queue(struct loop_device *lo)
+static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
 {
-   kthread_init_worker(>worker);
-   lo->worker_task = kthread_run(loop_kthread_worker_fn,
-   >worker, "loop%d", lo->lo_number);
-   if (IS_ERR(lo->worker_task))
-   return -ENOMEM;
-   set_user_nice(lo->worker_task, MIN_NICE);
-   return 0;
+   struct rb_node **node = &(lo->worker_tree.rb_node), *parent = NULL;
+   struct loop_worker *cur_worker, *worker = NULL;
+   struct work_struct *work;
+   struct list_head *cmd_list;
+
+   spin_lock_irq(>lo_work_lock);
+
+   if (!cmd->css)
+   goto queue_work;
+
+   node = >worker_tree.rb_node;
+
+   while (*node) {
+   parent = *node;
+   cur_worker = container_of(*node, struct loop_worker, rb_node);
+   if (cur_worker->css == cmd->css) {
+   worker = cur_worker;
+   break;
+   } else if ((long)cur_worker->css < (long)cmd->css) {
+   node = &(*node)->rb_left;
+   } else {
+   node = &(*node)->rb_right;
+   }
+   }
+   if (worker)
+   goto queue_work;
+
+   worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
+   /*
+* In the event we cannot allocate a worker, just queue on the
+* rootcg worker
+*/
+   if (!worker)
+   goto queue_work;
+
+   worker->css = cmd->css;
+   css_get(worker->css);
+   INIT_WORK(>work, loop_workfn);
+   INIT_LIST_HEAD(>cmd_list);
+   INIT_LIST_HEAD(>idle_list);
+   worker->lo = lo;
+   rb_link_node(>rb_node, parent, node);
+   rb_insert_color(>rb_node, >worker_tree);
+queue_work:
+   if (worker) {
+   /*
+* We need to remove from the idle list here while
+* holding the lock so that the idle timer doesn't
+* free the worker
+*/
+   if (!list_empty(>idle_list))
+   list_del_init(>idle_list);
+   work = >work;
+   cmd_list = >cmd_list;
+   } else {
+   work = >rootcg_work;
+   cmd_list

[PATCH 3/4] mm: Charge active memcg when no mm is set

2020-05-28 Thread Dan Schatzberg

memalloc_use_memcg() worked for kernel allocations but was silently
ignored for user pages.

This patch establishes a precedence order for who gets charged:

1. If there is a memcg associated with the page already, that memcg is
   charged. This happens during swapin.

2. If an explicit mm is passed, mm->memcg is charged. This happens
   during page faults, which can be triggered in remote VMs (eg gup).

3. Otherwise consult the current process context. If it has configured
   a current->active_memcg, use that. Otherwise, current->mm->memcg.

Previously, if a NULL mm was passed to mem_cgroup_try_charge (case 3) it
would always charge the root cgroup. Now it looks up the current
active_memcg first (falling back to charging the root cgroup if not
set).

Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
Acked-by: Tejun Heo 
Acked-by: Chris Down 
Reviewed-by: Shakeel Butt 
---
 mm/memcontrol.c | 11 ---
 mm/shmem.c  |  4 ++--
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a3b97f103966..383d88c1c105 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6438,7 +6438,8 @@ enum mem_cgroup_protection mem_cgroup_protected(struct 
mem_cgroup *root,
  * @compound: charge the page as compound or small page
  *
  * Try to charge @page to the memcg that @mm belongs to, reclaiming
- * pages according to @gfp_mask if necessary.
+ * pages according to @gfp_mask if necessary. if @mm is NULL, try to
+ * charge to the active memcg.
  *
  * Returns 0 on success, with *@memcgp pointing to the charged memcg.
  * Otherwise, an error code is returned.
@@ -6482,8 +6483,12 @@ int mem_cgroup_try_charge(struct page *page, struct 
mm_struct *mm,
}
}
 
-   if (!memcg)
-   memcg = get_mem_cgroup_from_mm(mm);
+   if (!memcg) {
+   if (!mm)
+   memcg = get_mem_cgroup_from_current();
+   else
+   memcg = get_mem_cgroup_from_mm(mm);
+   }
 
ret = try_charge(memcg, gfp_mask, nr_pages);
 
diff --git a/mm/shmem.c b/mm/shmem.c
index bd8840082c94..d2efa1a44311 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1618,7 +1618,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t 
index,
 {
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
-   struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
+   struct mm_struct *charge_mm = vma ? vma->vm_mm : NULL;
struct mem_cgroup *memcg;
struct page *page;
swp_entry_t swap;
@@ -1753,7 +1753,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t 
index,
}
 
sbinfo = SHMEM_SB(inode->i_sb);
-   charge_mm = vma ? vma->vm_mm : current->mm;
+   charge_mm = vma ? vma->vm_mm : NULL;
 
page = find_lock_entry(mapping, index);
if (xa_is_value(page)) {
-- 
2.24.1

[PATCH v6 0/4] Charge loop device i/o to issuing cgroup

2020-05-28 Thread Dan Schatzberg

Much of the discussion about this has died down. There's been a
concern raised that we could generalize infrastructure across loop,
md, etc. This may be possible, in the future, but it isn't clear to me
how this would look like. I'm inclined to fix the existing issue with
loop devices now (this is a problem we hit at FB) and address
consolidation with other cases if and when those need to be addressed.

Changes since V6:

* Added separate spinlock for worker synchronization
* Minor style changes

Changes since V5:

* Fixed a missing css_put when failing to allocate a worker
* Minor style changes

Changes since V4:

Only patches 1 and 2 have changed.

* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg

Changes since V3:

* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes

Changes since V2:

* Deferred destruction of workqueue items so in the common case there
  is no allocation needed

Changes since V1:

* Split out and reordered patches so cgroup charging changes are
  separate from kworker -> workqueue change

* Add mem_css to struct loop_cmd to simplify logic

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.

Dan Schatzberg (3):
  loop: Use worker per cgroup instead of kworker
  mm: Charge active memcg when no mm is set
  loop: Charge i/o to mem and blk cg

Johannes Weiner (1):
  mm: support nesting memalloc_use_memcg()

 drivers/block/loop.c | 244 ++-
 drivers/block/loop.h |  15 +-
 fs/buffer.c  |   6 +-
 fs/notify/fanotify/fanotify.c|   5 +-
 fs/notify/inotify/inotify_fsnotify.c |   5 +-
 include/linux/memcontrol.h   |   6 +
 include/linux/sched/mm.h |  28 +--
 kernel/cgroup/cgroup.c   |   1 +
 mm/memcontrol.c  |  11 +-
 mm/shmem.c   |   4 +-
 10 files changed, 246 insertions(+), 79 deletions(-)

-- 
2.24.1

Re: [PATCH v5 0/4] Charge loop device i/o to issuing cgroup

2020-05-26 Thread Dan Schatzberg

On Tue, May 12, 2020 at 06:35:45AM -0700, Christoph Hellwig wrote:
> On Tue, May 12, 2020 at 09:25:21AM -0400, Dan Schatzberg wrote:
> > Seems like discussion on this patch series has died down. There's been
> > a concern raised that we could generalize infrastructure across loop,
> > md, etc. This may be possible, in the future, but it isn't clear to me
> > how this would look like. I'm inclined to fix the existing issue with
> > loop devices now (this is a problem we hit at FB) and address
> > consolidation with other cases if and when those are addressed.
> > 
> > Jens, you've expressed interest in seeing this series go through the
> > block tree so I'm interested in your perspective here. Barring any
> > concrete implementation bugs, would you be okay merging this version?
> 
> Independ of any higher level issues you need to sort out the spinlock
> mess I pointed out.

Will do - I'll split out the lock-use refactor into a separate
patch. Do you have particular concerns about re-using the existing
spinlock? Its existing use is not contended so I didn't see any harm
in extending its use. I'll add this justification to the commit
message as well, but I'm tempted to leave the re-use as is instead of
creating a new lock.

Re: [PATCH v5 0/4] Charge loop device i/o to issuing cgroup

2020-05-12 Thread Dan Schatzberg

Seems like discussion on this patch series has died down. There's been
a concern raised that we could generalize infrastructure across loop,
md, etc. This may be possible, in the future, but it isn't clear to me
how this would look like. I'm inclined to fix the existing issue with
loop devices now (this is a problem we hit at FB) and address
consolidation with other cases if and when those are addressed.

Jens, you've expressed interest in seeing this series go through the
block tree so I'm interested in your perspective here. Barring any
concrete implementation bugs, would you be okay merging this version?

Re: [PATCH v5 0/4] Charge loop device i/o to issuing cgroup

2020-04-29 Thread Dan Schatzberg

On Wed, Apr 29, 2020 at 07:47:34AM +1000, Dave Chinner wrote:
> On Tue, Apr 28, 2020 at 12:13:46PM -0400, Dan Schatzberg wrote:
> > The loop device runs all i/o to the backing file on a separate kworker
> > thread which results in all i/o being charged to the root cgroup. This
> > allows a loop device to be used to trivially bypass resource limits
> > and other policy. This patch series fixes this gap in accounting.
> 
> How is this specific to the loop device? Isn't every block device
> that offloads work to a kthread or single worker thread susceptible
> to the same "exploit"?

I believe this is fairly loop device specific. The issue is that the
loop driver issues I/O by re-entering the VFS layer (resulting in
tmpfs like in my example or entering the block layer). Normally, I/O
through the VFS layer is accounted for and controlled (e.g. you can
OOM if writing to tmpfs, or get throttled by the I/O controller) but
the loop device completely side-steps the accounting.

> 
> Or is the problem simply that the loop worker thread is simply not
> taking the IO's associated cgroup and submitting the IO with that
> cgroup associated with it? That seems kinda simple to fix
> 
> > Naively charging cgroups could result in priority inversions through
> > the single kworker thread in the case where multiple cgroups are
> > reading/writing to the same loop device.
> 
> And that's where all the complexity and serialisation comes from,
> right?
> 
> So, again: how is this unique to the loop device? Other block
> devices also offload IO to kthreads to do blocking work and IO
> submission to lower layers. Hence this seems to me like a generic
> "block device does IO submission from different task" issue that
> should be handled by generic infrastructure and not need to be
> reimplemented multiple times in every block device driver that
> offloads work to other threads...

I'm not familiar with other block device drivers that behave like
this. Could you point me at a few?

> 
> > This patch series does some
> > minor modification to the loop driver so that each cgroup can make
> > forward progress independently to avoid this inversion.
> > 
> > With this patch series applied, the above script triggers OOM kills
> > when writing through the loop device as expected.
> 
> NACK!
> 
> The IO that is disallowed should fail with ENOMEM or some similar
> error, not trigger an OOM kill that shoots some innocent bystander
> in the head. That's worse than using BUG() to report errors...

The OOM behavior is due to cgroup limit. It mirrors the behavior one
sees when writing to a too-large tmpfs.

[PATCH v5 3/4] mm: Charge active memcg when no mm is set

2020-04-28 Thread Dan Schatzberg

memalloc_use_memcg() worked for kernel allocations but was silently
ignored for user pages.

This patch establishes a precedence order for who gets charged:

1. If there is a memcg associated with the page already, that memcg is
   charged. This happens during swapin.

2. If an explicit mm is passed, mm->memcg is charged. This happens
   during page faults, which can be triggered in remote VMs (eg gup).

3. Otherwise consult the current process context. If it has configured
   a current->active_memcg, use that. Otherwise, current->mm->memcg.

Previously, if a NULL mm was passed to mem_cgroup_try_charge (case 3) it
would always charge the root cgroup. Now it looks up the current
active_memcg first (falling back to charging the root cgroup if not
set).

Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
Acked-by: Tejun Heo 
Acked-by: Chris Down 
Reviewed-by: Shakeel Butt 
---
 mm/memcontrol.c | 11 ---
 mm/shmem.c  |  4 ++--
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5beea03dd58a..af68d1d7b456 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6435,7 +6435,8 @@ enum mem_cgroup_protection mem_cgroup_protected(struct 
mem_cgroup *root,
  * @compound: charge the page as compound or small page
  *
  * Try to charge @page to the memcg that @mm belongs to, reclaiming
- * pages according to @gfp_mask if necessary.
+ * pages according to @gfp_mask if necessary. if @mm is NULL, try to
+ * charge to the active memcg.
  *
  * Returns 0 on success, with *@memcgp pointing to the charged memcg.
  * Otherwise, an error code is returned.
@@ -6479,8 +6480,12 @@ int mem_cgroup_try_charge(struct page *page, struct 
mm_struct *mm,
}
}
 
-   if (!memcg)
-   memcg = get_mem_cgroup_from_mm(mm);
+   if (!memcg) {
+   if (!mm)
+   memcg = get_mem_cgroup_from_current();
+   else
+   memcg = get_mem_cgroup_from_mm(mm);
+   }
 
ret = try_charge(memcg, gfp_mask, nr_pages);
 
diff --git a/mm/shmem.c b/mm/shmem.c
index d722eb830317..8c8ffc35a957 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1618,7 +1618,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t 
index,
 {
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
-   struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
+   struct mm_struct *charge_mm = vma ? vma->vm_mm : NULL;
struct mem_cgroup *memcg;
struct page *page;
swp_entry_t swap;
@@ -1753,7 +1753,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t 
index,
}
 
sbinfo = SHMEM_SB(inode->i_sb);
-   charge_mm = vma ? vma->vm_mm : current->mm;
+   charge_mm = vma ? vma->vm_mm : NULL;
 
page = find_lock_entry(mapping, index);
if (xa_is_value(page)) {
-- 
2.24.1

[PATCH v5 0/4] Charge loop device i/o to issuing cgroup

2020-04-28 Thread Dan Schatzberg

Changes since V5:

* Fixed a missing css_put when failing to allocate a worker
* Minor style changes

Changes since V4:

Only patches 1 and 2 have changed.

* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg

Changes since V3:

* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes

Changes since V2:

* Deferred destruction of workqueue items so in the common case there
  is no allocation needed

Changes since V1:

* Split out and reordered patches so cgroup charging changes are
  separate from kworker -> workqueue change

* Add mem_css to struct loop_cmd to simplify logic

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.

Dan Schatzberg (3):
  loop: Use worker per cgroup instead of kworker
  mm: Charge active memcg when no mm is set
  loop: Charge i/o to mem and blk cg

Johannes Weiner (1):
  mm: support nesting memalloc_use_memcg()

 drivers/block/loop.c | 248 ++-
 drivers/block/loop.h |  14 +-
 fs/buffer.c  |   6 +-
 fs/notify/fanotify/fanotify.c|   5 +-
 fs/notify/inotify/inotify_fsnotify.c |   5 +-
 include/linux/memcontrol.h   |   6 +
 include/linux/sched/mm.h |  28 +--
 kernel/cgroup/cgroup.c   |   1 +
 mm/memcontrol.c  |  11 +-
 mm/shmem.c   |   4 +-
 10 files changed, 248 insertions(+), 80 deletions(-)

-- 
2.24.1

[PATCH v5 4/4] loop: Charge i/o to mem and blk cg

2020-04-28 Thread Dan Schatzberg

The current code only associates with the existing blkcg when aio is
used to access the backing file. This patch covers all types of i/o to
the backing file and also associates the memcg so if the backing file is
on tmpfs, memory is charged appropriately.

This patch also exports cgroup_get_e_css so it can be used by the loop
module.

Signed-off-by: Dan Schatzberg 
Acked-by: Johannes Weiner 
---
 drivers/block/loop.c   | 61 +-
 drivers/block/loop.h   |  3 +-
 include/linux/memcontrol.h |  6 
 kernel/cgroup/cgroup.c |  1 +
 4 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 49d7d1f62d88..4da0836f58be 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -77,6 +77,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "loop.h"
 
@@ -507,8 +508,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long 
ret, long ret2)
 {
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
 
-   if (cmd->css)
-   css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
 }
@@ -569,8 +568,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
-   if (cmd->css)
-   kthread_associate_blkcg(cmd->css);
 
if (rw == WRITE)
ret = call_write_iter(file, >iocb, );
@@ -578,7 +575,6 @@ static int lo_rw_aio(struct loop_device *lo, struct 
loop_cmd *cmd,
ret = call_read_iter(file, >iocb, );
 
lo_rw_aio_do_completion(cmd);
-   kthread_associate_blkcg(NULL);
 
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(>iocb, ret, 0);
@@ -924,7 +920,7 @@ struct loop_worker {
struct list_head cmd_list;
struct list_head idle_list;
struct loop_device *lo;
-   struct cgroup_subsys_state *css;
+   struct cgroup_subsys_state *blkcg_css;
unsigned long last_ran_at;
 };
 
@@ -941,7 +937,7 @@ static void loop_queue_work(struct loop_device *lo, struct 
loop_cmd *cmd)
 
spin_lock_irq(>lo_lock);
 
-   if (!cmd->css)
+   if (!cmd->blkcg_css)
goto queue_work;
 
node = >worker_tree.rb_node;
@@ -949,10 +945,10 @@ static void loop_queue_work(struct loop_device *lo, 
struct loop_cmd *cmd)
while (*node) {
parent = *node;
cur_worker = container_of(*node, struct loop_worker, rb_node);
-   if (cur_worker->css == cmd->css) {
+   if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
-   } else if ((long)cur_worker->css < (long)cmd->css) {
+   } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
node = &(*node)->rb_left;
} else {
node = &(*node)->rb_right;
@@ -964,13 +960,18 @@ static void loop_queue_work(struct loop_device *lo, 
struct loop_cmd *cmd)
worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
/*
 * In the event we cannot allocate a worker, just queue on the
-* rootcg worker
+* rootcg worker and issue the I/O as the rootcg
 */
-   if (!worker)
+   if (!worker) {
+   cmd->blkcg_css = NULL;
+   if (cmd->memcg_css)
+   css_put(cmd->memcg_css);
+   cmd->memcg_css = NULL;
goto queue_work;
+   }
 
-   worker->css = cmd->css;
-   css_get(worker->css);
+   worker->blkcg_css = cmd->blkcg_css;
+   css_get(worker->blkcg_css);
INIT_WORK(>work, loop_workfn);
INIT_LIST_HEAD(>cmd_list);
INIT_LIST_HEAD(>idle_list);
@@ -1221,7 +1222,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool 
release)
idle_list) {
list_del(>idle_list);
rb_erase(>rb_node, >worker_tree);
-   css_put(worker->css);
+   css_put(worker->blkcg_css);
kfree(worker);
}
spin_unlock_irq(>lo_lock);
@@ -2030,13 +2031,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx 
*hctx,
}
 
/* always use the first bio's css */
+   cmd->blkcg_css = NULL;
+   cmd->memcg_css = NULL;
 #ifdef CONFIG_BLK_CGROUP
-   if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
-   cmd->css = _blkcg(rq->bio)->css;
-   css_get(cmd->css);
-   } else
+   if

[PATCH v5 1/4] loop: Use worker per cgroup instead of kworker

2020-04-28 Thread Dan Schatzberg

Existing uses of loop device may have multiple cgroups reading/writing
to the same device. Simply charging resources for I/O to the backing
file could result in priority inversion where one cgroup gets
synchronously blocked, holding up all other I/O to the loop device.

In order to avoid this priority inversion, we use a single workqueue
where each work item is a "struct loop_worker" which contains a queue of
struct loop_cmds to issue. The loop device maintains a tree mapping blk
css_id -> loop_worker. This allows each cgroup to independently make
forward progress issuing I/O to the backing file.

There is also a single queue for I/O associated with the rootcg which
can be used in cases of extreme memory shortage where we cannot allocate
a loop_worker.

The locking for the tree and queues is fairly heavy handed - we acquire
the per-loop-device spinlock any time either is accessed. The existing
implementation serializes all I/O through a single thread anyways, so I
don't believe this is any worse.

Signed-off-by: Dan Schatzberg 
---
 drivers/block/loop.c | 207 ---
 drivers/block/loop.h |  11 ++-
 2 files changed, 180 insertions(+), 38 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e..49d7d1f62d88 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -70,7 +70,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -83,6 +82,8 @@
 
 #include 
 
+#define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
+
 static DEFINE_IDR(loop_index_idr);
 static DEFINE_MUTEX(loop_ctl_mutex);
 
@@ -778,12 +779,18 @@ static ssize_t loop_attr_backing_file_show(struct 
loop_device *lo, char *buf)
 {
ssize_t ret;
char *p = NULL;
+   struct file *filp = NULL;
 
spin_lock_irq(>lo_lock);
if (lo->lo_backing_file)
-   p = file_path(lo->lo_backing_file, buf, PAGE_SIZE - 1);
+   filp = get_file(lo->lo_backing_file);
spin_unlock_irq(>lo_lock);
 
+   if (filp) {
+   p = file_path(filp, buf, PAGE_SIZE - 1);
+   fput(filp);
+   }
+
if (IS_ERR_OR_NULL(p))
ret = PTR_ERR(p);
else {
@@ -911,27 +918,83 @@ static void loop_config_discard(struct loop_device *lo)
blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
 }
 
-static void loop_unprepare_queue(struct loop_device *lo)
-{
-   kthread_flush_worker(>worker);
-   kthread_stop(lo->worker_task);
-}
+struct loop_worker {
+   struct rb_node rb_node;
+   struct work_struct work;
+   struct list_head cmd_list;
+   struct list_head idle_list;
+   struct loop_device *lo;
+   struct cgroup_subsys_state *css;
+   unsigned long last_ran_at;
+};
 
-static int loop_kthread_worker_fn(void *worker_ptr)
-{
-   current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
-   return kthread_worker_fn(worker_ptr);
-}
+static void loop_workfn(struct work_struct *work);
+static void loop_rootcg_workfn(struct work_struct *work);
+static void loop_free_idle_workers(struct timer_list *timer);
 
-static int loop_prepare_queue(struct loop_device *lo)
+static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
 {
-   kthread_init_worker(>worker);
-   lo->worker_task = kthread_run(loop_kthread_worker_fn,
-   >worker, "loop%d", lo->lo_number);
-   if (IS_ERR(lo->worker_task))
-   return -ENOMEM;
-   set_user_nice(lo->worker_task, MIN_NICE);
-   return 0;
+   struct rb_node **node = &(lo->worker_tree.rb_node), *parent = NULL;
+   struct loop_worker *cur_worker, *worker = NULL;
+   struct work_struct *work;
+   struct list_head *cmd_list;
+
+   spin_lock_irq(>lo_lock);
+
+   if (!cmd->css)
+   goto queue_work;
+
+   node = >worker_tree.rb_node;
+
+   while (*node) {
+   parent = *node;
+   cur_worker = container_of(*node, struct loop_worker, rb_node);
+   if (cur_worker->css == cmd->css) {
+   worker = cur_worker;
+   break;
+   } else if ((long)cur_worker->css < (long)cmd->css) {
+   node = &(*node)->rb_left;
+   } else {
+   node = &(*node)->rb_right;
+   }
+   }
+   if (worker)
+   goto queue_work;
+
+   worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
+   /*
+* In the event we cannot allocate a worker, just queue on the
+* rootcg worker
+*/
+   if (!worker)
+   goto queue_work;
+
+   worker->css = cmd->css;
+   css_get(worker->css);
+   INIT_WORK(>work, loop_workfn);
+   INIT_LIST_HEAD(>cmd_list);
+   INIT_LIST_HEAD(>idle_list);
+

[PATCH v5 2/4] mm: support nesting memalloc_use_memcg()

2020-04-28 Thread Dan Schatzberg

From: Johannes Weiner 

The memalloc_use_memcg() function to override the default memcg
accounting context currently doesn't nest. But the patches to make the
loop driver cgroup-aware will end up nesting:

[   98.137605]  alloc_page_buffers+0x210/0x288
[   98.141799]  __getblk_gfp+0x1d4/0x400
[   98.145475]  ext4_read_block_bitmap_nowait+0x148/0xbc8
[   98.150628]  ext4_mb_init_cache+0x25c/0x9b0
[   98.154821]  ext4_mb_init_group+0x270/0x390
[   98.159014]  ext4_mb_good_group+0x264/0x270
[   98.163208]  ext4_mb_regular_allocator+0x480/0x798
[   98.168011]  ext4_mb_new_blocks+0x958/0x10f8
[   98.172294]  ext4_ext_map_blocks+0xec8/0x1618
[   98.176660]  ext4_map_blocks+0x1b8/0x8a0
[   98.180592]  ext4_writepages+0x830/0xf10
[   98.184523]  do_writepages+0xb4/0x198
[   98.188195]  __filemap_fdatawrite_range+0x170/0x1c8
[   98.193086]  filemap_write_and_wait_range+0x40/0xb0
[   98.197974]  ext4_punch_hole+0x4a4/0x660
[   98.201907]  ext4_fallocate+0x294/0x1190
[   98.205839]  loop_process_work+0x690/0x1100
[   98.210032]  loop_workfn+0x2c/0x110
[   98.213529]  process_one_work+0x3e0/0x648
[   98.217546]  worker_thread+0x70/0x670
[   98.221217]  kthread+0x1b8/0x1c0
[   98.224452]  ret_from_fork+0x10/0x18

where loop_process_work() sets the memcg override to the memcg that
submitted the IO request, and alloc_page_buffers() sets the override
to the memcg that instantiated the cache page, which may differ.

Make memalloc_use_memcg() return the old memcg and convert existing
users to a stacking model. Delete the unused memalloc_unuse_memcg().

Signed-off-by: Johannes Weiner 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
---
 fs/buffer.c  |  6 +++---
 fs/notify/fanotify/fanotify.c|  5 +++--
 fs/notify/inotify/inotify_fsnotify.c |  5 +++--
 include/linux/sched/mm.h | 28 +---
 4 files changed, 18 insertions(+), 26 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 599a0bf7257b..b4e99c6b52ec 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -851,13 +851,13 @@ struct buffer_head *alloc_page_buffers(struct page *page, 
unsigned long size,
struct buffer_head *bh, *head;
gfp_t gfp = GFP_NOFS | __GFP_ACCOUNT;
long offset;
-   struct mem_cgroup *memcg;
+   struct mem_cgroup *memcg, *old_memcg;
 
if (retry)
gfp |= __GFP_NOFAIL;
 
memcg = get_mem_cgroup_from_page(page);
-   memalloc_use_memcg(memcg);
+   old_memcg = memalloc_use_memcg(memcg);
 
head = NULL;
offset = PAGE_SIZE;
@@ -876,7 +876,7 @@ struct buffer_head *alloc_page_buffers(struct page *page, 
unsigned long size,
set_bh_page(bh, page, offset);
}
 out:
-   memalloc_unuse_memcg();
+   memalloc_use_memcg(old_memcg);
mem_cgroup_put(memcg);
return head;
 /*
diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
index 5435a40f82be..54c787cd6efb 100644
--- a/fs/notify/fanotify/fanotify.c
+++ b/fs/notify/fanotify/fanotify.c
@@ -353,6 +353,7 @@ struct fanotify_event *fanotify_alloc_event(struct 
fsnotify_group *group,
gfp_t gfp = GFP_KERNEL_ACCOUNT;
struct inode *id = fanotify_fid_inode(inode, mask, data, data_type);
const struct path *path = fsnotify_data_path(data, data_type);
+   struct mem_cgroup *oldmemcg;
 
/*
 * For queues with unlimited length lost events are not expected and
@@ -366,7 +367,7 @@ struct fanotify_event *fanotify_alloc_event(struct 
fsnotify_group *group,
gfp |= __GFP_RETRY_MAYFAIL;
 
/* Whoever is interested in the event, pays for the allocation. */
-   memalloc_use_memcg(group->memcg);
+   oldmemcg = memalloc_use_memcg(group->memcg);
 
if (fanotify_is_perm_event(mask)) {
struct fanotify_perm_event *pevent;
@@ -451,7 +452,7 @@ struct fanotify_event *fanotify_alloc_event(struct 
fsnotify_group *group,
}
}
 out:
-   memalloc_unuse_memcg();
+   memalloc_use_memcg(oldmemcg);
return event;
 }
 
diff --git a/fs/notify/inotify/inotify_fsnotify.c 
b/fs/notify/inotify/inotify_fsnotify.c
index 2ebc89047153..d27c6e83cea6 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -69,6 +69,7 @@ int inotify_handle_event(struct fsnotify_group *group,
int ret;
int len = 0;
int alloc_len = sizeof(struct inotify_event_info);
+   struct mem_cgroup *oldmemcg;
 
if (WARN_ON(fsnotify_iter_vfsmount_mark(iter_info)))
return 0;
@@ -93,9 +94,9 @@ int inotify_handle_event(struct fsnotify_group *group,
 * trigger OOM killer in the target monitoring memcg as it may have
 * security repercussion.
 */
-   memalloc_use_memcg(group->memcg);
+   oldmemcg = memalloc_use_memcg(group->memcg);
event = kmalloc(alloc_len, GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
-   memalloc_unuse_memcg();

[PATCH] psi: Expose pressure metrics on root cgroup

2019-05-10 Thread Dan Schatzberg

Pressure metrics are already recorded and exposed in procfs for the
entire system, but any tool which monitors cgroup pressure has to
special case the root cgroup to read from procfs. This patch exposes
the already recorded pressure metrics on the root cgroup.

Signed-off-by: Dan Schatzberg 
---
 include/linux/psi.h|  1 +
 kernel/cgroup/cgroup.c | 18 --
 kernel/sched/psi.c |  2 +-
 3 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index fc08da2bcc0a..64740dc42aa2 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -10,6 +10,7 @@ struct css_set;
 #ifdef CONFIG_PSI
 
 extern bool psi_disabled;
+extern struct psi_group psi_system;
 
 void psi_init(void);
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index dc8adb124874..3a748c746324 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3392,15 +3392,24 @@ static int cpu_stat_show(struct seq_file *seq, void *v)
 #ifdef CONFIG_PSI
 static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v)
 {
-   return psi_show(seq, _css(seq)->cgroup->psi, PSI_CPU);
+   struct cgroup *cgroup = seq_css(seq)->cgroup;
+   struct psi_group *psi = cgroup->id == 1 ? _system : >psi;
+
+   return psi_show(seq, psi, PSI_CPU);
 }
 static int cgroup_memory_pressure_show(struct seq_file *seq, void *v)
 {
-   return psi_show(seq, _css(seq)->cgroup->psi, PSI_MEM);
+   struct cgroup *cgroup = seq_css(seq)->cgroup;
+   struct psi_group *psi = cgroup->id == 1 ? _system : >psi;
+
+   return psi_show(seq, psi, PSI_MEM);
 }
 static int cgroup_io_pressure_show(struct seq_file *seq, void *v)
 {
-   return psi_show(seq, _css(seq)->cgroup->psi, PSI_IO);
+   struct cgroup *cgroup = seq_css(seq)->cgroup;
+   struct psi_group *psi = cgroup->id == 1 ? _system : >psi;
+
+   return psi_show(seq, psi, PSI_IO);
 }
 #endif
 
@@ -4518,17 +4527,14 @@ static struct cftype cgroup_base_files[] = {
 #ifdef CONFIG_PSI
{
.name = "cpu.pressure",
-   .flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cgroup_cpu_pressure_show,
},
{
.name = "memory.pressure",
-   .flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cgroup_memory_pressure_show,
},
{
.name = "io.pressure",
-   .flags = CFTYPE_NOT_ON_ROOT,
.seq_show = cgroup_io_pressure_show,
},
 #endif
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index cc3faf56b614..59e67b3c3bc4 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -150,7 +150,7 @@ static u64 psi_period __read_mostly;
 
 /* System-level pressure and stall tracking */
 static DEFINE_PER_CPU(struct psi_group_cpu, system_group_pcpu);
-static struct psi_group psi_system = {
+struct psi_group psi_system = {
.pcpu = _group_pcpu,
 };
 
-- 
2.17.1

Re: linux-next: Signed-off-by missing for commit in the fuse tree

2018-10-16 Thread Dan Schatzberg

On 10/16/18, 4:20 AM, "Miklos Szeredi"  wrote:
>
> On Mon, Oct 15, 2018 at 8:47 PM, Stephen Rothwell  
> wrote:
>> Hi Miklos,
>>
>> Commit
>>
>>   5571f1e65486 ("fuse: enable caching of symlinks")
>>
>> is missing a Signed-off-by from its author.
>
> Dan,
>
> Maybe you mistakenly left off the SOB tag?  Can I add it for you?
>
> Thanks,
> Miklos

Oops, please add it for me. Thank you for all your help!

Re: linux-next: Signed-off-by missing for commit in the fuse tree

2018-10-16 Thread Dan Schatzberg

On 10/16/18, 4:20 AM, "Miklos Szeredi"  wrote:
>
> On Mon, Oct 15, 2018 at 8:47 PM, Stephen Rothwell  
> wrote:
>> Hi Miklos,
>>
>> Commit
>>
>>   5571f1e65486 ("fuse: enable caching of symlinks")
>>
>> is missing a Signed-off-by from its author.
>
> Dan,
>
> Maybe you mistakenly left off the SOB tag?  Can I add it for you?
>
> Thanks,
> Miklos

Oops, please add it for me. Thank you for all your help!

[PATCH v2] fuse: enable caching of symlinks

2018-10-11 Thread Dan Schatzberg

FUSE file reads are cached in the page cache, but symlink reads are
not. This patch enables FUSE READLINK operations to be cached which
can improve performance of some FUSE workloads.

In particular, I'm working on a FUSE filesystem for access to source
code and discovered that about a 10% improvement to build times is
achieved with this patch (there are a lot of symlinks in the source
tree).
---
Changes in v2:
  - Gated behind INIT flag for backwards compatibility

 fs/fuse/dir.c | 67 +--
 fs/fuse/fuse_i.h  |  3 ++
 fs/fuse/inode.c   |  4 ++-
 include/uapi/linux/fuse.h |  7 +++-
 4 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 0979609d6eba..ce23fb6ac5f4 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1398,11 +1398,53 @@ static int fuse_readdir(struct file *file, struct 
dir_context *ctx)
return err;
 }
 
-static const char *fuse_get_link(struct dentry *dentry,
-struct inode *inode,
-struct delayed_call *done)
+static int fuse_symlink_readpage(struct file *file, struct page *page)
 {
+   struct inode *inode = page->mapping->host;
struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_req *req;
+   int err;
+   size_t num_read;
+
+   err = -EIO;
+   if (is_bad_inode(inode))
+   goto out;
+
+   req = fuse_get_req(fc, 1);
+   if (IS_ERR(req)) {
+   err = PTR_ERR(req);
+   goto out;
+   }
+
+   req->out.page_zeroing = 1;
+   req->out.argpages = 1;
+   req->num_pages = 1;
+   req->pages[0] = page;
+   req->page_descs[0].length = PAGE_SIZE - 1;
+   req->in.h.opcode = FUSE_READLINK;
+   req->in.h.nodeid = get_node_id(inode);
+   req->out.argvar = 1;
+   req->out.numargs = 1;
+   req->out.args[0].size = PAGE_SIZE - 1;
+   fuse_request_send(fc, req);
+   num_read = req->out.args[0].size;
+   err = req->out.h.error;
+
+   if (!err)
+   SetPageUptodate(page);
+
+   fuse_put_request(fc, req);
+   fuse_invalidate_atime(inode);
+out:
+   unlock_page(page);
+   return err;
+}
+
+static const char *fuse_get_link_nocache(struct dentry *dentry,
+struct inode *inode,
+struct delayed_call *done,
+struct fuse_conn *fc)
+{
FUSE_ARGS(args);
char *link;
ssize_t ret;
@@ -1432,6 +1474,17 @@ static const char *fuse_get_link(struct dentry *dentry,
return link;
 }
 
+static const char *fuse_get_link(struct dentry *dentry,
+struct inode *inode,
+struct delayed_call *done)
+{
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   if (fc->cache_symlinks)
+   return page_get_link(dentry, inode, done);
+   else
+   return fuse_get_link_nocache(dentry, inode, done, fc);
+}
+
 static int fuse_dir_open(struct inode *inode, struct file *file)
 {
return fuse_open_common(inode, file, true);
@@ -1871,7 +1924,15 @@ void fuse_init_dir(struct inode *inode)
inode->i_fop = _dir_operations;
 }
 
+static const struct address_space_operations fuse_symlink_aops = {
+   .readpage   = fuse_symlink_readpage,
+};
+
 void fuse_init_symlink(struct inode *inode)
 {
inode->i_op = _symlink_inode_operations;
+   inode->i_data.a_ops = _symlink_aops;
+   mapping_set_gfp_mask(inode->i_mapping,
+   mapping_gfp_constraint(inode->i_mapping,
+   ~(__GFP_FS | __GFP_HIGHMEM)));
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f78e9614bb5f..50f1d1e96011 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -551,6 +551,9 @@ struct fuse_conn {
/** handle fs handles killing suid/sgid/cap on write/chown/trunc */
unsigned handle_killpriv:1;
 
+   /** cache READLINK responses in page cache */
+   unsigned cache_symlinks:1;
+
/*
 * The following bitfields are only for optimization purposes
 * and hence races in setting them will not cause malfunction
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index db9e60b7eb69..0d8c2ee01308 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -926,6 +926,8 @@ static void process_init_reply(struct fuse_conn *fc, struct 
fuse_req *req)
}
if (arg->flags & FUSE_ABORT_ERROR)
fc->abort_err = 1;
+   if (arg->flags & FUSE_CACHE_SYMLINKS)
+   fc->cache_symlinks = 1;
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -957,7 +959,7 @@ static void fuse_send_init(struct fuse_conn *fc, struct

[PATCH v2] fuse: enable caching of symlinks

2018-10-11 Thread Dan Schatzberg

FUSE file reads are cached in the page cache, but symlink reads are
not. This patch enables FUSE READLINK operations to be cached which
can improve performance of some FUSE workloads.

In particular, I'm working on a FUSE filesystem for access to source
code and discovered that about a 10% improvement to build times is
achieved with this patch (there are a lot of symlinks in the source
tree).
---
Changes in v2:
  - Gated behind INIT flag for backwards compatibility

 fs/fuse/dir.c | 67 +--
 fs/fuse/fuse_i.h  |  3 ++
 fs/fuse/inode.c   |  4 ++-
 include/uapi/linux/fuse.h |  7 +++-
 4 files changed, 76 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 0979609d6eba..ce23fb6ac5f4 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1398,11 +1398,53 @@ static int fuse_readdir(struct file *file, struct 
dir_context *ctx)
return err;
 }
 
-static const char *fuse_get_link(struct dentry *dentry,
-struct inode *inode,
-struct delayed_call *done)
+static int fuse_symlink_readpage(struct file *file, struct page *page)
 {
+   struct inode *inode = page->mapping->host;
struct fuse_conn *fc = get_fuse_conn(inode);
+   struct fuse_req *req;
+   int err;
+   size_t num_read;
+
+   err = -EIO;
+   if (is_bad_inode(inode))
+   goto out;
+
+   req = fuse_get_req(fc, 1);
+   if (IS_ERR(req)) {
+   err = PTR_ERR(req);
+   goto out;
+   }
+
+   req->out.page_zeroing = 1;
+   req->out.argpages = 1;
+   req->num_pages = 1;
+   req->pages[0] = page;
+   req->page_descs[0].length = PAGE_SIZE - 1;
+   req->in.h.opcode = FUSE_READLINK;
+   req->in.h.nodeid = get_node_id(inode);
+   req->out.argvar = 1;
+   req->out.numargs = 1;
+   req->out.args[0].size = PAGE_SIZE - 1;
+   fuse_request_send(fc, req);
+   num_read = req->out.args[0].size;
+   err = req->out.h.error;
+
+   if (!err)
+   SetPageUptodate(page);
+
+   fuse_put_request(fc, req);
+   fuse_invalidate_atime(inode);
+out:
+   unlock_page(page);
+   return err;
+}
+
+static const char *fuse_get_link_nocache(struct dentry *dentry,
+struct inode *inode,
+struct delayed_call *done,
+struct fuse_conn *fc)
+{
FUSE_ARGS(args);
char *link;
ssize_t ret;
@@ -1432,6 +1474,17 @@ static const char *fuse_get_link(struct dentry *dentry,
return link;
 }
 
+static const char *fuse_get_link(struct dentry *dentry,
+struct inode *inode,
+struct delayed_call *done)
+{
+   struct fuse_conn *fc = get_fuse_conn(inode);
+   if (fc->cache_symlinks)
+   return page_get_link(dentry, inode, done);
+   else
+   return fuse_get_link_nocache(dentry, inode, done, fc);
+}
+
 static int fuse_dir_open(struct inode *inode, struct file *file)
 {
return fuse_open_common(inode, file, true);
@@ -1871,7 +1924,15 @@ void fuse_init_dir(struct inode *inode)
inode->i_fop = _dir_operations;
 }
 
+static const struct address_space_operations fuse_symlink_aops = {
+   .readpage   = fuse_symlink_readpage,
+};
+
 void fuse_init_symlink(struct inode *inode)
 {
inode->i_op = _symlink_inode_operations;
+   inode->i_data.a_ops = _symlink_aops;
+   mapping_set_gfp_mask(inode->i_mapping,
+   mapping_gfp_constraint(inode->i_mapping,
+   ~(__GFP_FS | __GFP_HIGHMEM)));
 }
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f78e9614bb5f..50f1d1e96011 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -551,6 +551,9 @@ struct fuse_conn {
/** handle fs handles killing suid/sgid/cap on write/chown/trunc */
unsigned handle_killpriv:1;
 
+   /** cache READLINK responses in page cache */
+   unsigned cache_symlinks:1;
+
/*
 * The following bitfields are only for optimization purposes
 * and hence races in setting them will not cause malfunction
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index db9e60b7eb69..0d8c2ee01308 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -926,6 +926,8 @@ static void process_init_reply(struct fuse_conn *fc, struct 
fuse_req *req)
}
if (arg->flags & FUSE_ABORT_ERROR)
fc->abort_err = 1;
+   if (arg->flags & FUSE_CACHE_SYMLINKS)
+   fc->cache_symlinks = 1;
} else {
ra_pages = fc->max_read / PAGE_SIZE;
fc->no_lock = 1;
@@ -957,7 +959,7 @@ static void fuse_send_init(struct fuse_conn *fc, struct

Re: [PATCH] fuse: enable caching of symlinks

2018-10-09 Thread Dan Schatzberg

On 9/27/18, 9:59 AM, "Miklos Szeredi"  wrote:

>  On Thu, Sep 27, 2018 at 3:52 PM, Dan Schatzberg  wrote:
>> FUSE file reads are cached in the page cache, but symlink reads are
>> not. This patch enables FUSE READLINK operations to be cached which
>> can improve performance of some FUSE workloads.
>>
>> In particular, I'm working on a FUSE filesystem for access to source
>> code and discovered that about a 10% improvement to build times is
>> achieved with this patch (there are a lot of symlinks in the source
>> tree).
>>
>> Please provide feedback, I'm looking to flesh this out more.
> 
> Need to think about how/when to invalidate the cached symlink
> contents.  I think treating it like metadata (i.e. look at
> attr_timeout for validity) would be the simplest.
> 
> Thanks,
> Miklos
> 

Any further thoughts on this? Just so I can confirm my understanding,
the attached patch will unconditionally cache READLINK responses
However, based on the entry timeout provided by the LOOKUP response,
Userspace can decide to invalidate this by providing a different inode on
a subsequent LOOKUP (e.g. during fuse_dentry_revalidate())

Would you like there to be a way to invalidate the cached symlink without
 changing the inode via LOOKUP?

>> Signed-off-by: Dan Schatzberg 
>> ---
>>  fs/fuse/dir.c | 70 +++
>>  1 file changed, 43 insertions(+), 27 deletions(-)
>>
>> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
>> index 0979609d6eba..3c0a81ef5bca 100644
>> --- a/fs/fuse/dir.c
>> +++ b/fs/fuse/dir.c
>> @@ -1398,38 +1398,46 @@ static int fuse_readdir(struct file *file, struct 
>> dir_context *ctx)
>> return err;
>>  }
>>
>> -static const char *fuse_get_link(struct dentry *dentry,
>> -struct inode *inode,
>> -struct delayed_call *done)
>> +static int fuse_symlink_readpage(struct file *file, struct page *page)
>>  {
>> +   struct inode *inode = page->mapping->host;
>> struct fuse_conn *fc = get_fuse_conn(inode);
>> -   FUSE_ARGS(args);
>> -   char *link;
>> -   ssize_t ret;
>> -
>> -   if (!dentry)
>> -   return ERR_PTR(-ECHILD);
>> +   struct fuse_req *req;
>> +   int err;
>> +   size_t num_read;
>>
>> -   link = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> -   if (!link)
>> -   return ERR_PTR(-ENOMEM);
>> +   err = -EIO;
>> +   if (is_bad_inode(inode))
>> +   goto out;
>>
>> -   args.in.h.opcode = FUSE_READLINK;
>> -   args.in.h.nodeid = get_node_id(inode);
>> -   args.out.argvar = 1;
>> -   args.out.numargs = 1;
>> -   args.out.args[0].size = PAGE_SIZE - 1;
>> -   args.out.args[0].value = link;
>> -   ret = fuse_simple_request(fc, );
>> -   if (ret < 0) {
>> -   kfree(link);
>> -   link = ERR_PTR(ret);
>> -   } else {
>> -   link[ret] = '\0';
>> -   set_delayed_call(done, kfree_link, link);
>> +   req = fuse_get_req(fc, 1);
>> +   if (IS_ERR(req)) {
>> +   err = PTR_ERR(req);
>> +   goto out;
>> }
>> +
>> +   req->out.page_zeroing = 1;
>> +   req->out.argpages = 1;
>> +   req->num_pages = 1;
>> +   req->pages[0] = page;
>> +   req->page_descs[0].length = PAGE_SIZE - 1;
>> +   req->in.h.opcode = FUSE_READLINK;
>> +   req->in.h.nodeid = get_node_id(inode);
>> +   req->out.argvar = 1;
>> +   req->out.numargs = 1;
>> +   req->out.args[0].size = PAGE_SIZE - 1;
>> +   fuse_request_send(fc, req);
>> +   num_read = req->out.args[0].size;
>> +   err = req->out.h.error;
>> +
>> +   if (!err)
>> +   SetPageUptodate(page);
>> +
>> +   fuse_put_request(fc, req);
>> fuse_invalidate_atime(inode);
>> -   return link;
>> +out:
>> +   unlock_page(page);
>> +   return err;
>>  }
>>
>>  static int fuse_dir_open(struct inode *inode, struct file *file)
>> @@ -1855,7 +1863,7 @@ static const struct inode_operations 
>> fuse_common_inode_operations = {
>>
>>  static const struct inode_operations fuse_symlink_inode_operations = {
>> .setattr= fuse_setattr,
>> -   .get_link   = fuse_get_link,
>> +   .get_link   = page_get_link,
>>

Re: [PATCH] fuse: enable caching of symlinks

2018-10-09 Thread Dan Schatzberg

On 9/27/18, 9:59 AM, "Miklos Szeredi"  wrote:

>  On Thu, Sep 27, 2018 at 3:52 PM, Dan Schatzberg  wrote:
>> FUSE file reads are cached in the page cache, but symlink reads are
>> not. This patch enables FUSE READLINK operations to be cached which
>> can improve performance of some FUSE workloads.
>>
>> In particular, I'm working on a FUSE filesystem for access to source
>> code and discovered that about a 10% improvement to build times is
>> achieved with this patch (there are a lot of symlinks in the source
>> tree).
>>
>> Please provide feedback, I'm looking to flesh this out more.
> 
> Need to think about how/when to invalidate the cached symlink
> contents.  I think treating it like metadata (i.e. look at
> attr_timeout for validity) would be the simplest.
> 
> Thanks,
> Miklos
> 

Any further thoughts on this? Just so I can confirm my understanding,
the attached patch will unconditionally cache READLINK responses
However, based on the entry timeout provided by the LOOKUP response,
Userspace can decide to invalidate this by providing a different inode on
a subsequent LOOKUP (e.g. during fuse_dentry_revalidate())

Would you like there to be a way to invalidate the cached symlink without
 changing the inode via LOOKUP?

>> Signed-off-by: Dan Schatzberg 
>> ---
>>  fs/fuse/dir.c | 70 +++
>>  1 file changed, 43 insertions(+), 27 deletions(-)
>>
>> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
>> index 0979609d6eba..3c0a81ef5bca 100644
>> --- a/fs/fuse/dir.c
>> +++ b/fs/fuse/dir.c
>> @@ -1398,38 +1398,46 @@ static int fuse_readdir(struct file *file, struct 
>> dir_context *ctx)
>> return err;
>>  }
>>
>> -static const char *fuse_get_link(struct dentry *dentry,
>> -struct inode *inode,
>> -struct delayed_call *done)
>> +static int fuse_symlink_readpage(struct file *file, struct page *page)
>>  {
>> +   struct inode *inode = page->mapping->host;
>> struct fuse_conn *fc = get_fuse_conn(inode);
>> -   FUSE_ARGS(args);
>> -   char *link;
>> -   ssize_t ret;
>> -
>> -   if (!dentry)
>> -   return ERR_PTR(-ECHILD);
>> +   struct fuse_req *req;
>> +   int err;
>> +   size_t num_read;
>>
>> -   link = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> -   if (!link)
>> -   return ERR_PTR(-ENOMEM);
>> +   err = -EIO;
>> +   if (is_bad_inode(inode))
>> +   goto out;
>>
>> -   args.in.h.opcode = FUSE_READLINK;
>> -   args.in.h.nodeid = get_node_id(inode);
>> -   args.out.argvar = 1;
>> -   args.out.numargs = 1;
>> -   args.out.args[0].size = PAGE_SIZE - 1;
>> -   args.out.args[0].value = link;
>> -   ret = fuse_simple_request(fc, );
>> -   if (ret < 0) {
>> -   kfree(link);
>> -   link = ERR_PTR(ret);
>> -   } else {
>> -   link[ret] = '\0';
>> -   set_delayed_call(done, kfree_link, link);
>> +   req = fuse_get_req(fc, 1);
>> +   if (IS_ERR(req)) {
>> +   err = PTR_ERR(req);
>> +   goto out;
>> }
>> +
>> +   req->out.page_zeroing = 1;
>> +   req->out.argpages = 1;
>> +   req->num_pages = 1;
>> +   req->pages[0] = page;
>> +   req->page_descs[0].length = PAGE_SIZE - 1;
>> +   req->in.h.opcode = FUSE_READLINK;
>> +   req->in.h.nodeid = get_node_id(inode);
>> +   req->out.argvar = 1;
>> +   req->out.numargs = 1;
>> +   req->out.args[0].size = PAGE_SIZE - 1;
>> +   fuse_request_send(fc, req);
>> +   num_read = req->out.args[0].size;
>> +   err = req->out.h.error;
>> +
>> +   if (!err)
>> +   SetPageUptodate(page);
>> +
>> +   fuse_put_request(fc, req);
>> fuse_invalidate_atime(inode);
>> -   return link;
>> +out:
>> +   unlock_page(page);
>> +   return err;
>>  }
>>
>>  static int fuse_dir_open(struct inode *inode, struct file *file)
>> @@ -1855,7 +1863,7 @@ static const struct inode_operations 
>> fuse_common_inode_operations = {
>>
>>  static const struct inode_operations fuse_symlink_inode_operations = {
>> .setattr= fuse_setattr,
>> -   .get_link   = fuse_get_link,
>> +   .get_link   = page_get_link,
>>

Re: [PATCH] fuse: enable caching of symlinks

2018-10-02 Thread Dan Schatzberg

On 9/27/18, 9:59 AM, "Miklos Szeredi"  wrote:

>  On Thu, Sep 27, 2018 at 3:52 PM, Dan Schatzberg  wrote:
>> FUSE file reads are cached in the page cache, but symlink reads are
>> not. This patch enables FUSE READLINK operations to be cached which
>> can improve performance of some FUSE workloads.
>>
>> In particular, I'm working on a FUSE filesystem for access to source
>> code and discovered that about a 10% improvement to build times is
>> achieved with this patch (there are a lot of symlinks in the source
>> tree).
>>
>> Please provide feedback, I'm looking to flesh this out more.
> 
> Need to think about how/when to invalidate the cached symlink
> contents.  I think treating it like metadata (i.e. look at
> attr_timeout for validity) would be the simplest.
> 
> Thanks,
> Miklos
> 

Any further thoughts on this? Just so I can confirm my understanding,
the attached patch will unconditionally cache READLINK responses
However, based on the entry timeout provided by the LOOKUP response,
Userspace can decide to invalidate this by providing a different inode on
a subsequent LOOKUP (e.g. during fuse_dentry_revalidate())

Would you like there to be a way to invalidate the cached symlink without
 changing the inode via LOOKUP?
>>
>> Signed-off-by: Dan Schatzberg 
>> ---
>>  fs/fuse/dir.c | 70 +++
>>  1 file changed, 43 insertions(+), 27 deletions(-)
>>
>> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
>> index 0979609d6eba..3c0a81ef5bca 100644
>> --- a/fs/fuse/dir.c
>> +++ b/fs/fuse/dir.c
>> @@ -1398,38 +1398,46 @@ static int fuse_readdir(struct file *file, struct 
>> dir_context *ctx)
>> return err;
>>  }
>>
>> -static const char *fuse_get_link(struct dentry *dentry,
>> -struct inode *inode,
>> -struct delayed_call *done)
>> +static int fuse_symlink_readpage(struct file *file, struct page *page)
>>  {
>> +   struct inode *inode = page->mapping->host;
>> struct fuse_conn *fc = get_fuse_conn(inode);
>> -   FUSE_ARGS(args);
>> -   char *link;
>> -   ssize_t ret;
>> -
>> -   if (!dentry)
>> -   return ERR_PTR(-ECHILD);
>> +   struct fuse_req *req;
>> +   int err;
>> +   size_t num_read;
>>
>> -   link = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> -   if (!link)
>> -   return ERR_PTR(-ENOMEM);
>> +   err = -EIO;
>> +   if (is_bad_inode(inode))
>> +   goto out;
>>
>> -   args.in.h.opcode = FUSE_READLINK;
>> -   args.in.h.nodeid = get_node_id(inode);
>> -   args.out.argvar = 1;
>> -   args.out.numargs = 1;
>> -   args.out.args[0].size = PAGE_SIZE - 1;
>> -   args.out.args[0].value = link;
>> -   ret = fuse_simple_request(fc, );
>> -   if (ret < 0) {
>> -   kfree(link);
>> -   link = ERR_PTR(ret);
>> -   } else {
>> -   link[ret] = '\0';
>> -   set_delayed_call(done, kfree_link, link);
>> +   req = fuse_get_req(fc, 1);
>> +   if (IS_ERR(req)) {
>> +   err = PTR_ERR(req);
>> +   goto out;
>> }
>> +
>> +   req->out.page_zeroing = 1;
>> +   req->out.argpages = 1;
>> +   req->num_pages = 1;
>> +   req->pages[0] = page;
>> +   req->page_descs[0].length = PAGE_SIZE - 1;
>> +   req->in.h.opcode = FUSE_READLINK;
>> +   req->in.h.nodeid = get_node_id(inode);
>> +   req->out.argvar = 1;
>> +   req->out.numargs = 1;
>> +   req->out.args[0].size = PAGE_SIZE - 1;
>> +   fuse_request_send(fc, req);
>> +   num_read = req->out.args[0].size;
>> +   err = req->out.h.error;
>> +
>> +   if (!err)
>> +   SetPageUptodate(page);
>> +
>> +   fuse_put_request(fc, req);
>> fuse_invalidate_atime(inode);
>> -   return link;
>> +out:
>> +   unlock_page(page);
>> +   return err;
>>  }
>>
>>  static int fuse_dir_open(struct inode *inode, struct file *file)
>> @@ -1855,7 +1863,7 @@ static const struct inode_operations 
>> fuse_common_inode_operations = {
>>
>>  static const struct inode_operations fuse_symlink_inode_operations = {
>> .setattr= fuse_setattr,
>> -   .get_link   = fuse_get_link,
>> +   .get_link   = page_get_link,
&

Re: [PATCH] fuse: enable caching of symlinks

2018-10-02 Thread Dan Schatzberg

On 9/27/18, 9:59 AM, "Miklos Szeredi"  wrote:

>  On Thu, Sep 27, 2018 at 3:52 PM, Dan Schatzberg  wrote:
>> FUSE file reads are cached in the page cache, but symlink reads are
>> not. This patch enables FUSE READLINK operations to be cached which
>> can improve performance of some FUSE workloads.
>>
>> In particular, I'm working on a FUSE filesystem for access to source
>> code and discovered that about a 10% improvement to build times is
>> achieved with this patch (there are a lot of symlinks in the source
>> tree).
>>
>> Please provide feedback, I'm looking to flesh this out more.
> 
> Need to think about how/when to invalidate the cached symlink
> contents.  I think treating it like metadata (i.e. look at
> attr_timeout for validity) would be the simplest.
> 
> Thanks,
> Miklos
> 

Any further thoughts on this? Just so I can confirm my understanding,
the attached patch will unconditionally cache READLINK responses
However, based on the entry timeout provided by the LOOKUP response,
Userspace can decide to invalidate this by providing a different inode on
a subsequent LOOKUP (e.g. during fuse_dentry_revalidate())

Would you like there to be a way to invalidate the cached symlink without
 changing the inode via LOOKUP?
>>
>> Signed-off-by: Dan Schatzberg 
>> ---
>>  fs/fuse/dir.c | 70 +++
>>  1 file changed, 43 insertions(+), 27 deletions(-)
>>
>> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
>> index 0979609d6eba..3c0a81ef5bca 100644
>> --- a/fs/fuse/dir.c
>> +++ b/fs/fuse/dir.c
>> @@ -1398,38 +1398,46 @@ static int fuse_readdir(struct file *file, struct 
>> dir_context *ctx)
>> return err;
>>  }
>>
>> -static const char *fuse_get_link(struct dentry *dentry,
>> -struct inode *inode,
>> -struct delayed_call *done)
>> +static int fuse_symlink_readpage(struct file *file, struct page *page)
>>  {
>> +   struct inode *inode = page->mapping->host;
>> struct fuse_conn *fc = get_fuse_conn(inode);
>> -   FUSE_ARGS(args);
>> -   char *link;
>> -   ssize_t ret;
>> -
>> -   if (!dentry)
>> -   return ERR_PTR(-ECHILD);
>> +   struct fuse_req *req;
>> +   int err;
>> +   size_t num_read;
>>
>> -   link = kmalloc(PAGE_SIZE, GFP_KERNEL);
>> -   if (!link)
>> -   return ERR_PTR(-ENOMEM);
>> +   err = -EIO;
>> +   if (is_bad_inode(inode))
>> +   goto out;
>>
>> -   args.in.h.opcode = FUSE_READLINK;
>> -   args.in.h.nodeid = get_node_id(inode);
>> -   args.out.argvar = 1;
>> -   args.out.numargs = 1;
>> -   args.out.args[0].size = PAGE_SIZE - 1;
>> -   args.out.args[0].value = link;
>> -   ret = fuse_simple_request(fc, );
>> -   if (ret < 0) {
>> -   kfree(link);
>> -   link = ERR_PTR(ret);
>> -   } else {
>> -   link[ret] = '\0';
>> -   set_delayed_call(done, kfree_link, link);
>> +   req = fuse_get_req(fc, 1);
>> +   if (IS_ERR(req)) {
>> +   err = PTR_ERR(req);
>> +   goto out;
>> }
>> +
>> +   req->out.page_zeroing = 1;
>> +   req->out.argpages = 1;
>> +   req->num_pages = 1;
>> +   req->pages[0] = page;
>> +   req->page_descs[0].length = PAGE_SIZE - 1;
>> +   req->in.h.opcode = FUSE_READLINK;
>> +   req->in.h.nodeid = get_node_id(inode);
>> +   req->out.argvar = 1;
>> +   req->out.numargs = 1;
>> +   req->out.args[0].size = PAGE_SIZE - 1;
>> +   fuse_request_send(fc, req);
>> +   num_read = req->out.args[0].size;
>> +   err = req->out.h.error;
>> +
>> +   if (!err)
>> +   SetPageUptodate(page);
>> +
>> +   fuse_put_request(fc, req);
>> fuse_invalidate_atime(inode);
>> -   return link;
>> +out:
>> +   unlock_page(page);
>> +   return err;
>>  }
>>
>>  static int fuse_dir_open(struct inode *inode, struct file *file)
>> @@ -1855,7 +1863,7 @@ static const struct inode_operations 
>> fuse_common_inode_operations = {
>>
>>  static const struct inode_operations fuse_symlink_inode_operations = {
>> .setattr= fuse_setattr,
>> -   .get_link   = fuse_get_link,
>> +   .get_link   = page_get_link,
&

[PATCH] fuse: enable caching of symlinks

2018-09-27 Thread Dan Schatzberg

FUSE file reads are cached in the page cache, but symlink reads are
not. This patch enables FUSE READLINK operations to be cached which
can improve performance of some FUSE workloads.

In particular, I'm working on a FUSE filesystem for access to source
code and discovered that about a 10% improvement to build times is
achieved with this patch (there are a lot of symlinks in the source
tree).

Please provide feedback, I'm looking to flesh this out more.

Signed-off-by: Dan Schatzberg 
---
 fs/fuse/dir.c | 70 +++
 1 file changed, 43 insertions(+), 27 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 0979609d6eba..3c0a81ef5bca 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1398,38 +1398,46 @@ static int fuse_readdir(struct file *file, struct 
dir_context *ctx)
return err;
 }
 
-static const char *fuse_get_link(struct dentry *dentry,
-struct inode *inode,
-struct delayed_call *done)
+static int fuse_symlink_readpage(struct file *file, struct page *page)
 {
+   struct inode *inode = page->mapping->host;
struct fuse_conn *fc = get_fuse_conn(inode);
-   FUSE_ARGS(args);
-   char *link;
-   ssize_t ret;
-
-   if (!dentry)
-   return ERR_PTR(-ECHILD);
+   struct fuse_req *req;
+   int err;
+   size_t num_read;
 
-   link = kmalloc(PAGE_SIZE, GFP_KERNEL);
-   if (!link)
-   return ERR_PTR(-ENOMEM);
+   err = -EIO;
+   if (is_bad_inode(inode))
+   goto out;
 
-   args.in.h.opcode = FUSE_READLINK;
-   args.in.h.nodeid = get_node_id(inode);
-   args.out.argvar = 1;
-   args.out.numargs = 1;
-   args.out.args[0].size = PAGE_SIZE - 1;
-   args.out.args[0].value = link;
-   ret = fuse_simple_request(fc, );
-   if (ret < 0) {
-   kfree(link);
-   link = ERR_PTR(ret);
-   } else {
-   link[ret] = '\0';
-   set_delayed_call(done, kfree_link, link);
+   req = fuse_get_req(fc, 1);
+   if (IS_ERR(req)) {
+   err = PTR_ERR(req);
+   goto out;
}
+
+   req->out.page_zeroing = 1;
+   req->out.argpages = 1;
+   req->num_pages = 1;
+   req->pages[0] = page;
+   req->page_descs[0].length = PAGE_SIZE - 1;
+   req->in.h.opcode = FUSE_READLINK;
+   req->in.h.nodeid = get_node_id(inode);
+   req->out.argvar = 1;
+   req->out.numargs = 1;
+   req->out.args[0].size = PAGE_SIZE - 1;
+   fuse_request_send(fc, req);
+   num_read = req->out.args[0].size;
+   err = req->out.h.error;
+
+   if (!err)
+   SetPageUptodate(page);
+
+   fuse_put_request(fc, req);
fuse_invalidate_atime(inode);
-   return link;
+out:
+   unlock_page(page);
+   return err;
 }
 
 static int fuse_dir_open(struct inode *inode, struct file *file)
@@ -1855,7 +1863,7 @@ static const struct inode_operations 
fuse_common_inode_operations = {
 
 static const struct inode_operations fuse_symlink_inode_operations = {
.setattr= fuse_setattr,
-   .get_link   = fuse_get_link,
+   .get_link   = page_get_link,
.getattr= fuse_getattr,
.listxattr  = fuse_listxattr,
 };
@@ -1871,7 +1879,15 @@ void fuse_init_dir(struct inode *inode)
inode->i_fop = _dir_operations;
 }
 
+static const struct address_space_operations fuse_symlink_aops = {
+   .readpage   = fuse_symlink_readpage,
+};
+
 void fuse_init_symlink(struct inode *inode)
 {
inode->i_op = _symlink_inode_operations;
+   inode->i_data.a_ops = _symlink_aops;
+   mapping_set_gfp_mask(inode->i_mapping,
+   mapping_gfp_constraint(inode->i_mapping,
+   ~(__GFP_FS | __GFP_HIGHMEM)));
 }
-- 
2.17.1

[PATCH] fuse: enable caching of symlinks

2018-09-27 Thread Dan Schatzberg

FUSE file reads are cached in the page cache, but symlink reads are
not. This patch enables FUSE READLINK operations to be cached which
can improve performance of some FUSE workloads.

In particular, I'm working on a FUSE filesystem for access to source
code and discovered that about a 10% improvement to build times is
achieved with this patch (there are a lot of symlinks in the source
tree).

Please provide feedback, I'm looking to flesh this out more.

Signed-off-by: Dan Schatzberg 
---
 fs/fuse/dir.c | 70 +++
 1 file changed, 43 insertions(+), 27 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 0979609d6eba..3c0a81ef5bca 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1398,38 +1398,46 @@ static int fuse_readdir(struct file *file, struct 
dir_context *ctx)
return err;
 }
 
-static const char *fuse_get_link(struct dentry *dentry,
-struct inode *inode,
-struct delayed_call *done)
+static int fuse_symlink_readpage(struct file *file, struct page *page)
 {
+   struct inode *inode = page->mapping->host;
struct fuse_conn *fc = get_fuse_conn(inode);
-   FUSE_ARGS(args);
-   char *link;
-   ssize_t ret;
-
-   if (!dentry)
-   return ERR_PTR(-ECHILD);
+   struct fuse_req *req;
+   int err;
+   size_t num_read;
 
-   link = kmalloc(PAGE_SIZE, GFP_KERNEL);
-   if (!link)
-   return ERR_PTR(-ENOMEM);
+   err = -EIO;
+   if (is_bad_inode(inode))
+   goto out;
 
-   args.in.h.opcode = FUSE_READLINK;
-   args.in.h.nodeid = get_node_id(inode);
-   args.out.argvar = 1;
-   args.out.numargs = 1;
-   args.out.args[0].size = PAGE_SIZE - 1;
-   args.out.args[0].value = link;
-   ret = fuse_simple_request(fc, );
-   if (ret < 0) {
-   kfree(link);
-   link = ERR_PTR(ret);
-   } else {
-   link[ret] = '\0';
-   set_delayed_call(done, kfree_link, link);
+   req = fuse_get_req(fc, 1);
+   if (IS_ERR(req)) {
+   err = PTR_ERR(req);
+   goto out;
}
+
+   req->out.page_zeroing = 1;
+   req->out.argpages = 1;
+   req->num_pages = 1;
+   req->pages[0] = page;
+   req->page_descs[0].length = PAGE_SIZE - 1;
+   req->in.h.opcode = FUSE_READLINK;
+   req->in.h.nodeid = get_node_id(inode);
+   req->out.argvar = 1;
+   req->out.numargs = 1;
+   req->out.args[0].size = PAGE_SIZE - 1;
+   fuse_request_send(fc, req);
+   num_read = req->out.args[0].size;
+   err = req->out.h.error;
+
+   if (!err)
+   SetPageUptodate(page);
+
+   fuse_put_request(fc, req);
fuse_invalidate_atime(inode);
-   return link;
+out:
+   unlock_page(page);
+   return err;
 }
 
 static int fuse_dir_open(struct inode *inode, struct file *file)
@@ -1855,7 +1863,7 @@ static const struct inode_operations 
fuse_common_inode_operations = {
 
 static const struct inode_operations fuse_symlink_inode_operations = {
.setattr= fuse_setattr,
-   .get_link   = fuse_get_link,
+   .get_link   = page_get_link,
.getattr= fuse_getattr,
.listxattr  = fuse_listxattr,
 };
@@ -1871,7 +1879,15 @@ void fuse_init_dir(struct inode *inode)
inode->i_fop = _dir_operations;
 }
 
+static const struct address_space_operations fuse_symlink_aops = {
+   .readpage   = fuse_symlink_readpage,
+};
+
 void fuse_init_symlink(struct inode *inode)
 {
inode->i_op = _symlink_inode_operations;
+   inode->i_data.a_ops = _symlink_aops;
+   mapping_set_gfp_mask(inode->i_mapping,
+   mapping_gfp_constraint(inode->i_mapping,
+   ~(__GFP_FS | __GFP_HIGHMEM)));
 }
-- 
2.17.1

59 matches

Mail list logo