Re: [PATCH net-next] page_pool: add a test module for page_pool

2024-09-09 Thread Mina Almasry
On Mon, Sep 9, 2024 at 2:25 AM Yunsheng Lin  wrote:
>
> The testing is done by ensuring that the page allocated from
> the page_pool instance is pushed into a ptr_ring instance in
> a kthread/napi binded to a specified cpu, and a kthread/napi
> binded to a specified cpu will pop the page from the ptr_ring
> and free it back to the page_pool.
>
> Signed-off-by: Yunsheng Lin 

It seems this test is has a correctness part and a performance part.
For the performance test, Jesper has out of tree tests for the
page_pool:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/bench_page_pool_simple.c

I have these rebased on top of net-next and use them to verify devmem
& memory-provider performance:
https://github.com/mina/linux/commit/07fd1c04591395d15d83c07298b4d37f6b56157f

My preference here (for the performance part) is to upstream the
out-of-tree tests that Jesper (and probably others) are using, rather
than adding a new performance test that is not as battle-hardened.

--
Thanks,
Mina



Re: [PATCH net-next v3 2/3] net: introduce abstraction for network memory

2024-01-04 Thread Mina Almasry
On Thu, Jan 4, 2024 at 1:44 PM Jakub Kicinski  wrote:
>
> On Thu, 21 Dec 2023 15:44:22 -0800 Mina Almasry wrote:
> > The warning is like so:
> >
> > ./include/net/page_pool/helpers.h: In function ‘page_pool_alloc’:
> > ./include/linux/stddef.h:8:14: warning: returning ‘void *’ from a
> > function with return type ‘netmem_ref’ {aka ‘long unsigned int’} makes
> > integer from pointer without a cast [-Wint-conversion]
> > 8 | #define NULL ((void *)0)
> >   |  ^
> > ./include/net/page_pool/helpers.h:132:24: note: in expansion of macro
> > ‘NULL’
> >   132 | return NULL;
> >   |^~~~
> >
> > And happens in all the code where:
> >
> > netmem_ref func()
> > {
> > return NULL;
> > }
> >
> > It's fixable by changing the return to `return (netmem_ref NULL);` or
> > `return 0;`, but I feel like netmem_ref should be some type which
> > allows a cast from NULL implicitly.
>
> Why do you think we should be able to cast NULL implicitly?
> netmem_ref is a handle, it could possibly be some form of
> an ID in the future, rather than a pointer. Or have more low
> bits stolen for specific use cases.
>
> unsigned long, and returning 0 as "no handle" makes perfect sense to me.
>
> Note that 0 is a special case, bitwise types are allowed to convert
> to 0/bool and 0 is implicitly allowed to become a bitwise type.
> This will pass without a warning:
>
> typedef unsigned long __bitwise netmem_ref;
>
> netmem_ref some_code(netmem_ref ref)
> {
> // direct test is fine
> if (!ref)
> // 0 "upgrades" without casts
> return 0;
> // 1 does not, we need __force
> return (__force netmem_ref)1 | ref;
> }
>
> The __bitwise annotation will make catching people trying
> to cast to struct page * trivial.
>
> You seem to be trying hard to make struct netmem a thing.
> Perhaps you have a reason I'm not getting?

There are a number of functions that return struct page* today that I
convert to return struct netmem* later in the child devmem series, one
example is something like:

struct page *page_pool_alloc(...); // returns NULL on failure.

becomes:

struct netmem *page_pool_alloc(...); // also returns NULL on failure.

rather than,

netmem_ref page_pool_alloc(...); // returns 0 on failure.

I guess in my mind having NULL be castable to the new type makes it so
that I can avoid the additional code churn of converting a bunch of
`return NULL;` to `return 0;`, and maybe the transition from page
pointers to netmem pointers can be more easily done if they're both
compatible pointer types.

But that is not any huge blocker or critical point in my mind, I just
thought this approach is preferred. If conversion to unsigned long
makes more sense to you, I'll respin this like that and do the `NULL
-> 0` conversion everywhere as needed.

-- 
Thanks,
Mina



[PATCH net-next v3] vsock/virtio: use skb_frag_*() helpers

2024-01-02 Thread Mina Almasry
Minor fix for virtio: code wanting to access the fields inside an skb
frag should use the skb_frag_*() helpers, instead of accessing the
fields directly. This allows for extensions where the underlying
memory is not a page.

Acked-by: Stefano Garzarella 
Signed-off-by: Mina Almasry 

---

v3:
- Applied Stefano's Acked-by.
- Forked this patch from 'Abstract page from net stack'.

v2:

- Also fix skb_frag_off() + skb_frag_size() (David)
- Did not apply the reviewed-by from Stefano since the patch changed
relatively much.

---
 net/vmw_vsock/virtio_transport.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index f495b9e5186b..1748268e0694 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -153,10 +153,10 @@ virtio_transport_send_pkt_work(struct work_struct *work)
 * 'virt_to_phys()' later to fill the buffer 
descriptor.
 * We don't touch memory at "virtual" address 
of this page.
 */
-   va = page_to_virt(skb_frag->bv_page);
+   va = page_to_virt(skb_frag_page(skb_frag));
sg_init_one(sgs[out_sg],
-   va + skb_frag->bv_offset,
-   skb_frag->bv_len);
+   va + skb_frag_off(skb_frag),
+   skb_frag_size(skb_frag));
out_sg++;
}
}
-- 
2.43.0.472.g3155946c3a-goog




Re: [PATCH net-next v3 2/3] net: introduce abstraction for network memory

2023-12-21 Thread Mina Almasry
On Thu, Dec 21, 2023 at 3:23 PM Shakeel Butt  wrote:
>
> On Wed, Dec 20, 2023 at 01:45:01PM -0800, Mina Almasry wrote:
> > Add the netmem_ref type, an abstraction for network memory.
> >
> > To add support for new memory types to the net stack, we must first
> > abstract the current memory type. Currently parts of the net stack
> > use struct page directly:
> >
> > - page_pool
> > - drivers
> > - skb_frag_t
> >
> > Originally the plan was to reuse struct page* for the new memory types,
> > and to set the LSB on the page* to indicate it's not really a page.
> > However, for compiler type checking we need to introduce a new type.
> >
> > netmem_ref is introduced to abstract the underlying memory type. Currently
> > it's a no-op abstraction that is always a struct page underneath. In
> > parallel there is an undergoing effort to add support for devmem to the
> > net stack:
> >
> > https://lore.kernel.org/netdev/20231208005250.2910004-1-almasrym...@google.com/
> >
> > Signed-off-by: Mina Almasry 
> >
> > ---
> >
> > v3:
> >
> > - Modify struct netmem from a union of struct page + new types to an opaque
> >   netmem_ref type.  I went with:
> >
> >   +typedef void *__bitwise netmem_ref;
> >
> >   rather than this that Jakub recommended:
> >
> >   +typedef unsigned long __bitwise netmem_ref;
> >
> >   Because with the latter the compiler issues warnings to cast NULL to
> >   netmem_ref. I hope that's ok.
> >
>
> Can you share what the warning was? You might just need __force
> attribute. However you might need this __force a lot. I wonder if you
> can just follow struct encoded_page example verbatim here.
>

The warning is like so:

./include/net/page_pool/helpers.h: In function ‘page_pool_alloc’:
./include/linux/stddef.h:8:14: warning: returning ‘void *’ from a
function with return type ‘netmem_ref’ {aka ‘long unsigned int’} makes
integer from pointer without a cast [-Wint-conversion]
8 | #define NULL ((void *)0)
  |  ^
./include/net/page_pool/helpers.h:132:24: note: in expansion of macro
‘NULL’
  132 | return NULL;
  |^~~~

And happens in all the code where:

netmem_ref func()
{
return NULL;
}

It's fixable by changing the return to `return (netmem_ref NULL);` or
`return 0;`, but I feel like netmem_ref should be some type which
allows a cast from NULL implicitly.

Also as you (and patchwork) noticed, __bitwise should not be used with
void*; it's only meant for integer types. Sorry I missed that in the
docs and was not running make C=2.

-- 
Thanks,
Mina



[PATCH net-next v3 3/3] net: add netmem_ref to skb_frag_t

2023-12-20 Thread Mina Almasry
Use netmem_ref instead of page in skb_frag_t. Currently netmem_ref
is always a struct page underneath, but the abstraction allows efforts
to add support for skb frags not backed by pages.

There is unfortunately 1 instance where the skb_frag_t is assumed to be
a bio_vec in kcm. For this case, add a debug assert that the skb frag is
indeed backed by a page, and do a cast.

Add skb[_frag]_fill_netmem_*() and skb_add_rx_frag_netmem() helpers so
that the API can be used to create netmem skbs.

Signed-off-by: Mina Almasry 

---

v3;
- Renamed the fields in skb_frag_t.

v2:
- Add skb frag filling helpers.

---
 include/linux/skbuff.h | 92 +-
 net/core/skbuff.c  | 22 +++---
 net/kcm/kcmsock.c  | 10 -
 3 files changed, 89 insertions(+), 35 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7ce38874dbd1..729c95e97be1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -37,6 +37,7 @@
 #endif
 #include 
 #include 
+#include 
 
 /**
  * DOC: skb checksums
@@ -359,7 +360,11 @@ extern int sysctl_max_skb_frags;
  */
 #define GSO_BY_FRAGS   0x
 
-typedef struct bio_vec skb_frag_t;
+typedef struct skb_frag {
+   netmem_ref netmem;
+   unsigned int len;
+   unsigned int offset;
+} skb_frag_t;
 
 /**
  * skb_frag_size() - Returns the size of a skb fragment
@@ -367,7 +372,7 @@ typedef struct bio_vec skb_frag_t;
  */
 static inline unsigned int skb_frag_size(const skb_frag_t *frag)
 {
-   return frag->bv_len;
+   return frag->len;
 }
 
 /**
@@ -377,7 +382,7 @@ static inline unsigned int skb_frag_size(const skb_frag_t 
*frag)
  */
 static inline void skb_frag_size_set(skb_frag_t *frag, unsigned int size)
 {
-   frag->bv_len = size;
+   frag->len = size;
 }
 
 /**
@@ -387,7 +392,7 @@ static inline void skb_frag_size_set(skb_frag_t *frag, 
unsigned int size)
  */
 static inline void skb_frag_size_add(skb_frag_t *frag, int delta)
 {
-   frag->bv_len += delta;
+   frag->len += delta;
 }
 
 /**
@@ -397,7 +402,7 @@ static inline void skb_frag_size_add(skb_frag_t *frag, int 
delta)
  */
 static inline void skb_frag_size_sub(skb_frag_t *frag, int delta)
 {
-   frag->bv_len -= delta;
+   frag->len -= delta;
 }
 
 /**
@@ -417,7 +422,7 @@ static inline bool skb_frag_must_loop(struct page *p)
  * skb_frag_foreach_page - loop over pages in a fragment
  *
  * @f: skb frag to operate on
- * @f_off: offset from start of f->bv_page
+ * @f_off: offset from start of f->netmem
  * @f_len: length from f_off to loop over
  * @p: (temp var) current page
  * @p_off: (temp var) offset from start of current page,
@@ -2431,22 +2436,37 @@ static inline unsigned int skb_pagelen(const struct 
sk_buff *skb)
return skb_headlen(skb) + __skb_pagelen(skb);
 }
 
+static inline void skb_frag_fill_netmem_desc(skb_frag_t *frag,
+netmem_ref netmem, int off,
+int size)
+{
+   frag->netmem = netmem;
+   frag->offset = off;
+   skb_frag_size_set(frag, size);
+}
+
 static inline void skb_frag_fill_page_desc(skb_frag_t *frag,
   struct page *page,
   int off, int size)
 {
-   frag->bv_page = page;
-   frag->bv_offset = off;
-   skb_frag_size_set(frag, size);
+   skb_frag_fill_netmem_desc(frag, page_to_netmem(page), off, size);
+}
+
+static inline void __skb_fill_netmem_desc_noacc(struct skb_shared_info *shinfo,
+   int i, netmem_ref netmem,
+   int off, int size)
+{
+   skb_frag_t *frag = &shinfo->frags[i];
+
+   skb_frag_fill_netmem_desc(frag, netmem, off, size);
 }
 
 static inline void __skb_fill_page_desc_noacc(struct skb_shared_info *shinfo,
  int i, struct page *page,
  int off, int size)
 {
-   skb_frag_t *frag = &shinfo->frags[i];
-
-   skb_frag_fill_page_desc(frag, page, off, size);
+   __skb_fill_netmem_desc_noacc(shinfo, i, page_to_netmem(page), off,
+size);
 }
 
 /**
@@ -2462,10 +2482,10 @@ static inline void skb_len_add(struct sk_buff *skb, int 
delta)
 }
 
 /**
- * __skb_fill_page_desc - initialise a paged fragment in an skb
+ * __skb_fill_netmem_desc - initialise a fragment in an skb
  * @skb: buffer containing fragment to be initialised
- * @i: paged fragment index to initialise
- * @page: the page to use for this fragment
+ * @i: fragment index to initialise
+ * @netmem: the netmem to use for this fragment
  * @off: the offset to the data with @page
  * @size: the length of the data
  *
@@ -2474,10 +2494,13 @@ static inline void skb_len_ad

[PATCH net-next v3 2/3] net: introduce abstraction for network memory

2023-12-20 Thread Mina Almasry
Add the netmem_ref type, an abstraction for network memory.

To add support for new memory types to the net stack, we must first
abstract the current memory type. Currently parts of the net stack
use struct page directly:

- page_pool
- drivers
- skb_frag_t

Originally the plan was to reuse struct page* for the new memory types,
and to set the LSB on the page* to indicate it's not really a page.
However, for compiler type checking we need to introduce a new type.

netmem_ref is introduced to abstract the underlying memory type. Currently
it's a no-op abstraction that is always a struct page underneath. In
parallel there is an undergoing effort to add support for devmem to the
net stack:

https://lore.kernel.org/netdev/20231208005250.2910004-1-almasrym...@google.com/

Signed-off-by: Mina Almasry 

---

v3:

- Modify struct netmem from a union of struct page + new types to an opaque
  netmem_ref type.  I went with:

  +typedef void *__bitwise netmem_ref;

  rather than this that Jakub recommended:

  +typedef unsigned long __bitwise netmem_ref;

  Because with the latter the compiler issues warnings to cast NULL to
  netmem_ref. I hope that's ok.

- Add some function docs.

v2:

- Use container_of instead of a type cast (David).
---
 include/net/netmem.h | 41 +
 1 file changed, 41 insertions(+)
 create mode 100644 include/net/netmem.h

diff --git a/include/net/netmem.h b/include/net/netmem.h
new file mode 100644
index ..edd977326203
--- /dev/null
+++ b/include/net/netmem.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Network memory
+ *
+ *     Author: Mina Almasry 
+ */
+
+#ifndef _NET_NETMEM_H
+#define _NET_NETMEM_H
+
+/**
+ * typedef netmem_ref - a nonexistent type marking a reference to generic
+ * network memory.
+ *
+ * A netmem_ref currently is always a reference to a struct page. This
+ * abstraction is introduced so support for new memory types can be added.
+ *
+ * Use the supplied helpers to obtain the underlying memory pointer and fields.
+ */
+typedef void *__bitwise netmem_ref;
+
+/* This conversion fails (returns NULL) if the netmem_ref is not struct page
+ * backed.
+ *
+ * Currently struct page is the only possible netmem, and this helper never
+ * fails.
+ */
+static inline struct page *netmem_to_page(netmem_ref netmem)
+{
+   return (struct page *)netmem;
+}
+
+/* Converting from page to netmem is always safe, because a page can always be
+ * a netmem.
+ */
+static inline netmem_ref page_to_netmem(struct page *page)
+{
+   return (netmem_ref)page;
+}
+
+#endif /* _NET_NETMEM_H */
-- 
2.43.0.472.g3155946c3a-goog




[PATCH net-next v3 1/3] vsock/virtio: use skb_frag_*() helpers

2023-12-20 Thread Mina Almasry
Minor fix for virtio: code wanting to access the fields inside an skb
frag should use the skb_frag_*() helpers, instead of accessing the
fields directly. This allows for extensions where the underlying
memory is not a page.

Signed-off-by: Mina Almasry 

---

v2:

- Also fix skb_frag_off() + skb_frag_size() (David)
- Did not apply the reviewed-by from Stefano since the patch changed
relatively much.

---
 net/vmw_vsock/virtio_transport.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index f495b9e5186b..1748268e0694 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -153,10 +153,10 @@ virtio_transport_send_pkt_work(struct work_struct *work)
 * 'virt_to_phys()' later to fill the buffer 
descriptor.
 * We don't touch memory at "virtual" address 
of this page.
 */
-   va = page_to_virt(skb_frag->bv_page);
+   va = page_to_virt(skb_frag_page(skb_frag));
sg_init_one(sgs[out_sg],
-   va + skb_frag->bv_offset,
-   skb_frag->bv_len);
+   va + skb_frag_off(skb_frag),
+   skb_frag_size(skb_frag));
out_sg++;
}
}
-- 
2.43.0.472.g3155946c3a-goog




[PATCH net-next v3 0/3] Abstract page from net stack

2023-12-20 Thread Mina Almasry
Changes in v3:

- Replaced the struct netmem union with an opaque netmem_ref type.
- Added func docs to the netmem helpers and type.
- Renamed the skb_frag_t fields since it's no longer a bio_vec

---

Changes in v2:
- Reverted changes to the page_pool. The page pool now retains the same
  API, so that we don't have to touch many existing drivers. The devmem
  TCP series will include the changes to the page pool.

- Addressed comments.

This series is a prerequisite to the devmem TCP series. For a full
snapshot of the code which includes these changes, feel free to check:

https://github.com/mina/linux/commits/tcpdevmem-rfcv5/

---

Currently these components in the net stack use the struct page
directly:

1. Drivers.
2. Page pool.
3. skb_frag_t.

To add support for new (non struct page) memory types to the net stack, we
must first abstract the current memory type.

Originally the plan was to reuse struct page* for the new memory types,
and to set the LSB on the page* to indicate it's not really a page.
However, for safe compiler type checking we need to introduce a new type.

struct netmem is introduced to abstract the underlying memory type.
Currently it's a no-op abstraction that is always a struct page underneath.
In parallel there is an undergoing effort to add support for devmem to the
net stack:

https://lore.kernel.org/netdev/20231208005250.2910004-1-almasrym...@google.com/

Cc: Jason Gunthorpe 
Cc: Christian König 
Cc: Shakeel Butt 
Cc: Yunsheng Lin 
Cc: Willem de Bruijn 

Mina Almasry (3):
  vsock/virtio: use skb_frag_*() helpers
  net: introduce abstraction for network memory
  net: add netmem_ref to skb_frag_t

 include/linux/skbuff.h   | 92 ++--
 include/net/netmem.h | 41 ++
 net/core/skbuff.c| 22 +---
 net/kcm/kcmsock.c| 10 +++-
 net/vmw_vsock/virtio_transport.c |  6 +--
 5 files changed, 133 insertions(+), 38 deletions(-)
 create mode 100644 include/net/netmem.h

-- 
2.43.0.472.g3155946c3a-goog




Re: [PATCH net-next v2 3/3] net: add netmem_t to skb_frag_t

2023-12-18 Thread Mina Almasry
On Mon, Dec 18, 2023 at 4:39 AM Yunsheng Lin  wrote:
>
> On 2023/12/17 16:09, Mina Almasry wrote:
> > Use netmem_t instead of page directly in skb_frag_t. Currently netmem_t
> > is always a struct page underneath, but the abstraction allows efforts
> > to add support for skb frags not backed by pages.
> >
> > There is unfortunately 1 instance where the skb_frag_t is assumed to be
> > a bio_vec in kcm. For this case, add a debug assert that the skb frag is
> > indeed backed by a page, and do a cast.
> >
> > Add skb[_frag]_fill_netmem_*() and skb_add_rx_frag_netmem() helpers so
> > that the API can be used to create netmem skbs.
> >
> > Signed-off-by: Mina Almasry 
> >
>
> ...
>
> >
> > -typedef struct bio_vec skb_frag_t;
> > +typedef struct skb_frag {
> > + struct netmem *bv_page;
>
> bv_page -> bv_netmem?
>

bv_page, bv_len & bv_offset all are misnomers after this change
indeed, because bv_ refers to bio_vec and skb_frag_t is no longer a
bio_vec. However I'm hoping renaming everything can be done in a
separate series. Maybe I'll just apply the bv_page -> bv_netmem
change, that doesn't seem to be much code churn and it makes things
much less confusing.

> > + unsigned int bv_len;
> > + unsigned int bv_offset;
> > +} skb_frag_t;
> >
> >  /**
> >   * skb_frag_size() - Returns the size of a skb fragment
> > @@ -2431,22 +2436,37 @@ static inline unsigned int skb_pagelen(const struct 
> > sk_buff *skb)
> >   return skb_headlen(skb) + __skb_pagelen(skb);
> >  }
> >
>
> ...
>
> >  /**
> > @@ -2462,10 +2482,10 @@ static inline void skb_len_add(struct sk_buff *skb, 
> > int delta)
> >  }
> >
> >  /**
> > - * __skb_fill_page_desc - initialise a paged fragment in an skb
> > + * __skb_fill_netmem_desc - initialise a paged fragment in an skb
> >   * @skb: buffer containing fragment to be initialised
> >   * @i: paged fragment index to initialise
> > - * @page: the page to use for this fragment
> > + * @netmem: the netmem to use for this fragment
> >   * @off: the offset to the data with @page
> >   * @size: the length of the data
> >   *
> > @@ -2474,10 +2494,13 @@ static inline void skb_len_add(struct sk_buff *skb, 
> > int delta)
> >   *
> >   * Does not take any additional reference on the fragment.
> >   */
> > -static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
> > - struct page *page, int off, int size)
> > +static inline void __skb_fill_netmem_desc(struct sk_buff *skb, int i,
> > +   struct netmem *netmem, int off,
> > +   int size)
> >  {
> > - __skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size);
> > + struct page *page = netmem_to_page(netmem);
> > +
> > + __skb_fill_netmem_desc_noacc(skb_shinfo(skb), i, netmem, off, size);
> >
> >   /* Propagate page pfmemalloc to the skb if we can. The problem is
> >* that not all callers have unique ownership of the page but rely
> > @@ -2485,7 +2508,21 @@ static inline void __skb_fill_page_desc(struct 
> > sk_buff *skb, int i,
> >*/
> >   page = compound_head(page);
> >   if (page_is_pfmemalloc(page))
> > - skb->pfmemalloc = true;
> > + skb->pfmemalloc = true;
>
> Is it possible to introduce netmem_is_pfmemalloc() and netmem_compound_head()
> for netmem,

That is exactly the plan, and I added these helpers in the follow up
series which introduces devmem support:

https://patchwork.kernel.org/project/netdevbpf/patch/20231218024024.3516870-8-almasrym...@google.com/

> and have some built-time testing to ensure the implementation
> is the same between page_is_pfmemalloc()/compound_head() and
> netmem_is_pfmemalloc()/netmem_compound_head()?

That doesn't seem desirable to me. It's too hacky IMO to duplicate the
implementation details of the MM stack in the net stack and that is
not the implementation you see in the patch that adds these helpers
above.

> So that we can avoid the
> netmem_to_page() as much as possible, especially in the driver.
>

Agreed.

>
> > +}
> > +
> > +static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
> > + struct page *page, int off, int size)
> > +{
> > + __skb_fill_netmem_desc(skb, i, page_to_netmem(page), off, size);
> > +}
> > +
>
> ...
>
> >   */
> >  static inline struct page *skb_frag_page(const skb_frag_t 

[PATCH net-next v2 3/3] net: add netmem_t to skb_frag_t

2023-12-17 Thread Mina Almasry
Use netmem_t instead of page directly in skb_frag_t. Currently netmem_t
is always a struct page underneath, but the abstraction allows efforts
to add support for skb frags not backed by pages.

There is unfortunately 1 instance where the skb_frag_t is assumed to be
a bio_vec in kcm. For this case, add a debug assert that the skb frag is
indeed backed by a page, and do a cast.

Add skb[_frag]_fill_netmem_*() and skb_add_rx_frag_netmem() helpers so
that the API can be used to create netmem skbs.

Signed-off-by: Mina Almasry 

---

v2:
- Add skb frag filling helpers.
---
 include/linux/skbuff.h | 70 --
 net/core/skbuff.c  | 22 +
 net/kcm/kcmsock.c  | 10 --
 3 files changed, 78 insertions(+), 24 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 7ce38874dbd1..03ab13072962 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -37,6 +37,7 @@
 #endif
 #include 
 #include 
+#include 
 
 /**
  * DOC: skb checksums
@@ -359,7 +360,11 @@ extern int sysctl_max_skb_frags;
  */
 #define GSO_BY_FRAGS   0x
 
-typedef struct bio_vec skb_frag_t;
+typedef struct skb_frag {
+   struct netmem *bv_page;
+   unsigned int bv_len;
+   unsigned int bv_offset;
+} skb_frag_t;
 
 /**
  * skb_frag_size() - Returns the size of a skb fragment
@@ -2431,22 +2436,37 @@ static inline unsigned int skb_pagelen(const struct 
sk_buff *skb)
return skb_headlen(skb) + __skb_pagelen(skb);
 }
 
+static inline void skb_frag_fill_netmem_desc(skb_frag_t *frag,
+struct netmem *netmem, int off,
+int size)
+{
+   frag->bv_page = netmem;
+   frag->bv_offset = off;
+   skb_frag_size_set(frag, size);
+}
+
 static inline void skb_frag_fill_page_desc(skb_frag_t *frag,
   struct page *page,
   int off, int size)
 {
-   frag->bv_page = page;
-   frag->bv_offset = off;
-   skb_frag_size_set(frag, size);
+   skb_frag_fill_netmem_desc(frag, page_to_netmem(page), off, size);
+}
+
+static inline void __skb_fill_netmem_desc_noacc(struct skb_shared_info *shinfo,
+   int i, struct netmem *netmem,
+   int off, int size)
+{
+   skb_frag_t *frag = &shinfo->frags[i];
+
+   skb_frag_fill_netmem_desc(frag, netmem, off, size);
 }
 
 static inline void __skb_fill_page_desc_noacc(struct skb_shared_info *shinfo,
  int i, struct page *page,
  int off, int size)
 {
-   skb_frag_t *frag = &shinfo->frags[i];
-
-   skb_frag_fill_page_desc(frag, page, off, size);
+   __skb_fill_netmem_desc_noacc(shinfo, i, page_to_netmem(page), off,
+size);
 }
 
 /**
@@ -2462,10 +2482,10 @@ static inline void skb_len_add(struct sk_buff *skb, int 
delta)
 }
 
 /**
- * __skb_fill_page_desc - initialise a paged fragment in an skb
+ * __skb_fill_netmem_desc - initialise a paged fragment in an skb
  * @skb: buffer containing fragment to be initialised
  * @i: paged fragment index to initialise
- * @page: the page to use for this fragment
+ * @netmem: the netmem to use for this fragment
  * @off: the offset to the data with @page
  * @size: the length of the data
  *
@@ -2474,10 +2494,13 @@ static inline void skb_len_add(struct sk_buff *skb, int 
delta)
  *
  * Does not take any additional reference on the fragment.
  */
-static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
-   struct page *page, int off, int size)
+static inline void __skb_fill_netmem_desc(struct sk_buff *skb, int i,
+ struct netmem *netmem, int off,
+ int size)
 {
-   __skb_fill_page_desc_noacc(skb_shinfo(skb), i, page, off, size);
+   struct page *page = netmem_to_page(netmem);
+
+   __skb_fill_netmem_desc_noacc(skb_shinfo(skb), i, netmem, off, size);
 
/* Propagate page pfmemalloc to the skb if we can. The problem is
 * that not all callers have unique ownership of the page but rely
@@ -2485,7 +2508,21 @@ static inline void __skb_fill_page_desc(struct sk_buff 
*skb, int i,
 */
page = compound_head(page);
if (page_is_pfmemalloc(page))
-   skb->pfmemalloc = true;
+   skb->pfmemalloc = true;
+}
+
+static inline void __skb_fill_page_desc(struct sk_buff *skb, int i,
+   struct page *page, int off, int size)
+{
+   __skb_fill_netmem_desc(skb, i, page_to_netmem(page), off, size);
+}
+
+static inline void skb_fill_netmem_desc(struct sk_buff *skb, int i,
+ 

[PATCH net-next v2 2/3] net: introduce abstraction for network memory

2023-12-17 Thread Mina Almasry
Add the netmem_t type, an abstraction for network memory.

To add support for new memory types to the net stack, we must first
abstract the current memory type from the net stack. Currently parts of
the net stack use struct page directly:

- page_pool
- drivers
- skb_frag_t

Originally the plan was to reuse struct page* for the new memory types,
and to set the LSB on the page* to indicate it's not really a page.
However, for compiler type checking we need to introduce a new type.

netmem_t is introduced to abstract the underlying memory type. Currently
it's a no-op abstraction that is always a struct page underneath. In
parallel there is an undergoing effort to add support for devmem to the
net stack:

https://lore.kernel.org/netdev/20231208005250.2910004-1-almasrym...@google.com/

Signed-off-by: Mina Almasry 

---

v2:

- Use container_of instead of a type cast (David).
---
 include/net/netmem.h | 35 +++
 1 file changed, 35 insertions(+)
 create mode 100644 include/net/netmem.h

diff --git a/include/net/netmem.h b/include/net/netmem.h
new file mode 100644
index ..b60b00216704
--- /dev/null
+++ b/include/net/netmem.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * netmem.h
+ *     Author: Mina Almasry 
+ * Copyright (C) 2023 Google LLC
+ */
+
+#ifndef _NET_NETMEM_H
+#define _NET_NETMEM_H
+
+struct netmem {
+   union {
+   struct page page;
+
+   /* Stub to prevent compiler implicitly converting from page*
+* to netmem_t* and vice versa.
+*
+* Other memory type(s) net stack would like to support
+* can be added to this union.
+*/
+   void *addr;
+   };
+};
+
+static inline struct page *netmem_to_page(struct netmem *netmem)
+{
+   return &netmem->page;
+}
+
+static inline struct netmem *page_to_netmem(struct page *page)
+{
+   return container_of(page, struct netmem, page);
+}
+
+#endif /* _NET_NETMEM_H */
-- 
2.43.0.472.g3155946c3a-goog




[PATCH net-next v2 1/3] vsock/virtio: use skb_frag_*() helpers

2023-12-17 Thread Mina Almasry
Minor fix for virtio: code wanting to access the fields inside an skb
frag should use the skb_frag_*() helpers, instead of accessing the
fields directly. This allows for extensions where the underlying
memory is not a page.

Signed-off-by: Mina Almasry 

---

v2:

- Also fix skb_frag_off() + skb_frag_size() (David)
- Did not apply the reviewed-by from Stefano since the patch changed
relatively much.

---
 net/vmw_vsock/virtio_transport.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index f495b9e5186b..1748268e0694 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -153,10 +153,10 @@ virtio_transport_send_pkt_work(struct work_struct *work)
 * 'virt_to_phys()' later to fill the buffer 
descriptor.
 * We don't touch memory at "virtual" address 
of this page.
 */
-   va = page_to_virt(skb_frag->bv_page);
+   va = page_to_virt(skb_frag_page(skb_frag));
sg_init_one(sgs[out_sg],
-   va + skb_frag->bv_offset,
-   skb_frag->bv_len);
+   va + skb_frag_off(skb_frag),
+   skb_frag_size(skb_frag));
out_sg++;
}
}
-- 
2.43.0.472.g3155946c3a-goog




[PATCH net-next v2 0/3] Abstract page from net stack

2023-12-17 Thread Mina Almasry
Changes in v2:
- Reverted changes to the page_pool. The page pool now retains the same
  API, so that we don't have to touch many existing drivers. The devmem
  TCP series will include the changes to the page pool.

- Addressed comments.

This series is a prerequisite to the devmem TCP series. For a full
snapshot of the code which includes these changes, feel free to check:

https://github.com/mina/linux/commits/tcpdevmem-rfcv5/

---

Currently these components in the net stack use the struct page
directly:

1. Drivers.
2. Page pool.
3. skb_frag_t.

To add support for new (non struct page) memory types to the net stack, we
must first abstract the current memory type.

Originally the plan was to reuse struct page* for the new memory types,
and to set the LSB on the page* to indicate it's not really a page.
However, for safe compiler type checking we need to introduce a new type.

struct netmem is introduced to abstract the underlying memory type.
Currently it's a no-op abstraction that is always a struct page underneath.
In parallel there is an undergoing effort to add support for devmem to the
net stack:

https://lore.kernel.org/netdev/20231208005250.2910004-1-almasrym...@google.com/

Cc: Jason Gunthorpe 
Cc: Christian König 
Cc: Shakeel Butt 
Cc: Yunsheng Lin 
Cc: Willem de Bruijn 

Mina Almasry (3):
  vsock/virtio: use skb_frag_*() helpers
  net: introduce abstraction for network memory
  net: add netmem_t to skb_frag_t

 include/linux/skbuff.h   | 70 
 include/net/netmem.h | 35 
 net/core/skbuff.c| 22 +++---
 net/kcm/kcmsock.c| 10 -
 net/vmw_vsock/virtio_transport.c |  6 +--
 5 files changed, 116 insertions(+), 27 deletions(-)
 create mode 100644 include/net/netmem.h

-- 
2.43.0.472.g3155946c3a-goog




Re: [PATCH v1] virtio_pmem: populate numa information

2022-11-14 Thread Mina Almasry
On Sun, Nov 13, 2022 at 9:44 AM Pankaj Gupta
 wrote:
>
> > > Pankaj Gupta wrote:
> > > > > > > Compute the numa information for a virtio_pmem device from the 
> > > > > > > memory
> > > > > > > range of the device. Previously, the target_node was always 0 
> > > > > > > since
> > > > > > > the ndr_desc.target_node field was never explicitly set. The code 
> > > > > > > for
> > > > > > > computing the numa node is taken from cxl_pmem_region_probe in
> > > > > > > drivers/cxl/pmem.c.
> > > > > > >
> > > > > > > Signed-off-by: Michael Sammler 
> >
> > Tested-by: Mina Almasry 
> >
> > I don't have much expertise on this driver, but with the help of this
> > patch I was able to get memory tiering [1] emulation going on qemu. As
> > far as I know there is no alternative to this emulation, and so I
> > would love to see this or equivalent merged, if possible.
> >
> > This is what I have going to get memory tiering emulation:
> >
> > In qemu, added these configs:
> >   -object 
> > memory-backend-file,id=m4,share=on,mem-path="$path_to_virtio_pmem_file",size=2G
> > \
> >   -smp 2,sockets=2,maxcpus=2  \
> >   -numa node,nodeid=0,memdev=m0 \
> >   -numa node,nodeid=1,memdev=m1 \
> >   -numa node,nodeid=2,memdev=m2,initiator=0 \
> >   -numa node,nodeid=3,initiator=0 \
> >   -device virtio-pmem-pci,memdev=m4,id=nvdimm1 \
> >
> > On boot, ran these commands:
> > ndctl_static create-namespace -e namespace0.0 -m devdax -f 1&> /dev/null
> > echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> > echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> > for i in `ls /sys/devices/system/memory/`; do
> >   state=$(cat "/sys/devices/system/memory/$i/state" 2&>/dev/null)
> >   if [ "$state" == "offline" ]; then
> > echo online_movable > "/sys/devices/system/memory/$i/state"
> >   fi
> > done
>
> Nice to see the way to handle the virtio-pmem device memory through kmem 
> driver
> and online the corresponding memory blocks to 'zone_movable'.
>
> This also opens way to use this memory range directly irrespective of attached
> block device. Of course there won't be any persistent data guarantee. But good
> way to simulate memory tiering inside guest as demonstrated below.
> >
> > Without this CL, I see the memory onlined in node 0 always, and is not
> > a separate memory tier. With this CL and qemu configs, the memory is
> > onlined in node 3 and is set as a separate memory tier, which enables
> > qemu-based development:
> >
> > ==> /sys/devices/virtual/memory_tiering/memory_tier22/nodelist <==
> > 3
> > ==> /sys/devices/virtual/memory_tiering/memory_tier4/nodelist <==
> > 0-2
> >
> > AFAIK there is no alternative to enabling memory tiering emulation in
> > qemu, and would love to see this or equivalent merged, if possible.
>
> Just wondering if Qemu vNVDIMM device can also achieve this?
>

I spent a few minutes on this. Please note I'm really not familiar
with these drivers, but as far as I can tell the qemu vNVDIMM device
has the same problem and needs a similar fix to this to what Michael
did here. What I did with vNVDIMM qemu device:

- Added these qemu configs:
  -object 
memory-backend-file,id=m4,share=on,mem-path=./hello,size=2G,readonly=off
\
  -device nvdimm,id=nvdimm1,memdev=m4,unarmed=off \

- Ran the same commands in my previous email (they seem to apply to
the vNVDIMM device without modification):
ndctl_static create-namespace -e namespace0.0 -m devdax -f 1&> /dev/null
echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
for i in `ls /sys/devices/system/memory/`; do
  state=$(cat "/sys/devices/system/memory/$i/state" 2&>/dev/null)
  if [ "$state" == "offline" ]; then
echo online_movable > "/sys/devices/system/memory/$i/state"
  fi
done

I see the memory from the vNVDIMM device get onlined on node0, and is
not detected as a separate memory tier. I suspect that driver needs a
similar fix to this one.

> In any case, this patch is useful, So,
> Reviewed-by: Pankaj Gupta 
> >
> >
> >
> > [1] 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers
> >
> >

Re: [PATCH v1] virtio_pmem: populate numa information

2022-11-11 Thread Mina Almasry
On Wed, Oct 26, 2022 at 2:50 PM Dan Williams  wrote:
>
> Pankaj Gupta wrote:
> > > > > Compute the numa information for a virtio_pmem device from the memory
> > > > > range of the device. Previously, the target_node was always 0 since
> > > > > the ndr_desc.target_node field was never explicitly set. The code for
> > > > > computing the numa node is taken from cxl_pmem_region_probe in
> > > > > drivers/cxl/pmem.c.
> > > > >
> > > > > Signed-off-by: Michael Sammler 

Tested-by: Mina Almasry 

I don't have much expertise on this driver, but with the help of this
patch I was able to get memory tiering [1] emulation going on qemu. As
far as I know there is no alternative to this emulation, and so I
would love to see this or equivalent merged, if possible.

This is what I have going to get memory tiering emulation:

In qemu, added these configs:
  -object 
memory-backend-file,id=m4,share=on,mem-path="$path_to_virtio_pmem_file",size=2G
\
  -smp 2,sockets=2,maxcpus=2  \
  -numa node,nodeid=0,memdev=m0 \
  -numa node,nodeid=1,memdev=m1 \
  -numa node,nodeid=2,memdev=m2,initiator=0 \
  -numa node,nodeid=3,initiator=0 \
  -device virtio-pmem-pci,memdev=m4,id=nvdimm1 \

On boot, ran these commands:
ndctl_static create-namespace -e namespace0.0 -m devdax -f 1&> /dev/null
echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
for i in `ls /sys/devices/system/memory/`; do
  state=$(cat "/sys/devices/system/memory/$i/state" 2&>/dev/null)
  if [ "$state" == "offline" ]; then
echo online_movable > "/sys/devices/system/memory/$i/state"
  fi
done

Without this CL, I see the memory onlined in node 0 always, and is not
a separate memory tier. With this CL and qemu configs, the memory is
onlined in node 3 and is set as a separate memory tier, which enables
qemu-based development:

==> /sys/devices/virtual/memory_tiering/memory_tier22/nodelist <==
3
==> /sys/devices/virtual/memory_tiering/memory_tier4/nodelist <==
0-2

AFAIK there is no alternative to enabling memory tiering emulation in
qemu, and would love to see this or equivalent merged, if possible.


[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/ABI/testing/sysfs-kernel-mm-memory-tiers

> > > > > ---
> > > > >  drivers/nvdimm/virtio_pmem.c | 11 +--
> > > > >  1 file changed, 9 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/nvdimm/virtio_pmem.c 
> > > > > b/drivers/nvdimm/virtio_pmem.c
> > > > > index 20da455d2ef6..a92eb172f0e7 100644
> > > > > --- a/drivers/nvdimm/virtio_pmem.c
> > > > > +++ b/drivers/nvdimm/virtio_pmem.c
> > > > > @@ -32,7 +32,6 @@ static int init_vq(struct virtio_pmem *vpmem)
> > > > >  static int virtio_pmem_probe(struct virtio_device *vdev)
> > > > >  {
> > > > > struct nd_region_desc ndr_desc = {};
> > > > > -   int nid = dev_to_node(&vdev->dev);
> > > > > struct nd_region *nd_region;
> > > > > struct virtio_pmem *vpmem;
> > > > > struct resource res;
> > > > > @@ -79,7 +78,15 @@ static int virtio_pmem_probe(struct virtio_device 
> > > > > *vdev)
> > > > > dev_set_drvdata(&vdev->dev, vpmem->nvdimm_bus);
> > > > >
> > > > > ndr_desc.res = &res;
> > > > > -   ndr_desc.numa_node = nid;
> > > > > +
> > > > > +   ndr_desc.numa_node = memory_add_physaddr_to_nid(res.start);
> > > > > +   ndr_desc.target_node = phys_to_target_node(res.start);
> > > > > +   if (ndr_desc.target_node == NUMA_NO_NODE) {
> > > > > +   ndr_desc.target_node = ndr_desc.numa_node;
> > > > > +   dev_dbg(&vdev->dev, "changing target node from %d to 
> > > > > %d",
> > > > > +   NUMA_NO_NODE, ndr_desc.target_node);
> > > > > +   }
> > > >
> > > > As this memory later gets hotplugged using "devm_memremap_pages". I 
> > > > don't
> > > > see if 'target_node' is used for fsdax case?
> > > >
> > > > It seems to me "target_node" is used mainly for volatile range above
> > > > persistent memory ( e.g kmem driver?).
> > > >
> > > I am not sure if 'target_node' is used in the fsdax case, but it is
> > > indeed used by the devdax/kmem driver when hotplugging the memory (see
> > > 'dev_dax_kmem_probe' and '__dax_pmem_probe').
> >
> > Yes, but not currently for FS_DAX iiuc.
>
> The target_node is only used by the dax_kmem driver. In the FSDAX case
> the memory (persistent or otherwise) is mapped behind a block-device.
> That block-device has affinity to a CPU initiator, but that memory does
> not itself have any NUMA affinity or identity as a target.
>
> So:
>
> block-device NUMA node == closest CPU initiator node to the device
>
> dax-device target node == memory only NUMA node target, after onlining



Re: [PATCH] hugetlb_cgroup: fix reservation accounting

2020-10-28 Thread Mina Almasry
On Thu, Oct 22, 2020 at 5:21 AM Michael S. Tsirkin  wrote:
>
> On Wed, Oct 21, 2020 at 01:44:26PM -0700, Mike Kravetz wrote:
> > Michal Privoznik was using "free page reporting" in QEMU/virtio-balloon
> > with hugetlbfs and hit the warning below.  QEMU with free page hinting
> > uses fallocate(FALLOC_FL_PUNCH_HOLE) to discard pages that are reported
> > as free by a VM. The reporting granularity is in pageblock granularity.
> > So when the guest reports 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE)
> > one huge page in QEMU.
> >
> > [  315.251417] [ cut here ]
> > [  315.251424] WARNING: CPU: 7 PID: 6636 at mm/page_counter.c:57 
> > page_counter_uncharge+0x4b/0x50
> > [  315.251425] Modules linked in: ...
> > [  315.251466] CPU: 7 PID: 6636 Comm: qemu-system-x86 Not tainted 5.9.0 #137
> > [  315.251467] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS 
> > PRO/X570 AORUS PRO, BIOS F21 07/31/2020
> > [  315.251469] RIP: 0010:page_counter_uncharge+0x4b/0x50
> > ...
> > [  315.251479] Call Trace:
> > [  315.251485]  hugetlb_cgroup_uncharge_file_region+0x4b/0x80
> > [  315.251487]  region_del+0x1d3/0x300
> > [  315.251489]  hugetlb_unreserve_pages+0x39/0xb0
> > [  315.251492]  remove_inode_hugepages+0x1a8/0x3d0
> > [  315.251495]  ? tlb_finish_mmu+0x7a/0x1d0
> > [  315.251497]  hugetlbfs_fallocate+0x3c4/0x5c0
> > [  315.251519]  ? kvm_arch_vcpu_ioctl_run+0x614/0x1700 [kvm]
> > [  315.251522]  ? file_has_perm+0xa2/0xb0
> > [  315.251524]  ? inode_security+0xc/0x60
> > [  315.251525]  ? selinux_file_permission+0x4e/0x120
> > [  315.251527]  vfs_fallocate+0x146/0x290
> > [  315.251529]  __x64_sys_fallocate+0x3e/0x70
> > [  315.251531]  do_syscall_64+0x33/0x40
> > [  315.251533]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > ...
> > [  315.251542] ---[ end trace 4c88c62ccb1349c9 ]---
> >
> > Investigation of the issue uncovered bugs in hugetlb cgroup reservation
> > accounting.  This patch addresses the found issues.
> >
> > Fixes: 075a61d07a8e ("hugetlb_cgroup: add accounting for shared mappings")
> > Cc: 
> > Reported-by: Michal Privoznik 
> > Co-developed-by: David Hildenbrand 
> > Signed-off-by: David Hildenbrand 
> > Signed-off-by: Mike Kravetz 
>
> Acked-by: Michael S. Tsirkin 
>
> > ---
> >  mm/hugetlb.c | 20 +++-
> >  1 file changed, 11 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 67fc6383995b..b853a11de14f 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -655,6 +655,8 @@ static long region_del(struct resv_map *resv, long f, 
> > long t)
> >   }
> >
> >   del += t - f;
> > + hugetlb_cgroup_uncharge_file_region(
> > + resv, rg, t - f);
> >
> >   /* New entry for end of split region */
> >   nrg->from = t;
> > @@ -667,9 +669,6 @@ static long region_del(struct resv_map *resv, long f, 
> > long t)
> >   /* Original entry is trimmed */
> >   rg->to = f;
> >
> > - hugetlb_cgroup_uncharge_file_region(
> > - resv, rg, nrg->to - nrg->from);
> > -
> >   list_add(&nrg->link, &rg->link);
> >   nrg = NULL;
> >   break;
> > @@ -685,17 +684,17 @@ static long region_del(struct resv_map *resv, long f, 
> > long t)
> >   }
> >
> >   if (f <= rg->from) {/* Trim beginning of region */
> > - del += t - rg->from;
> > - rg->from = t;
> > -
> >   hugetlb_cgroup_uncharge_file_region(resv, rg,
> >   t - rg->from);
> > - } else {/* Trim end of region */
> > - del += rg->to - f;
> > - rg->to = f;
> >
> > + del += t - rg->from;
> > + rg->from = t;
> > + } else {/* Trim end of region */
> >   hugetlb_cgroup_uncharge_file_region(resv, rg,
> >   rg->to - f);
> > +
> > +     del += rg->to - f;
> > + rg->to = f;
> >   }
> >   }
> >
> > @@ -2454,6 +2453,9 @@ struct page *alloc_huge_page(struct vm_area_struct 
> > *vma,
> >
> >   rsv_adjust = hugepage_subpool_put_pages(spool, 1);
> >   hugetlb_acct_memory(h, -rsv_adjust);
> > + if (deferred_reserve)
> > + hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
> > + pages_per_huge_page(h), page);
> >   }
> >   return page;
> >
> > --
> > 2.25.4
>

Sorry for the late review. Looks good to me.

Reviewed-by: Mina Almasry 


Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2020-10-14 Thread Mina Almasry
On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand  wrote:
>
> On 14.10.20 17:22, David Hildenbrand wrote:
> > Hi everybody,
> >
> > Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
> > with hugetlbfs and reported that this results in [1]
> >
> > 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 
> > page_counter_uncharge+0x4b/0x5
> >
> > 2. Any hugetlbfs allocations failing. (I assume because some accounting is 
> > wrong)
> >
> >
> > QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
> > to discard pages that are reported as free by a VM. The reporting
> > granularity is in pageblock granularity. So when the guest reports
> > 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.
> >
> > I was also able to reproduce (also with virtio-mem, which similarly
> > uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
> > (and on v5.7.X from F32).
> >
> > Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
> > is broken with cgroups. I did *not* try without cgroups yet.
> >
> > Any ideas?

Hi David,

I may be able to dig in and take a look. How do I reproduce this
though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a
hugetlb region?

>
> Just tried without the hugetlb controller, seems to work just fine.
>
> I'd like to note that
> - The controller was not activated
> - I had to compile the hugetlb controller out to make it work.
>
> --
> Thanks,
>
> David / dhildenb
>


Re: [PATCH 1/2] selftests/vm/write_to_hugetlbfs.c: fix unused variable warning

2020-05-18 Thread Mina Almasry
On Sat, May 16, 2020 at 5:12 PM John Hubbard  wrote:
>
> Remove unused variable "i", which was triggering a compiler warning.
>
> Fixes: 29750f71a9b4 ("hugetlb_cgroup: add hugetlb_cgroup reservation tests")
> Cc: Mina Almasry 
> Signed-off-by: John Hubbard 
> ---
>  tools/testing/selftests/vm/write_to_hugetlbfs.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/tools/testing/selftests/vm/write_to_hugetlbfs.c 
> b/tools/testing/selftests/vm/write_to_hugetlbfs.c
> index 110bc4e4015d..6a2caba19ee1 100644
> --- a/tools/testing/selftests/vm/write_to_hugetlbfs.c
> +++ b/tools/testing/selftests/vm/write_to_hugetlbfs.c
> @@ -74,8 +74,6 @@ int main(int argc, char **argv)
> int write = 0;
> int reserve = 1;
>
> -   unsigned long i;
> -
> if (signal(SIGINT, sig_handler) == SIG_ERR)
> err(1, "\ncan't catch SIGINT\n");
>
> --
> 2.26.2
>

Thanks John!

Reviewed-By: Mina Almasry 


Re: [PATCH v6 5/9] hugetlb: disable region_add file_region coalescing

2019-10-21 Thread Mina Almasry
On Mon, Oct 21, 2019 at 12:02 PM Mike Kravetz  wrote:
>
> On 10/12/19 5:30 PM, Mina Almasry wrote:
> > A follow up patch in this series adds hugetlb cgroup uncharge info the
> > file_region entries in resv->regions. The cgroup uncharge info may
> > differ for different regions, so they can no longer be coalesced at
> > region_add time. So, disable region coalescing in region_add in this
> > patch.
> >
> > Behavior change:
> >
> > Say a resv_map exists like this [0->1], [2->3], and [5->6].
> >
> > Then a region_chg/add call comes in region_chg/add(f=0, t=5).
> >
> > Old code would generate resv->regions: [0->5], [5->6].
> > New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
> > [5->6].
> >
> > Special care needs to be taken to handle the resv->adds_in_progress
> > variable correctly. In the past, only 1 region would be added for every
> > region_chg and region_add call. But now, each call may add multiple
> > regions, so we can no longer increment adds_in_progress by 1 in region_chg,
> > or decrement adds_in_progress by 1 after region_add or region_abort. 
> > Instead,
> > region_chg calls add_reservation_in_range() to count the number of regions
> > needed and allocates those, and that info is passed to region_add and
> > region_abort to decrement adds_in_progress correctly.
> >
> > Signed-off-by: Mina Almasry 
> >
> > ---
> >
> > Changes in v6:
> > - Fix bug in number of region_caches allocated by region_chg
> >
> > ---
> >  mm/hugetlb.c | 256 +--
> >  1 file changed, 147 insertions(+), 109 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 4a60d7d44b4c3..f9c1947925bb9 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> 
> > -static long region_chg(struct resv_map *resv, long f, long t)
> > +static long region_chg(struct resv_map *resv, long f, long t,
> > +long *out_regions_needed)
> >  {
> > + struct file_region *trg = NULL;
> >   long chg = 0;
> >
> > + /* Allocate the maximum number of regions we're going to need for this
> > +  * reservation. The maximum number of regions we're going to need is
> > +  * (t - f) / 2 + 1, which corresponds to a region with alternating
> > +  * reserved and unreserved pages.
> > +  */
> > + *out_regions_needed = (t - f) / 2 + 1;
> > +
> >   spin_lock(&resv->lock);
> > -retry_locked:
> > - resv->adds_in_progress++;
> > +
> > + resv->adds_in_progress += *out_regions_needed;
> >
> >   /*
> >* Check for sufficient descriptors in the cache to accommodate
> >* the number of in progress add operations.
> >*/
> > - if (resv->adds_in_progress > resv->region_cache_count) {
> > - struct file_region *trg;
> > -
> > - VM_BUG_ON(resv->adds_in_progress - resv->region_cache_count > 
> > 1);
> > + while (resv->region_cache_count < resv->adds_in_progress) {
> >   /* Must drop lock to allocate a new descriptor. */
> > - resv->adds_in_progress--;
> >   spin_unlock(&resv->lock);
> > -
> >   trg = kmalloc(sizeof(*trg), GFP_KERNEL);
> >   if (!trg)
> >   return -ENOMEM;
> > @@ -393,9 +395,9 @@ static long region_chg(struct resv_map *resv, long f, 
> > long t)
> >   spin_lock(&resv->lock);
> >   list_add(&trg->link, &resv->region_cache);
> >   resv->region_cache_count++;
> > - goto retry_locked;
> >   }
>
>
> I know that I suggested allocating the worst case number of entries, but this
> is going to be too much of a hit for existing hugetlbfs users.  It is not
> uncommon for DBs to have a shared areas in excess of 1TB mapped by hugetlbfs.
> With this new scheme, the above while loop will allocate over a half million
> file region entries and end up only using one.
>
> I think we need to step back and come up with a different approach.  Let me
> give it some more thought before throwing out ideas that may waste more of
> your time.  Sorry.

No problem at all. The other more reasonable option is to have it such
that region_add allocates its own cache entries if it needs to, and
the effect of that is that region_add may fail, so the callers must
handle that possibility. Doesn't seem too difficult to handle.

> --
> Mike Kravetz


Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-10-14 Thread Mina Almasry
On Mon, Oct 14, 2019 at 10:33 AM Mike Kravetz  wrote:
>
> On 10/11/19 1:41 PM, Mina Almasry wrote:
> > On Fri, Oct 11, 2019 at 12:10 PM Mina Almasry  
> > wrote:
> >>
> >> On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz  
> >> wrote:
> >>>
> >>> On 9/19/19 3:24 PM, Mina Almasry wrote:
> >>
> >> Mike, note your suggestion above to check if the page hugetlb_cgroup
> >> is null doesn't work if we want to keep the current counter working
> >> the same: the page will always have a hugetlb_cgroup that points that
> >> contains the old counter. Any ideas how to apply this new counter
> >> behavior to a private NORESERVE mappings? Is there maybe a flag I can
> >> set on the pages at allocation time that I can read on free time to
> >> know whether to uncharge the hugetlb_cgroup or not?
> >
> > Reading the code and asking around a bit, it seems the pointer to the
> > hugetlb_cgroup is in page[2].private. Is it reasonable to use
> > page[3].private to store the hugetlb_cgroup to uncharge for the new
> > counter and increment HUGETLB_CGROUP_MIN_ORDER to 3? I think that
> > would solve my problem. When allocating a private NORESERVE page, set
> > page[3].private to the hugetlb_cgroup to uncharge, then on
> > free_huge_page, check page[3].private, if it is non-NULL, uncharge the
> > new counter on it.
>
> Sorry for not responding sooner.  This approach should work, and it looks like
> you have a v6 of the series.  I'll take a look.
>

Great! Thanks! That's the approach I went with in v6.

> --
> Mike Kravetz


[PATCH v6 9/9] hugetlb_cgroup: Add hugetlb_cgroup reservation docs

2019-10-12 Thread Mina Almasry
Add docs for how to use hugetlb_cgroup reservations, and their behavior.

Signed-off-by: Mina Almasry 
Acked-by: Hillf Danton 

---

Changes in v6:
- Updated docs to reflect the new design based on a new counter that
tracks both reservations and faults.

---
 .../admin-guide/cgroup-v1/hugetlb.rst | 64 +++
 1 file changed, 53 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst 
b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
index a3902aa253a96..efb94e4db9d5a 100644
--- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
@@ -2,13 +2,6 @@
 HugeTLB Controller
 ==

-The HugeTLB controller allows to limit the HugeTLB usage per control group and
-enforces the controller limit during page fault. Since HugeTLB doesn't
-support page reclaim, enforcing the limit at page fault time implies that,
-the application will get SIGBUS signal if it tries to access HugeTLB pages
-beyond its limit. This requires the application to know beforehand how much
-HugeTLB pages it would require for its use.
-
 HugeTLB controller can be created by first mounting the cgroup filesystem.

 # mount -t cgroup -o hugetlb none /sys/fs/cgroup
@@ -28,10 +21,14 @@ process (bash) into it.

 Brief summary of control files::

- hugetlb..limit_in_bytes # set/show limit of "hugepagesize" 
hugetlb usage
- hugetlb..max_usage_in_bytes # show max "hugepagesize" hugetlb  
usage recorded
- hugetlb..usage_in_bytes # show current usage for 
"hugepagesize" hugetlb
- hugetlb..failcnt   # show the number of 
allocation failure due to HugeTLB limit
+ hugetlb..reservation_limit_in_bytes # set/show limit of 
"hugepagesize" hugetlb reservations
+ hugetlb..reservation_max_usage_in_bytes # show max 
"hugepagesize" hugetlb reservations and no-reserve faults.
+ hugetlb..reservation_usage_in_bytes # show current 
reservations and no-reserve faults for "hugepagesize" hugetlb
+ hugetlb..reservation_failcnt# show the number of 
allocation failure due to HugeTLB reservation limit
+ hugetlb..limit_in_bytes # set/show limit of 
"hugepagesize" hugetlb faults
+ hugetlb..max_usage_in_bytes # show max 
"hugepagesize" hugetlb  usage recorded
+ hugetlb..usage_in_bytes # show current usage 
for "hugepagesize" hugetlb
+ hugetlb..failcnt# show the number of 
allocation failure due to HugeTLB usage limit

 For a system supporting three hugepage sizes (64k, 32M and 1G), the control
 files include::
@@ -40,11 +37,56 @@ files include::
   hugetlb.1GB.max_usage_in_bytes
   hugetlb.1GB.usage_in_bytes
   hugetlb.1GB.failcnt
+  hugetlb.1GB.reservation_limit_in_bytes
+  hugetlb.1GB.reservation_max_usage_in_bytes
+  hugetlb.1GB.reservation_usage_in_bytes
+  hugetlb.1GB.reservation_failcnt
   hugetlb.64KB.limit_in_bytes
   hugetlb.64KB.max_usage_in_bytes
   hugetlb.64KB.usage_in_bytes
   hugetlb.64KB.failcnt
+  hugetlb.64KB.reservation_limit_in_bytes
+  hugetlb.64KB.reservation_max_usage_in_bytes
+  hugetlb.64KB.reservation_usage_in_bytes
+  hugetlb.64KB.reservation_failcnt
   hugetlb.32MB.limit_in_bytes
   hugetlb.32MB.max_usage_in_bytes
   hugetlb.32MB.usage_in_bytes
   hugetlb.32MB.failcnt
+  hugetlb.32MB.reservation_limit_in_bytes
+  hugetlb.32MB.reservation_max_usage_in_bytes
+  hugetlb.32MB.reservation_usage_in_bytes
+  hugetlb.32MB.reservation_failcnt
+
+
+1. Reservation limits
+
+The HugeTLB controller allows to limit the HugeTLB reservations per control
+group and enforces the controller limit at reservation time and at the fault of
+hugetlb memory for which no reservation exists. Reservation limits
+are superior to Page fault limits (see section 2), since Reservation limits are
+enforced at reservation time (on mmap or shget), and never causes the
+application to get SIGBUS signal if the memory was reserved before hand. For
+MAP_NORESERVE allocations, the reservation limit behaves the same as the fault
+limit, enforcing memory usage at fault time and causing the application to
+receive a SIGBUS if it's crossing its limit.
+
+2. Page fault limits
+
+The HugeTLB controller allows to limit the HugeTLB usage (page fault) per
+control group and enforces the controller limit during page fault. Since 
HugeTLB
+doesn't support page reclaim, enforcing the limit at page fault time implies
+that, the application will get SIGBUS signal if it tries to access HugeTLB
+pages beyond its limit. This requires the application to know beforehand how
+much HugeTLB pages it would require for its use.
+
+
+3. Caveats with shared memory
+
+For shared hugetlb memory, both hugetlb reservation and page faults are charged
+to the first task that causes the memory to be reserved or faulted, and all
+subsequent uses of this reserved or faulted memory is done without ch

[PATCH v6 8/9] hugetlb_cgroup: Add hugetlb_cgroup reservation tests

2019-10-12 Thread Mina Almasry
The tests use both shared and private mapped hugetlb memory, and
monitors the hugetlb usage counter as well as the hugetlb reservation
counter. They test different configurations such as hugetlb memory usage
via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
MAP_POPULATE.

Signed-off-by: Mina Almasry 

---

Changes in v6:
- Updates tests for cgroups-v2 and NORESERVE allocations.

---
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   1 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 527 ++
 .../selftests/vm/write_hugetlb_memory.sh  |  23 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 261 +
 5 files changed, 813 insertions(+)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

diff --git a/tools/testing/selftests/vm/.gitignore 
b/tools/testing/selftests/vm/.gitignore
index 31b3c98b6d34d..d3bed9407773c 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -14,3 +14,4 @@ virtual_address_range
 gup_benchmark
 va_128TBswitch
 map_fixed_noreplace
+write_to_hugetlbfs
diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index 9534dc2bc9295..31c2cc5cf30b5 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -18,6 +18,7 @@ TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
 TEST_GEN_FILES += va_128TBswitch
 TEST_GEN_FILES += virtual_address_range
+TEST_GEN_FILES += write_to_hugetlbfs

 TEST_PROGS := run_vmtests

diff --git a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh 
b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
new file mode 100755
index 0..278dd6475cd0f
--- /dev/null
+++ b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
@@ -0,0 +1,527 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+if [[ $(id -u) -ne 0 ]]; then
+   echo "This test must be run as root. Skipping..."
+   exit 0
+fi
+
+cgroup_path=/dev/cgroup/memory
+if [[ ! -e $cgroup_path ]]; then
+  mkdir -p $cgroup_path
+  mount -t cgroup2 none $cgroup_path
+fi
+
+echo "+hugetlb" > /dev/cgroup/memory/cgroup.subtree_control
+
+
+cleanup () {
+   echo $$ > $cgroup_path/cgroup.procs
+
+   if [[ -e /mnt/huge ]]; then
+ rm -rf /mnt/huge/*
+ umount /mnt/huge || echo error
+ rmdir /mnt/huge
+   fi
+   if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test
+   fi
+   if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test1
+   fi
+   if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test2
+   fi
+   echo 0 > /proc/sys/vm/nr_hugepages
+   echo CLEANUP DONE
+}
+
+function expect_equal() {
+  local expected="$1"
+  local actual="$2"
+  local error="$3"
+
+  if [[ "$expected" != "$actual" ]]; then
+   echo "expected ($expected) != actual ($actual): $3"
+   cleanup
+   exit 1
+  fi
+}
+
+function setup_cgroup() {
+  local name="$1"
+  local cgroup_limit="$2"
+  local reservation_limit="$3"
+
+  mkdir $cgroup_path/$name
+
+  echo writing cgroup limit: "$cgroup_limit"
+  echo "$cgroup_limit" > $cgroup_path/$name/hugetlb.2MB.limit_in_bytes
+
+  echo writing reseravation limit: "$reservation_limit"
+  echo "$reservation_limit" > \
+   $cgroup_path/$name/hugetlb.2MB.reservation_limit_in_bytes
+
+  if [ -e "$cgroup_path/$name/cpuset.cpus" ]; then
+echo 0 > $cgroup_path/$name/cpuset.cpus
+  fi
+  if [ -e "$cgroup_path/$name/cpuset.mems" ]; then
+echo 0 > $cgroup_path/$name/cpuset.mems
+  fi
+}
+
+function wait_for_hugetlb_memory_to_get_depleted {
+   local cgroup="$1"
+   local 
path="/dev/cgroup/memory/$cgroup/hugetlb.2MB.reservation_usage_in_bytes"
+   # Wait for hugetlbfs memory to get depleted.
+   while [ $(cat $path) != 0 ]; do
+  echo Waiting for hugetlb memory to get depleted.
+  cat $path
+  sleep 0.5
+   done
+}
+
+function wait_for_hugetlb_memory_to_get_reserved {
+   local cgroup="$1"
+   local size="$2"
+
+   local 
path="/dev/cgroup/memory/$cgroup/hugetlb.2MB.reservation_usage_in_bytes"
+   # Wait for hugetlbfs memory to get written.
+   while [ $(cat $path) != $size ]; do
+  echo Waiting for hugetlb memory to reach size $size.
+  cat $path
+  sleep 0.5
+   done
+}
+
+function wait_for_hugetlb_memory_to_get_wri

[PATCH v6 6/9] hugetlb_cgroup: add accounting for shared mappings

2019-10-12 Thread Mina Almasry
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.

After a call to region_chg, we charge the approprate hugetlb_cgroup, and if
successful, we pass on the hugetlb_cgroup info to a follow up region_add call.
When a file_region entry is added to the resv_map via region_add, we put the
pointer to that cgroup in file_region->reservation_counter. If charging doesn't
succeed, we report the error to the caller, so that the kernel fails the
reservation.

On region_del, which is when the hugetlb memory is unreserved, we also uncharge
the file_region->reservation_counter.

Signed-off-by: Mina Almasry 

---
 mm/hugetlb.c | 147 ---
 1 file changed, 116 insertions(+), 31 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f9c1947925bb9..af336bf227fb6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -242,6 +242,15 @@ struct file_region {
struct list_head link;
long from;
long to;
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* On shared mappings, each reserved region appears as a struct
+* file_region in resv_map. These fields hold the info needed to
+* uncharge each reservation.
+*/
+   struct page_counter *reservation_counter;
+   unsigned long pages_per_hpage;
+#endif
 };

 /* Helper that removes a struct file_region from the resv_map cache and returns
@@ -250,12 +259,30 @@ struct file_region {
 static struct file_region *
 get_file_region_entry_from_cache(struct resv_map *resv, long from, long to);

+/* Helper that records hugetlb_cgroup uncharge info. */
+static void record_hugetlb_cgroup_uncharge_info(struct hugetlb_cgroup *h_cg,
+   struct file_region *nrg,
+   struct hstate *h)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+   if (h_cg) {
+   nrg->reservation_counter =
+   &h_cg->reserved_hugepage[hstate_index(h)];
+   nrg->pages_per_hpage = pages_per_huge_page(h);
+   } else {
+   nrg->reservation_counter = NULL;
+   nrg->pages_per_hpage = 0;
+   }
+#endif
+}
+
 /* Must be called with resv->lock held. Calling this with count_only == true
  * will count the number of pages to be added but will not modify the linked
  * list.
  */
 static long add_reservation_in_range(struct resv_map *resv, long f, long t,
-bool count_only)
+struct hugetlb_cgroup *h_cg,
+struct hstate *h, bool count_only)
 {
long add = 0;
struct list_head *head = &resv->regions;
@@ -291,6 +318,8 @@ static long add_reservation_in_range(struct resv_map *resv, 
long f, long t,
if (!count_only) {
nrg = get_file_region_entry_from_cache(
resv, last_accounted_offset, rg->from);
+   record_hugetlb_cgroup_uncharge_info(h_cg, nrg,
+   h);
list_add(&nrg->link, rg->link.prev);
}
}
@@ -306,11 +335,13 @@ static long add_reservation_in_range(struct resv_map 
*resv, long f, long t,
if (!count_only) {
nrg = get_file_region_entry_from_cache(
resv, last_accounted_offset, t);
+   record_hugetlb_cgroup_uncharge_info(h_cg, nrg, h);
list_add(&nrg->link, rg->link.prev);
}
last_accounted_offset = t;
}

+   VM_BUG_ON(add < 0);
return add;
 }

@@ -327,7 +358,8 @@ static long add_reservation_in_range(struct resv_map *resv, 
long f, long t,
  * Return the number of new huge pages added to the map.  This
  * number is greater than or equal to zero.
  */
-static long region_add(struct resv_map *resv, long f, long t,
+static long region_add(struct hstate *h, struct hugetlb_cgroup *h_cg,
+  struct resv_map *resv, long f, long t,
   long regions_needed)
 {
long add = 0;
@@ -336,7 +368,7 @@ static long region_add(struct resv_map *resv, long f, long 
t,

VM_BUG_ON(resv->region_cache_count < regions_needed);

-   add = add_reservation_in_range(resv, f, t, false);
+   add = add_reservation_in_range(resv, f, t, h_cg, h, false);
resv->adds_in_progress -= regions_needed;

spin_unlock(&resv->lock);
@@ -398,7 +430,7 @@ static long region_chg(struct resv_map *resv, long f, long 
t,
}

/* Count how many hugepages in this range are NOT respresented. */
-   chg = add_reservation_in_range(resv, f, t, true);
+   chg

[PATCH v6 7/9] hugetlb_cgroup: support noreserve mappings

2019-10-12 Thread Mina Almasry
Support MAP_NORESERVE accounting as part of the new counter.

For each hugepage allocation, at allocation time we check if there is
a reservation for this allocation or not. If there is a reservation for
this allocation, then this allocation was charged at reservation time,
and we don't re-account it. If there is no reserevation for this
allocation, we charge the appropriate hugetlb_cgroup.

The hugetlb_cgroup to uncharge for this allocation is stored in
page[3].private. We use new APIs added in an earlier patch to set this
pointer.

---
 mm/hugetlb.c | 25 -
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index af336bf227fb6..79b99878ce6f9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1217,6 +1217,7 @@ static void update_and_free_page(struct hstate *h, struct 
page *page)
1 << PG_writeback);
}
VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page, false), page);
+   VM_BUG_ON_PAGE(hugetlb_cgroup_from_page(page, true), page);
set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
set_page_refcounted(page);
if (hstate_is_gigantic(h)) {
@@ -1328,6 +1329,9 @@ void free_huge_page(struct page *page)
clear_page_huge_active(page);
hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h),
 page, false);
+   hugetlb_cgroup_uncharge_page(hstate_index(h), pages_per_huge_page(h),
+page, true);
+
if (restore_reserve)
h->resv_huge_pages++;

@@ -1354,6 +1358,7 @@ static void prep_new_huge_page(struct hstate *h, struct 
page *page, int nid)
set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
spin_lock(&hugetlb_lock);
set_hugetlb_cgroup(page, NULL, false);
+   set_hugetlb_cgroup(page, NULL, true);
h->nr_huge_pages++;
h->nr_huge_pages_node[nid]++;
spin_unlock(&hugetlb_lock);
@@ -2155,10 +2160,19 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
gbl_chg = 1;
}

+   /* If this allocation is not consuming a reservation, charge it now.
+*/
+   if (map_chg || avoid_reserve || !vma_resv_map(vma)) {
+   ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h),
+  &h_cg, true);
+   if (ret)
+   goto out_subpool_put;
+   }
+
ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg,
   false);
if (ret)
-   goto out_subpool_put;
+   goto out_uncharge_cgroup_reservation;

spin_lock(&hugetlb_lock);
/*
@@ -2182,6 +2196,11 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
}
hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page,
 false);
+   if (!vma_resv_map(vma) || map_chg || avoid_reserve) {
+   hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg,
+page, true);
+   }
+
spin_unlock(&hugetlb_lock);

set_page_private(page, (unsigned long)spool);
@@ -2207,6 +2226,10 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 out_uncharge_cgroup:
hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg,
   false);
+out_uncharge_cgroup_reservation:
+   if (map_chg || avoid_reserve || !vma_resv_map(vma))
+   hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h),
+  h_cg, true);
 out_subpool_put:
if (map_chg || avoid_reserve)
hugepage_subpool_put_pages(spool, 1);
--
2.23.0.700.g56cf767bdb-goog


[PATCH v6 4/9] hugetlb_cgroup: add reservation accounting for private mappings

2019-10-12 Thread Mina Almasry
Normally the pointer to the cgroup to uncharge hangs off the struct
page, and gets queried when it's time to free the page. With
hugetlb_cgroup reservations, this is not possible. Because it's possible
for a page to be reserved by one task and actually faulted in by another
task.

The best place to put the hugetlb_cgroup pointer to uncharge for
reservations is in the resv_map. But, because the resv_map has different
semantics for private and shared mappings, the code patch to
charge/uncharge shared and private mappings is different. This patch
implements charging and uncharging for private mappings.

For private mappings, the counter to uncharge is in
resv_map->reservation_counter. On initializing the resv_map this is set
to NULL. On reservation of a region in private mapping, the tasks
hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
resv_map->reservation_counter.

On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.

Signed-off-by: Mina Almasry 
Acked-by: Hillf Danton 

---
 include/linux/hugetlb.h|  8 +++
 include/linux/hugetlb_cgroup.h | 11 +
 mm/hugetlb.c   | 44 +-
 mm/hugetlb_cgroup.c| 12 --
 4 files changed, 62 insertions(+), 13 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 9c49a0ba894d3..36dcda7be4b0e 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -46,6 +46,14 @@ struct resv_map {
long adds_in_progress;
struct list_head region_cache;
long region_cache_count;
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* On private mappings, the counter to uncharge reservations is stored
+* here. If these fields are 0, then the mapping is shared.
+*/
+   struct page_counter *reservation_counter;
+   unsigned long pages_per_hpage;
+#endif
 };
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 1bb58a63af586..f6e3d74a02536 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -25,6 +25,17 @@ struct hugetlb_cgroup;
 #define HUGETLB_CGROUP_MIN_ORDER 3

 #ifdef CONFIG_CGROUP_HUGETLB
+struct hugetlb_cgroup {
+   struct cgroup_subsys_state css;
+   /*
+* the counter to account for hugepages from hugetlb.
+*/
+   struct page_counter hugepage[HUGE_MAX_HSTATE];
+   /*
+* the counter to account for hugepage reservations from hugetlb.
+*/
+   struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
+};

 static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page 
*page,
  bool reserved)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 324859170463b..4a60d7d44b4c3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -665,6 +665,16 @@ struct resv_map *resv_map_alloc(void)
INIT_LIST_HEAD(&resv_map->regions);

resv_map->adds_in_progress = 0;
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* Initialize these to 0. On shared mappings, 0's here indicate these
+* fields don't do cgroup accounting. On private mappings, these will be
+* re-initialized to the proper values, to indicate that hugetlb cgroup
+* reservations are to be un-charged from here.
+*/
+   resv_map->reservation_counter = NULL;
+   resv_map->pages_per_hpage = 0;
+#endif

INIT_LIST_HEAD(&resv_map->region_cache);
list_add(&rg->link, &resv_map->region_cache);
@@ -3217,7 +3227,18 @@ static void hugetlb_vm_op_close(struct vm_area_struct 
*vma)

reserve = (end - start) - region_count(resv, start, end);

-   kref_put(&resv->refs, resv_map_release);
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* Since we check for HPAGE_RESV_OWNER above, this must a private
+* mapping, and these values should be none-zero, and should point to
+* the hugetlb_cgroup counter to uncharge for this reservation.
+*/
+   WARN_ON(!resv->reservation_counter);
+   WARN_ON(!resv->pages_per_hpage);
+
+   hugetlb_cgroup_uncharge_counter(resv->reservation_counter,
+   (end - start) * resv->pages_per_hpage);
+#endif

if (reserve) {
/*
@@ -3227,6 +3248,8 @@ static void hugetlb_vm_op_close(struct vm_area_struct 
*vma)
gbl_reserve = hugepage_subpool_put_pages(spool, reserve);
hugetlb_acct_memory(h, -gbl_reserve);
}
+
+   kref_put(&resv->refs, resv_map_release);
 }

 static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
@@ -4560,6 +4583,7 @@ int hugetlb_reserve_pages(struct inode *inode,
struct hstate *h = hstate_inode(inode);
struct hugepage_subpool *spool = 

[PATCH v6 2/9] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations

2019-10-12 Thread Mina Almasry
Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb
usage or hugetlb reservation counter.

Adds a new interface to uncharge a hugetlb_cgroup counter via
hugetlb_cgroup_uncharge_counter.

Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.

Signed-off-by: Mina Almasry 

---
 include/linux/hugetlb_cgroup.h |  67 +-
 mm/hugetlb.c   |  17 +++---
 mm/hugetlb_cgroup.c| 100 +
 3 files changed, 130 insertions(+), 54 deletions(-)

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 063962f6dfc6a..1bb58a63af586 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -22,27 +22,35 @@ struct hugetlb_cgroup;
  * Minimum page order trackable by hugetlb cgroup.
  * At least 3 pages are necessary for all the tracking information.
  */
-#define HUGETLB_CGROUP_MIN_ORDER   2
+#define HUGETLB_CGROUP_MIN_ORDER 3

 #ifdef CONFIG_CGROUP_HUGETLB

-static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page 
*page)
+static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page 
*page,
+ bool reserved)
 {
VM_BUG_ON_PAGE(!PageHuge(page), page);

if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
return NULL;
-   return (struct hugetlb_cgroup *)page[2].private;
+   if (reserved)
+   return (struct hugetlb_cgroup *)page[3].private;
+   else
+   return (struct hugetlb_cgroup *)page[2].private;
 }

-static inline
-int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
+static inline int set_hugetlb_cgroup(struct page *page,
+struct hugetlb_cgroup *h_cg,
+bool reservation)
 {
VM_BUG_ON_PAGE(!PageHuge(page), page);

if (compound_order(page) < HUGETLB_CGROUP_MIN_ORDER)
return -1;
-   page[2].private = (unsigned long)h_cg;
+   if (reservation)
+   page[3].private = (unsigned long)h_cg;
+   else
+   page[2].private = (unsigned long)h_cg;
return 0;
 }

@@ -52,26 +60,33 @@ static inline bool hugetlb_cgroup_disabled(void)
 }

 extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-   struct hugetlb_cgroup **ptr);
+   struct hugetlb_cgroup **ptr,
+   bool reserved);
 extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
 struct hugetlb_cgroup *h_cg,
-struct page *page);
+struct page *page, bool reserved);
 extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
-struct page *page);
+struct page *page, bool reserved);
+
 extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
-  struct hugetlb_cgroup *h_cg);
+  struct hugetlb_cgroup *h_cg,
+  bool reserved);
+extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
+   unsigned long nr_pages);
+
 extern void hugetlb_cgroup_file_init(void) __init;
 extern void hugetlb_cgroup_migrate(struct page *oldhpage,
   struct page *newhpage);

 #else
-static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page 
*page)
+static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page 
*page,
+ bool reserved)
 {
return NULL;
 }

-static inline
-int set_hugetlb_cgroup(struct page *page, struct hugetlb_cgroup *h_cg)
+static inline int set_hugetlb_cgroup(struct page *page,
+struct hugetlb_cgroup *h_cg, bool reserved)
 {
return 0;
 }
@@ -81,28 +96,30 @@ static inline bool hugetlb_cgroup_disabled(void)
return true;
 }

-static inline int
-hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-struct hugetlb_cgroup **ptr)
+static inline int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
+  struct hugetlb_cgroup **ptr,
+  bool reserved)
 {
return 0;
 }

-static inline void
-hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
-struct hugetlb_cgroup *h_cg,
-struct page *page)
+static inline void hugetlb_cgroup_commit_charge(int idx, unsigne

[PATCH v6 1/9] hugetlb_cgroup: Add hugetlb_cgroup reservation counter

2019-10-12 Thread Mina Almasry
M: 32):
  FAILshmget failed size 2097152 from line 176: Invalid argument
- shmoverride_linked_static (2M: 32):
  FAIL shmget failed size 2097152 from line 176: Invalid argument
- HUGETLB_SHM=yes shmoverride_linked_static (2M: 32):
  FAIL shmget failed size 2097152 from line 176: Invalid argument
- LD_PRELOAD=libhugetlbfs.so shmoverride_unlinked (2M: 32):
  FAIL shmget failed size 2097152 from line 176: Invalid argument
- LD_PRELOAD=libhugetlbfs.so HUGETLB_SHM=yes shmoverride_unlinked (2M: 32):
  FAILshmget failed size 2097152 from line 176: Invalid argument

[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

Signed-off-by: Mina Almasry 
Acked-by: Hillf Danton 

---
 include/linux/hugetlb.h |  23 -
 mm/hugetlb_cgroup.c | 111 ++--
 2 files changed, 107 insertions(+), 27 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 53fc34f930d08..9c49a0ba894d3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -320,6 +320,27 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, 
unsigned long addr,

 #ifdef CONFIG_HUGETLB_PAGE

+enum {
+   /* Tracks hugetlb memory faulted in. */
+   HUGETLB_RES_USAGE,
+   /* Tracks hugetlb memory reserved. */
+   HUGETLB_RES_RESERVATION_USAGE,
+   /* Limit for hugetlb memory faulted in. */
+   HUGETLB_RES_LIMIT,
+   /* Limit for hugetlb memory reserved. */
+   HUGETLB_RES_RESERVATION_LIMIT,
+   /* Max usage for hugetlb memory faulted in. */
+   HUGETLB_RES_MAX_USAGE,
+   /* Max usage for hugetlb memory reserved. */
+   HUGETLB_RES_RESERVATION_MAX_USAGE,
+   /* Faulted memory accounting fail count. */
+   HUGETLB_RES_FAILCNT,
+   /* Reserved memory accounting fail count. */
+   HUGETLB_RES_RESERVATION_FAILCNT,
+   HUGETLB_RES_NULL,
+   HUGETLB_RES_MAX,
+};
+
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
@@ -340,7 +361,7 @@ struct hstate {
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 #ifdef CONFIG_CGROUP_HUGETLB
/* cgroup control files */
-   struct cftype cgroup_files[5];
+   struct cftype cgroup_files[HUGETLB_RES_MAX];
 #endif
char name[HSTATE_NAME_LEN];
 };
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index f1930fa0b445d..1ed4448ca41d3 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -25,6 +25,10 @@ struct hugetlb_cgroup {
 * the counter to account for hugepages from hugetlb.
 */
struct page_counter hugepage[HUGE_MAX_HSTATE];
+   /*
+* the counter to account for hugepage reservations from hugetlb.
+*/
+   struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
 };

 #define MEMFILE_PRIVATE(x, val)(((x) << 16) | (val))
@@ -33,6 +37,14 @@ struct hugetlb_cgroup {

 static struct hugetlb_cgroup *root_h_cgroup __read_mostly;

+static inline struct page_counter *
+hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, int idx, bool reserved)
+{
+   if (reserved)
+   return &h_cg->reserved_hugepage[idx];
+   return &h_cg->hugepage[idx];
+}
+
 static inline
 struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
 {
@@ -254,30 +266,33 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned 
long nr_pages,
return;
 }

-enum {
-   RES_USAGE,
-   RES_LIMIT,
-   RES_MAX_USAGE,
-   RES_FAILCNT,
-};
-
 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
   struct cftype *cft)
 {
struct page_counter *counter;
+   struct page_counter *reserved_counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);

counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
+   reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];

switch (MEMFILE_ATTR(cft->private)) {
-   case RES_USAGE:
+   case HUGETLB_RES_USAGE:
return (u64)page_counter_read(counter) * PAGE_SIZE;
-   case RES_LIMIT:
+   case HUGETLB_RES_RESERVATION_USAGE:
+   return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
+   case HUGETLB_RES_LIMIT:
return (u64)counter->max * PAGE_SIZE;
-   case RES_MAX_USAGE:
+   case HUGETLB_RES_RESERVATION_LIMIT:
+   return (u64)reserved_counter->max * PAGE_SIZE;
+   case HUGETLB_RES_MAX_USAGE:
return (u64)counter->watermark * PAGE_SIZE;
-   case RES_FAILCNT:
+   case HUGETLB_RES_RESERVATION_MAX_USAGE:
+   return (u64)reserved_counter->watermark * PAGE_SIZE;
+   case HUGETLB_RES_FAILCNT:
return counter->failcnt;
+   case HUGETLB_RES_RESERVATION_FAILCNT:
+   return reserved_counter->failcnt;
default:
BUG()

[PATCH v6 3/9] hugetlb_cgroup: add cgroup-v2 support

2019-10-12 Thread Mina Almasry
---
 mm/hugetlb_cgroup.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 854117513979b..ac1500205faf7 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -503,8 +503,13 @@ static void __init __hugetlb_cgroup_file_init(int idx)
cft = &h->cgroup_files[HUGETLB_RES_NULL];
memset(cft, 0, sizeof(*cft));

-   WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
- h->cgroup_files));
+   if (cgroup_subsys_on_dfl(hugetlb_cgrp_subsys)) {
+   WARN_ON(cgroup_add_dfl_cftypes(&hugetlb_cgrp_subsys,
+  h->cgroup_files));
+   } else {
+   WARN_ON(cgroup_add_legacy_cftypes(&hugetlb_cgrp_subsys,
+ h->cgroup_files));
+   }
 }

 void __init hugetlb_cgroup_file_init(void)
@@ -548,8 +553,14 @@ void hugetlb_cgroup_migrate(struct page *oldhpage, struct 
page *newhpage)
return;
 }

+static struct cftype hugetlb_files[] = {
+   {} /* terminate */
+};
+
 struct cgroup_subsys hugetlb_cgrp_subsys = {
.css_alloc  = hugetlb_cgroup_css_alloc,
.css_offline= hugetlb_cgroup_css_offline,
.css_free   = hugetlb_cgroup_css_free,
+   .dfl_cftypes = hugetlb_files,
+   .legacy_cftypes = hugetlb_files,
 };
--
2.23.0.700.g56cf767bdb-goog


[PATCH v6 5/9] hugetlb: disable region_add file_region coalescing

2019-10-12 Thread Mina Almasry
A follow up patch in this series adds hugetlb cgroup uncharge info the
file_region entries in resv->regions. The cgroup uncharge info may
differ for different regions, so they can no longer be coalesced at
region_add time. So, disable region coalescing in region_add in this
patch.

Behavior change:

Say a resv_map exists like this [0->1], [2->3], and [5->6].

Then a region_chg/add call comes in region_chg/add(f=0, t=5).

Old code would generate resv->regions: [0->5], [5->6].
New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
[5->6].

Special care needs to be taken to handle the resv->adds_in_progress
variable correctly. In the past, only 1 region would be added for every
region_chg and region_add call. But now, each call may add multiple
regions, so we can no longer increment adds_in_progress by 1 in region_chg,
or decrement adds_in_progress by 1 after region_add or region_abort. Instead,
region_chg calls add_reservation_in_range() to count the number of regions
needed and allocates those, and that info is passed to region_add and
region_abort to decrement adds_in_progress correctly.

Signed-off-by: Mina Almasry 

---

Changes in v6:
- Fix bug in number of region_caches allocated by region_chg

---
 mm/hugetlb.c | 256 +--
 1 file changed, 147 insertions(+), 109 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4a60d7d44b4c3..f9c1947925bb9 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -244,6 +244,12 @@ struct file_region {
long to;
 };

+/* Helper that removes a struct file_region from the resv_map cache and returns
+ * it for use.
+ */
+static struct file_region *
+get_file_region_entry_from_cache(struct resv_map *resv, long from, long to);
+
 /* Must be called with resv->lock held. Calling this with count_only == true
  * will count the number of pages to be added but will not modify the linked
  * list.
@@ -251,51 +257,61 @@ struct file_region {
 static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 bool count_only)
 {
-   long chg = 0;
+   long add = 0;
struct list_head *head = &resv->regions;
+   long last_accounted_offset = f;
struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;

-   /* Locate the region we are before or in. */
-   list_for_each_entry (rg, head, link)
-   if (f <= rg->to)
-   break;
-
-   /* Round our left edge to the current segment if it encloses us. */
-   if (f > rg->from)
-   f = rg->from;
-
-   chg = t - f;
+   /* In this loop, we essentially handle an entry for the range
+* last_accounted_offset -> rg->from, at every iteration, with some
+* bounds checking.
+*/
+   list_for_each_entry_safe(rg, trg, head, link) {
+   /* Skip irrelevant regions that start before our range. */
+   if (rg->from < f) {
+   /* If this region ends after the last accounted offset,
+* then we need to update last_accounted_offset.
+*/
+   if (rg->to > last_accounted_offset)
+   last_accounted_offset = rg->to;
+   continue;
+   }

-   /* Check for and consume any regions we now overlap with. */
-   nrg = rg;
-   list_for_each_entry_safe (rg, trg, rg->link.prev, link) {
-   if (&rg->link == head)
-   break;
+   /* When we find a region that starts beyond our range, we've
+* finished.
+*/
if (rg->from > t)
break;

-   /* We overlap with this area, if it extends further than
-* us then we must extend ourselves.  Account for its
-* existing reservation.
+   /* Add an entry for last_accounted_offset -> rg->from, and
+* update last_accounted_offset.
 */
-   if (rg->to > t) {
-   chg += rg->to - t;
-   t = rg->to;
+   if (rg->from > last_accounted_offset) {
+   add += rg->from - last_accounted_offset;
+   if (!count_only) {
+   nrg = get_file_region_entry_from_cache(
+   resv, last_accounted_offset, rg->from);
+   list_add(&nrg->link, rg->link.prev);
+   }
}
-   chg -= rg->to - rg->from;

-   if (!count_only && rg != nrg) {
-   list_del(&rg->link);
-   kfree(rg);
-   }
+   la

Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-10-11 Thread Mina Almasry
On Fri, Oct 11, 2019 at 12:10 PM Mina Almasry  wrote:
>
> On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz  wrote:
> >
> > On 9/19/19 3:24 PM, Mina Almasry wrote:
> > > Patch series implements hugetlb_cgroup reservation usage and limits, which
> > > track hugetlb reservations rather than hugetlb memory faulted in. Details 
> > > of
> > > the approach is 1/7.
> >
> > Thanks for your continued efforts Mina.
> >
> > One thing that has bothered me with this approach from the beginning is that
> > hugetlb reservations are related to, but somewhat distinct from hugetlb
> > allocations.  The original (existing) huegtlb cgroup implementation does not
> > take reservations into account.  This is an issue you are trying to address
> > by adding a cgroup support for hugetlb reservations.  However, this new
> > reservation cgroup ignores hugetlb allocations at fault time.
> >
> > I 'think' the whole purpose of any hugetlb cgroup is to manage the 
> > allocation
> > of hugetlb pages.  Both the existing cgroup code and the reservation 
> > approach
> > have what I think are some serious flaws.  Consider a system with 100 
> > hugetlb
> > pages available.  A sysadmin, has two groups A and B and wants to limit 
> > hugetlb
> > usage to 50 pages each.
> >
> > With the existing implementation, a task in group A could create a mmap of
> > 100 pages in size and reserve all 100 pages.  Since the pages are 
> > 'reserved',
> > nobody in group B can allocate ANY huge pages.  This is true even though
> > no pages have been allocated in A (or B).
> >
> > With the reservation implementation, a task in group A could use 
> > MAP_NORESERVE
> > and allocate all 100 pages without taking any reservations.
> >
> > As mentioned in your documentation, it would be possible to use both the
> > existing (allocation) and new reservation cgroups together.  Perhaps if both
> > are setup for the 50/50 split things would work a little better.
> >
> > However, instead of creating a new reservation crgoup how about adding
> > reservation support to the existing allocation cgroup support.  One could
> > even argue that a reservation is an allocation as it sets aside huge pages
> > that can only be used for a specific purpose.  Here is something that
> > may work.
> >
> > Starting with the existing allocation cgroup.
> > - When hugetlb pages are reserved, the cgroup of the task making the
> >   reservations is charged.  Tracking for the charged cgroup is done in the
> >   reservation map in the same way proposed by this patch set.
> > - At page fault time,
> >   - If a reservation already exists for that specific area do not charge the
> > faulting task.  No tracking in page, just the reservation map.
> >   - If no reservation exists, charge the group of the faulting task.  
> > Tracking
> > of this information is in the page itself as implemented today.
> > - When the hugetlb object is removed, compare the reservation map with any
> >   allocated pages.  If cgroup tracking information exists in page, uncharge
> >   that group.  Otherwise, unharge the group (if any) in the reservation map.
> >
>
> Sorry for the late response here. I've been prototyping the
> suggestions from this conversation:
>
> 1. Supporting cgroup-v2 on the current controller seems trivial.
> Basically just specifying the dfl files seems to do it, and my tests
> on top of cgroup-v2 don't see any problems so far at least. In light
> of this I'm not sure it's best to create a new controller per say.
> Seems like it would duplicate a lot of code with the current
> controller, so I've tentatively just stuck to the plan in my current
> patchset, a new counter on the existing controller.
>
> 2. I've been working on transitioning the new counter to the behavior
> Mike specified in the email I'm responding to. So far I have a flow
> that works for shared mappings but not private mappings:
>
> - On reservation, charge the new counter and store the info in the
> resv_map. The counter gets uncharged when the resv_map entry gets
> removed (works fine).
> - On alloc_huge_page(), check if there is a reservation for the page
> being allocated. If not, charge the new counter and store the
> information in resv_map. The counter still gets uncharged when the
> resv_map entry gets removed.
>
> The above works for all shared mappings and reserved private mappings,
> but I' having trouble supporting private NORESERVE mappings. Charging
> can work the same as for shared mappings: char

Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-10-11 Thread Mina Almasry
On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz  wrote:
>
> On 9/19/19 3:24 PM, Mina Almasry wrote:
> > Patch series implements hugetlb_cgroup reservation usage and limits, which
> > track hugetlb reservations rather than hugetlb memory faulted in. Details of
> > the approach is 1/7.
>
> Thanks for your continued efforts Mina.
>
> One thing that has bothered me with this approach from the beginning is that
> hugetlb reservations are related to, but somewhat distinct from hugetlb
> allocations.  The original (existing) huegtlb cgroup implementation does not
> take reservations into account.  This is an issue you are trying to address
> by adding a cgroup support for hugetlb reservations.  However, this new
> reservation cgroup ignores hugetlb allocations at fault time.
>
> I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation
> of hugetlb pages.  Both the existing cgroup code and the reservation approach
> have what I think are some serious flaws.  Consider a system with 100 hugetlb
> pages available.  A sysadmin, has two groups A and B and wants to limit 
> hugetlb
> usage to 50 pages each.
>
> With the existing implementation, a task in group A could create a mmap of
> 100 pages in size and reserve all 100 pages.  Since the pages are 'reserved',
> nobody in group B can allocate ANY huge pages.  This is true even though
> no pages have been allocated in A (or B).
>
> With the reservation implementation, a task in group A could use MAP_NORESERVE
> and allocate all 100 pages without taking any reservations.
>
> As mentioned in your documentation, it would be possible to use both the
> existing (allocation) and new reservation cgroups together.  Perhaps if both
> are setup for the 50/50 split things would work a little better.
>
> However, instead of creating a new reservation crgoup how about adding
> reservation support to the existing allocation cgroup support.  One could
> even argue that a reservation is an allocation as it sets aside huge pages
> that can only be used for a specific purpose.  Here is something that
> may work.
>
> Starting with the existing allocation cgroup.
> - When hugetlb pages are reserved, the cgroup of the task making the
>   reservations is charged.  Tracking for the charged cgroup is done in the
>   reservation map in the same way proposed by this patch set.
> - At page fault time,
>   - If a reservation already exists for that specific area do not charge the
> faulting task.  No tracking in page, just the reservation map.
>   - If no reservation exists, charge the group of the faulting task.  Tracking
> of this information is in the page itself as implemented today.
> - When the hugetlb object is removed, compare the reservation map with any
>   allocated pages.  If cgroup tracking information exists in page, uncharge
>   that group.  Otherwise, unharge the group (if any) in the reservation map.
>

Sorry for the late response here. I've been prototyping the
suggestions from this conversation:

1. Supporting cgroup-v2 on the current controller seems trivial.
Basically just specifying the dfl files seems to do it, and my tests
on top of cgroup-v2 don't see any problems so far at least. In light
of this I'm not sure it's best to create a new controller per say.
Seems like it would duplicate a lot of code with the current
controller, so I've tentatively just stuck to the plan in my current
patchset, a new counter on the existing controller.

2. I've been working on transitioning the new counter to the behavior
Mike specified in the email I'm responding to. So far I have a flow
that works for shared mappings but not private mappings:

- On reservation, charge the new counter and store the info in the
resv_map. The counter gets uncharged when the resv_map entry gets
removed (works fine).
- On alloc_huge_page(), check if there is a reservation for the page
being allocated. If not, charge the new counter and store the
information in resv_map. The counter still gets uncharged when the
resv_map entry gets removed.

The above works for all shared mappings and reserved private mappings,
but I' having trouble supporting private NORESERVE mappings. Charging
can work the same as for shared mappings: charge the new counter on
reservation and on allocations that do not have a reservation. But the
question still comes up: where to store the counter to uncharge this
page? I thought of a couple of things that don't seem to work:

1. I thought of putting the counter in resv_map->reservation_counter,
so that it gets uncharged on vm_op_close. But, private NORESERVE
mappings don't even have a resv_map allocated for them.

2. I thought of detecting on free_huge_page that the page being freed
belonged to a private NORESERVE mapping, 

Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-27 Thread Mina Almasry
On Fri, Sep 27, 2019 at 2:59 PM Mike Kravetz  wrote:
>
> On 9/26/19 5:55 PM, Mina Almasry wrote:
> > Provided we keep the existing controller untouched, should the new
> > controller track:
> >
> > 1. only reservations, or
> > 2. both reservations and allocations for which no reservations exist
> > (such as the MAP_NORESERVE case)?
> >
> > I like the 'both' approach. Seems to me a counter like that would work
> > automatically regardless of whether the application is allocating
> > hugetlb memory with NORESERVE or not. NORESERVE allocations cannot cut
> > into reserved hugetlb pages, correct?
>
> Correct.  One other easy way to allocate huge pages without reserves
> (that I know is used today) is via the fallocate system call.
>
> >   If so, then applications that
> > allocate with NORESERVE will get sigbused when they hit their limit,
> > and applications that allocate without NORESERVE may get an error at
> > mmap time but will always be within their limits while they access the
> > mmap'd memory, correct?
>
> Correct.  At page allocation time we can easily check to see if a reservation
> exists and not charge.  For any specific page within a hugetlbfs file,
> a charge would happen at mmap time or allocation time.
>
> One exception (that I can think of) to this mmap(RESERVE) will not cause
> a SIGBUS rule is in the case of hole punch.  If someone punches a hole in
> a file, not only do they remove pages associated with the file but the
> reservation information as well.  Therefore, a subsequent fault will be
> the same as an allocation without reservation.
>

I don't think it causes a sigbus. This is the scenario, right:

1. Make cgroup with limit X bytes.
2. Task in cgroup mmaps a file with X bytes, causing the cgroup to get charged
3. A hole of size Y is punched in the file, causing the cgroup to get
uncharged Y bytes.
4. The task faults in memory from the hole, getting charged up to Y
bytes again. But they will be still within their limits.

IIUC userspace only gets sigbus'd if the limit is lowered between
steps 3 and 4, and it's ok if it gets sigbus'd there in my opinion.

> I 'think' the code to remove/truncate a file will work corrctly as it
> is today, but I need to think about this some more.
>
> > mmap'd memory, correct? So the 'both' counter seems like a one size
> > fits all.
> >
> > I think the only sticking point left is whether an added controller
> > can support both cgroup-v2 and cgroup-v1. If I could get confirmation
> > on that I'll provide a patchset.
>
> Sorry, but I can not provide cgroup expertise.
> --
> Mike Kravetz


Re: [PATCH v5 4/7] hugetlb: disable region_add file_region coalescing

2019-09-27 Thread Mina Almasry
On Fri, Sep 27, 2019 at 2:44 PM Mike Kravetz  wrote:
>
> On 9/19/19 3:24 PM, Mina Almasry wrote:
> > A follow up patch in this series adds hugetlb cgroup uncharge info the
> > file_region entries in resv->regions. The cgroup uncharge info may
> > differ for different regions, so they can no longer be coalesced at
> > region_add time. So, disable region coalescing in region_add in this
> > patch.
> >
> > Behavior change:
> >
> > Say a resv_map exists like this [0->1], [2->3], and [5->6].
> >
> > Then a region_chg/add call comes in region_chg/add(f=0, t=5).
> >
> > Old code would generate resv->regions: [0->5], [5->6].
> > New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
> > [5->6].
> >
> > Special care needs to be taken to handle the resv->adds_in_progress
> > variable correctly. In the past, only 1 region would be added for every
> > region_chg and region_add call. But now, each call may add multiple
> > regions, so we can no longer increment adds_in_progress by 1 in region_chg,
> > or decrement adds_in_progress by 1 after region_add or region_abort. 
> > Instead,
> > region_chg calls add_reservation_in_range() to count the number of regions
> > needed and allocates those, and that info is passed to region_add and
> > region_abort to decrement adds_in_progress correctly.
> >
> > Signed-off-by: Mina Almasry 
> >
> > ---
> >  mm/hugetlb.c | 273 +--
> >  1 file changed, 158 insertions(+), 115 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index bac1cbdd027c..d03b048084a3 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -244,6 +244,12 @@ struct file_region {
> >   long to;
> >  };
> >
> > +/* Helper that removes a struct file_region from the resv_map cache and 
> > returns
> > + * it for use.
> > + */
> > +static struct file_region *
> > +get_file_region_entry_from_cache(struct resv_map *resv, long from, long 
> > to);
> > +
>
> Instead of the forward declaration, just put the function here.
>
> >  /* Must be called with resv->lock held. Calling this with count_only == 
> > true
> >   * will count the number of pages to be added but will not modify the 
> > linked
> >   * list.
> > @@ -251,51 +257,61 @@ struct file_region {
> >  static long add_reservation_in_range(struct resv_map *resv, long f, long t,
> >bool count_only)
> >  {
> > - long chg = 0;
> > + long add = 0;
> >   struct list_head *head = &resv->regions;
> > + long last_accounted_offset = f;
> >   struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
> >
> > - /* Locate the region we are before or in. */
> > - list_for_each_entry (rg, head, link)
> > - if (f <= rg->to)
> > - break;
> > -
> > - /* Round our left edge to the current segment if it encloses us. */
> > - if (f > rg->from)
> > - f = rg->from;
> > -
> > - chg = t - f;
> > + /* In this loop, we essentially handle an entry for the range
> > +  * last_accounted_offset -> rg->from, at every iteration, with some
> > +  * bounds checking.
> > +  */
> > + list_for_each_entry_safe(rg, trg, head, link) {
> > + /* Skip irrelevant regions that start before our range. */
> > + if (rg->from < f) {
> > + /* If this region ends after the last accounted 
> > offset,
> > +  * then we need to update last_accounted_offset.
> > +  */
> > + if (rg->to > last_accounted_offset)
> > + last_accounted_offset = rg->to;
> > + continue;
> > + }
> >
> > - /* Check for and consume any regions we now overlap with. */
> > - nrg = rg;
> > - list_for_each_entry_safe (rg, trg, rg->link.prev, link) {
> > - if (&rg->link == head)
> > - break;
> > + /* When we find a region that starts beyond our range, we've
> > +  * finished.
> > +  */
> >   if (rg->from > t)
> >   break;
> >
> > - /* We overlap with this area, if it extends further than
> > -  * us then we must

Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-26 Thread Mina Almasry
On Thu, Sep 26, 2019 at 2:23 PM Mike Kravetz  wrote:
>
> On 9/26/19 12:28 PM, David Rientjes wrote:
> > On Tue, 24 Sep 2019, Mina Almasry wrote:
> >
> >>> I personally prefer the one counter approach only for the reason that it
> >>> exposes less information about hugetlb reservations.  I was not around
> >>> for the introduction of hugetlb reservations, but I have fixed several
> >>> issues having to do with reservations.  IMO, reservations should be hidden
> >>> from users as much as possible.  Others may disagree.
> >>>
> >>> I really hope that Aneesh will comment.  He added the existing hugetlb
> >>> cgroup code.  I was not involved in that effort, but it looks like there
> >>> might have been some thought given to reservations in early versions of
> >>> that code.  It would be interesting to get his perspective.
> >>>
> >>> Changes included in patch 4 (disable region_add file_region coalescing)
> >>> would be needed in a one counter approach as well, so I do plan to
> >>> review those changes.
> >>
> >> OK, FWIW, the 1 counter approach should be sufficient for us, so I'm
> >> not really opposed. David, maybe chime in if you see a problem here?
> >> From the perspective of hiding reservations from the user as much as
> >> possible, it is an improvement.
> >>
> >> I'm only wary about changing the behavior of the current and having
> >> that regress applications. I'm hoping you and Aneesh can shed light on
> >> this.
> >>
> >
> > I think neither Aneesh nor myself are going to be able to provide a
> > complete answer on the use of hugetlb cgroup today, anybody could be using
> > it without our knowledge and that opens up the possibility that combining
> > the limits would adversely affect a real system configuration.
>
> I agree that nobody can provide complete information on hugetlb cgroup usage
> today.  My interest was in anything Aneesh could remember about development
> of the current cgroup code.  It 'appears' that the idea of including
> reservations or mmap ranges was considered or at least discussed.  But, those
> discussions happened more than 7 years old and my searches are not providing
> a complete picture.  My hope was that Aneesh may remember those discussions.
>
> > If that is a possibility, I think we need to do some due diligence and try
> > to deprecate allocation limits if possible.  One of the benefits to
> > separate limits is that we can make reasonable steps to deprecating the
> > actual allocation limits, if possible: we could add warnings about the
> > deprecation of allocation limits and see if anybody complains.
> >
> > That could take the form of two separate limits or a tunable in the root
> > hugetlb cgroup that defines whether the limits are for allocation or
> > reservation.
> >
> > Combining them in the first pass seems to be very risky and could cause
> > pain for users that will not detect this during an rc cycle and will
> > report the issue only when their distro gets it.  Then we are left with no
> > alternative other than stable backports and the separation of the limits
> > anyway.
>
> I agree that changing behavior of the existing controller is too risky.
> Such a change is likely to break someone.

I'm glad we're converging on keeping the existing behavior unchanged.

> The more I think about it, the
> best way forward will be to retain the existing controller and create a
> new controller that satisfies the new use cases.

My guess is that a new controller needs to support cgroups-v2, which
is fine. But can a new controller also support v1? Or is there a
requirement that new controllers support *only* v2? I need whatever
solution here to work on v1. Added Tejun to hopefully comment on this.

>The question remains as
> to what that new controller will be.  Does it control reservations only?
> Is it a combination of reservations and allocations?  If a combined
> controller will work for new use cases, that would be my preference.  Of
> course, I have not prototyped such a controller so there may be issues when
> we get into the details.  For a reservation only or combined controller,
> the region_* changes proposed by Mina would be used.

Provided we keep the existing controller untouched, should the new
controller track:

1. only reservations, or
2. both reservations and allocations for which no reservations exist
(such as the MAP_NORESERVE case)?

I like the 'both' approach. Seems to me a counter like that would work
automatically regardless of whether the appli

Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-24 Thread Mina Almasry
On Mon, Sep 23, 2019 at 2:27 PM Mike Kravetz  wrote:
>
> On 9/23/19 12:18 PM, Mina Almasry wrote:
> > On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz  
> > wrote:
> >>
> >> On 9/19/19 3:24 PM, Mina Almasry wrote:
> >>> Patch series implements hugetlb_cgroup reservation usage and limits, which
> >>> track hugetlb reservations rather than hugetlb memory faulted in. Details 
> >>> of
> >>> the approach is 1/7.
> >>
> >> Thanks for your continued efforts Mina.
> >>
> >
> > And thanks for your reviews so far.
> >
> >> One thing that has bothered me with this approach from the beginning is 
> >> that
> >> hugetlb reservations are related to, but somewhat distinct from hugetlb
> >> allocations.  The original (existing) huegtlb cgroup implementation does 
> >> not
> >> take reservations into account.  This is an issue you are trying to address
> >> by adding a cgroup support for hugetlb reservations.  However, this new
> >> reservation cgroup ignores hugetlb allocations at fault time.
> >>
> >> I 'think' the whole purpose of any hugetlb cgroup is to manage the 
> >> allocation
> >> of hugetlb pages.  Both the existing cgroup code and the reservation 
> >> approach
> >> have what I think are some serious flaws.  Consider a system with 100 
> >> hugetlb
> >> pages available.  A sysadmin, has two groups A and B and wants to limit 
> >> hugetlb
> >> usage to 50 pages each.
> >>
> >> With the existing implementation, a task in group A could create a mmap of
> >> 100 pages in size and reserve all 100 pages.  Since the pages are 
> >> 'reserved',
> >> nobody in group B can allocate ANY huge pages.  This is true even though
> >> no pages have been allocated in A (or B).
> >>
> >> With the reservation implementation, a task in group A could use 
> >> MAP_NORESERVE
> >> and allocate all 100 pages without taking any reservations.
> >>
> >> As mentioned in your documentation, it would be possible to use both the
> >> existing (allocation) and new reservation cgroups together.  Perhaps if 
> >> both
> >> are setup for the 50/50 split things would work a little better.
> >>
> >> However, instead of creating a new reservation crgoup how about adding
> >> reservation support to the existing allocation cgroup support.  One could
> >> even argue that a reservation is an allocation as it sets aside huge pages
> >> that can only be used for a specific purpose.  Here is something that
> >> may work.
> >>
> >> Starting with the existing allocation cgroup.
> >> - When hugetlb pages are reserved, the cgroup of the task making the
> >>   reservations is charged.  Tracking for the charged cgroup is done in the
> >>   reservation map in the same way proposed by this patch set.
> >> - At page fault time,
> >>   - If a reservation already exists for that specific area do not charge 
> >> the
> >> faulting task.  No tracking in page, just the reservation map.
> >>   - If no reservation exists, charge the group of the faulting task.  
> >> Tracking
> >> of this information is in the page itself as implemented today.
> >> - When the hugetlb object is removed, compare the reservation map with any
> >>   allocated pages.  If cgroup tracking information exists in page, uncharge
> >>   that group.  Otherwise, unharge the group (if any) in the reservation 
> >> map.
> >>
> >> One of the advantages of a separate reservation cgroup is that the existing
> >> code is unmodified.  Combining the two provides a more complete/accurate
> >> solution IMO.  But, it has the potential to break existing users.
> >>
> >> I really would like to get feedback from anyone that knows how the existing
> >> hugetlb cgroup controller may be used today.  Comments from Aneesh would
> >> be very welcome to know if reservations were considered in development of 
> >> the
> >> existing code.
> >> --
> >
> > FWIW, I'm aware of the interaction with NORESERVE and my thoughts are:
> >
> > AFAICT, the 2 counter approach we have here is strictly superior to
> > the 1 upgraded counter approach. Consider these points:
> >
> > - From what I can tell so far, everything you can do with the 1
> > counter approach, you can do with the two counter approach by setting
> > both limit_in_bytes and 

Re: [PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-23 Thread Mina Almasry
On Mon, Sep 23, 2019 at 10:47 AM Mike Kravetz  wrote:
>
> On 9/19/19 3:24 PM, Mina Almasry wrote:
> > Patch series implements hugetlb_cgroup reservation usage and limits, which
> > track hugetlb reservations rather than hugetlb memory faulted in. Details of
> > the approach is 1/7.
>
> Thanks for your continued efforts Mina.
>

And thanks for your reviews so far.

> One thing that has bothered me with this approach from the beginning is that
> hugetlb reservations are related to, but somewhat distinct from hugetlb
> allocations.  The original (existing) huegtlb cgroup implementation does not
> take reservations into account.  This is an issue you are trying to address
> by adding a cgroup support for hugetlb reservations.  However, this new
> reservation cgroup ignores hugetlb allocations at fault time.
>
> I 'think' the whole purpose of any hugetlb cgroup is to manage the allocation
> of hugetlb pages.  Both the existing cgroup code and the reservation approach
> have what I think are some serious flaws.  Consider a system with 100 hugetlb
> pages available.  A sysadmin, has two groups A and B and wants to limit 
> hugetlb
> usage to 50 pages each.
>
> With the existing implementation, a task in group A could create a mmap of
> 100 pages in size and reserve all 100 pages.  Since the pages are 'reserved',
> nobody in group B can allocate ANY huge pages.  This is true even though
> no pages have been allocated in A (or B).
>
> With the reservation implementation, a task in group A could use MAP_NORESERVE
> and allocate all 100 pages without taking any reservations.
>
> As mentioned in your documentation, it would be possible to use both the
> existing (allocation) and new reservation cgroups together.  Perhaps if both
> are setup for the 50/50 split things would work a little better.
>
> However, instead of creating a new reservation crgoup how about adding
> reservation support to the existing allocation cgroup support.  One could
> even argue that a reservation is an allocation as it sets aside huge pages
> that can only be used for a specific purpose.  Here is something that
> may work.
>
> Starting with the existing allocation cgroup.
> - When hugetlb pages are reserved, the cgroup of the task making the
>   reservations is charged.  Tracking for the charged cgroup is done in the
>   reservation map in the same way proposed by this patch set.
> - At page fault time,
>   - If a reservation already exists for that specific area do not charge the
> faulting task.  No tracking in page, just the reservation map.
>   - If no reservation exists, charge the group of the faulting task.  Tracking
> of this information is in the page itself as implemented today.
> - When the hugetlb object is removed, compare the reservation map with any
>   allocated pages.  If cgroup tracking information exists in page, uncharge
>   that group.  Otherwise, unharge the group (if any) in the reservation map.
>
> One of the advantages of a separate reservation cgroup is that the existing
> code is unmodified.  Combining the two provides a more complete/accurate
> solution IMO.  But, it has the potential to break existing users.
>
> I really would like to get feedback from anyone that knows how the existing
> hugetlb cgroup controller may be used today.  Comments from Aneesh would
> be very welcome to know if reservations were considered in development of the
> existing code.
> --

FWIW, I'm aware of the interaction with NORESERVE and my thoughts are:

AFAICT, the 2 counter approach we have here is strictly superior to
the 1 upgraded counter approach. Consider these points:

- From what I can tell so far, everything you can do with the 1
counter approach, you can do with the two counter approach by setting
both limit_in_bytes and reservation_limit_in_bytes to the limit value.
That will limit both reservations and at fault allocations.

- The 2 counter approach preserves existing usage of hugetlb cgroups,
so no need to muck around with reverting the feature some time from
now because of broken users. No existing users of hugetlb cgroups need
to worry about the effect of this on their usage.

- Users that use hugetlb memory strictly through reservations can use
only reservation_limit_in_bytes and enjoy cgroup limits that never
SIGBUS the application. This is our usage for example.

- The 2 counter approach provides more info to the sysadmin. The
sysadmin knows exactly how much reserved bytes there are via
reservation_usage_in_bytes, and how much actually in use bytes there
are via usage_in_bytes. They can even detect NORESERVE usage if
usage_in_bytes > reservation_usage_in_bytes. failcnt shows failed
reservations *and* failed allocations at fault, etc. All around better
debuggability when things go wrong. 

[PATCH v5 0/7] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-19 Thread Mina Almasry
Patch series implements hugetlb_cgroup reservation usage and limits, which
track hugetlb reservations rather than hugetlb memory faulted in. Details of
the approach is 1/7.

Changes in v5:
- Moved the bulk of the description to the first patch in the series.
- Clang formatted the entire series.
- Split off 'hugetlb: remove duplicated code' and 'hugetlb: region_chg provides
  only cache entry' into their own patch series.
- Added comments to HUGETLB_RES enum.
- Fixed bug in 'hugetlb: disable region_add file_region coalescing' calculating
  the wrong number of regions_needed in some cases.
- Changed sleeps in test to proper conditions.
- Misc fixes in test based on shuah@ review.

Changes in v4:
- Split up 'hugetlb_cgroup: add accounting for shared mappings' into 4 patches
  for better isolation and context on the indvidual changes:
  - hugetlb_cgroup: add accounting for shared mappings
  - hugetlb: disable region_add file_region coalescing
  - hugetlb: remove duplicated code
  - hugetlb: region_chg provides only cache entry
- Fixed resv->adds_in_progress accounting.
- Retained behavior that region_add never fails, in earlier patchsets region_add
  could return failure.
- Fixed libhugetlbfs failure.
- Minor fix to the added tests that was preventing them from running on some
  environments.

Changes in v3:
- Addressed comments of Hillf Danton:
  - Added docs.
  - cgroup_files now uses enum.
  - Various readability improvements.
- Addressed comments of Mike Kravetz.
  - region_* functions no longer coalesce file_region entries in the resv_map.
  - region_add() and region_chg() refactored to make them much easier to
understand and remove duplicated code so this patch doesn't add too much
complexity.
  - Refactored common functionality into helpers.

Changes in v2:
- Split the patch into a 5 patch series.
- Fixed patch subject.

Mina Almasry (7):
  hugetlb_cgroup: Add hugetlb_cgroup reservation counter
  hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
  hugetlb_cgroup: add reservation accounting for private mappings
  hugetlb: disable region_add file_region coalescing
  hugetlb_cgroup: add accounting for shared mappings
  hugetlb_cgroup: Add hugetlb_cgroup reservation tests
  hugetlb_cgroup: Add hugetlb_cgroup reservation docs

 .../admin-guide/cgroup-v1/hugetlb.rst |  85 +++-
 include/linux/hugetlb.h   |  31 +-
 include/linux/hugetlb_cgroup.h|  33 +-
 mm/hugetlb.c  | 423 +++-
 mm/hugetlb_cgroup.c   | 190 ++--
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   1 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 461 ++
 .../selftests/vm/write_hugetlb_memory.sh  |  22 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 250 ++
 10 files changed, 1306 insertions(+), 191 deletions(-)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

--
2.23.0.351.gc4317032e6-goog


[PATCH v5 7/7] hugetlb_cgroup: Add hugetlb_cgroup reservation docs

2019-09-19 Thread Mina Almasry
Add docs for how to use hugetlb_cgroup reservations, and their behavior.

Signed-off-by: Mina Almasry 
Acked-by: Hillf Danton 

---
 .../admin-guide/cgroup-v1/hugetlb.rst | 85 ---
 1 file changed, 74 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst 
b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
index a3902aa253a9..70c10bd9a0b7 100644
--- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
@@ -2,13 +2,6 @@
 HugeTLB Controller
 ==

-The HugeTLB controller allows to limit the HugeTLB usage per control group and
-enforces the controller limit during page fault. Since HugeTLB doesn't
-support page reclaim, enforcing the limit at page fault time implies that,
-the application will get SIGBUS signal if it tries to access HugeTLB pages
-beyond its limit. This requires the application to know beforehand how much
-HugeTLB pages it would require for its use.
-
 HugeTLB controller can be created by first mounting the cgroup filesystem.

 # mount -t cgroup -o hugetlb none /sys/fs/cgroup
@@ -28,10 +21,14 @@ process (bash) into it.

 Brief summary of control files::

- hugetlb..limit_in_bytes # set/show limit of "hugepagesize" 
hugetlb usage
- hugetlb..max_usage_in_bytes # show max "hugepagesize" hugetlb  
usage recorded
- hugetlb..usage_in_bytes # show current usage for 
"hugepagesize" hugetlb
- hugetlb..failcnt   # show the number of 
allocation failure due to HugeTLB limit
+ hugetlb..reservation_limit_in_bytes # set/show limit of 
"hugepagesize" hugetlb reservations
+ hugetlb..reservation_max_usage_in_bytes # show max 
"hugepagesize" hugetlb reservations recorded
+ hugetlb..reservation_usage_in_bytes # show current 
reservations for "hugepagesize" hugetlb
+ hugetlb..reservation_failcnt# show the number of 
allocation failure due to HugeTLB reservation limit
+ hugetlb..limit_in_bytes # set/show limit of 
"hugepagesize" hugetlb faults
+ hugetlb..max_usage_in_bytes # show max 
"hugepagesize" hugetlb  usage recorded
+ hugetlb..usage_in_bytes # show current usage 
for "hugepagesize" hugetlb
+ hugetlb..failcnt# show the number of 
allocation failure due to HugeTLB usage limit

 For a system supporting three hugepage sizes (64k, 32M and 1G), the control
 files include::
@@ -40,11 +37,77 @@ files include::
   hugetlb.1GB.max_usage_in_bytes
   hugetlb.1GB.usage_in_bytes
   hugetlb.1GB.failcnt
+  hugetlb.1GB.reservation_limit_in_bytes
+  hugetlb.1GB.reservation_max_usage_in_bytes
+  hugetlb.1GB.reservation_usage_in_bytes
+  hugetlb.1GB.reservation_failcnt
   hugetlb.64KB.limit_in_bytes
   hugetlb.64KB.max_usage_in_bytes
   hugetlb.64KB.usage_in_bytes
   hugetlb.64KB.failcnt
+  hugetlb.64KB.reservation_limit_in_bytes
+  hugetlb.64KB.reservation_max_usage_in_bytes
+  hugetlb.64KB.reservation_usage_in_bytes
+  hugetlb.64KB.reservation_failcnt
   hugetlb.32MB.limit_in_bytes
   hugetlb.32MB.max_usage_in_bytes
   hugetlb.32MB.usage_in_bytes
   hugetlb.32MB.failcnt
+  hugetlb.32MB.reservation_limit_in_bytes
+  hugetlb.32MB.reservation_max_usage_in_bytes
+  hugetlb.32MB.reservation_usage_in_bytes
+  hugetlb.32MB.reservation_failcnt
+
+
+1. Reservation limits
+
+The HugeTLB controller allows to limit the HugeTLB reservations per control
+group and enforces the controller limit at reservation time. Reservation limits
+are superior to Page fault limits (see section 2), since Reservation limits are
+enforced at reservation time, and never causes the application to get SIGBUS
+signal. Instead, if the application is violating its limits, then it gets an
+error on reservation time, i.e. the mmap or shmget return an error.
+
+
+2. Page fault limits
+
+The HugeTLB controller allows to limit the HugeTLB usage (page fault) per
+control group and enforces the controller limit during page fault. Since 
HugeTLB
+doesn't support page reclaim, enforcing the limit at page fault time implies
+that, the application will get SIGBUS signal if it tries to access HugeTLB
+pages beyond its limit. This requires the application to know beforehand how
+much HugeTLB pages it would require for its use.
+
+
+3. Caveats with shared memory
+
+a. Charging and uncharging:
+
+For shared hugetlb memory, both hugetlb reservation and usage (page faults) are
+charged to the first task that causes the memory to be reserved or faulted,
+and all subsequent uses of this reserved or faulted memory is done without
+charging.
+
+Shared hugetlb memory is only uncharged when it is unreserved or deallocated.
+This is usually when the hugetlbfs file is deleted, and not when the task that
+caused the reservation or fault has exited.
+
+b. Interaction between reservation limit and fault limit.
+
+Generally, it's not recommended to s

[PATCH v5 5/7] hugetlb_cgroup: add accounting for shared mappings

2019-09-19 Thread Mina Almasry
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.

After a call to region_chg, we charge the approprate hugetlb_cgroup, and if
successful, we pass on the hugetlb_cgroup info to a follow up region_add call.
When a file_region entry is added to the resv_map via region_add, we put the
pointer to that cgroup in file_region->reservation_counter. If charging doesn't
succeed, we report the error to the caller, so that the kernel fails the
reservation.

On region_del, which is when the hugetlb memory is unreserved, we also uncharge
the file_region->reservation_counter.

Signed-off-by: Mina Almasry 

---
 mm/hugetlb.c | 126 ++-
 1 file changed, 105 insertions(+), 21 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index d03b048084a3..ae573eff80bb 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -242,6 +242,15 @@ struct file_region {
struct list_head link;
long from;
long to;
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* On shared mappings, each reserved region appears as a struct
+* file_region in resv_map. These fields hold the info needed to
+* uncharge each reservation.
+*/
+   struct page_counter *reservation_counter;
+   unsigned long pages_per_hpage;
+#endif
 };

 /* Helper that removes a struct file_region from the resv_map cache and returns
@@ -250,12 +259,30 @@ struct file_region {
 static struct file_region *
 get_file_region_entry_from_cache(struct resv_map *resv, long from, long to);

+/* Helper that records hugetlb_cgroup uncharge info. */
+static void record_hugetlb_cgroup_uncharge_info(struct hugetlb_cgroup *h_cg,
+   struct file_region *nrg,
+   struct hstate *h)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+   if (h_cg) {
+   nrg->reservation_counter =
+   &h_cg->reserved_hugepage[hstate_index(h)];
+   nrg->pages_per_hpage = pages_per_huge_page(h);
+   } else {
+   nrg->reservation_counter = NULL;
+   nrg->pages_per_hpage = 0;
+   }
+#endif
+}
+
 /* Must be called with resv->lock held. Calling this with count_only == true
  * will count the number of pages to be added but will not modify the linked
  * list.
  */
 static long add_reservation_in_range(struct resv_map *resv, long f, long t,
-bool count_only)
+struct hugetlb_cgroup *h_cg,
+struct hstate *h, bool count_only)
 {
long add = 0;
struct list_head *head = &resv->regions;
@@ -291,6 +318,8 @@ static long add_reservation_in_range(struct resv_map *resv, 
long f, long t,
if (!count_only) {
nrg = get_file_region_entry_from_cache(
resv, last_accounted_offset, rg->from);
+   record_hugetlb_cgroup_uncharge_info(h_cg, nrg,
+   h);
list_add(&nrg->link, rg->link.prev);
}
}
@@ -306,11 +335,13 @@ static long add_reservation_in_range(struct resv_map 
*resv, long f, long t,
if (!count_only) {
nrg = get_file_region_entry_from_cache(
resv, last_accounted_offset, t);
+   record_hugetlb_cgroup_uncharge_info(h_cg, nrg, h);
list_add(&nrg->link, rg->link.prev);
}
last_accounted_offset = t;
}

+   VM_BUG_ON(add < 0);
return add;
 }

@@ -327,7 +358,8 @@ static long add_reservation_in_range(struct resv_map *resv, 
long f, long t,
  * Return the number of new huge pages added to the map.  This
  * number is greater than or equal to zero.
  */
-static long region_add(struct resv_map *resv, long f, long t,
+static long region_add(struct hstate *h, struct hugetlb_cgroup *h_cg,
+  struct resv_map *resv, long f, long t,
   long regions_needed)
 {
long add = 0;
@@ -336,7 +368,7 @@ static long region_add(struct resv_map *resv, long f, long 
t,

VM_BUG_ON(resv->region_cache_count < regions_needed);

-   add = add_reservation_in_range(resv, f, t, false);
+   add = add_reservation_in_range(resv, f, t, h_cg, h, false);
resv->adds_in_progress -= regions_needed;

spin_unlock(&resv->lock);
@@ -380,7 +412,7 @@ static long region_chg(struct resv_map *resv, long f, long 
t,
spin_lock(&resv->lock);

/* Count how many hugepages in this range are NOT respresented. */
-   chg = add_reservation_in_

[PATCH v5 6/7] hugetlb_cgroup: Add hugetlb_cgroup reservation tests

2019-09-19 Thread Mina Almasry
The tests use both shared and private mapped hugetlb memory, and
monitors the hugetlb usage counter as well as the hugetlb reservation
counter. They test different configurations such as hugetlb memory usage
via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
MAP_POPULATE.

Signed-off-by: Mina Almasry 
---
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   1 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 461 ++
 .../selftests/vm/write_hugetlb_memory.sh  |  22 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 250 ++
 5 files changed, 735 insertions(+)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

diff --git a/tools/testing/selftests/vm/.gitignore 
b/tools/testing/selftests/vm/.gitignore
index 31b3c98b6d34..d3bed9407773 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -14,3 +14,4 @@ virtual_address_range
 gup_benchmark
 va_128TBswitch
 map_fixed_noreplace
+write_to_hugetlbfs
diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index 9534dc2bc929..31c2cc5cf30b 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -18,6 +18,7 @@ TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
 TEST_GEN_FILES += va_128TBswitch
 TEST_GEN_FILES += virtual_address_range
+TEST_GEN_FILES += write_to_hugetlbfs

 TEST_PROGS := run_vmtests

diff --git a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh 
b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
new file mode 100755
index ..17315db4111c
--- /dev/null
+++ b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
@@ -0,0 +1,461 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+if [[ $(id -u) -ne 0 ]]; then
+   echo "This test must be run as root. Skipping..."
+   exit 0
+fi
+
+cgroup_path=/dev/cgroup/memory
+if [[ ! -e $cgroup_path ]]; then
+  mkdir -p $cgroup_path
+  mount -t cgroup -o hugetlb,memory cgroup $cgroup_path
+fi
+
+cleanup () {
+   echo $$ > $cgroup_path/tasks
+
+   if [[ -e /mnt/huge ]]; then
+ rm -rf /mnt/huge/*
+ umount /mnt/huge || echo error
+ rmdir /mnt/huge
+   fi
+   if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test
+   fi
+   if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test1
+   fi
+   if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test2
+   fi
+   echo 0 > /proc/sys/vm/nr_hugepages
+   echo CLEANUP DONE
+}
+
+function expect_equal() {
+  local expected="$1"
+  local actual="$2"
+  local error="$3"
+
+  if [[ "$expected" != "$actual" ]]; then
+   echo "expected ($expected) != actual ($actual): $3"
+   cleanup
+   exit 1
+  fi
+}
+
+function setup_cgroup() {
+  local name="$1"
+  local cgroup_limit="$2"
+  local reservation_limit="$3"
+
+  mkdir $cgroup_path/$name
+
+  echo writing cgroup limit: "$cgroup_limit"
+  echo "$cgroup_limit" > $cgroup_path/$name/hugetlb.2MB.limit_in_bytes
+
+  echo writing reseravation limit: "$reservation_limit"
+  echo "$reservation_limit" > \
+   $cgroup_path/$name/hugetlb.2MB.reservation_limit_in_bytes
+  if [ -e "$cgroup_path/$name/cpuset.cpus" ]; then
+echo 0 > $cgroup_path/$name/cpuset.cpus
+  fi
+  if [ -e "$cgroup_path/$name/cpuset.mems" ]; then
+echo 0 > $cgroup_path/$name/cpuset.mems
+  fi
+}
+
+function wait_for_hugetlb_memory_to_get_depleted {
+   local cgroup="$1"
+   local 
path="/dev/cgroup/memory/$cgroup/hugetlb.2MB.reservation_usage_in_bytes"
+   # Wait for hugetlbfs memory to get depleted.
+   while [ $(cat $path) != 0 ]; do
+  echo Waiting for hugetlb memory to get depleted.
+  sleep 0.5
+   done
+}
+
+function wait_for_hugetlb_memory_to_get_written {
+   local cgroup="$1"
+   local size="$2"
+
+   local 
path="/dev/cgroup/memory/$cgroup/hugetlb.2MB.reservation_usage_in_bytes"
+   # Wait for hugetlbfs memory to get written.
+   while [ $(cat $path) != $size ]; do
+  echo Waiting for hugetlb memory to reach size $size.
+  sleep 0.5
+   done
+}
+
+function write_hugetlbfs_and_get_usage() {
+  local cgroup="$1"
+  local size="$2"
+  local populate="$3"
+  local write="$4"
+  local path="$5"
+  

[PATCH v5 2/7] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations

2019-09-19 Thread Mina Almasry
Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb
usage or hugetlb reservation counter.

Adds a new interface to uncharge a hugetlb_cgroup counter via
hugetlb_cgroup_uncharge_counter.

Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.

Signed-off-by: Mina Almasry 

---
 include/linux/hugetlb_cgroup.h | 22 ++
 mm/hugetlb.c   |  6 ++-
 mm/hugetlb_cgroup.c| 77 --
 3 files changed, 83 insertions(+), 22 deletions(-)

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 063962f6dfc6..de35997bb5f9 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -52,14 +52,19 @@ static inline bool hugetlb_cgroup_disabled(void)
 }

 extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-   struct hugetlb_cgroup **ptr);
+   struct hugetlb_cgroup **ptr,
+   bool reserved);
 extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
 struct hugetlb_cgroup *h_cg,
 struct page *page);
 extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
 struct page *page);
 extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
-  struct hugetlb_cgroup *h_cg);
+  struct hugetlb_cgroup *h_cg,
+  bool reserved);
+extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
+   unsigned long nr_pages);
+
 extern void hugetlb_cgroup_file_init(void) __init;
 extern void hugetlb_cgroup_migrate(struct page *oldhpage,
   struct page *newhpage);
@@ -81,9 +86,9 @@ static inline bool hugetlb_cgroup_disabled(void)
return true;
 }

-static inline int
-hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-struct hugetlb_cgroup **ptr)
+static inline int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
+  struct hugetlb_cgroup **ptr,
+  bool reserved)
 {
return 0;
 }
@@ -100,9 +105,10 @@ hugetlb_cgroup_uncharge_page(int idx, unsigned long 
nr_pages, struct page *page)
 {
 }

-static inline void
-hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
-  struct hugetlb_cgroup *h_cg)
+static inline void hugetlb_cgroup_uncharge_cgroup(int idx,
+ unsigned long nr_pages,
+ struct hugetlb_cgroup *h_cg,
+ bool reserved)
 {
 }

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 052a2532528a..a52efcb70d04 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2032,7 +2032,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
gbl_chg = 1;
}

-   ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
+   ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg,
+  false);
if (ret)
goto out_subpool_put;

@@ -2080,7 +2081,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
return page;

 out_uncharge_cgroup:
-   hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
+   hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg,
+  false);
 out_subpool_put:
if (map_chg || avoid_reserve)
hugepage_subpool_put_pages(spool, 1);
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 1386da79c9d7..dc1ddc9b09c4 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -73,8 +73,12 @@ static inline bool hugetlb_cgroup_have_usage(struct 
hugetlb_cgroup *h_cg)
int idx;

for (idx = 0; idx < hugetlb_max_hstate; idx++) {
-   if (page_counter_read(&h_cg->hugepage[idx]))
+   if (page_counter_read(
+   hugetlb_cgroup_get_counter(h_cg, idx, true)) ||
+   page_counter_read(
+   hugetlb_cgroup_get_counter(h_cg, idx, false))) {
return true;
+   }
}
return false;
 }
@@ -85,18 +89,32 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup 
*h_cgroup,
int idx;

for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
-   struct page_counter *counter = &h_cgroup->hugepage[idx];
struct

[PATCH v5 3/7] hugetlb_cgroup: add reservation accounting for private mappings

2019-09-19 Thread Mina Almasry
Normally the pointer to the cgroup to uncharge hangs off the struct
page, and gets queried when it's time to free the page. With
hugetlb_cgroup reservations, this is not possible. Because it's possible
for a page to be reserved by one task and actually faulted in by another
task.

The best place to put the hugetlb_cgroup pointer to uncharge for
reservations is in the resv_map. But, because the resv_map has different
semantics for private and shared mappings, the code patch to
charge/uncharge shared and private mappings is different. This patch
implements charging and uncharging for private mappings.

For private mappings, the counter to uncharge is in
resv_map->reservation_counter. On initializing the resv_map this is set
to NULL. On reservation of a region in private mapping, the tasks
hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
resv_map->reservation_counter.

On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.

Signed-off-by: Mina Almasry 
Acked-by: Hillf Danton 

---
 include/linux/hugetlb.h|  8 +++
 include/linux/hugetlb_cgroup.h | 11 +
 mm/hugetlb.c   | 44 +-
 mm/hugetlb_cgroup.c| 12 --
 4 files changed, 62 insertions(+), 13 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 3d70a17cc0c3..230f44f730fa 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -46,6 +46,14 @@ struct resv_map {
long adds_in_progress;
struct list_head region_cache;
long region_cache_count;
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* On private mappings, the counter to uncharge reservations is stored
+* here. If these fields are 0, then the mapping is shared.
+*/
+   struct page_counter *reservation_counter;
+   unsigned long pages_per_hpage;
+#endif
 };
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index de35997bb5f9..31c4a9e1cf91 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -25,6 +25,17 @@ struct hugetlb_cgroup;
 #define HUGETLB_CGROUP_MIN_ORDER   2

 #ifdef CONFIG_CGROUP_HUGETLB
+struct hugetlb_cgroup {
+   struct cgroup_subsys_state css;
+   /*
+* the counter to account for hugepages from hugetlb.
+*/
+   struct page_counter hugepage[HUGE_MAX_HSTATE];
+   /*
+* the counter to account for hugepage reservations from hugetlb.
+*/
+   struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
+};

 static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page 
*page)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a52efcb70d04..bac1cbdd027c 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -665,6 +665,16 @@ struct resv_map *resv_map_alloc(void)
INIT_LIST_HEAD(&resv_map->regions);

resv_map->adds_in_progress = 0;
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* Initialize these to 0. On shared mappings, 0's here indicate these
+* fields don't do cgroup accounting. On private mappings, these will be
+* re-initialized to the proper values, to indicate that hugetlb cgroup
+* reservations are to be un-charged from here.
+*/
+   resv_map->reservation_counter = NULL;
+   resv_map->pages_per_hpage = 0;
+#endif

INIT_LIST_HEAD(&resv_map->region_cache);
list_add(&rg->link, &resv_map->region_cache);
@@ -3147,7 +3157,18 @@ static void hugetlb_vm_op_close(struct vm_area_struct 
*vma)

reserve = (end - start) - region_count(resv, start, end);

-   kref_put(&resv->refs, resv_map_release);
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* Since we check for HPAGE_RESV_OWNER above, this must a private
+* mapping, and these values should be none-zero, and should point to
+* the hugetlb_cgroup counter to uncharge for this reservation.
+*/
+   WARN_ON(!resv->reservation_counter);
+   WARN_ON(!resv->pages_per_hpage);
+
+   hugetlb_cgroup_uncharge_counter(resv->reservation_counter,
+   (end - start) * resv->pages_per_hpage);
+#endif

if (reserve) {
/*
@@ -3157,6 +3178,8 @@ static void hugetlb_vm_op_close(struct vm_area_struct 
*vma)
gbl_reserve = hugepage_subpool_put_pages(spool, reserve);
hugetlb_acct_memory(h, -gbl_reserve);
}
+
+   kref_put(&resv->refs, resv_map_release);
 }

 static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
@@ -4490,6 +4513,7 @@ int hugetlb_reserve_pages(struct inode *inode,
struct hstate *h = hstate_inode(inode);
struct hugepage_subpool *spool = subpool_inode(inode);
struct resv_map *resv_map;
+   struct hugetlb

[PATCH v5 4/7] hugetlb: disable region_add file_region coalescing

2019-09-19 Thread Mina Almasry
A follow up patch in this series adds hugetlb cgroup uncharge info the
file_region entries in resv->regions. The cgroup uncharge info may
differ for different regions, so they can no longer be coalesced at
region_add time. So, disable region coalescing in region_add in this
patch.

Behavior change:

Say a resv_map exists like this [0->1], [2->3], and [5->6].

Then a region_chg/add call comes in region_chg/add(f=0, t=5).

Old code would generate resv->regions: [0->5], [5->6].
New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
[5->6].

Special care needs to be taken to handle the resv->adds_in_progress
variable correctly. In the past, only 1 region would be added for every
region_chg and region_add call. But now, each call may add multiple
regions, so we can no longer increment adds_in_progress by 1 in region_chg,
or decrement adds_in_progress by 1 after region_add or region_abort. Instead,
region_chg calls add_reservation_in_range() to count the number of regions
needed and allocates those, and that info is passed to region_add and
region_abort to decrement adds_in_progress correctly.

Signed-off-by: Mina Almasry 

---
 mm/hugetlb.c | 273 +--
 1 file changed, 158 insertions(+), 115 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bac1cbdd027c..d03b048084a3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -244,6 +244,12 @@ struct file_region {
long to;
 };

+/* Helper that removes a struct file_region from the resv_map cache and returns
+ * it for use.
+ */
+static struct file_region *
+get_file_region_entry_from_cache(struct resv_map *resv, long from, long to);
+
 /* Must be called with resv->lock held. Calling this with count_only == true
  * will count the number of pages to be added but will not modify the linked
  * list.
@@ -251,51 +257,61 @@ struct file_region {
 static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 bool count_only)
 {
-   long chg = 0;
+   long add = 0;
struct list_head *head = &resv->regions;
+   long last_accounted_offset = f;
struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;

-   /* Locate the region we are before or in. */
-   list_for_each_entry (rg, head, link)
-   if (f <= rg->to)
-   break;
-
-   /* Round our left edge to the current segment if it encloses us. */
-   if (f > rg->from)
-   f = rg->from;
-
-   chg = t - f;
+   /* In this loop, we essentially handle an entry for the range
+* last_accounted_offset -> rg->from, at every iteration, with some
+* bounds checking.
+*/
+   list_for_each_entry_safe(rg, trg, head, link) {
+   /* Skip irrelevant regions that start before our range. */
+   if (rg->from < f) {
+   /* If this region ends after the last accounted offset,
+* then we need to update last_accounted_offset.
+*/
+   if (rg->to > last_accounted_offset)
+   last_accounted_offset = rg->to;
+   continue;
+   }

-   /* Check for and consume any regions we now overlap with. */
-   nrg = rg;
-   list_for_each_entry_safe (rg, trg, rg->link.prev, link) {
-   if (&rg->link == head)
-   break;
+   /* When we find a region that starts beyond our range, we've
+* finished.
+*/
if (rg->from > t)
break;

-   /* We overlap with this area, if it extends further than
-* us then we must extend ourselves.  Account for its
-* existing reservation.
+   /* Add an entry for last_accounted_offset -> rg->from, and
+* update last_accounted_offset.
 */
-   if (rg->to > t) {
-   chg += rg->to - t;
-   t = rg->to;
+   if (rg->from > last_accounted_offset) {
+   add += rg->from - last_accounted_offset;
+   if (!count_only) {
+   nrg = get_file_region_entry_from_cache(
+   resv, last_accounted_offset, rg->from);
+   list_add(&nrg->link, rg->link.prev);
+   }
}
-   chg -= rg->to - rg->from;

-   if (!count_only && rg != nrg) {
-   list_del(&rg->link);
-   kfree(rg);
-   }
+   last_accounted_offset = rg->to;
}

-   if (!count_only) {
-

[PATCH v5 1/7] hugetlb_cgroup: Add hugetlb_cgroup reservation counter

2019-09-19 Thread Mina Almasry
D=libhugetlbfs.so
  libheapshrink.so HUGETLB_MORECORE=yes heapshrink (2M: 32):
  FAILHeap not on hugepages
- HUGETLB_ELFMAP=RW linkhuge_rw (2M: 32): FAILsmall_data is not hugepage
- HUGETLB_ELFMAP=RW HUGETLB_MINIMAL_COPY=no linkhuge_rw (2M: 32):
  FAILsmall_data is not hugepage
- alloc-instantiate-race shared (2M: 32):
  Bad configuration: sched_setaffinity(cpu1): Invalid argument -
  FAILChild 1 killed by signal Killed
- shmoverride_linked (2M: 32):
  FAILshmget failed size 2097152 from line 176: Invalid argument
- HUGETLB_SHM=yes shmoverride_linked (2M: 32):
  FAILshmget failed size 2097152 from line 176: Invalid argument
- shmoverride_linked_static (2M: 32):
  FAIL shmget failed size 2097152 from line 176: Invalid argument
- HUGETLB_SHM=yes shmoverride_linked_static (2M: 32):
  FAIL shmget failed size 2097152 from line 176: Invalid argument
- LD_PRELOAD=libhugetlbfs.so shmoverride_unlinked (2M: 32):
  FAIL shmget failed size 2097152 from line 176: Invalid argument
- LD_PRELOAD=libhugetlbfs.so HUGETLB_SHM=yes shmoverride_unlinked (2M: 32):
  FAILshmget failed size 2097152 from line 176: Invalid argument

[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

Signed-off-by: Mina Almasry 
Acked-by: Hillf Danton 

---
 include/linux/hugetlb.h |  23 -
 mm/hugetlb_cgroup.c | 111 ++--
 2 files changed, 107 insertions(+), 27 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index edfca4278319..3d70a17cc0c3 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -320,6 +320,27 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, 
unsigned long addr,

 #ifdef CONFIG_HUGETLB_PAGE

+enum {
+   /* Tracks hugetlb memory faulted in. */
+   HUGETLB_RES_USAGE,
+   /* Tracks hugetlb memory reserved. */
+   HUGETLB_RES_RESERVATION_USAGE,
+   /* Limit for hugetlb memory faulted in. */
+   HUGETLB_RES_LIMIT,
+   /* Limit for hugetlb memory reserved. */
+   HUGETLB_RES_RESERVATION_LIMIT,
+   /* Max usage for hugetlb memory faulted in. */
+   HUGETLB_RES_MAX_USAGE,
+   /* Max usage for hugetlb memory reserved. */
+   HUGETLB_RES_RESERVATION_MAX_USAGE,
+   /* Faulted memory accounting fail count. */
+   HUGETLB_RES_FAILCNT,
+   /* Reserved memory accounting fail count. */
+   HUGETLB_RES_RESERVATION_FAILCNT,
+   HUGETLB_RES_NULL,
+   HUGETLB_RES_MAX,
+};
+
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
@@ -340,7 +361,7 @@ struct hstate {
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 #ifdef CONFIG_CGROUP_HUGETLB
/* cgroup control files */
-   struct cftype cgroup_files[5];
+   struct cftype cgroup_files[HUGETLB_RES_MAX];
 #endif
char name[HSTATE_NAME_LEN];
 };
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 68c2f2f3c05b..1386da79c9d7 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -25,6 +25,10 @@ struct hugetlb_cgroup {
 * the counter to account for hugepages from hugetlb.
 */
struct page_counter hugepage[HUGE_MAX_HSTATE];
+   /*
+* the counter to account for hugepage reservations from hugetlb.
+*/
+   struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
 };

 #define MEMFILE_PRIVATE(x, val)(((x) << 16) | (val))
@@ -33,6 +37,14 @@ struct hugetlb_cgroup {

 static struct hugetlb_cgroup *root_h_cgroup __read_mostly;

+static inline struct page_counter *
+hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, int idx, bool reserved)
+{
+   if (reserved)
+   return &h_cg->reserved_hugepage[idx];
+   return &h_cg->hugepage[idx];
+}
+
 static inline
 struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
 {
@@ -254,30 +266,33 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned 
long nr_pages,
return;
 }

-enum {
-   RES_USAGE,
-   RES_LIMIT,
-   RES_MAX_USAGE,
-   RES_FAILCNT,
-};
-
 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
   struct cftype *cft)
 {
struct page_counter *counter;
+   struct page_counter *reserved_counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);

counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
+   reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];

switch (MEMFILE_ATTR(cft->private)) {
-   case RES_USAGE:
+   case HUGETLB_RES_USAGE:
return (u64)page_counter_read(counter) * PAGE_SIZE;
-   case RES_LIMIT:
+   case HUGETLB_RES_RESERVATION_USAGE:
+   return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
+   case HUGETLB_RES_LIMIT:
return (u64)counter->m

[PATCH 2/2] hugetlb: remove duplicated code

2019-09-19 Thread Mina Almasry
Remove duplicated code between region_chg and region_add, and refactor it into
a common function, add_reservation_in_range. This is mostly done because
there is a follow up change in another series that disables region
coalescing in region_add, and I want to make that change in one place
only. It should improve maintainability anyway on its own.

Signed-off-by: Mina Almasry 
Reviewed-by: Mike Kravetz 

---
 mm/hugetlb.c | 119 ---
 1 file changed, 57 insertions(+), 62 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a14f6047fc7e..052a2532528a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -244,6 +244,60 @@ struct file_region {
long to;
 };

+/* Must be called with resv->lock held. Calling this with count_only == true
+ * will count the number of pages to be added but will not modify the linked
+ * list.
+ */
+static long add_reservation_in_range(struct resv_map *resv, long f, long t,
+bool count_only)
+{
+   long chg = 0;
+   struct list_head *head = &resv->regions;
+   struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
+
+   /* Locate the region we are before or in. */
+   list_for_each_entry (rg, head, link)
+   if (f <= rg->to)
+   break;
+
+   /* Round our left edge to the current segment if it encloses us. */
+   if (f > rg->from)
+   f = rg->from;
+
+   chg = t - f;
+
+   /* Check for and consume any regions we now overlap with. */
+   nrg = rg;
+   list_for_each_entry_safe (rg, trg, rg->link.prev, link) {
+   if (&rg->link == head)
+   break;
+   if (rg->from > t)
+   break;
+
+   /* We overlap with this area, if it extends further than
+* us then we must extend ourselves.  Account for its
+* existing reservation.
+*/
+   if (rg->to > t) {
+   chg += rg->to - t;
+   t = rg->to;
+   }
+   chg -= rg->to - rg->from;
+
+   if (!count_only && rg != nrg) {
+   list_del(&rg->link);
+   kfree(rg);
+   }
+   }
+
+   if (!count_only) {
+   nrg->from = f;
+   nrg->to = t;
+   }
+
+   return chg;
+}
+
 /*
  * Add the huge page range represented by [f, t) to the reserve
  * map.  Existing regions will be expanded to accommodate the specified
@@ -257,7 +311,7 @@ struct file_region {
 static long region_add(struct resv_map *resv, long f, long t)
 {
struct list_head *head = &resv->regions;
-   struct file_region *rg, *nrg, *trg;
+   struct file_region *rg, *nrg;
long add = 0;

spin_lock(&resv->lock);
@@ -287,38 +341,7 @@ static long region_add(struct resv_map *resv, long f, long 
t)
goto out_locked;
}

-   /* Round our left edge to the current segment if it encloses us. */
-   if (f > rg->from)
-   f = rg->from;
-
-   /* Check for and consume any regions we now overlap with. */
-   nrg = rg;
-   list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-   if (&rg->link == head)
-   break;
-   if (rg->from > t)
-   break;
-
-   /* If this area reaches higher then extend our area to
-* include it completely.  If this is not the first area
-* which we intend to reuse, free it. */
-   if (rg->to > t)
-   t = rg->to;
-   if (rg != nrg) {
-   /* Decrement return value by the deleted range.
-* Another range will span this area so that by
-* end of routine add will be >= zero
-*/
-   add -= (rg->to - rg->from);
-   list_del(&rg->link);
-   kfree(rg);
-   }
-   }
-
-   add += (nrg->from - f); /* Added to beginning of region */
-   nrg->from = f;
-   add += t - nrg->to; /* Added to end of region */
-   nrg->to = t;
+   add = add_reservation_in_range(resv, f, t, false);

 out_locked:
resv->adds_in_progress--;
@@ -345,8 +368,6 @@ static long region_add(struct resv_map *resv, long f, long 
t)
  */
 static long region_chg(struct resv_map *resv, long f, long t)
 {
-   struct list_head *head = &resv->regions;
-   struct file_region *rg;
long chg = 0;

spin_lock(&resv->lock);
@@ -375,34 +396,8 @@ static long region_chg(struct resv_map *resv, long f, long 
t)
goto retry_locke

[PATCH 1/2] hugetlb: region_chg provides only cache entry

2019-09-19 Thread Mina Almasry
Current behavior is that region_chg provides both a cache entry in
resv->region_cache, AND a placeholder entry in resv->regions. region_add
first tries to use the placeholder, and if it finds that the placeholder
has been deleted by a racing region_del call, it uses the cache entry.

This behavior is completely unnecessary and is removed in this patch for
a couple of reasons:

1. region_add needs to either find a cached file_region entry in
   resv->region_cache, or find an entry in resv->regions to expand. It
   does not need both.
2. region_chg adding a placeholder entry in resv->regions opens up
   a possible race with region_del, where region_chg adds a placeholder
   region in resv->regions, and this region is deleted by a racing call
   to region_del during region_chg execution or before region_add is
   called. Removing the race makes the code easier to reason about and
   maintain.

In addition, a follow up patch in another series that disables region
coalescing, which would be further complicated if the race with
region_del exists.

Signed-off-by: Mina Almasry 
Reviewed-by: Mike Kravetz 

---
 mm/hugetlb.c | 63 +---
 1 file changed, 11 insertions(+), 52 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6d7296dd11b8..a14f6047fc7e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -246,14 +246,10 @@ struct file_region {

 /*
  * Add the huge page range represented by [f, t) to the reserve
- * map.  In the normal case, existing regions will be expanded
- * to accommodate the specified range.  Sufficient regions should
- * exist for expansion due to the previous call to region_chg
- * with the same range.  However, it is possible that region_del
- * could have been called after region_chg and modifed the map
- * in such a way that no region exists to be expanded.  In this
- * case, pull a region descriptor from the cache associated with
- * the map and use that for the new range.
+ * map.  Existing regions will be expanded to accommodate the specified
+ * range, or a region will be taken from the cache.  Sufficient regions
+ * must exist in the cache due to the previous call to region_chg with
+ * the same range.
  *
  * Return the number of new huge pages added to the map.  This
  * number is greater than or equal to zero.
@@ -272,9 +268,8 @@ static long region_add(struct resv_map *resv, long f, long 
t)

/*
 * If no region exists which can be expanded to include the
-* specified range, the list must have been modified by an
-* interleving call to region_del().  Pull a region descriptor
-* from the cache and use it for this range.
+* specified range, pull a region descriptor from the cache
+* and use it for this range.
 */
if (&rg->link == head || t < rg->from) {
VM_BUG_ON(resv->region_cache_count <= 0);
@@ -339,15 +334,9 @@ static long region_add(struct resv_map *resv, long f, long 
t)
  * call to region_add that will actually modify the reserve
  * map to add the specified range [f, t).  region_chg does
  * not change the number of huge pages represented by the
- * map.  However, if the existing regions in the map can not
- * be expanded to represent the new range, a new file_region
- * structure is added to the map as a placeholder.  This is
- * so that the subsequent region_add call will have all the
- * regions it needs and will not fail.
- *
- * Upon entry, region_chg will also examine the cache of region descriptors
- * associated with the map.  If there are not enough descriptors cached, one
- * will be allocated for the in progress add operation.
+ * map.  A new file_region structure is added to the cache
+ * as a placeholder, so that the subsequent region_add
+ * call will have all the regions it needs and will not fail.
  *
  * Returns the number of huge pages that need to be added to the existing
  * reservation map for the range [f, t).  This number is greater or equal to
@@ -357,10 +346,9 @@ static long region_add(struct resv_map *resv, long f, long 
t)
 static long region_chg(struct resv_map *resv, long f, long t)
 {
struct list_head *head = &resv->regions;
-   struct file_region *rg, *nrg = NULL;
+   struct file_region *rg;
long chg = 0;

-retry:
spin_lock(&resv->lock);
 retry_locked:
resv->adds_in_progress++;
@@ -378,10 +366,8 @@ static long region_chg(struct resv_map *resv, long f, long 
t)
spin_unlock(&resv->lock);

trg = kmalloc(sizeof(*trg), GFP_KERNEL);
-   if (!trg) {
-   kfree(nrg);
+   if (!trg)
return -ENOMEM;
-   }

spin_lock(&resv->lock);
list_add(&trg->link, &resv->region_cache);
@@ -394,28 +380,6 @@ static long region_chg(struct resv_map *resv, long f, long 
t)
   

[PATCH 0/2] Cleanups to hugetlb code

2019-09-19 Thread Mina Almasry
These couple of patches were part of my 'hugetlb_cgroup: Add hugetlb_cgroup
reservation limits' patch series, and Mike recommended that they are split off
into their own since they are generic cleanups that should apply regardless.
Hence, I upload them here as a their own patch series.

They have been already reviewed by Mike as part of the previous series, so
already hold the Reviewed-by tag.

Mina Almasry (2):
  hugetlb: region_chg provides only cache entry
  hugetlb: remove duplicated code

 mm/hugetlb.c | 180 +++
 1 file changed, 67 insertions(+), 113 deletions(-)

--
2.23.0.351.gc4317032e6-goog


Re: [PATCH v4 8/9] hugetlb_cgroup: Add hugetlb_cgroup reservation tests

2019-09-18 Thread Mina Almasry
On Mon, Sep 16, 2019 at 6:52 PM shuah  wrote:
>
> On 9/10/19 5:31 PM, Mina Almasry wrote:
> > The tests use both shared and private mapped hugetlb memory, and
> > monitors the hugetlb usage counter as well as the hugetlb reservation
> > counter. They test different configurations such as hugetlb memory usage
> > via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
> > MAP_POPULATE.
> >
> > Signed-off-by: Mina Almasry 
> > ---
> >   tools/testing/selftests/vm/.gitignore |   1 +
> >   tools/testing/selftests/vm/Makefile   |   4 +
> >   .../selftests/vm/charge_reserved_hugetlb.sh   | 440 ++
> >   .../selftests/vm/write_hugetlb_memory.sh  |  22 +
> >   .../testing/selftests/vm/write_to_hugetlbfs.c | 252 ++
> >   5 files changed, 719 insertions(+)
> >   create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
> >   create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
> >   create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c
> >
> > diff --git a/tools/testing/selftests/vm/.gitignore 
> > b/tools/testing/selftests/vm/.gitignore
> > index 31b3c98b6d34d..d3bed9407773c 100644
> > --- a/tools/testing/selftests/vm/.gitignore
> > +++ b/tools/testing/selftests/vm/.gitignore
> > @@ -14,3 +14,4 @@ virtual_address_range
> >   gup_benchmark
> >   va_128TBswitch
> >   map_fixed_noreplace
> > +write_to_hugetlbfs
> > diff --git a/tools/testing/selftests/vm/Makefile 
> > b/tools/testing/selftests/vm/Makefile
> > index 9534dc2bc9295..8d37d5409b52c 100644
> > --- a/tools/testing/selftests/vm/Makefile
> > +++ b/tools/testing/selftests/vm/Makefile
> > @@ -18,6 +18,7 @@ TEST_GEN_FILES += transhuge-stress
> >   TEST_GEN_FILES += userfaultfd
> >   TEST_GEN_FILES += va_128TBswitch
> >   TEST_GEN_FILES += virtual_address_range
> > +TEST_GEN_FILES += write_to_hugetlbfs
> >
> >   TEST_PROGS := run_vmtests
> >
> > @@ -29,3 +30,6 @@ include ../lib.mk
> >   $(OUTPUT)/userfaultfd: LDLIBS += -lpthread
> >
> >   $(OUTPUT)/mlock-random-test: LDLIBS += -lcap
> > +
> > +# Why does adding $(OUTPUT)/ like above not apply this flag..?
>
> Can you verify the following and remove this comment, once you figure
> out if you need $(OUTPUT)/
> > +write_to_hugetlbfs: CFLAGS += -static
>
> It should. Did you test "make O=" and "KBUILD_OUTPUT" kselftest
> use-cases?
>

Turns out I don't need -static actually.

> > diff --git a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh 
> > b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
> > new file mode 100755
> > index 0..09e90e8f6fab4
> > --- /dev/null
> > +++ b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
> > @@ -0,0 +1,440 @@
> > +#!/bin/sh
> > +# SPDX-License-Identifier: GPL-2.0
> > +
> > +set -e
> > +
> > +cgroup_path=/dev/cgroup/memory
> > +if [[ ! -e $cgroup_path ]]; then
> > +  mkdir -p $cgroup_path
> > +  mount -t cgroup -o hugetlb,memory cgroup $cgroup_path
> > +fi
> > +
>
> Does this test need root access? If yes, please add root check
> and skip the test when a non-root runs the test.
>
> > +cleanup () {
> > + echo $$ > $cgroup_path/tasks
> > +
> > + set +e
> > + if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then
> > +   kill -2 write_to_hugetlbfs
> > +   # Wait for hugetlbfs memory to get depleted.
> > +   sleep 0.5
>
> This time looks arbitrary. How can you be sure it gets depleted?
> Is there another way to check for it.
>
> > + fi
> > + set -e
> > +
> > + if [[ -e /mnt/huge ]]; then
> > +   rm -rf /mnt/huge/*
> > +   umount /mnt/huge || echo error
> > +   rmdir /mnt/huge
> > + fi
> > + if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then
> > +   rmdir $cgroup_path/hugetlb_cgroup_test
> > + fi
> > + if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then
> > +   rmdir $cgroup_path/hugetlb_cgroup_test1
> > + fi
> > + if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then
> > +   rmdir $cgroup_path/hugetlb_cgroup_test2
> > + fi
> > + echo 0 > /proc/sys/vm/nr_hugepages
> > + echo CLEANUP DONE
> > +}
> > +
> > +cleanup
> > +
> > +function expect_equal() {
> > +  local expected="$1"
> > +  local actual="$2"
> > +

Re: [PATCH v4 6/9] hugetlb: disable region_add file_region coalescing

2019-09-16 Thread Mina Almasry
On Mon, Sep 16, 2019 at 4:57 PM Mike Kravetz  wrote:
>
> On 9/10/19 4:31 PM, Mina Almasry wrote:
> > A follow up patch in this series adds hugetlb cgroup uncharge info the
> > file_region entries in resv->regions. The cgroup uncharge info may
> > differ for different regions, so they can no longer be coalesced at
> > region_add time. So, disable region coalescing in region_add in this
> > patch.
> >
> > Behavior change:
> >
> > Say a resv_map exists like this [0->1], [2->3], and [5->6].
> >
> > Then a region_chg/add call comes in region_chg/add(f=0, t=5).
> >
> > Old code would generate resv->regions: [0->5], [5->6].
> > New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
> > [5->6].
> >
> > Special care needs to be taken to handle the resv->adds_in_progress
> > variable correctly. In the past, only 1 region would be added for every
> > region_chg and region_add call. But now, each call may add multiple
> > regions, so we can no longer increment adds_in_progress by 1 in region_chg,
> > or decrement adds_in_progress by 1 after region_add or region_abort. 
> > Instead,
> > region_chg calls add_reservation_in_range() to count the number of regions
> > needed and allocates those, and that info is passed to region_add and
> > region_abort to decrement adds_in_progress correctly.
>
> Hate to throw more theoretical examples at you but ...
>
> Consider an existing reserv_map like [3-10]
> Then a region_chg/add call comes in region_chg/add(f=0, t=10).
> The region_chg is going to return 3 (additional reservations needed), and
> also out_regions_needed = 1 as it would want to create a region [0-3].
> Correct?
> But, there is nothing to prevent another thread from doing a region_del [5-7]
> after the region_chg and before region_add.  Correct?
> If so, it seems the region_add would need to create two regions, but there
> is only one in the cache and we would BUG in get_file_region_entry_from_cache.
> Am I reading the code correctly?
>
> The existing code wants to make sure region_add called after region_chg will
> never return error.  This is why all needed allocations were done in the
> region_chg call, and it was relatively easy to do in existing code when
> region_chg would only need one additional region at most.
>
> I'm thinking that we may have to make region_chg allocate the worst case
> number of regions (t - f)/2, OR change to the code such that region_add
> could return an error.

Yep you are right, I missed reasoning about the region_del punch hole
into the reservations case. Let me consider these 2 options.

> --
> Mike Kravetz


[PATCH v4 3/9] hugetlb_cgroup: add reservation accounting for private mappings

2019-09-10 Thread Mina Almasry
Normally the pointer to the cgroup to uncharge hangs off the struct
page, and gets queried when it's time to free the page. With
hugetlb_cgroup reservations, this is not possible. Because it's possible
for a page to be reserved by one task and actually faulted in by another
task.

The best place to put the hugetlb_cgroup pointer to uncharge for
reservations is in the resv_map. But, because the resv_map has different
semantics for private and shared mappings, the code patch to
charge/uncharge shared and private mappings is different. This patch
implements charging and uncharging for private mappings.

For private mappings, the counter to uncharge is in
resv_map->reservation_counter. On initializing the resv_map this is set
to NULL. On reservation of a region in private mapping, the tasks
hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
resv_map->reservation_counter.

On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.

Signed-off-by: Mina Almasry 
---
 include/linux/hugetlb.h|  8 ++
 include/linux/hugetlb_cgroup.h | 11 
 mm/hugetlb.c   | 47 --
 mm/hugetlb_cgroup.c| 12 -
 4 files changed, 64 insertions(+), 14 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 128ff1aff1c93..536cb144cf484 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -46,6 +46,14 @@ struct resv_map {
long adds_in_progress;
struct list_head region_cache;
long region_cache_count;
+ #ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* On private mappings, the counter to uncharge reservations is stored
+* here. If these fields are 0, then the mapping is shared.
+*/
+   struct page_counter *reservation_counter;
+   unsigned long pages_per_hpage;
+#endif
 };
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index c467715dd8fb8..8c6ea58c63c89 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -25,6 +25,17 @@ struct hugetlb_cgroup;
 #define HUGETLB_CGROUP_MIN_ORDER   2

 #ifdef CONFIG_CGROUP_HUGETLB
+struct hugetlb_cgroup {
+   struct cgroup_subsys_state css;
+   /*
+* the counter to account for hugepages from hugetlb.
+*/
+   struct page_counter hugepage[HUGE_MAX_HSTATE];
+   /*
+* the counter to account for hugepage reservations from hugetlb.
+*/
+   struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
+};

 static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page 
*page)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e975f55aede94..fbd7c52e17348 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -711,6 +711,16 @@ struct resv_map *resv_map_alloc(void)
INIT_LIST_HEAD(&resv_map->regions);

resv_map->adds_in_progress = 0;
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* Initialize these to 0. On shared mappings, 0's here indicate these
+* fields don't do cgroup accounting. On private mappings, these will be
+* re-initialized to the proper values, to indicate that hugetlb cgroup
+* reservations are to be un-charged from here.
+*/
+   resv_map->reservation_counter = NULL;
+   resv_map->pages_per_hpage = 0;
+#endif

INIT_LIST_HEAD(&resv_map->region_cache);
list_add(&rg->link, &resv_map->region_cache);
@@ -3193,7 +3203,19 @@ static void hugetlb_vm_op_close(struct vm_area_struct 
*vma)

reserve = (end - start) - region_count(resv, start, end);

-   kref_put(&resv->refs, resv_map_release);
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* Since we check for HPAGE_RESV_OWNER above, this must a private
+* mapping, and these values should be none-zero, and should point to
+* the hugetlb_cgroup counter to uncharge for this reservation.
+*/
+   WARN_ON(!resv->reservation_counter);
+   WARN_ON(!resv->pages_per_hpage);
+
+   hugetlb_cgroup_uncharge_counter(
+   resv->reservation_counter,
+   (end - start) * resv->pages_per_hpage);
+#endif

if (reserve) {
/*
@@ -3203,6 +3225,8 @@ static void hugetlb_vm_op_close(struct vm_area_struct 
*vma)
gbl_reserve = hugepage_subpool_put_pages(spool, reserve);
hugetlb_acct_memory(h, -gbl_reserve);
}
+
+   kref_put(&resv->refs, resv_map_release);
 }

 static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
@@ -4536,6 +4560,7 @@ int hugetlb_reserve_pages(struct inode *inode,
struct hstate *h = hstate_inode(inode);
struct hugepage_subpool *spool = subpool_inode(inode);
struct resv_map *resv_map;
+   struct hugetlb_cgroup *

[PATCH v4 2/9] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations

2019-09-10 Thread Mina Almasry
Augements hugetlb_cgroup_charge_cgroup to be able to charge hugetlb
usage or hugetlb reservation counter.

Adds a new interface to uncharge a hugetlb_cgroup counter via
hugetlb_cgroup_uncharge_counter.

Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.

Signed-off-by: Mina Almasry 
---
 include/linux/hugetlb_cgroup.h | 13 --
 mm/hugetlb.c   |  6 ++-
 mm/hugetlb_cgroup.c| 82 +++---
 3 files changed, 80 insertions(+), 21 deletions(-)

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 063962f6dfc6a..c467715dd8fb8 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -52,14 +52,19 @@ static inline bool hugetlb_cgroup_disabled(void)
 }

 extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-   struct hugetlb_cgroup **ptr);
+   struct hugetlb_cgroup **ptr,
+   bool reserved);
 extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
 struct hugetlb_cgroup *h_cg,
 struct page *page);
 extern void hugetlb_cgroup_uncharge_page(int idx, unsigned long nr_pages,
 struct page *page);
 extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
-  struct hugetlb_cgroup *h_cg);
+  struct hugetlb_cgroup *h_cg,
+  bool reserved);
+extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
+   unsigned long nr_pages);
+
 extern void hugetlb_cgroup_file_init(void) __init;
 extern void hugetlb_cgroup_migrate(struct page *oldhpage,
   struct page *newhpage);
@@ -83,7 +88,7 @@ static inline bool hugetlb_cgroup_disabled(void)

 static inline int
 hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-struct hugetlb_cgroup **ptr)
+struct hugetlb_cgroup **ptr, bool reserved)
 {
return 0;
 }
@@ -102,7 +107,7 @@ hugetlb_cgroup_uncharge_page(int idx, unsigned long 
nr_pages, struct page *page)

 static inline void
 hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
-  struct hugetlb_cgroup *h_cg)
+  struct hugetlb_cgroup *h_cg, bool reserved)
 {
 }

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6d7296dd11b83..e975f55aede94 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2078,7 +2078,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
gbl_chg = 1;
}

-   ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
+   ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg,
+  false);
if (ret)
goto out_subpool_put;

@@ -2126,7 +2127,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
return page;

 out_uncharge_cgroup:
-   hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
+   hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg,
+   false);
 out_subpool_put:
if (map_chg || avoid_reserve)
hugepage_subpool_put_pages(spool, 1);
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 51a72624bd1ff..2ab36a98d834e 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -38,8 +38,8 @@ struct hugetlb_cgroup {
 static struct hugetlb_cgroup *root_h_cgroup __read_mostly;

 static inline
-struct page_counter *hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, 
int idx,
-bool reserved)
+struct page_counter *hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg,
+   int idx, bool reserved)
 {
if (reserved)
return  &h_cg->reserved_hugepage[idx];
@@ -74,8 +74,12 @@ static inline bool hugetlb_cgroup_have_usage(struct 
hugetlb_cgroup *h_cg)
int idx;

for (idx = 0; idx < hugetlb_max_hstate; idx++) {
-   if (page_counter_read(&h_cg->hugepage[idx]))
+   if (page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx,
+   true)) ||
+   page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx,
+   false))) {
return true;
+   }
}
return false;
 }
@@ -86,18 +90,30 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup 
*h_cgroup,
int idx;

for (idx = 0; idx < HUGE_MAX_HSTATE; idx

[PATCH v4 5/9] hugetlb: remove duplicated code

2019-09-10 Thread Mina Almasry
Remove duplicated code between region_chg and region_add, and refactor it into
a common function, add_reservation_in_range. This is mostly done because
there is a follow up change in this series that disables region
coalescing in region_add, and I want to make that change in one place
only. It should improve maintainability anyway on its own.

Signed-off-by: Mina Almasry 
---
 mm/hugetlb.c | 116 ---
 1 file changed, 54 insertions(+), 62 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bea51ae422f63..ce5ed1056fefd 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -244,6 +244,57 @@ struct file_region {
long to;
 };

+static long add_reservation_in_range(
+   struct resv_map *resv, long f, long t, bool count_only)
+{
+
+   long chg = 0;
+   struct list_head *head = &resv->regions;
+   struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;
+
+   /* Locate the region we are before or in. */
+   list_for_each_entry(rg, head, link)
+   if (f <= rg->to)
+   break;
+
+   /* Round our left edge to the current segment if it encloses us. */
+   if (f > rg->from)
+   f = rg->from;
+
+   chg = t - f;
+
+   /* Check for and consume any regions we now overlap with. */
+   nrg = rg;
+   list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+   if (&rg->link == head)
+   break;
+   if (rg->from > t)
+   break;
+
+   /* We overlap with this area, if it extends further than
+* us then we must extend ourselves.  Account for its
+* existing reservation.
+*/
+   if (rg->to > t) {
+   chg += rg->to - t;
+   t = rg->to;
+   }
+   chg -= rg->to - rg->from;
+
+   if (!count_only && rg != nrg) {
+   list_del(&rg->link);
+   kfree(rg);
+   }
+   }
+
+   if (!count_only) {
+   nrg->from = f;
+   nrg->to = t;
+   }
+
+   return chg;
+}
+
 /*
  * Add the huge page range represented by [f, t) to the reserve
  * map.  Existing regions will be expanded to accommodate the specified
@@ -257,7 +308,7 @@ struct file_region {
 static long region_add(struct resv_map *resv, long f, long t)
 {
struct list_head *head = &resv->regions;
-   struct file_region *rg, *nrg, *trg;
+   struct file_region *rg, *nrg;
long add = 0;

spin_lock(&resv->lock);
@@ -287,38 +338,7 @@ static long region_add(struct resv_map *resv, long f, long 
t)
goto out_locked;
}

-   /* Round our left edge to the current segment if it encloses us. */
-   if (f > rg->from)
-   f = rg->from;
-
-   /* Check for and consume any regions we now overlap with. */
-   nrg = rg;
-   list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-   if (&rg->link == head)
-   break;
-   if (rg->from > t)
-   break;
-
-   /* If this area reaches higher then extend our area to
-* include it completely.  If this is not the first area
-* which we intend to reuse, free it. */
-   if (rg->to > t)
-   t = rg->to;
-   if (rg != nrg) {
-   /* Decrement return value by the deleted range.
-* Another range will span this area so that by
-* end of routine add will be >= zero
-*/
-   add -= (rg->to - rg->from);
-   list_del(&rg->link);
-   kfree(rg);
-   }
-   }
-
-   add += (nrg->from - f); /* Added to beginning of region */
-   nrg->from = f;
-   add += t - nrg->to; /* Added to end of region */
-   nrg->to = t;
+   add = add_reservation_in_range(resv, f, t, false);

 out_locked:
resv->adds_in_progress--;
@@ -345,8 +365,6 @@ static long region_add(struct resv_map *resv, long f, long 
t)
  */
 static long region_chg(struct resv_map *resv, long f, long t)
 {
-   struct list_head *head = &resv->regions;
-   struct file_region *rg;
long chg = 0;

spin_lock(&resv->lock);
@@ -375,34 +393,8 @@ static long region_chg(struct resv_map *resv, long f, long 
t)
goto retry_locked;
}

-   /* Locate the region we are before or in. */
-   list_for_each_entry(rg, head, link)
-   if (f <= rg->to)
-   break;
-
-   /* Round our left edge to the curre

[PATCH v4 9/9] hugetlb_cgroup: Add hugetlb_cgroup reservation docs

2019-09-10 Thread Mina Almasry
Add docs for how to use hugetlb_cgroup reservations, and their behavior.

Signed-off-by: Mina Almasry 
Acked-by: Hillf Danton 
---
 .../admin-guide/cgroup-v1/hugetlb.rst | 84 ---
 1 file changed, 73 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst 
b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
index a3902aa253a96..cc6eb859fc722 100644
--- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
@@ -2,13 +2,6 @@
 HugeTLB Controller
 ==

-The HugeTLB controller allows to limit the HugeTLB usage per control group and
-enforces the controller limit during page fault. Since HugeTLB doesn't
-support page reclaim, enforcing the limit at page fault time implies that,
-the application will get SIGBUS signal if it tries to access HugeTLB pages
-beyond its limit. This requires the application to know beforehand how much
-HugeTLB pages it would require for its use.
-
 HugeTLB controller can be created by first mounting the cgroup filesystem.

 # mount -t cgroup -o hugetlb none /sys/fs/cgroup
@@ -28,10 +21,14 @@ process (bash) into it.

 Brief summary of control files::

- hugetlb..limit_in_bytes # set/show limit of "hugepagesize" 
hugetlb usage
- hugetlb..max_usage_in_bytes # show max "hugepagesize" hugetlb  
usage recorded
- hugetlb..usage_in_bytes # show current usage for 
"hugepagesize" hugetlb
- hugetlb..failcnt   # show the number of 
allocation failure due to HugeTLB limit
+ hugetlb..reservation_limit_in_bytes # set/show limit of 
"hugepagesize" hugetlb reservations
+ hugetlb..reservation_max_usage_in_bytes # show max 
"hugepagesize" hugetlb reservations recorded
+ hugetlb..reservation_usage_in_bytes # show current 
reservations for "hugepagesize" hugetlb
+ hugetlb..reservation_failcnt# show the number of 
allocation failure due to HugeTLB reservation limit
+ hugetlb..limit_in_bytes # set/show limit of 
"hugepagesize" hugetlb faults
+ hugetlb..max_usage_in_bytes # show max 
"hugepagesize" hugetlb  usage recorded
+ hugetlb..usage_in_bytes # show current usage 
for "hugepagesize" hugetlb
+ hugetlb..failcnt# show the number of 
allocation failure due to HugeTLB usage limit

 For a system supporting three hugepage sizes (64k, 32M and 1G), the control
 files include::
@@ -40,11 +37,76 @@ files include::
   hugetlb.1GB.max_usage_in_bytes
   hugetlb.1GB.usage_in_bytes
   hugetlb.1GB.failcnt
+  hugetlb.1GB.reservation_limit_in_bytes
+  hugetlb.1GB.reservation_max_usage_in_bytes
+  hugetlb.1GB.reservation_usage_in_bytes
+  hugetlb.1GB.reservation_failcnt
   hugetlb.64KB.limit_in_bytes
   hugetlb.64KB.max_usage_in_bytes
   hugetlb.64KB.usage_in_bytes
   hugetlb.64KB.failcnt
+  hugetlb.64KB.reservation_limit_in_bytes
+  hugetlb.64KB.reservation_max_usage_in_bytes
+  hugetlb.64KB.reservation_usage_in_bytes
+  hugetlb.64KB.reservation_failcnt
   hugetlb.32MB.limit_in_bytes
   hugetlb.32MB.max_usage_in_bytes
   hugetlb.32MB.usage_in_bytes
   hugetlb.32MB.failcnt
+  hugetlb.32MB.reservation_limit_in_bytes
+  hugetlb.32MB.reservation_max_usage_in_bytes
+  hugetlb.32MB.reservation_usage_in_bytes
+  hugetlb.32MB.reservation_failcnt
+
+
+1. Reservation limits
+
+The HugeTLB controller allows to limit the HugeTLB reservations per control
+group and enforces the controller limit at reservation time. Reservation limits
+are superior to Page fault limits (see section 2), since Reservation limits are
+enforced at reservation time, and never causes the application to get SIGBUS
+signal. Instead, if the application is violating its limits, then it gets an
+error on reservation time, i.e. the mmap or shmget return an error.
+
+
+2. Page fault limits
+
+The HugeTLB controller allows to limit the HugeTLB usage (page fault) per
+control group and enforces the controller limit during page fault. Since 
HugeTLB
+doesn't support page reclaim, enforcing the limit at page fault time implies
+that, the application will get SIGBUS signal if it tries to access HugeTLB
+pages beyond its limit. This requires the application to know beforehand how
+much HugeTLB pages it would require for its use.
+
+
+3. Caveats with shared memory
+
+a. Charging and uncharging:
+
+For shared hugetlb memory, both hugetlb reservation and usage (page faults) are
+charged to the first task that causes the memory to be reserved or faulted,
+and all subsequent uses of this reserved or faulted memory is done without
+charging.
+
+Shared hugetlb memory is only uncharged when it is unreseved or deallocated.
+This is usually when the hugetlbfs file is deleted, and not when the task that
+caused the reservation or fault has exited.
+
+b. Interaction between reservation limit and fault limit.
+
+Generally, it's not recommended to s

[PATCH v4 7/9] hugetlb_cgroup: add accounting for shared mappings

2019-09-10 Thread Mina Almasry
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.

After a call to region_chg, we charge the approprate hugetlb_cgroup, and if
successful, we pass on the hugetlb_cgroup info to a follow up region_add call.
When a file_region entry is added to the resv_map via region_add, we put the
pointer to that cgroup in file_region->reservation_counter. If charging doesn't
succeed, we report the error to the caller, so that the kernel fails the
reservation.

On region_del, which is when the hugetlb memory is unreserved, we also uncharge
the file_region->reservation_counter.

Signed-off-by: Mina Almasry 
---
 mm/hugetlb.c | 147 ---
 1 file changed, 115 insertions(+), 32 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5eca34d9b753d..711690b87dce5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -242,6 +242,15 @@ struct file_region {
struct list_head link;
long from;
long to;
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* On shared mappings, each reserved region appears as a struct
+* file_region in resv_map. These fields hold the info needed to
+* uncharge each reservation.
+*/
+   struct page_counter *reservation_counter;
+   unsigned long pages_per_hpage;
+#endif
 };

 /* Helper that removes a struct file_region from the resv_map cache and returns
@@ -250,9 +259,29 @@ struct file_region {
 static struct file_region *get_file_region_entry_from_cache(
struct resv_map *resv, long from, long to);

-static long add_reservation_in_range(
-   struct resv_map *resv,
+/* Helper that records hugetlb_cgroup uncharge info. */
+static void record_hugetlb_cgroup_uncharge_info(struct hugetlb_cgroup *h_cg,
+   struct file_region *nrg, struct hstate *h)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+   if (h_cg) {
+   nrg->reservation_counter =
+   &h_cg->reserved_hugepage[hstate_index(h)];
+   nrg->pages_per_hpage = pages_per_huge_page(h);
+   } else {
+   nrg->reservation_counter = NULL;
+   nrg->pages_per_hpage = 0;
+   }
+#endif
+}
+
+/* Must be called with resv->lock held. Calling this with dry_run == true will
+ * count the number of pages to be added but will not modify the linked list.
+ */
+static long add_reservation_in_range(struct resv_map *resv,
long f, long t,
+   struct hugetlb_cgroup *h_cg,
+   struct hstate *h,
long *regions_needed,
bool count_only)
 {
@@ -294,6 +323,8 @@ static long add_reservation_in_range(
nrg = get_file_region_entry_from_cache(resv,
last_accounted_offset,
rg->from);
+   record_hugetlb_cgroup_uncharge_info(h_cg, nrg,
+   h);
list_add(&nrg->link, rg->link.prev);
} else if (regions_needed)
*regions_needed += 1;
@@ -310,6 +341,7 @@ static long add_reservation_in_range(
if (!count_only) {
nrg = get_file_region_entry_from_cache(resv,
last_accounted_offset, t);
+   record_hugetlb_cgroup_uncharge_info(h_cg, nrg, h);
list_add(&nrg->link, rg->link.prev);
} else if (regions_needed)
*regions_needed += 1;
@@ -317,6 +349,7 @@ static long add_reservation_in_range(
last_accounted_offset = t;
}

+   VM_BUG_ON(add < 0);
return add;
 }

@@ -333,8 +366,8 @@ static long add_reservation_in_range(
  * Return the number of new huge pages added to the map.  This
  * number is greater than or equal to zero.
  */
-static long region_add(struct resv_map *resv, long f, long t,
-   long regions_needed)
+static long region_add(struct hstate *h, struct hugetlb_cgroup *h_cg,
+   struct resv_map *resv, long f, long t, long regions_needed)
 {
long add = 0;

@@ -342,7 +375,7 @@ static long region_add(struct resv_map *resv, long f, long 
t,

VM_BUG_ON(resv->region_cache_count < regions_needed);

-   add = add_reservation_in_range(resv, f, t, NULL, false);
+   add = add_reservation_in_range(resv, f, t, h_cg, h, NULL, false);
resv->adds_in_progress -= regions_needed;

spin_unlock(&resv->lock);
@@ -380,7 +413,8 @@ static long region_chg(struct resv_map *resv, long f, long 
t,
spin_lock(&resv->lock);

/* Count how many hugepages in this range are NOT respresented. */
-   chg = add_reservation_in_range(resv, f, t, 

[PATCH v4 0/9] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-10 Thread Mina Almasry
 is not hugepage
- HUGETLB_ELFMAP=RW HUGETLB_MINIMAL_COPY=no linkhuge_rw (2M: 32):
  FAILsmall_data is not hugepage
- alloc-instantiate-race shared (2M: 32):
  Bad configuration: sched_setaffinity(cpu1): Invalid argument -
  FAILChild 1 killed by signal Killed
- shmoverride_linked (2M: 32):
  FAILshmget failed size 2097152 from line 176: Invalid argument
- HUGETLB_SHM=yes shmoverride_linked (2M: 32):
  FAILshmget failed size 2097152 from line 176: Invalid argument
- shmoverride_linked_static (2M: 32):
  FAIL shmget failed size 2097152 from line 176: Invalid argument
- HUGETLB_SHM=yes shmoverride_linked_static (2M: 32):
  FAIL shmget failed size 2097152 from line 176: Invalid argument
- LD_PRELOAD=libhugetlbfs.so shmoverride_unlinked (2M: 32):
  FAIL shmget failed size 2097152 from line 176: Invalid argument
- LD_PRELOAD=libhugetlbfs.so HUGETLB_SHM=yes shmoverride_unlinked (2M: 32):
  FAILshmget failed size 2097152 from line 176: Invalid argument

Signed-off-by: Mina Almasry 

[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

Changes in v4:
- Split up 'hugetlb_cgroup: add accounting for shared mappings' into 4 patches
  for better isolation and context on the individual changes:
  - hugetlb_cgroup: add accounting for shared mappings
  - hugetlb: disable region_add file_region coalescing
  - hugetlb: remove duplicated code
  - hugetlb: region_chg provides only cache entry
- Fixed resv->adds_in_progress accounting.
- Retained behavior that region_add never fails, in earlier patchsets region_add
  could return failure.
- Fixed libhugetlbfs failure.
- Minor fix to the added tests that was preventing them from running on some
  environments.

Changes in v3:
- Addressed comments of Hillf Danton:
  - Added docs.
  - cgroup_files now uses enum.
  - Various readability improvements.
- Addressed comments of Mike Kravetz.
  - region_* functions no longer coalesce file_region entries in the resv_map.
  - region_add() and region_chg() refactored to make them much easier to
understand and remove duplicated code so this patch doesn't add too much
complexity.
  - Refactored common functionality into helpers.

Changes in v2:
- Split the patch into a 5 patch series.
- Fixed patch subject.

Mina Almasry (9):
  hugetlb_cgroup: Add hugetlb_cgroup reservation counter
  hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
  hugetlb_cgroup: add reservation accounting for private mappings
  hugetlb: region_chg provides only cache entry
  hugetlb: remove duplicated code
  hugetlb: disable region_add file_region coalescing
  hugetlb_cgroup: add accounting for shared mappings
  hugetlb_cgroup: Add hugetlb_cgroup reservation tests
  hugetlb_cgroup: Add hugetlb_cgroup reservation docs

 .../admin-guide/cgroup-v1/hugetlb.rst |  84 ++-
 include/linux/hugetlb.h   |  24 +-
 include/linux/hugetlb_cgroup.h|  24 +-
 mm/hugetlb.c  | 516 +++---
 mm/hugetlb_cgroup.c   | 189 +--
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   4 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 440 +++
 .../selftests/vm/write_hugetlb_memory.sh  |  22 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 252 +
 10 files changed, 1304 insertions(+), 252 deletions(-)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

--
2.23.0.162.g0b9fbb3734-goog


[PATCH v4 8/9] hugetlb_cgroup: Add hugetlb_cgroup reservation tests

2019-09-10 Thread Mina Almasry
The tests use both shared and private mapped hugetlb memory, and
monitors the hugetlb usage counter as well as the hugetlb reservation
counter. They test different configurations such as hugetlb memory usage
via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
MAP_POPULATE.

Signed-off-by: Mina Almasry 
---
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   4 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 440 ++
 .../selftests/vm/write_hugetlb_memory.sh  |  22 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 252 ++
 5 files changed, 719 insertions(+)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

diff --git a/tools/testing/selftests/vm/.gitignore 
b/tools/testing/selftests/vm/.gitignore
index 31b3c98b6d34d..d3bed9407773c 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -14,3 +14,4 @@ virtual_address_range
 gup_benchmark
 va_128TBswitch
 map_fixed_noreplace
+write_to_hugetlbfs
diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index 9534dc2bc9295..8d37d5409b52c 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -18,6 +18,7 @@ TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
 TEST_GEN_FILES += va_128TBswitch
 TEST_GEN_FILES += virtual_address_range
+TEST_GEN_FILES += write_to_hugetlbfs

 TEST_PROGS := run_vmtests

@@ -29,3 +30,6 @@ include ../lib.mk
 $(OUTPUT)/userfaultfd: LDLIBS += -lpthread

 $(OUTPUT)/mlock-random-test: LDLIBS += -lcap
+
+# Why does adding $(OUTPUT)/ like above not apply this flag..?
+write_to_hugetlbfs: CFLAGS += -static
diff --git a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh 
b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
new file mode 100755
index 0..09e90e8f6fab4
--- /dev/null
+++ b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
@@ -0,0 +1,440 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+cgroup_path=/dev/cgroup/memory
+if [[ ! -e $cgroup_path ]]; then
+  mkdir -p $cgroup_path
+  mount -t cgroup -o hugetlb,memory cgroup $cgroup_path
+fi
+
+cleanup () {
+   echo $$ > $cgroup_path/tasks
+
+   set +e
+   if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then
+ kill -2 write_to_hugetlbfs
+ # Wait for hugetlbfs memory to get depleted.
+ sleep 0.5
+   fi
+   set -e
+
+   if [[ -e /mnt/huge ]]; then
+ rm -rf /mnt/huge/*
+ umount /mnt/huge || echo error
+ rmdir /mnt/huge
+   fi
+   if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test
+   fi
+   if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test1
+   fi
+   if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test2
+   fi
+   echo 0 > /proc/sys/vm/nr_hugepages
+   echo CLEANUP DONE
+}
+
+cleanup
+
+function expect_equal() {
+  local expected="$1"
+  local actual="$2"
+  local error="$3"
+
+  if [[ "$expected" != "$actual" ]]; then
+   echo "expected ($expected) != actual ($actual): $3"
+   cleanup
+   exit 1
+  fi
+}
+
+function setup_cgroup() {
+  local name="$1"
+  local cgroup_limit="$2"
+  local reservation_limit="$3"
+
+  mkdir $cgroup_path/$name
+
+  echo writing cgroup limit: "$cgroup_limit"
+  echo "$cgroup_limit" > $cgroup_path/$name/hugetlb.2MB.limit_in_bytes
+
+  echo writing reseravation limit: "$reservation_limit"
+  echo "$reservation_limit" > \
+   $cgroup_path/$name/hugetlb.2MB.reservation_limit_in_bytes
+  echo 0 > $cgroup_path/$name/cpuset.cpus
+  echo 0 > $cgroup_path/$name/cpuset.mems
+}
+
+function write_hugetlbfs_and_get_usage() {
+  local cgroup="$1"
+  local size="$2"
+  local populate="$3"
+  local write="$4"
+  local path="$5"
+  local method="$6"
+  local private="$7"
+  local expect_failure="$8"
+
+  # Function return values.
+  reservation_failed=0
+  oom_killed=0
+  hugetlb_difference=0
+  reserved_difference=0
+
+  local hugetlb_usage=$cgroup_path/$cgroup/hugetlb.2MB.usage_in_bytes
+  local 
reserved_usage=$cgroup_path/$cgroup/hugetlb.2MB.reservation_usage_in_bytes
+
+  local hugetlb_before=$(cat $hugetlb_usage)
+  local reserved_before=$(cat $reserved_us

[PATCH v4 1/9] hugetlb_cgroup: Add hugetlb_cgroup reservation counter

2019-09-10 Thread Mina Almasry
These counters will track hugetlb reservations rather than hugetlb
memory faulted in. This patch only adds the counter, following patches
add the charging and uncharging of the counter.

Signed-off-by: Mina Almasry 
Acked-by: Hillf Danton 
---
 include/linux/hugetlb.h |  16 +-
 mm/hugetlb_cgroup.c | 111 ++--
 2 files changed, 100 insertions(+), 27 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index edfca42783192..128ff1aff1c93 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -320,6 +320,20 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, 
unsigned long addr,

 #ifdef CONFIG_HUGETLB_PAGE

+enum {
+   HUGETLB_RES_USAGE,
+   HUGETLB_RES_RESERVATION_USAGE,
+   HUGETLB_RES_LIMIT,
+   HUGETLB_RES_RESERVATION_LIMIT,
+   HUGETLB_RES_MAX_USAGE,
+   HUGETLB_RES_RESERVATION_MAX_USAGE,
+   HUGETLB_RES_FAILCNT,
+   HUGETLB_RES_RESERVATION_FAILCNT,
+   HUGETLB_RES_NULL,
+   HUGETLB_RES_MAX,
+};
+
+
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
@@ -340,7 +354,7 @@ struct hstate {
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 #ifdef CONFIG_CGROUP_HUGETLB
/* cgroup control files */
-   struct cftype cgroup_files[5];
+   struct cftype cgroup_files[HUGETLB_RES_MAX];
 #endif
char name[HSTATE_NAME_LEN];
 };
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 68c2f2f3c05b7..51a72624bd1ff 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -25,6 +25,10 @@ struct hugetlb_cgroup {
 * the counter to account for hugepages from hugetlb.
 */
struct page_counter hugepage[HUGE_MAX_HSTATE];
+   /*
+* the counter to account for hugepage reservations from hugetlb.
+*/
+   struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
 };

 #define MEMFILE_PRIVATE(x, val)(((x) << 16) | (val))
@@ -33,6 +37,15 @@ struct hugetlb_cgroup {

 static struct hugetlb_cgroup *root_h_cgroup __read_mostly;

+static inline
+struct page_counter *hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, 
int idx,
+bool reserved)
+{
+   if (reserved)
+   return  &h_cg->reserved_hugepage[idx];
+   return &h_cg->hugepage[idx];
+}
+
 static inline
 struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
 {
@@ -254,30 +267,33 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned 
long nr_pages,
return;
 }

-enum {
-   RES_USAGE,
-   RES_LIMIT,
-   RES_MAX_USAGE,
-   RES_FAILCNT,
-};
-
 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
   struct cftype *cft)
 {
struct page_counter *counter;
+   struct page_counter *reserved_counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);

counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
+   reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];

switch (MEMFILE_ATTR(cft->private)) {
-   case RES_USAGE:
+   case HUGETLB_RES_USAGE:
return (u64)page_counter_read(counter) * PAGE_SIZE;
-   case RES_LIMIT:
+   case HUGETLB_RES_RESERVATION_USAGE:
+   return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
+   case HUGETLB_RES_LIMIT:
return (u64)counter->max * PAGE_SIZE;
-   case RES_MAX_USAGE:
+   case HUGETLB_RES_RESERVATION_LIMIT:
+   return (u64)reserved_counter->max * PAGE_SIZE;
+   case HUGETLB_RES_MAX_USAGE:
return (u64)counter->watermark * PAGE_SIZE;
-   case RES_FAILCNT:
+   case HUGETLB_RES_RESERVATION_MAX_USAGE:
+   return (u64)reserved_counter->watermark * PAGE_SIZE;
+   case HUGETLB_RES_FAILCNT:
return counter->failcnt;
+   case HUGETLB_RES_RESERVATION_FAILCNT:
+   return reserved_counter->failcnt;
default:
BUG();
}
@@ -291,6 +307,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file 
*of,
int ret, idx;
unsigned long nr_pages;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
+   bool reserved = false;

if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
return -EINVAL;
@@ -304,9 +321,13 @@ static ssize_t hugetlb_cgroup_write(struct 
kernfs_open_file *of,
nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));

switch (MEMFILE_ATTR(of_cft(of)->private)) {
-   case RES_LIMIT:
+   case HUGETLB_RES_RESERVATION_LIMIT:
+   reserved = true;
+   /* Fall through. */
+   case HUGETLB_RES_LIMIT:
mutex_lock(&hugetlb_limit_mutex);
-   ret = page_counter_

[PATCH v4 4/9] hugetlb: region_chg provides only cache entry

2019-09-10 Thread Mina Almasry
Current behavior is that region_chg provides both a cache entry in
resv->region_cache, AND a placeholder entry in resv->regions. region_add
first tries to use the placeholder, and if it finds that the placeholder
has been deleted by a racing region_del call, it uses the cache entry.

This behavior is completely unnecessary and is removed in this patch for
a couple of reasons:

1. region_add needs to either find a cached file_region entry in
   resv->region_cache, or find an entry in resv->regions to expand. It
   does not need both.
2. region_chg adding a placeholder entry in resv->regions opens up
   a possible race with region_del, where region_chg adds a placeholder
   region in resv->regions, and this region is deleted by a racing call
   to region_del during region_chg execution or before region_add is
   called. Removing the race makes the code easier to reason about and
   maintain.

In addition, a follow up patch in this series disables region
coalescing, which would be further complicated if the race with
region_del exists.

Signed-off-by: Mina Almasry 
---
 mm/hugetlb.c | 63 +---
 1 file changed, 11 insertions(+), 52 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fbd7c52e17348..bea51ae422f63 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -246,14 +246,10 @@ struct file_region {

 /*
  * Add the huge page range represented by [f, t) to the reserve
- * map.  In the normal case, existing regions will be expanded
- * to accommodate the specified range.  Sufficient regions should
- * exist for expansion due to the previous call to region_chg
- * with the same range.  However, it is possible that region_del
- * could have been called after region_chg and modifed the map
- * in such a way that no region exists to be expanded.  In this
- * case, pull a region descriptor from the cache associated with
- * the map and use that for the new range.
+ * map.  Existing regions will be expanded to accommodate the specified
+ * range, or a region will be taken from the cache.  Sufficient regions
+ * must exist in the cache due to the previous call to region_chg with
+ * the same range.
  *
  * Return the number of new huge pages added to the map.  This
  * number is greater than or equal to zero.
@@ -272,9 +268,8 @@ static long region_add(struct resv_map *resv, long f, long 
t)

/*
 * If no region exists which can be expanded to include the
-* specified range, the list must have been modified by an
-* interleving call to region_del().  Pull a region descriptor
-* from the cache and use it for this range.
+* specified range, pull a region descriptor from the cache
+* and use it for this range.
 */
if (&rg->link == head || t < rg->from) {
VM_BUG_ON(resv->region_cache_count <= 0);
@@ -339,15 +334,9 @@ static long region_add(struct resv_map *resv, long f, long 
t)
  * call to region_add that will actually modify the reserve
  * map to add the specified range [f, t).  region_chg does
  * not change the number of huge pages represented by the
- * map.  However, if the existing regions in the map can not
- * be expanded to represent the new range, a new file_region
- * structure is added to the map as a placeholder.  This is
- * so that the subsequent region_add call will have all the
- * regions it needs and will not fail.
- *
- * Upon entry, region_chg will also examine the cache of region descriptors
- * associated with the map.  If there are not enough descriptors cached, one
- * will be allocated for the in progress add operation.
+ * map.  A new file_region structure is added to the cache
+ * as a placeholder, so that the subsequent region_add
+ * call will have all the regions it needs and will not fail.
  *
  * Returns the number of huge pages that need to be added to the existing
  * reservation map for the range [f, t).  This number is greater or equal to
@@ -357,10 +346,9 @@ static long region_add(struct resv_map *resv, long f, long 
t)
 static long region_chg(struct resv_map *resv, long f, long t)
 {
struct list_head *head = &resv->regions;
-   struct file_region *rg, *nrg = NULL;
+   struct file_region *rg;
long chg = 0;

-retry:
spin_lock(&resv->lock);
 retry_locked:
resv->adds_in_progress++;
@@ -378,10 +366,8 @@ static long region_chg(struct resv_map *resv, long f, long 
t)
spin_unlock(&resv->lock);

trg = kmalloc(sizeof(*trg), GFP_KERNEL);
-   if (!trg) {
-   kfree(nrg);
+   if (!trg)
return -ENOMEM;
-   }

spin_lock(&resv->lock);
list_add(&trg->link, &resv->region_cache);
@@ -394,28 +380,6 @@ static long region_chg(struct resv_map *resv, long f, long 
t)
if (f <= rg->to)

[PATCH v4 6/9] hugetlb: disable region_add file_region coalescing

2019-09-10 Thread Mina Almasry
A follow up patch in this series adds hugetlb cgroup uncharge info the
file_region entries in resv->regions. The cgroup uncharge info may
differ for different regions, so they can no longer be coalesced at
region_add time. So, disable region coalescing in region_add in this
patch.

Behavior change:

Say a resv_map exists like this [0->1], [2->3], and [5->6].

Then a region_chg/add call comes in region_chg/add(f=0, t=5).

Old code would generate resv->regions: [0->5], [5->6].
New code would generate resv->regions: [0->1], [1->2], [2->3], [3->5],
[5->6].

Special care needs to be taken to handle the resv->adds_in_progress
variable correctly. In the past, only 1 region would be added for every
region_chg and region_add call. But now, each call may add multiple
regions, so we can no longer increment adds_in_progress by 1 in region_chg,
or decrement adds_in_progress by 1 after region_add or region_abort. Instead,
region_chg calls add_reservation_in_range() to count the number of regions
needed and allocates those, and that info is passed to region_add and
region_abort to decrement adds_in_progress correctly.

Signed-off-by: Mina Almasry 
---
 mm/hugetlb.c | 279 ++-
 1 file changed, 167 insertions(+), 112 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ce5ed1056fefd..5eca34d9b753d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -244,55 +244,80 @@ struct file_region {
long to;
 };

+/* Helper that removes a struct file_region from the resv_map cache and returns
+ * it for use.
+ */
+static struct file_region *get_file_region_entry_from_cache(
+   struct resv_map *resv, long from, long to);
+
 static long add_reservation_in_range(
-   struct resv_map *resv, long f, long t, bool count_only)
+   struct resv_map *resv,
+   long f, long t,
+   long *regions_needed,
+   bool count_only)
 {
-
-   long chg = 0;
+   long add = 0;
struct list_head *head = &resv->regions;
+   long last_accounted_offset = f;
struct file_region *rg = NULL, *trg = NULL, *nrg = NULL;

-   /* Locate the region we are before or in. */
-   list_for_each_entry(rg, head, link)
-   if (f <= rg->to)
-   break;
-
-   /* Round our left edge to the current segment if it encloses us. */
-   if (f > rg->from)
-   f = rg->from;
+   if (regions_needed)
+   *regions_needed = 0;

-   chg = t - f;
+   /* In this loop, we essentially handle an entry for the range
+* last_accounted_offset -> rg->from, at every iteration, with some
+* bounds checking.
+*/
+   list_for_each_entry_safe(rg, trg, head, link) {
+   /* Skip irrelevant regions that start before our range. */
+   if (rg->from < f) {
+   /* If this region ends after the last accounted offset,
+* then we need to update last_accounted_offset.
+*/
+   if (rg->to > last_accounted_offset)
+   last_accounted_offset = rg->to;
+   continue;
+   }

-   /* Check for and consume any regions we now overlap with. */
-   nrg = rg;
-   list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
-   if (&rg->link == head)
-   break;
+   /* When we find a region that starts beyond our range, we've
+* finished.
+*/
if (rg->from > t)
break;

-   /* We overlap with this area, if it extends further than
-* us then we must extend ourselves.  Account for its
-* existing reservation.
+   /* Add an entry for last_accounted_offset -> rg->from, and
+* update last_accounted_offset.
 */
-   if (rg->to > t) {
-   chg += rg->to - t;
-   t = rg->to;
+   if (rg->from > last_accounted_offset) {
+   add += rg->from - last_accounted_offset;
+   if (!count_only) {
+   nrg = get_file_region_entry_from_cache(resv,
+   last_accounted_offset,
+   rg->from);
+   list_add(&nrg->link, rg->link.prev);
+   } else if (regions_needed)
+   *regions_needed += 1;
}
-   chg -= rg->to - rg->from;

-   if (!count_only && rg != nrg) {
-   list_del(&rg->link);
-   kfree(

Re: [PATCH v3 0/6] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-05 Thread Mina Almasry
On Tue, Sep 3, 2019 at 4:46 PM Mike Kravetz  wrote:
>
> On 9/3/19 10:57 AM, Mike Kravetz wrote:
> > On 8/29/19 12:18 AM, Michal Hocko wrote:
> >> [Cc cgroups maintainers]
> >>
> >> On Wed 28-08-19 10:58:00, Mina Almasry wrote:
> >>> On Wed, Aug 28, 2019 at 4:23 AM Michal Hocko  wrote:
> >>>>
> >>>> On Mon 26-08-19 16:32:34, Mina Almasry wrote:
> >>>>>  mm/hugetlb.c  | 493 --
> >>>>>  mm/hugetlb_cgroup.c   | 187 +--
> >>>>
> >>>> This is a lot of changes to an already subtle code which hugetlb
> >>>> reservations undoubly are.
> >>>
> >>> For what it's worth, I think this patch series is a net decrease in
> >>> the complexity of the reservation code, especially the region_*
> >>> functions, which is where a lot of the complexity lies. I removed the
> >>> race between region_del and region_{add|chg}, refactored the main
> >>> logic into smaller code, moved common code to helpers and deleted the
> >>> duplicates, and finally added lots of comments to the hard to
> >>> understand pieces. I hope that when folks review the changes they will
> >>> see that! :)
> >>
> >> Post those improvements as standalone patches and sell them as
> >> improvements. We can talk about the net additional complexity of the
> >> controller much easier then.
> >
> > All such changes appear to be in patch 4 of this series.  The commit message
> > says "region_add() and region_chg() are heavily refactored to in this commit
> > to make the code easier to understand and remove duplication.".  However, 
> > the
> > modifications were also added to accommodate the new cgroup reservation
> > accounting.  I think it would be helpful to explain why the existing code 
> > does
> > not work with the new accounting.  For example, one change is because
> > "existing code coalesces resv_map entries for shared mappings.  new cgroup
> > accounting requires that resv_map entries be kept separate for proper
> > uncharging."
> >
> > I am starting to review the changes, but it would help if there was a high
> > level description.  I also like Michal's idea of calling out the region_*
> > changes separately.  If not a standalone patch, at least the first patch of
> > the series.  This new code will be exercised even if cgroup reservation
> > accounting not enabled, so it is very important than no subtle regressions
> > be introduced.
>
> While looking at the region_* changes, I started thinking about this no
> coalesce change for shared mappings which I think is necessary.  Am I
> mistaken, or is this a requirement?
>

No coalesce is a requirement, yes. The idea is that task A can reseve
range [0-1], and task B can reserve range [1-2]. We want the code to
put in 2 regions:

1. [0-1], with cgroup information that points to task A's cgroup.
2. [1-2], with cgroup information that points to task B's cgroup.

If coalescing is happening, then you end up with one region [0-2] with
cgroup information for one of those cgroups, and someone gets
uncharged wrong when the reservation is freed.

Technically we can still coalesce if the cgroup information is the
same and I can do that, but the region_* code becomes more
complicated, and you mentioned on an earlier patchset that you were
concerned with how complicated the region_* functions are as is.

> If it is a requirement, then think about some of the possible scenarios
> such as:
> - There is a hugetlbfs file of size 10 huge pages.
> - Task A has reservations for pages at offset 1 3 5 7 and 9
> - Task B then mmaps the entire file which should result in reservations
>   at 0 2 4 6 and 8.
> - region_chg will return 5, but will also need to allocate 5 resv_map
>   entries for the subsequent region_add which can not fail.  Correct?
>   The code does not appear to handle this.
>

I thought the code did handle this. region_chg calls
allocate_enough_cache_for_range_and_lock(), which in this scenario
will put 5 entries in resv_map->region_cache. region_add will use
these 5 region_cache entries to do its business.

I'll add a test in my suite to test this case to make sure.

> BTW, this series will BUG when running libhugetlbfs test suite.  It will
> hit this in resv_map_release().
>
> VM_BUG_ON(resv_map->adds_in_progress);
>

Sorry about that, I've been having trouble running the libhugetlbfs
tests, but I'm still on it. I'll get to the bottom of this by next
patch series.

> --
> Mike Kravetz


Re: [PATCH v3 0/6] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-09-05 Thread Mina Almasry
On Tue, Sep 3, 2019 at 10:58 AM Mike Kravetz  wrote:
>
> On 8/29/19 12:18 AM, Michal Hocko wrote:
> > [Cc cgroups maintainers]
> >
> > On Wed 28-08-19 10:58:00, Mina Almasry wrote:
> >> On Wed, Aug 28, 2019 at 4:23 AM Michal Hocko  wrote:
> >>>
> >>> On Mon 26-08-19 16:32:34, Mina Almasry wrote:
> >>>>  mm/hugetlb.c  | 493 --
> >>>>  mm/hugetlb_cgroup.c   | 187 +--
> >>>
> >>> This is a lot of changes to an already subtle code which hugetlb
> >>> reservations undoubly are.
> >>
> >> For what it's worth, I think this patch series is a net decrease in
> >> the complexity of the reservation code, especially the region_*
> >> functions, which is where a lot of the complexity lies. I removed the
> >> race between region_del and region_{add|chg}, refactored the main
> >> logic into smaller code, moved common code to helpers and deleted the
> >> duplicates, and finally added lots of comments to the hard to
> >> understand pieces. I hope that when folks review the changes they will
> >> see that! :)
> >
> > Post those improvements as standalone patches and sell them as
> > improvements. We can talk about the net additional complexity of the
> > controller much easier then.
>
> All such changes appear to be in patch 4 of this series.  The commit message
> says "region_add() and region_chg() are heavily refactored to in this commit
> to make the code easier to understand and remove duplication.".  However, the
> modifications were also added to accommodate the new cgroup reservation
> accounting.  I think it would be helpful to explain why the existing code does
> not work with the new accounting.  For example, one change is because
> "existing code coalesces resv_map entries for shared mappings.  new cgroup
> accounting requires that resv_map entries be kept separate for proper
> uncharging."
>
> I am starting to review the changes, but it would help if there was a high
> level description.  I also like Michal's idea of calling out the region_*
> changes separately.  If not a standalone patch, at least the first patch of
> the series.  This new code will be exercised even if cgroup reservation
> accounting not enabled, so it is very important than no subtle regressions
> be introduced.
>

Yep, seems I'm not calling out these changes as clearly as I should.
I'll look into breaking them into separate patches. I'll probably put
them as a separate patch or right behind current patchset 4, since
they are really done to make removing the coalescing a bit easier. Let
me look into that.

> --
> Mike Kravetz


Re: [PATCH v3 0/6] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-08-28 Thread Mina Almasry
On Wed, Aug 28, 2019 at 4:23 AM Michal Hocko  wrote:
>
> On Mon 26-08-19 16:32:34, Mina Almasry wrote:
> >  mm/hugetlb.c  | 493 --
> >  mm/hugetlb_cgroup.c   | 187 +--
>
> This is a lot of changes to an already subtle code which hugetlb
> reservations undoubly are.

For what it's worth, I think this patch series is a net decrease in
the complexity of the reservation code, especially the region_*
functions, which is where a lot of the complexity lies. I removed the
race between region_del and region_{add|chg}, refactored the main
logic into smaller code, moved common code to helpers and deleted the
duplicates, and finally added lots of comments to the hard to
understand pieces. I hope that when folks review the changes they will
see that! :)

> Moreover cgroupv1 is feature frozen and I am
> not aware of any plans to port the controller to v2.

Also for what it's worth, if porting the controller to v2 is a
requisite to take this, I'm happy to do that. As far as I understand
there is no reason hugetlb_cgroups shouldn't be in cgroups v2, and we
see value in them.

> That all doesn't
> sound in favor of this change. Mike is the maintainer of the hugetlb
> code so I will defer to him to make a decision but I wouldn't recommend
> that.
> --
> Michal Hocko
> SUSE Labs


[PATCH v3 5/6] hugetlb_cgroup: Add hugetlb_cgroup reservation tests

2019-08-26 Thread Mina Almasry
The tests use both shared and private mapped hugetlb memory, and
monitors the hugetlb usage counter as well as the hugetlb reservation
counter. They test different configurations such as hugetlb memory usage
via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
MAP_POPULATE.

---
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   4 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 438 ++
 .../selftests/vm/write_hugetlb_memory.sh  |  22 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 252 ++
 5 files changed, 717 insertions(+)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

diff --git a/tools/testing/selftests/vm/.gitignore 
b/tools/testing/selftests/vm/.gitignore
index 31b3c98b6d34d..d3bed9407773c 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -14,3 +14,4 @@ virtual_address_range
 gup_benchmark
 va_128TBswitch
 map_fixed_noreplace
+write_to_hugetlbfs
diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index 9534dc2bc9295..8d37d5409b52c 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -18,6 +18,7 @@ TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
 TEST_GEN_FILES += va_128TBswitch
 TEST_GEN_FILES += virtual_address_range
+TEST_GEN_FILES += write_to_hugetlbfs

 TEST_PROGS := run_vmtests

@@ -29,3 +30,6 @@ include ../lib.mk
 $(OUTPUT)/userfaultfd: LDLIBS += -lpthread

 $(OUTPUT)/mlock-random-test: LDLIBS += -lcap
+
+# Why does adding $(OUTPUT)/ like above not apply this flag..?
+write_to_hugetlbfs: CFLAGS += -static
diff --git a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh 
b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
new file mode 100755
index 0..bf0b6dcec9977
--- /dev/null
+++ b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
@@ -0,0 +1,438 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+cgroup_path=/dev/cgroup/memory
+if [[ ! -e $cgroup_path ]]; then
+  mkdir -p $cgroup_path
+  mount -t cgroup -o hugetlb,memory cgroup $cgroup_path
+fi
+
+cleanup () {
+   echo $$ > $cgroup_path/tasks
+
+   set +e
+   if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then
+ kill -2 write_to_hugetlbfs
+ # Wait for hugetlbfs memory to get depleted.
+ sleep 0.5
+   fi
+   set -e
+
+   if [[ -e /mnt/huge ]]; then
+ rm -rf /mnt/huge/*
+ umount /mnt/huge || echo error
+ rmdir /mnt/huge
+   fi
+   if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test
+   fi
+   if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test1
+   fi
+   if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test2
+   fi
+   echo 0 > /proc/sys/vm/nr_hugepages
+   echo CLEANUP DONE
+}
+
+cleanup
+
+function expect_equal() {
+  local expected="$1"
+  local actual="$2"
+  local error="$3"
+
+  if [[ "$expected" != "$actual" ]]; then
+   echo "expected ($expected) != actual ($actual): $3"
+   cleanup
+   exit 1
+  fi
+}
+
+function setup_cgroup() {
+  local name="$1"
+  local cgroup_limit="$2"
+  local reservation_limit="$3"
+
+  mkdir $cgroup_path/$name
+
+  echo writing cgroup limit: "$cgroup_limit"
+  echo "$cgroup_limit" > $cgroup_path/$name/hugetlb.2MB.limit_in_bytes
+
+  echo writing reseravation limit: "$reservation_limit"
+  echo "$reservation_limit" > \
+   $cgroup_path/$name/hugetlb.2MB.reservation_limit_in_bytes
+}
+
+function write_hugetlbfs_and_get_usage() {
+  local cgroup="$1"
+  local size="$2"
+  local populate="$3"
+  local write="$4"
+  local path="$5"
+  local method="$6"
+  local private="$7"
+  local expect_failure="$8"
+
+  # Function return values.
+  reservation_failed=0
+  oom_killed=0
+  hugetlb_difference=0
+  reserved_difference=0
+
+  local hugetlb_usage=$cgroup_path/$cgroup/hugetlb.2MB.usage_in_bytes
+  local 
reserved_usage=$cgroup_path/$cgroup/hugetlb.2MB.reservation_usage_in_bytes
+
+  local hugetlb_before=$(cat $hugetlb_usage)
+  local reserved_before=$(cat $reserved_usage)
+
+  echo
+  echo Starting:
+  echo hugetlb_usage="$hugetlb_before"
+  echo reserved_usage="$reserved_before"
+  echo expect_failure is "$expect_failure"
+
+  set +e
+  if [[ "$method" == "1" ]] || [[ "$method" == 2 ]] || \
+   [[ "$private" == "-r" ]] && [[ "$expect_failure" != 1 ]]; then
+   bash write_hugetlb_memory.sh "$size" "

[PATCH v3 6/6] hugetlb_cgroup: Add hugetlb_cgroup reservation docs

2019-08-26 Thread Mina Almasry
Add docs for how to use hugetlb_cgroup reservations, and their behavior.

---
 .../admin-guide/cgroup-v1/hugetlb.rst | 84 ---
 1 file changed, 73 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst 
b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
index a3902aa253a96..cc6eb859fc722 100644
--- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
@@ -2,13 +2,6 @@
 HugeTLB Controller
 ==

-The HugeTLB controller allows to limit the HugeTLB usage per control group and
-enforces the controller limit during page fault. Since HugeTLB doesn't
-support page reclaim, enforcing the limit at page fault time implies that,
-the application will get SIGBUS signal if it tries to access HugeTLB pages
-beyond its limit. This requires the application to know beforehand how much
-HugeTLB pages it would require for its use.
-
 HugeTLB controller can be created by first mounting the cgroup filesystem.

 # mount -t cgroup -o hugetlb none /sys/fs/cgroup
@@ -28,10 +21,14 @@ process (bash) into it.

 Brief summary of control files::

- hugetlb..limit_in_bytes # set/show limit of "hugepagesize" 
hugetlb usage
- hugetlb..max_usage_in_bytes # show max "hugepagesize" hugetlb  
usage recorded
- hugetlb..usage_in_bytes # show current usage for 
"hugepagesize" hugetlb
- hugetlb..failcnt   # show the number of 
allocation failure due to HugeTLB limit
+ hugetlb..reservation_limit_in_bytes # set/show limit of 
"hugepagesize" hugetlb reservations
+ hugetlb..reservation_max_usage_in_bytes # show max 
"hugepagesize" hugetlb reservations recorded
+ hugetlb..reservation_usage_in_bytes # show current 
reservations for "hugepagesize" hugetlb
+ hugetlb..reservation_failcnt# show the number of 
allocation failure due to HugeTLB reservation limit
+ hugetlb..limit_in_bytes # set/show limit of 
"hugepagesize" hugetlb faults
+ hugetlb..max_usage_in_bytes # show max 
"hugepagesize" hugetlb  usage recorded
+ hugetlb..usage_in_bytes # show current usage 
for "hugepagesize" hugetlb
+ hugetlb..failcnt# show the number of 
allocation failure due to HugeTLB usage limit

 For a system supporting three hugepage sizes (64k, 32M and 1G), the control
 files include::
@@ -40,11 +37,76 @@ files include::
   hugetlb.1GB.max_usage_in_bytes
   hugetlb.1GB.usage_in_bytes
   hugetlb.1GB.failcnt
+  hugetlb.1GB.reservation_limit_in_bytes
+  hugetlb.1GB.reservation_max_usage_in_bytes
+  hugetlb.1GB.reservation_usage_in_bytes
+  hugetlb.1GB.reservation_failcnt
   hugetlb.64KB.limit_in_bytes
   hugetlb.64KB.max_usage_in_bytes
   hugetlb.64KB.usage_in_bytes
   hugetlb.64KB.failcnt
+  hugetlb.64KB.reservation_limit_in_bytes
+  hugetlb.64KB.reservation_max_usage_in_bytes
+  hugetlb.64KB.reservation_usage_in_bytes
+  hugetlb.64KB.reservation_failcnt
   hugetlb.32MB.limit_in_bytes
   hugetlb.32MB.max_usage_in_bytes
   hugetlb.32MB.usage_in_bytes
   hugetlb.32MB.failcnt
+  hugetlb.32MB.reservation_limit_in_bytes
+  hugetlb.32MB.reservation_max_usage_in_bytes
+  hugetlb.32MB.reservation_usage_in_bytes
+  hugetlb.32MB.reservation_failcnt
+
+
+1. Reservation limits
+
+The HugeTLB controller allows to limit the HugeTLB reservations per control
+group and enforces the controller limit at reservation time. Reservation limits
+are superior to Page fault limits (see section 2), since Reservation limits are
+enforced at reservation time, and never causes the application to get SIGBUS
+signal. Instead, if the application is violating its limits, then it gets an
+error on reservation time, i.e. the mmap or shmget return an error.
+
+
+2. Page fault limits
+
+The HugeTLB controller allows to limit the HugeTLB usage (page fault) per
+control group and enforces the controller limit during page fault. Since 
HugeTLB
+doesn't support page reclaim, enforcing the limit at page fault time implies
+that, the application will get SIGBUS signal if it tries to access HugeTLB
+pages beyond its limit. This requires the application to know beforehand how
+much HugeTLB pages it would require for its use.
+
+
+3. Caveats with shared memory
+
+a. Charging and uncharging:
+
+For shared hugetlb memory, both hugetlb reservation and usage (page faults) are
+charged to the first task that causes the memory to be reserved or faulted,
+and all subsequent uses of this reserved or faulted memory is done without
+charging.
+
+Shared hugetlb memory is only uncharged when it is unreseved or deallocated.
+This is usually when the hugetlbfs file is deleted, and not when the task that
+caused the reservation or fault has exited.
+
+b. Interaction between reservation limit and fault limit.
+
+Generally, it's not recommended to set both of the reservation limit and fault
+limit in a cgroup. For private memory, the fault usage cannot exceed the
+reservation usage, so if you set both, one o

[PATCH v3 4/6] hugetlb_cgroup: add accounting for shared mappings

2019-08-26 Thread Mina Almasry
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.

When a file_region entry is added to the resv_map via region_add, we
also charge the appropriate hugetlb_cgroup and put the pointer to that
in file_region->reservation_counter. This is slightly delicate since we
need to not modify the resv_map until we know that charging the
reservation has succeeded. If charging doesn't succeed, we report the
error to the caller, so that the kernel fails the reservation.

On region_del, which is when the hugetlb memory is unreserved, we delete
the file_region entry in the resv_map, but also uncharge the
file_region->reservation_counter.

region_add() and region_chg() are heavily refactored to in this commit
to make the code easier to understand and remove duplication.

---
 mm/hugetlb.c | 443 ---
 1 file changed, 280 insertions(+), 163 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7c2df7574cf50..953e93359f021 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -242,208 +242,276 @@ struct file_region {
struct list_head link;
long from;
long to;
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* On shared mappings, each reserved region appears as a struct
+* file_region in resv_map. These fields hold the info needed to
+* uncharge each reservation.
+*/
+   struct page_counter *reservation_counter;
+   unsigned long pages_per_hpage;
+#endif
 };

-/*
- * Add the huge page range represented by [f, t) to the reserve
- * map.  In the normal case, existing regions will be expanded
- * to accommodate the specified range.  Sufficient regions should
- * exist for expansion due to the previous call to region_chg
- * with the same range.  However, it is possible that region_del
- * could have been called after region_chg and modifed the map
- * in such a way that no region exists to be expanded.  In this
- * case, pull a region descriptor from the cache associated with
- * the map and use that for the new range.
- *
- * Return the number of new huge pages added to the map.  This
- * number is greater than or equal to zero.
+/* Helper that removes a struct file_region from the resv_map cache and returns
+ * it for use.
  */
-static long region_add(struct resv_map *resv, long f, long t)
+static struct file_region *get_file_region_entry_from_cache(
+   struct resv_map *resv, long from, long to)
 {
-   struct list_head *head = &resv->regions;
-   struct file_region *rg, *nrg, *trg;
-   long add = 0;
+   struct file_region *nrg = NULL;

-   spin_lock(&resv->lock);
-   /* Locate the region we are either in or before. */
-   list_for_each_entry(rg, head, link)
-   if (f <= rg->to)
-   break;
+   VM_BUG_ON(resv->region_cache_count <= 0);

-   /*
-* If no region exists which can be expanded to include the
-* specified range, the list must have been modified by an
-* interleving call to region_del().  Pull a region descriptor
-* from the cache and use it for this range.
-*/
-   if (&rg->link == head || t < rg->from) {
-   VM_BUG_ON(resv->region_cache_count <= 0);
+   resv->region_cache_count--;
+   nrg = list_first_entry(&resv->region_cache, struct file_region,
+   link);
+   VM_BUG_ON(!nrg);
+   list_del(&nrg->link);

-   resv->region_cache_count--;
-   nrg = list_first_entry(&resv->region_cache, struct file_region,
-   link);
-   list_del(&nrg->link);
+   nrg->from = from;
+   nrg->to = to;

-   nrg->from = f;
-   nrg->to = t;
-   list_add(&nrg->link, rg->link.prev);
+   return nrg;
+}

-   add += t - f;
-   goto out_locked;
+/* Helper that records hugetlb_cgroup uncharge info. */
+static void record_hugetlb_cgroup_uncharge_info(struct hugetlb_cgroup *h_cg,
+   struct file_region *nrg, struct hstate *h)
+{
+#ifdef CONFIG_CGROUP_HUGETLB
+   if (h_cg) {
+   nrg->reservation_counter =
+   &h_cg->reserved_hugepage[hstate_index(h)];
+   nrg->pages_per_hpage = pages_per_huge_page(h);
}
+#endif
+}

-   /* Round our left edge to the current segment if it encloses us. */
-   if (f > rg->from)
-   f = rg->from;
+/* Must be called with resv->lock held. Calling this with dry_run == true will
+ * count the number of pages to be added but will not modify the linked list.
+ */
+static long add_reservations_in_range(struct resv_map *resv,
+   struct list_head *head, long f, long t,
+   struct hugetlb_cgroup *h_cg,
+   struct hstate *h,
+   bool dry_run)
+{
+   long add = 0;
+   long last_accounted_offset = f;

[PATCH v3 2/6] hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations

2019-08-26 Thread Mina Almasry
Augements hugetlb_cgroup_charge_cgroup to be able to charge hugetlb
usage or hugetlb reservation counter.

Adds a new interface to uncharge a hugetlb_cgroup counter via
hugetlb_cgroup_uncharge_counter.

Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.

---
 include/linux/hugetlb_cgroup.h |  8 +++-
 mm/hugetlb.c   |  3 +-
 mm/hugetlb_cgroup.c| 80 --
 3 files changed, 74 insertions(+), 17 deletions(-)

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 063962f6dfc6a..0725f809cd2d9 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -52,7 +52,8 @@ static inline bool hugetlb_cgroup_disabled(void)
 }

 extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-   struct hugetlb_cgroup **ptr);
+   struct hugetlb_cgroup **ptr,
+   bool reserved);
 extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
 struct hugetlb_cgroup *h_cg,
 struct page *page);
@@ -60,6 +61,9 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned 
long nr_pages,
 struct page *page);
 extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
   struct hugetlb_cgroup *h_cg);
+extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
+   unsigned long nr_pages);
+
 extern void hugetlb_cgroup_file_init(void) __init;
 extern void hugetlb_cgroup_migrate(struct page *oldhpage,
   struct page *newhpage);
@@ -83,7 +87,7 @@ static inline bool hugetlb_cgroup_disabled(void)

 static inline int
 hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-struct hugetlb_cgroup **ptr)
+struct hugetlb_cgroup **ptr, bool reserved)
 {
return 0;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6d7296dd11b83..242cfeb7cc3e1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2078,7 +2078,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
gbl_chg = 1;
}

-   ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
+   ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg,
+  false);
if (ret)
goto out_subpool_put;

diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 51a72624bd1ff..bd9b58474be51 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -38,8 +38,8 @@ struct hugetlb_cgroup {
 static struct hugetlb_cgroup *root_h_cgroup __read_mostly;

 static inline
-struct page_counter *hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, 
int idx,
-bool reserved)
+struct page_counter *hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg,
+   int idx, bool reserved)
 {
if (reserved)
return  &h_cg->reserved_hugepage[idx];
@@ -74,8 +74,12 @@ static inline bool hugetlb_cgroup_have_usage(struct 
hugetlb_cgroup *h_cg)
int idx;

for (idx = 0; idx < hugetlb_max_hstate; idx++) {
-   if (page_counter_read(&h_cg->hugepage[idx]))
+   if (page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx,
+   true)) ||
+   page_counter_read(hugetlb_cgroup_get_counter(h_cg, idx,
+   false))) {
return true;
+   }
}
return false;
 }
@@ -86,18 +90,30 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup 
*h_cgroup,
int idx;

for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
-   struct page_counter *counter = &h_cgroup->hugepage[idx];
struct page_counter *parent = NULL;
+   struct page_counter *reserved_parent = NULL;
unsigned long limit;
int ret;

-   if (parent_h_cgroup)
-   parent = &parent_h_cgroup->hugepage[idx];
-   page_counter_init(counter, parent);
+   if (parent_h_cgroup) {
+   parent = hugetlb_cgroup_get_counter(
+   parent_h_cgroup, idx, false);
+   reserved_parent = hugetlb_cgroup_get_counter(
+   parent_h_cgroup, idx, true);
+   }
+   page_counter_init(hugetlb_cgroup_get_counter(
+   h_cgroup, idx, false), parent);
+   page_counter_init(hugetlb_cgroup_ge

[PATCH v3 3/6] hugetlb_cgroup: add reservation accounting for private mappings

2019-08-26 Thread Mina Almasry
Normally the pointer to the cgroup to uncharge hangs off the struct
page, and gets queried when it's time to free the page. With
hugetlb_cgroup reservations, this is not possible. Because it's possible
for a page to be reserved by one task and actually faulted in by another
task.

The best place to put the hugetlb_cgroup pointer to uncharge for
reservations is in the resv_map. But, because the resv_map has different
semantics for private and shared mappings, the code patch to
charge/uncharge shared and private mappings is different. This patch
implements charging and uncharging for private mappings.

For private mappings, the counter to uncharge is in
resv_map->reservation_counter. On initializing the resv_map this is set
to NULL. On reservation of a region in private mapping, the tasks
hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
resv_map->reservation_counter.

On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.

---
 include/linux/hugetlb.h|  8 ++
 include/linux/hugetlb_cgroup.h | 11 
 mm/hugetlb.c   | 47 --
 mm/hugetlb_cgroup.c| 12 -
 4 files changed, 64 insertions(+), 14 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 128ff1aff1c93..536cb144cf484 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -46,6 +46,14 @@ struct resv_map {
long adds_in_progress;
struct list_head region_cache;
long region_cache_count;
+ #ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* On private mappings, the counter to uncharge reservations is stored
+* here. If these fields are 0, then the mapping is shared.
+*/
+   struct page_counter *reservation_counter;
+   unsigned long pages_per_hpage;
+#endif
 };
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 0725f809cd2d9..1fdde63a4e775 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -25,6 +25,17 @@ struct hugetlb_cgroup;
 #define HUGETLB_CGROUP_MIN_ORDER   2

 #ifdef CONFIG_CGROUP_HUGETLB
+struct hugetlb_cgroup {
+   struct cgroup_subsys_state css;
+   /*
+* the counter to account for hugepages from hugetlb.
+*/
+   struct page_counter hugepage[HUGE_MAX_HSTATE];
+   /*
+* the counter to account for hugepage reservations from hugetlb.
+*/
+   struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
+};

 static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page 
*page)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 242cfeb7cc3e1..7c2df7574cf50 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -711,6 +711,16 @@ struct resv_map *resv_map_alloc(void)
INIT_LIST_HEAD(&resv_map->regions);

resv_map->adds_in_progress = 0;
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* Initialize these to 0. On shared mappings, 0's here indicate these
+* fields don't do cgroup accounting. On private mappings, these will be
+* re-initialized to the proper values, to indicate that hugetlb cgroup
+* reservations are to be un-charged from here.
+*/
+   resv_map->reservation_counter = NULL;
+   resv_map->pages_per_hpage = 0;
+#endif

INIT_LIST_HEAD(&resv_map->region_cache);
list_add(&rg->link, &resv_map->region_cache);
@@ -3192,7 +3202,19 @@ static void hugetlb_vm_op_close(struct vm_area_struct 
*vma)

reserve = (end - start) - region_count(resv, start, end);

-   kref_put(&resv->refs, resv_map_release);
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* Since we check for HPAGE_RESV_OWNER above, this must a private
+* mapping, and these values should be none-zero, and should point to
+* the hugetlb_cgroup counter to uncharge for this reservation.
+*/
+   WARN_ON(!resv->reservation_counter);
+   WARN_ON(!resv->pages_per_hpage);
+
+   hugetlb_cgroup_uncharge_counter(
+   resv->reservation_counter,
+   (end - start) * resv->pages_per_hpage);
+#endif

if (reserve) {
/*
@@ -3202,6 +3224,8 @@ static void hugetlb_vm_op_close(struct vm_area_struct 
*vma)
gbl_reserve = hugepage_subpool_put_pages(spool, reserve);
hugetlb_acct_memory(h, -gbl_reserve);
}
+
+   kref_put(&resv->refs, resv_map_release);
 }

 static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
@@ -4535,6 +4559,7 @@ int hugetlb_reserve_pages(struct inode *inode,
struct hstate *h = hstate_inode(inode);
struct hugepage_subpool *spool = subpool_inode(inode);
struct resv_map *resv_map;
+   struct hugetlb_cgroup *h_cg;
long gbl_reserve;

/* This should never happen */
@@ -4568,11 +4593,29 @@ int hugetlb_reserve_pages(struc

[PATCH v3 0/6] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-08-26 Thread Mina Almasry
Problem:
Currently tasks attempting to allocate more hugetlb memory than is available get
a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
However, if a task attempts to allocate hugetlb memory only more than its
hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
but will SIGBUS the task when it attempts to fault the memory in.

We have developers interested in using hugetlb_cgroups, and they have expressed
dissatisfaction regarding this behavior. We'd like to improve this
behavior such that tasks violating the hugetlb_cgroup limits get an error on
mmap/shmget time, rather than getting SIGBUS'd when they try to fault
the excess memory in.

The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time.
Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
the offending task gets SIGBUS'd.

Proposed Solution:
A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This
counter has slightly different semantics than
hugetlb.xMB.[limit|usage]_in_bytes:

- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory.

- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than reservation_limit_in_bytes, the kernel will fail this
reservation.

This proposal is implemented in this patch, with tests to verify
functionality and show the usage.

Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to
   the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
   duplication with hugetlb_cgroup. Keeping hugetlb related page counters under
   hugetlb_cgroup seemed cleaner as well.

2. Instead of adding a new counter, we considered adding a sysctl that modifies
   the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at
   reservation time rather than fault time. Adding a new page_counter seems
   better as userspace could, if it wants, choose to enforce different cgroups
   differently: one via limit_in_bytes, and another via
   reservation_limit_in_bytes. This could be very useful if you're
   transitioning how hugetlb memory is partitioned on your system one
   cgroup at a time, for example. Also, someone may find usage for both
   limit_in_bytes and reservation_limit_in_bytes concurrently, and this
   approach gives them the option to do so.

Caveats:
1. This support is implemented for cgroups-v1. I have not tried
   hugetlb_cgroups with cgroups v2, and AFAICT it's not supported yet.
   This is largely because we use cgroups-v1 for now. If required, I
   can add hugetlb_cgroup support to cgroups v2 in this patch or
   a follow up.
2. Most complicated bit of this patch I believe is: where to store the
   pointer to the hugetlb_cgroup to uncharge at unreservation time?
   Normally the cgroup pointers hang off the struct page. But, with
   hugetlb_cgroup reservations, one task can reserve a specific page and another
   task may fault it in (I believe), so storing the pointer in struct
   page is not appropriate. Proposed approach here is to store the pointer in
   the resv_map. See patch for details.

Signed-off-by: Mina Almasry 

[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

Changes in v3:
- Addressed comments of Hillf Danton:
  - Added docs.
  - cgroup_files now uses enum.
  - Various readability improvements.
- Addressed comments of Mike Kravetz.
  - region_* functions no longer coalesce file_region entries in the resv_map.
  - region_add() and region_chg() refactored to make them much easier to
understand and remove duplicated code so this patch doesn't add too much
complexity.
  - Refactored common functionality into helpers.

Changes in v2:
- Split the patch into a 5 patch series.
- Fixed patch subject.

Mina Almasry (6):
  hugetlb_cgroup: Add hugetlb_cgroup reservation counter
  hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
  hugetlb_cgroup: add reservation accounting for private mappings
  hugetlb_cgroup: add accounting for shared mappings
  hugetlb_cgroup: Add hugetlb_cgroup reservation tests
  hugetlb_cgroup: Add hugetlb_cgroup reservation docs

 .../admin-guide/cgroup-v1/hugetlb.rst |  84 ++-
 include/linux/hugetlb.h   |  24 +-
 include/linux/hugetlb_cgroup.h|  19 +-
 mm/hugetlb.c  | 493 --
 mm/hugetlb_cgroup.c   | 187 +--
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   4 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 438 
 .../selftests/vm/write_hugetlb_memory.sh  |  22 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 252 +

[PATCH v3 1/6] hugetlb_cgroup: Add hugetlb_cgroup reservation counter

2019-08-26 Thread Mina Almasry
These counters will track hugetlb reservations rather than hugetlb
memory faulted in. This patch only adds the counter, following patches
add the charging and uncharging of the counter.

---
 include/linux/hugetlb.h |  16 +-
 mm/hugetlb_cgroup.c | 111 ++--
 2 files changed, 100 insertions(+), 27 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index edfca42783192..128ff1aff1c93 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -320,6 +320,20 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, 
unsigned long addr,

 #ifdef CONFIG_HUGETLB_PAGE

+enum {
+   HUGETLB_RES_USAGE,
+   HUGETLB_RES_RESERVATION_USAGE,
+   HUGETLB_RES_LIMIT,
+   HUGETLB_RES_RESERVATION_LIMIT,
+   HUGETLB_RES_MAX_USAGE,
+   HUGETLB_RES_RESERVATION_MAX_USAGE,
+   HUGETLB_RES_FAILCNT,
+   HUGETLB_RES_RESERVATION_FAILCNT,
+   HUGETLB_RES_NULL,
+   HUGETLB_RES_MAX,
+};
+
+
 #define HSTATE_NAME_LEN 32
 /* Defines one hugetlb page size */
 struct hstate {
@@ -340,7 +354,7 @@ struct hstate {
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 #ifdef CONFIG_CGROUP_HUGETLB
/* cgroup control files */
-   struct cftype cgroup_files[5];
+   struct cftype cgroup_files[HUGETLB_RES_MAX];
 #endif
char name[HSTATE_NAME_LEN];
 };
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 68c2f2f3c05b7..51a72624bd1ff 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -25,6 +25,10 @@ struct hugetlb_cgroup {
 * the counter to account for hugepages from hugetlb.
 */
struct page_counter hugepage[HUGE_MAX_HSTATE];
+   /*
+* the counter to account for hugepage reservations from hugetlb.
+*/
+   struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
 };

 #define MEMFILE_PRIVATE(x, val)(((x) << 16) | (val))
@@ -33,6 +37,15 @@ struct hugetlb_cgroup {

 static struct hugetlb_cgroup *root_h_cgroup __read_mostly;

+static inline
+struct page_counter *hugetlb_cgroup_get_counter(struct hugetlb_cgroup *h_cg, 
int idx,
+bool reserved)
+{
+   if (reserved)
+   return  &h_cg->reserved_hugepage[idx];
+   return &h_cg->hugepage[idx];
+}
+
 static inline
 struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
 {
@@ -254,30 +267,33 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned 
long nr_pages,
return;
 }

-enum {
-   RES_USAGE,
-   RES_LIMIT,
-   RES_MAX_USAGE,
-   RES_FAILCNT,
-};
-
 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
   struct cftype *cft)
 {
struct page_counter *counter;
+   struct page_counter *reserved_counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);

counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
+   reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];

switch (MEMFILE_ATTR(cft->private)) {
-   case RES_USAGE:
+   case HUGETLB_RES_USAGE:
return (u64)page_counter_read(counter) * PAGE_SIZE;
-   case RES_LIMIT:
+   case HUGETLB_RES_RESERVATION_USAGE:
+   return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
+   case HUGETLB_RES_LIMIT:
return (u64)counter->max * PAGE_SIZE;
-   case RES_MAX_USAGE:
+   case HUGETLB_RES_RESERVATION_LIMIT:
+   return (u64)reserved_counter->max * PAGE_SIZE;
+   case HUGETLB_RES_MAX_USAGE:
return (u64)counter->watermark * PAGE_SIZE;
-   case RES_FAILCNT:
+   case HUGETLB_RES_RESERVATION_MAX_USAGE:
+   return (u64)reserved_counter->watermark * PAGE_SIZE;
+   case HUGETLB_RES_FAILCNT:
return counter->failcnt;
+   case HUGETLB_RES_RESERVATION_FAILCNT:
+   return reserved_counter->failcnt;
default:
BUG();
}
@@ -291,6 +307,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file 
*of,
int ret, idx;
unsigned long nr_pages;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
+   bool reserved = false;

if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
return -EINVAL;
@@ -304,9 +321,13 @@ static ssize_t hugetlb_cgroup_write(struct 
kernfs_open_file *of,
nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));

switch (MEMFILE_ATTR(of_cft(of)->private)) {
-   case RES_LIMIT:
+   case HUGETLB_RES_RESERVATION_LIMIT:
+   reserved = true;
+   /* Fall through. */
+   case HUGETLB_RES_LIMIT:
mutex_lock(&hugetlb_limit_mutex);
-   ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
+   ret = page_counter_set_max(hugetlb_cgroup_get_counter(h_cg, 
idx, reserved),
+   

Re: [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings

2019-08-16 Thread Mina Almasry
On Fri, Aug 16, 2019 at 9:29 AM Mike Kravetz  wrote:
>
> On 8/15/19 4:04 PM, Mina Almasry wrote:
> > On Wed, Aug 14, 2019 at 9:46 AM Mike Kravetz  
> > wrote:
> >>
> >> On 8/13/19 4:54 PM, Mike Kravetz wrote:
> >>> On 8/8/19 4:13 PM, Mina Almasry wrote:
> >>>> For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
> >>>> in the resv_map entries, in file_region->reservation_counter.
> >>>>
> >>>> When a file_region entry is added to the resv_map via region_add, we
> >>>> also charge the appropriate hugetlb_cgroup and put the pointer to that
> >>>> in file_region->reservation_counter. This is slightly delicate since we
> >>>> need to not modify the resv_map until we know that charging the
> >>>> reservation has succeeded. If charging doesn't succeed, we report the
> >>>> error to the caller, so that the kernel fails the reservation.
> >>>
> >>> I wish we did not need to modify these region_() routines as they are
> >>> already difficult to understand.  However, I see no other way with the
> >>> desired semantics.
> >>>
> >>
> >> I suspect you have considered this, but what about using the return value
> >> from region_chg() in hugetlb_reserve_pages() to charge reservation limits?
> >> There is a VERY SMALL race where the value could be too large, but that
> >> can be checked and adjusted at region_add time as is done with normal
> >> accounting today.
> >
> > I have not actually until now; I didn't consider doing stuff with the
> > resv_map while not holding onto the resv_map->lock. I guess that's the
> > small race you're talking about. Seems fine to me, but I'm more
> > worried about hanging off the vma below.
>
> This race is already handled for other 'reservation like' things in
> hugetlb_reserve_pages.  So, I don't think the race is much of an issue.
>
> >> If the question is, where would we store the information
> >> to uncharge?, then we can hang a structure off the vma.  This would be
> >> similar to what is done for private mappings.  In fact, I would suggest
> >> making them both use a new cgroup reserve structure hanging off the vma.
> >>
> >
> > I actually did consider hanging off the info to uncharge off the vma,
> > but I didn't for a couple of reasons:
> >
> > 1. region_del is called from hugetlb_unreserve_pages, and I don't have
> > access to the vma there. Maybe there is a way to query the proper vma
> > I don't know about?
>
> I am still thinking about closely tying cgroup revervation limits/usage
> to existing reservation accounting.  Of most concern (to me) is handling
> shared mappings.  Reservations created for shared mappings are more
> associated with the inode/file than individual mappings.  For example,
> consider a task which mmaps(MAP_SHARED) a hugetlbfs file.  At mmap time
> reservations are created based on the size of the mmap.  Now, if the task
> unmaps and/or exits the reservations still exist as they are associated
> with the file rather than the mapping.
>

I'm aware of this behavior, and IMO it seems fine to me. I believe it
works the same way with tmfs today. I think a task that creates a file
in tmpfs gets charged the memory, and even if the task exits the
memory is still charged to its cgroup, and the memory remains charged
until the tmpfs file is deleted by someone.

Makes sense to me for hugetlb reservations to work the same way. The
memory remains charged until the hugetlbfs file gets deleted. But, if
you think of improvement, I'm happy to oblige :)

> Honesty, I think this existing reservation bevahior is wrong or at least
> not desirable.  Because there are outstanding reservations, the number of
> reserved huge pages can not be used for other purposes.  It is also very
> difficult for a user or admin to determine the source of the reservations.
> No one is currently complaining about this behavior.  This proposal just
> made me think about it.
>
> Tying cgroup reservation limits/usage to existing reservation accounting
> will introduce the same issues there.  We will need to clearly document the
> behavior.
>

Yes, seems we're maybe converging on a solution here, so the next
patchset will include docs for your review.

> > 2. hugetlb_reserve_pages seems to be able to conduct a reservation
> > with a NULL *vma. Not sure what to do in that case.
> >
> > Is there a way to get around these that I'm missing here?
>
> You are correct.  The !vma case is ther

Re: [RFC PATCH v2 1/5] hugetlb_cgroup: Add hugetlb_cgroup reservation counter

2019-08-15 Thread Mina Almasry
On Wed, Aug 14, 2019 at 8:54 PM Hillf Danton  wrote:
>
>
> On Thu,  8 Aug 2019 16:13:36 -0700 Mina Almasry wrote:
> >
> > These counters will track hugetlb reservations rather than hugetlb
> > memory faulted in. This patch only adds the counter, following patches
> > add the charging and uncharging of the counter.
> > ---
>
>   !?!
>

Thanks for reviewing. I'm not sure what you're referring to though.
What's wrong here?

> >  include/linux/hugetlb.h |  2 +-
> >  mm/hugetlb_cgroup.c | 86 +
> >  2 files changed, 80 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index edfca42783192..6777b3013345d 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -340,7 +340,7 @@ struct hstate {
> >   unsigned int surplus_huge_pages_node[MAX_NUMNODES];
> >  #ifdef CONFIG_CGROUP_HUGETLB
> >   /* cgroup control files */
> > - struct cftype cgroup_files[5];
> > + struct cftype cgroup_files[9];
>
> Move that enum in this header file and replace numbers with characters
> to easy both reading and maintaining.
> >  #endif
> >   char name[HSTATE_NAME_LEN];
> >  };
> > diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
> > index 68c2f2f3c05b7..708103663988a 100644
> > --- a/mm/hugetlb_cgroup.c
> > +++ b/mm/hugetlb_cgroup.c
> > @@ -25,6 +25,10 @@ struct hugetlb_cgroup {
> >* the counter to account for hugepages from hugetlb.
> >*/
> >   struct page_counter hugepage[HUGE_MAX_HSTATE];
> > + /*
> > +  * the counter to account for hugepage reservations from hugetlb.
> > +  */
> > + struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
> >  };
> >
> >  #define MEMFILE_PRIVATE(x, val)  (((x) << 16) | (val))
> > @@ -33,6 +37,15 @@ struct hugetlb_cgroup {
> >
> >  static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
> >
> > +static inline
> > +struct page_counter *get_counter(struct hugetlb_cgroup *h_cg, int idx,
> > +  bool reserved)
>
> s/get_/hugetlb_cgroup_get_/ to make it not too generic.
> > +{
> > + if (reserved)
> > + return  &h_cg->reserved_hugepage[idx];
> > + return &h_cg->hugepage[idx];
> > +}
> > +
> >  static inline
> >  struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state 
> > *s)
> >  {
> > @@ -256,28 +269,42 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned 
> > long nr_pages,
> >
> >  enum {
> >   RES_USAGE,
> > + RES_RESERVATION_USAGE,
> >   RES_LIMIT,
> > + RES_RESERVATION_LIMIT,
> >   RES_MAX_USAGE,
> > + RES_RESERVATION_MAX_USAGE,
> >   RES_FAILCNT,
> > + RES_RESERVATION_FAILCNT,
> >  };
> >
> >  static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
> >  struct cftype *cft)
> >  {
> >   struct page_counter *counter;
> > + struct page_counter *reserved_counter;
> >   struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);
> >
> >   counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
> > + reserved_counter = 
> > &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];
> >
> >   switch (MEMFILE_ATTR(cft->private)) {
> >   case RES_USAGE:
> >   return (u64)page_counter_read(counter) * PAGE_SIZE;
> > + case RES_RESERVATION_USAGE:
> > + return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
> >   case RES_LIMIT:
> >   return (u64)counter->max * PAGE_SIZE;
> > + case RES_RESERVATION_LIMIT:
> > + return (u64)reserved_counter->max * PAGE_SIZE;
> >   case RES_MAX_USAGE:
> >   return (u64)counter->watermark * PAGE_SIZE;
> > + case RES_RESERVATION_MAX_USAGE:
> > + return (u64)reserved_counter->watermark * PAGE_SIZE;
> >   case RES_FAILCNT:
> >   return counter->failcnt;
> > + case RES_RESERVATION_FAILCNT:
> > + return reserved_counter->failcnt;
> >   default:
> >   BUG();
> >   }
> > @@ -291,6 +318,7 @@ static ssize_t hugetlb_cgroup_write(struct 
> > kernfs_open_file *of,
> >   int ret, idx;
> >   unsigned long nr_pages;
> >   struct hugetlb_cgroup *h_c

Re: [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings

2019-08-15 Thread Mina Almasry
On Tue, Aug 13, 2019 at 4:54 PM Mike Kravetz  wrote:
>
> On 8/8/19 4:13 PM, Mina Almasry wrote:
> > For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
> > in the resv_map entries, in file_region->reservation_counter.
> >
> > When a file_region entry is added to the resv_map via region_add, we
> > also charge the appropriate hugetlb_cgroup and put the pointer to that
> > in file_region->reservation_counter. This is slightly delicate since we
> > need to not modify the resv_map until we know that charging the
> > reservation has succeeded. If charging doesn't succeed, we report the
> > error to the caller, so that the kernel fails the reservation.
>
> I wish we did not need to modify these region_() routines as they are
> already difficult to understand.  However, I see no other way with the
> desired semantics.
>
> > On region_del, which is when the hugetlb memory is unreserved, we delete
> > the file_region entry in the resv_map, but also uncharge the
> > file_region->reservation_counter.
> >
> > ---
> >  mm/hugetlb.c | 208 +--
> >  1 file changed, 170 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 235996aef6618..d76e3137110ab 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -242,8 +242,72 @@ struct file_region {
> >   struct list_head link;
> >   long from;
> >   long to;
> > +#ifdef CONFIG_CGROUP_HUGETLB
> > + /*
> > +  * On shared mappings, each reserved region appears as a struct
> > +  * file_region in resv_map. These fields hold the info needed to
> > +  * uncharge each reservation.
> > +  */
> > + struct page_counter *reservation_counter;
> > + unsigned long pages_per_hpage;
> > +#endif
> >  };
> >
> > +/* Must be called with resv->lock held. Calling this with dry_run == true 
> > will
> > + * count the number of pages added but will not modify the linked list.
> > + */
> > +static long consume_regions_we_overlap_with(struct file_region *rg,
> > + struct list_head *head, long f, long *t,
> > + struct hugetlb_cgroup *h_cg,
> > + struct hstate *h,
> > + bool dry_run)
> > +{
> > + long add = 0;
> > + struct file_region *trg = NULL, *nrg = NULL;
> > +
> > + /* Consume any regions we now overlap with. */
> > + nrg = rg;
> > + list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
> > + if (&rg->link == head)
> > + break;
> > + if (rg->from > *t)
> > + break;
> > +
> > + /* If this area reaches higher then extend our area to
> > +  * include it completely.  If this is not the first area
> > +  * which we intend to reuse, free it.
> > +  */
> > + if (rg->to > *t)
> > + *t = rg->to;
> > + if (rg != nrg) {
> > + /* Decrement return value by the deleted range.
> > +  * Another range will span this area so that by
> > +  * end of routine add will be >= zero
> > +  */
> > + add -= (rg->to - rg->from);
> > + if (!dry_run) {
> > + list_del(&rg->link);
> > + kfree(rg);
>
> Is it possible that the region struct we are deleting pointed to
> a reservation_counter?  Perhaps even for another cgroup?
> Just concerned with the way regions are coalesced that we may be
> deleting counters.
>

Yep, that needs to be handled I think. Thanks for catching!


> --
> Mike Kravetz


Re: [RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings

2019-08-15 Thread Mina Almasry
On Wed, Aug 14, 2019 at 9:46 AM Mike Kravetz  wrote:
>
> On 8/13/19 4:54 PM, Mike Kravetz wrote:
> > On 8/8/19 4:13 PM, Mina Almasry wrote:
> >> For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
> >> in the resv_map entries, in file_region->reservation_counter.
> >>
> >> When a file_region entry is added to the resv_map via region_add, we
> >> also charge the appropriate hugetlb_cgroup and put the pointer to that
> >> in file_region->reservation_counter. This is slightly delicate since we
> >> need to not modify the resv_map until we know that charging the
> >> reservation has succeeded. If charging doesn't succeed, we report the
> >> error to the caller, so that the kernel fails the reservation.
> >
> > I wish we did not need to modify these region_() routines as they are
> > already difficult to understand.  However, I see no other way with the
> > desired semantics.
> >
>
> I suspect you have considered this, but what about using the return value
> from region_chg() in hugetlb_reserve_pages() to charge reservation limits?
> There is a VERY SMALL race where the value could be too large, but that
> can be checked and adjusted at region_add time as is done with normal
> accounting today.

I have not actually until now; I didn't consider doing stuff with the
resv_map while not holding onto the resv_map->lock. I guess that's the
small race you're talking about. Seems fine to me, but I'm more
worried about hanging off the vma below.

> If the question is, where would we store the information
> to uncharge?, then we can hang a structure off the vma.  This would be
> similar to what is done for private mappings.  In fact, I would suggest
> making them both use a new cgroup reserve structure hanging off the vma.
>

I actually did consider hanging off the info to uncharge off the vma,
but I didn't for a couple of reasons:

1. region_del is called from hugetlb_unreserve_pages, and I don't have
access to the vma there. Maybe there is a way to query the proper vma
I don't know about?
2. hugetlb_reserve_pages seems to be able to conduct a reservation
with a NULL *vma. Not sure what to do in that case.

Is there a way to get around these that I'm missing here?

FWIW I think tracking is better in resv_map since the reservations are
in resv_map themselves. If I do another structure, then for each
reservation there will be an entry in resv_map and an entry in the new
structure and they need to be kept in sync and I need to handle errors
for when they get out of sync.

> One issue I see is what to do if a vma is split?  The private mapping case
> 'should' handle this today, but I would not be surprised if such code is
> missing or incorrect.
>
> --
> Mike Kravetz


Re: [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-08-10 Thread Mina Almasry
On Sat, Aug 10, 2019 at 11:58 AM Mike Kravetz  wrote:
>
> On 8/9/19 12:42 PM, Mina Almasry wrote:
> > On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz  
> > wrote:
> >> On 8/8/19 4:13 PM, Mina Almasry wrote:
> >>> Problem:
> >>> Currently tasks attempting to allocate more hugetlb memory than is 
> >>> available get
> >>> a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations 
> >>> [1].
> >>> However, if a task attempts to allocate hugetlb memory only more than its
> >>> hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
> >>> but will SIGBUS the task when it attempts to fault the memory in.
> 
> >> I believe tracking reservations for shared mappings can get quite 
> >> complicated.
> >> The hugetlbfs reservation code around shared mappings 'works' on the basis
> >> that shared mapping reservations are global.  As a result, reservations are
> >> more associated with the inode than with the task making the reservation.
> >
> > FWIW, I found it not too bad. And my tests at least don't detect an
> > anomaly around shared mappings. The key I think is that I'm tracking
> > cgroup to uncharge on the file_region entry inside the resv_map, so we
> > know who allocated each file_region entry exactly and we can uncharge
> > them when the entry is region_del'd.
> >
> >> For example, consider a file of size 4 hugetlb pages.
> >> Task A maps the first 2 pages, and 2 reservations are taken.  Task B maps
> >> all 4 pages, and 2 additional reservations are taken.  I am not really sure
> >> of the desired semantics here for reservation limits if A and B are in 
> >> separate
> >> cgroups.  Should B be charged for 4 or 2 reservations?
> >
> > Task A's cgroup is charged 2 pages to its reservation usage.
> > Task B's cgroup is charged 2 pages to its reservation usage.
>
> OK,
> Suppose Task B's cgroup allowed 2 huge pages reservation and 2 huge pages
> allocation.  The mmap would succeed, but Task B could potentially need to
> allocate more than 2 huge pages.  So, when faulting in more than 2 huge
> pages B would get a SIGBUS.  Correct?  Or, am I missing something?
>
> Perhaps reservation charge should always be the same as map size/maximum
> allocation size?

I'm thinking this would work similar to how other shared memory like
tmpfs is accounted for right now. I.e. if a task conducts an operation
that causes memory to be allocated then that task is charged for that
memory, and if another task uses memory that has already been
allocated and charged by another task, then it can use the memory
without being charged.

So in case of hugetlb memory, if a task is mmaping memory that causes
a new reservation to be made, and new entries to be created in the
resv_map for the shared mapping, then that task gets charged. If the
task is mmaping memory that is already reserved or faulted, then it
reserves or faults it without getting charged.

In the example above, in chronological order:
- Task A mmaps 2 hugetlb pages, gets charged 2 hugetlb reservations.
- Task B mmaps 4 hugetlb pages, gets charged only 2 hugetlb
reservations because the first 2 are charged already and can be used
without incurring a charge.
- Task B accesses 4 hugetlb pages, gets charged *4* hugetlb faults,
since none of the 4 pages are faulted in yet. If the task is only
allowed 2 hugetlb page faults then it will actually get a SIGBUS.
- Task A accesses 4 hugetlb pages, gets charged no faults, since all
the hugetlb faults is charged to Task B.

So, yes, I can see a scenario where userspace still gets SIGBUS'd, but
I think that's fine because:
1. Notice that the SIGBUS is due to the faulting limit, and not the
reservation limit, so we're not regressing the status quo per say.
Folks using the fault limit today understand the SIGBUS risk.
2. the way I expect folks to use this is to use 'reservation limits'
to partition the available hugetlb memory on the machine using it and
forgo using the existing fault limits. Using both at the same time I
think would be a superuser feature for folks that really know what
they are doing, and understand the risk of SIGBUS that comes with
using the existing fault limits.
3. I expect userspace to in general handle this correctly because
there are similar challenges with all shared memory and accounting of
it, even in tmpfs, I think.

I would not like to charge the full reservation to every process that
does the mmap. Think of this, much more common scenario: Task A and B
are supposed to collaborate on a 10 hugetlb pages of data. Task B
should not access any hugetlb memory other than the memory it is
working on with Ta

Re: [RFC PATCH] hugetlbfs: Add hugetlb_cgroup reservation limits

2019-08-09 Thread Mina Almasry
On Fri, Aug 9, 2019 at 1:39 PM Mike Kravetz  wrote:
>
> On 8/9/19 11:05 AM, Mina Almasry wrote:
> > On Fri, Aug 9, 2019 at 4:27 AM Michal Koutný  wrote:
> >>> Alternatives considered:
> >>> [...]
> >> (I did not try that but) have you considered:
> >> 3) MAP_POPULATE while you're making the reservation,
> >
> > I have tried this, and the behaviour is not great. Basically if
> > userspace mmaps more memory than its cgroup limit allows with
> > MAP_POPULATE, the kernel will reserve the total amount requested by
> > the userspace, it will fault in up to the cgroup limit, and then it
> > will SIGBUS the task when it tries to access the rest of its
> > 'reserved' memory.
> >
> > So for example:
> > - if /proc/sys/vm/nr_hugepages == 10, and
> > - your cgroup limit is 5 pages, and
> > - you mmap(MAP_POPULATE) 7 pages.
> >
> > Then the kernel will reserve 7 pages, and will fault in 5 of those 7
> > pages, and will SIGBUS you when you try to access the remaining 2
> > pages. So the problem persists. Folks would still like to know they
> > are crossing the limits on mmap time.
>
> If you got the failure at mmap time in the MAP_POPULATE case would this
> be useful?
>
> Just thinking that would be a relatively simple change.

Not quite, unfortunately. A subset of the folks that want to use
hugetlb memory, don't want to use MAP_POPULATE (IIRC, something about
mmaping a huge amount of hugetlb memory at their jobs' startup, and
doing that with MAP_POPULATE adds so much to their startup time that
it is prohibitively expensive - but that's just what I vaguely recall
offhand. I can get you the details if you're interested).

> --
> Mike Kravetz


Re: [RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-08-09 Thread Mina Almasry
On Fri, Aug 9, 2019 at 10:54 AM Mike Kravetz  wrote:
>
> (+CC  Michal Koutný, cgro...@vger.kernel.org, Aneesh Kumar)
>
> On 8/8/19 4:13 PM, Mina Almasry wrote:
> > Problem:
> > Currently tasks attempting to allocate more hugetlb memory than is 
> > available get
> > a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
> > However, if a task attempts to allocate hugetlb memory only more than its
> > hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
> > but will SIGBUS the task when it attempts to fault the memory in.
> >
> > We have developers interested in using hugetlb_cgroups, and they have 
> > expressed
> > dissatisfaction regarding this behavior. We'd like to improve this
> > behavior such that tasks violating the hugetlb_cgroup limits get an error on
> > mmap/shmget time, rather than getting SIGBUS'd when they try to fault
> > the excess memory in.
> >
> > The underlying problem is that today's hugetlb_cgroup accounting happens
> > at hugetlb memory *fault* time, rather than at *reservation* time.
> > Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
> > the offending task gets SIGBUS'd.
> >
> > Proposed Solution:
> > A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. 
> > This
> > counter has slightly different semantics than
> > hugetlb.xMB.[limit|usage]_in_bytes:
> >
> > - While usage_in_bytes tracks all *faulted* hugetlb memory,
> > reservation_usage_in_bytes tracks all *reserved* hugetlb memory.
> >
> > - If a task attempts to reserve more memory than limit_in_bytes allows,
> > the kernel will allow it to do so. But if a task attempts to reserve
> > more memory than reservation_limit_in_bytes, the kernel will fail this
> > reservation.
> >
> > This proposal is implemented in this patch, with tests to verify
> > functionality and show the usage.
>
> Thanks for taking on this effort Mina.
>
No problem! Thanks for reviewing!

> Before looking at the details of the code, it might be helpful to discuss
> the expected semantics of the proposed reservation limits.
>
> I see you took into account the differences between private and shared
> mappings.  This is good, as the reservation behavior is different for each
> of these cases.  First let's look at private mappings.
>
> For private mappings, the reservation usage will be the size of the mapping.
> This should be fairly simple.  As reservations are consumed in the hugetlbfs
> code, reservations in the resv_map are removed.  I see you have a hook into
> region_del.  So, the expectation is that as reservations are consumed the
> reservation usage will drop for the cgroup.  Correct?

I assume by 'reservations are consumed' you mean when a reservation
goes from just 'reserved' to actually in use (as in the task is
writing to the hugetlb page or something). If so, then the answer is
no, that is not correct. When reservations are consumed, the
reservation usage stays the same. I.e. the reservation usage tracks
hugetlb memory (reserved + used) you could say. This is 100% the
intention, as we want to know on mmap time if there is enough 'free'
(that is unreserved and unused) memory left over in the cgroup to
satisfy the mmap call.

The hooks in region_add and region_del are to account shared mappings
only. There is a check in those code blocks that makes sure the code
is only engaged in shared mappings. The commit messages of patches 3/5
and 4/5 go into more details regarding this.

> The only tricky thing about private mappings is COW because of fork.  Current
> reservation semantics specify that all reservations stay with the parent.
> If child faults and can not get page, SIGBUS.  I assume the new reservation
> limits will work the same.
>

Although I did not explicitly try it, yes. It should work the same.
The additional reservation due to the COW will get charged to whatever
cgroup the fork is in. If the task can't get a page it gets SIGBUS'd.
If there is not enough room to charge the cgroup it's in, then the
charge will fail, which I assume will trigger error path that also
leads to SIGBUS.

> I believe tracking reservations for shared mappings can get quite complicated.
> The hugetlbfs reservation code around shared mappings 'works' on the basis
> that shared mapping reservations are global.  As a result, reservations are
> more associated with the inode than with the task making the reservation.

FWIW, I found it not too bad. And my tests at least don't detect an
anomaly around shared mappings. The key I think is that I'm tracking
cgroup to uncharge on the file_region entry in

Re: [RFC PATCH] hugetlbfs: Add hugetlb_cgroup reservation limits

2019-08-09 Thread Mina Almasry
On Fri, Aug 9, 2019 at 4:27 AM Michal Koutný  wrote:
>
> (+CC cgro...@vger.kernel.org)
>
> On Thu, Aug 08, 2019 at 12:40:02PM -0700, Mina Almasry 
>  wrote:
> > We have developers interested in using hugetlb_cgroups, and they have 
> > expressed
> > dissatisfaction regarding this behavior.
> I assume you still want to enforce a limit on a particular group and the
> application must be able to handle resource scarcity (but better
> notified than SIGBUS).
>
> > Alternatives considered:
> > [...]
> (I did not try that but) have you considered:
> 3) MAP_POPULATE while you're making the reservation,

I have tried this, and the behaviour is not great. Basically if
userspace mmaps more memory than its cgroup limit allows with
MAP_POPULATE, the kernel will reserve the total amount requested by
the userspace, it will fault in up to the cgroup limit, and then it
will SIGBUS the task when it tries to access the rest of its
'reserved' memory.

So for example:
- if /proc/sys/vm/nr_hugepages == 10, and
- your cgroup limit is 5 pages, and
- you mmap(MAP_POPULATE) 7 pages.

Then the kernel will reserve 7 pages, and will fault in 5 of those 7
pages, and will SIGBUS you when you try to access the remaining 2
pages. So the problem persists. Folks would still like to know they
are crossing the limits on mmap time.

> 4) Using multple hugetlbfs mounts with respective limits.
>

I assume you mean the size= option on the hugetlbfs mount. This
would only limit hugetlb memory usage via the hugetlbfs mount. Tasks
can still allocate hugetlb memory without any mount via
mmap(MAP_HUGETLB) and shmget/shmat APIs, and all these calls will
deplete the global, shared hugetlb memory pool.

> > Caveats:
> > 1. This support is implemented for cgroups-v1. I have not tried
> >hugetlb_cgroups with cgroups v2, and AFAICT it's not supported yet.
> >This is largely because we use cgroups-v1 for now.
> Adding something new into v1 without v2 counterpart, is making migration
> harder, that's one of the reasons why v1 API is rather frozen now. (I'm
> not sure whether current hugetlb controller fits into v2 at all though.)
>

In my estimation it's maybe fine to make this change in v1 because, as
far as I understand, hugetlb_cgroups are a little used feature of the
kernel (although we see it getting requested) and hugetlb_cgroups
aren't supported in v2 yet, and I don't *think* this change makes it
any harder to port hugetlb_cgroups to v2.

But, like I said if there is consensus this must not be checked in
without hugetlb_cgroups v2 supported is added alongside, I can take a
look at that.

> Michal


[RFC PATCH v2 3/5] hugetlb_cgroup: Add reservation accounting for private mappings

2019-08-08 Thread Mina Almasry
Normally the pointer to the cgroup to uncharge hangs off the struct
page, and gets queried when it's time to free the page. With
hugetlb_cgroup reservations, this is not possible. Because it's possible
for a page to be reserved by one task and actually faulted in by another
task.

The best place to put the hugetlb_cgroup pointer to uncharge for
reservations is in the resv_map. But, because the resv_map has different
semantics for private and shared mappings, the code patch to
charge/uncharge shared and private mappings is different. This patch
implements charging and uncharging for private mappings.

For private mappings, the counter to uncharge is in
resv_map->reservation_counter. On initializing the resv_map this is set
to NULL. On reservation of a region in private mapping, the tasks
hugetlb_cgroup is charged and the hugetlb_cgroup is placed is
resv_map->reservation_counter.

On hugetlb_vm_op_close, we uncharge resv_map->reservation_counter.

---
 include/linux/hugetlb.h|  8 ++
 include/linux/hugetlb_cgroup.h | 11 
 mm/hugetlb.c   | 47 --
 mm/hugetlb_cgroup.c| 12 -
 4 files changed, 64 insertions(+), 14 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6777b3013345d..90b3c928d16c1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -46,6 +46,14 @@ struct resv_map {
long adds_in_progress;
struct list_head region_cache;
long region_cache_count;
+ #ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* On private mappings, the counter to uncharge reservations is stored
+* here. If these fields are 0, then the mapping is shared.
+*/
+   struct page_counter *reservation_counter;
+   unsigned long pages_per_hpage;
+#endif
 };
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 0725f809cd2d9..1fdde63a4e775 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -25,6 +25,17 @@ struct hugetlb_cgroup;
 #define HUGETLB_CGROUP_MIN_ORDER   2

 #ifdef CONFIG_CGROUP_HUGETLB
+struct hugetlb_cgroup {
+   struct cgroup_subsys_state css;
+   /*
+* the counter to account for hugepages from hugetlb.
+*/
+   struct page_counter hugepage[HUGE_MAX_HSTATE];
+   /*
+* the counter to account for hugepage reservations from hugetlb.
+*/
+   struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
+};

 static inline struct hugetlb_cgroup *hugetlb_cgroup_from_page(struct page 
*page)
 {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c153bef42e729..235996aef6618 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -711,6 +711,16 @@ struct resv_map *resv_map_alloc(void)
INIT_LIST_HEAD(&resv_map->regions);

resv_map->adds_in_progress = 0;
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* Initialize these to 0. On shared mappings, 0's here indicate these
+* fields don't do cgroup accounting. On private mappings, these will be
+* re-initialized to the proper values, to indicate that hugetlb cgroup
+* reservations are to be un-charged from here.
+*/
+   resv_map->reservation_counter = NULL;
+   resv_map->pages_per_hpage = 0;
+#endif

INIT_LIST_HEAD(&resv_map->region_cache);
list_add(&rg->link, &resv_map->region_cache);
@@ -3192,7 +3202,19 @@ static void hugetlb_vm_op_close(struct vm_area_struct 
*vma)

reserve = (end - start) - region_count(resv, start, end);

-   kref_put(&resv->refs, resv_map_release);
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* Since we check for HPAGE_RESV_OWNER above, this must a private
+* mapping, and these values should be none-zero, and should point to
+* the hugetlb_cgroup counter to uncharge for this reservation.
+*/
+   WARN_ON(!resv->reservation_counter);
+   WARN_ON(!resv->pages_per_hpage);
+
+   hugetlb_cgroup_uncharge_counter(
+   resv->reservation_counter,
+   (end - start) * resv->pages_per_hpage);
+#endif

if (reserve) {
/*
@@ -3202,6 +3224,8 @@ static void hugetlb_vm_op_close(struct vm_area_struct 
*vma)
gbl_reserve = hugepage_subpool_put_pages(spool, reserve);
hugetlb_acct_memory(h, -gbl_reserve);
}
+
+   kref_put(&resv->refs, resv_map_release);
 }

 static int hugetlb_vm_op_split(struct vm_area_struct *vma, unsigned long addr)
@@ -4516,6 +4540,7 @@ int hugetlb_reserve_pages(struct inode *inode,
struct hstate *h = hstate_inode(inode);
struct hugepage_subpool *spool = subpool_inode(inode);
struct resv_map *resv_map;
+   struct hugetlb_cgroup *h_cg;
long gbl_reserve;

/* This should never happen */
@@ -4549,11 +4574,29 @@ int hugetlb_reserve_pages(struc

[RFC PATCH v2 4/5] hugetlb_cgroup: Add accounting for shared mappings

2019-08-08 Thread Mina Almasry
For shared mappings, the pointer to the hugetlb_cgroup to uncharge lives
in the resv_map entries, in file_region->reservation_counter.

When a file_region entry is added to the resv_map via region_add, we
also charge the appropriate hugetlb_cgroup and put the pointer to that
in file_region->reservation_counter. This is slightly delicate since we
need to not modify the resv_map until we know that charging the
reservation has succeeded. If charging doesn't succeed, we report the
error to the caller, so that the kernel fails the reservation.

On region_del, which is when the hugetlb memory is unreserved, we delete
the file_region entry in the resv_map, but also uncharge the
file_region->reservation_counter.

---
 mm/hugetlb.c | 208 +--
 1 file changed, 170 insertions(+), 38 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 235996aef6618..d76e3137110ab 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -242,8 +242,72 @@ struct file_region {
struct list_head link;
long from;
long to;
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* On shared mappings, each reserved region appears as a struct
+* file_region in resv_map. These fields hold the info needed to
+* uncharge each reservation.
+*/
+   struct page_counter *reservation_counter;
+   unsigned long pages_per_hpage;
+#endif
 };

+/* Must be called with resv->lock held. Calling this with dry_run == true will
+ * count the number of pages added but will not modify the linked list.
+ */
+static long consume_regions_we_overlap_with(struct file_region *rg,
+   struct list_head *head, long f, long *t,
+   struct hugetlb_cgroup *h_cg,
+   struct hstate *h,
+   bool dry_run)
+{
+   long add = 0;
+   struct file_region *trg = NULL, *nrg = NULL;
+
+   /* Consume any regions we now overlap with. */
+   nrg = rg;
+   list_for_each_entry_safe(rg, trg, rg->link.prev, link) {
+   if (&rg->link == head)
+   break;
+   if (rg->from > *t)
+   break;
+
+   /* If this area reaches higher then extend our area to
+* include it completely.  If this is not the first area
+* which we intend to reuse, free it.
+*/
+   if (rg->to > *t)
+   *t = rg->to;
+   if (rg != nrg) {
+   /* Decrement return value by the deleted range.
+* Another range will span this area so that by
+* end of routine add will be >= zero
+*/
+   add -= (rg->to - rg->from);
+   if (!dry_run) {
+   list_del(&rg->link);
+   kfree(rg);
+   }
+   }
+   }
+
+   add += (nrg->from - f); /* Added to beginning of region */
+   add += *t - nrg->to;/* Added to end of region */
+
+   if (!dry_run) {
+   nrg->from = f;
+   nrg->to = *t;
+#ifdef CONFIG_CGROUP_HUGETLB
+   nrg->reservation_counter =
+   &h_cg->reserved_hugepage[hstate_index(h)];
+   nrg->pages_per_hpage = pages_per_huge_page(h);
+#endif
+   }
+
+   return add;
+}
+
 /*
  * Add the huge page range represented by [f, t) to the reserve
  * map.  In the normal case, existing regions will be expanded
@@ -258,11 +322,13 @@ struct file_region {
  * Return the number of new huge pages added to the map.  This
  * number is greater than or equal to zero.
  */
-static long region_add(struct resv_map *resv, long f, long t)
+static long region_add(struct hstate *h, struct resv_map *resv, long f, long t)
 {
struct list_head *head = &resv->regions;
-   struct file_region *rg, *nrg, *trg;
-   long add = 0;
+   struct file_region *rg, *nrg;
+   long add = 0, newadd = 0;
+   struct hugetlb_cgroup *h_cg = NULL;
+   int ret = 0;

spin_lock(&resv->lock);
/* Locate the region we are either in or before. */
@@ -277,6 +343,23 @@ static long region_add(struct resv_map *resv, long f, long 
t)
 * from the cache and use it for this range.
 */
if (&rg->link == head || t < rg->from) {
+#ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* If res->reservation_counter is NULL, then it means this is
+* a shared mapping, and hugetlb cgroup accounting should be
+* done on the file_region entries inside resv_map.
+*/
+   if (!resv->reservation_counter) {
+   ret = hugetlb_cgroup_charge_cgroup(
+   hstate_index(h),
+   (t - f) * pages_per_huge_page(h),
+   &h_cg, 

[RFC PATCH v2 5/5] hugetlb_cgroup: Add hugetlb_cgroup reservation tests

2019-08-08 Thread Mina Almasry
The tests use both shared and private mapped hugetlb memory, and
monitors the hugetlb usage counter as well as the hugetlb reservation
counter. They test different configurations such as hugetlb memory usage
via hugetlbfs, or MAP_HUGETLB, or shmget/shmat, and with and without
MAP_POPULATE.

---
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   4 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 438 ++
 .../selftests/vm/write_hugetlb_memory.sh  |  22 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 252 ++
 5 files changed, 717 insertions(+)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

diff --git a/tools/testing/selftests/vm/.gitignore 
b/tools/testing/selftests/vm/.gitignore
index 31b3c98b6d34d..d3bed9407773c 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -14,3 +14,4 @@ virtual_address_range
 gup_benchmark
 va_128TBswitch
 map_fixed_noreplace
+write_to_hugetlbfs
diff --git a/tools/testing/selftests/vm/Makefile 
b/tools/testing/selftests/vm/Makefile
index 9534dc2bc9295..8d37d5409b52c 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -18,6 +18,7 @@ TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += userfaultfd
 TEST_GEN_FILES += va_128TBswitch
 TEST_GEN_FILES += virtual_address_range
+TEST_GEN_FILES += write_to_hugetlbfs

 TEST_PROGS := run_vmtests

@@ -29,3 +30,6 @@ include ../lib.mk
 $(OUTPUT)/userfaultfd: LDLIBS += -lpthread

 $(OUTPUT)/mlock-random-test: LDLIBS += -lcap
+
+# Why does adding $(OUTPUT)/ like above not apply this flag..?
+write_to_hugetlbfs: CFLAGS += -static
diff --git a/tools/testing/selftests/vm/charge_reserved_hugetlb.sh 
b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
new file mode 100755
index 0..bf0b6dcec9977
--- /dev/null
+++ b/tools/testing/selftests/vm/charge_reserved_hugetlb.sh
@@ -0,0 +1,438 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+
+set -e
+
+cgroup_path=/dev/cgroup/memory
+if [[ ! -e $cgroup_path ]]; then
+  mkdir -p $cgroup_path
+  mount -t cgroup -o hugetlb,memory cgroup $cgroup_path
+fi
+
+cleanup () {
+   echo $$ > $cgroup_path/tasks
+
+   set +e
+   if [[ "$(pgrep write_to_hugetlbfs)" != "" ]]; then
+ kill -2 write_to_hugetlbfs
+ # Wait for hugetlbfs memory to get depleted.
+ sleep 0.5
+   fi
+   set -e
+
+   if [[ -e /mnt/huge ]]; then
+ rm -rf /mnt/huge/*
+ umount /mnt/huge || echo error
+ rmdir /mnt/huge
+   fi
+   if [[ -e $cgroup_path/hugetlb_cgroup_test ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test
+   fi
+   if [[ -e $cgroup_path/hugetlb_cgroup_test1 ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test1
+   fi
+   if [[ -e $cgroup_path/hugetlb_cgroup_test2 ]]; then
+ rmdir $cgroup_path/hugetlb_cgroup_test2
+   fi
+   echo 0 > /proc/sys/vm/nr_hugepages
+   echo CLEANUP DONE
+}
+
+cleanup
+
+function expect_equal() {
+  local expected="$1"
+  local actual="$2"
+  local error="$3"
+
+  if [[ "$expected" != "$actual" ]]; then
+   echo "expected ($expected) != actual ($actual): $3"
+   cleanup
+   exit 1
+  fi
+}
+
+function setup_cgroup() {
+  local name="$1"
+  local cgroup_limit="$2"
+  local reservation_limit="$3"
+
+  mkdir $cgroup_path/$name
+
+  echo writing cgroup limit: "$cgroup_limit"
+  echo "$cgroup_limit" > $cgroup_path/$name/hugetlb.2MB.limit_in_bytes
+
+  echo writing reseravation limit: "$reservation_limit"
+  echo "$reservation_limit" > \
+   $cgroup_path/$name/hugetlb.2MB.reservation_limit_in_bytes
+}
+
+function write_hugetlbfs_and_get_usage() {
+  local cgroup="$1"
+  local size="$2"
+  local populate="$3"
+  local write="$4"
+  local path="$5"
+  local method="$6"
+  local private="$7"
+  local expect_failure="$8"
+
+  # Function return values.
+  reservation_failed=0
+  oom_killed=0
+  hugetlb_difference=0
+  reserved_difference=0
+
+  local hugetlb_usage=$cgroup_path/$cgroup/hugetlb.2MB.usage_in_bytes
+  local 
reserved_usage=$cgroup_path/$cgroup/hugetlb.2MB.reservation_usage_in_bytes
+
+  local hugetlb_before=$(cat $hugetlb_usage)
+  local reserved_before=$(cat $reserved_usage)
+
+  echo
+  echo Starting:
+  echo hugetlb_usage="$hugetlb_before"
+  echo reserved_usage="$reserved_before"
+  echo expect_failure is "$expect_failure"
+
+  set +e
+  if [[ "$method" == "1" ]] || [[ "$method" == 2 ]] || \
+   [[ "$private" == "-r" ]] && [[ "$expect_failure" != 1 ]]; then
+   bash write_hugetlb_memory.sh "$size" "

[RFC PATCH v2 1/5] hugetlb_cgroup: Add hugetlb_cgroup reservation counter

2019-08-08 Thread Mina Almasry
These counters will track hugetlb reservations rather than hugetlb
memory faulted in. This patch only adds the counter, following patches
add the charging and uncharging of the counter.
---
 include/linux/hugetlb.h |  2 +-
 mm/hugetlb_cgroup.c | 86 +
 2 files changed, 80 insertions(+), 8 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index edfca42783192..6777b3013345d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -340,7 +340,7 @@ struct hstate {
unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 #ifdef CONFIG_CGROUP_HUGETLB
/* cgroup control files */
-   struct cftype cgroup_files[5];
+   struct cftype cgroup_files[9];
 #endif
char name[HSTATE_NAME_LEN];
 };
diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 68c2f2f3c05b7..708103663988a 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -25,6 +25,10 @@ struct hugetlb_cgroup {
 * the counter to account for hugepages from hugetlb.
 */
struct page_counter hugepage[HUGE_MAX_HSTATE];
+   /*
+* the counter to account for hugepage reservations from hugetlb.
+*/
+   struct page_counter reserved_hugepage[HUGE_MAX_HSTATE];
 };

 #define MEMFILE_PRIVATE(x, val)(((x) << 16) | (val))
@@ -33,6 +37,15 @@ struct hugetlb_cgroup {

 static struct hugetlb_cgroup *root_h_cgroup __read_mostly;

+static inline
+struct page_counter *get_counter(struct hugetlb_cgroup *h_cg, int idx,
+bool reserved)
+{
+   if (reserved)
+   return  &h_cg->reserved_hugepage[idx];
+   return &h_cg->hugepage[idx];
+}
+
 static inline
 struct hugetlb_cgroup *hugetlb_cgroup_from_css(struct cgroup_subsys_state *s)
 {
@@ -256,28 +269,42 @@ void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned 
long nr_pages,

 enum {
RES_USAGE,
+   RES_RESERVATION_USAGE,
RES_LIMIT,
+   RES_RESERVATION_LIMIT,
RES_MAX_USAGE,
+   RES_RESERVATION_MAX_USAGE,
RES_FAILCNT,
+   RES_RESERVATION_FAILCNT,
 };

 static u64 hugetlb_cgroup_read_u64(struct cgroup_subsys_state *css,
   struct cftype *cft)
 {
struct page_counter *counter;
+   struct page_counter *reserved_counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(css);

counter = &h_cg->hugepage[MEMFILE_IDX(cft->private)];
+   reserved_counter = &h_cg->reserved_hugepage[MEMFILE_IDX(cft->private)];

switch (MEMFILE_ATTR(cft->private)) {
case RES_USAGE:
return (u64)page_counter_read(counter) * PAGE_SIZE;
+   case RES_RESERVATION_USAGE:
+   return (u64)page_counter_read(reserved_counter) * PAGE_SIZE;
case RES_LIMIT:
return (u64)counter->max * PAGE_SIZE;
+   case RES_RESERVATION_LIMIT:
+   return (u64)reserved_counter->max * PAGE_SIZE;
case RES_MAX_USAGE:
return (u64)counter->watermark * PAGE_SIZE;
+   case RES_RESERVATION_MAX_USAGE:
+   return (u64)reserved_counter->watermark * PAGE_SIZE;
case RES_FAILCNT:
return counter->failcnt;
+   case RES_RESERVATION_FAILCNT:
+   return reserved_counter->failcnt;
default:
BUG();
}
@@ -291,6 +318,7 @@ static ssize_t hugetlb_cgroup_write(struct kernfs_open_file 
*of,
int ret, idx;
unsigned long nr_pages;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));
+   bool reserved = false;

if (hugetlb_cgroup_is_root(h_cg)) /* Can't set limit on root */
return -EINVAL;
@@ -303,10 +331,16 @@ static ssize_t hugetlb_cgroup_write(struct 
kernfs_open_file *of,
idx = MEMFILE_IDX(of_cft(of)->private);
nr_pages = round_down(nr_pages, 1 << huge_page_order(&hstates[idx]));

+   if (MEMFILE_ATTR(of_cft(of)->private) == RES_RESERVATION_LIMIT) {
+   reserved = true;
+   }
+
switch (MEMFILE_ATTR(of_cft(of)->private)) {
+   case RES_RESERVATION_LIMIT:
case RES_LIMIT:
mutex_lock(&hugetlb_limit_mutex);
-   ret = page_counter_set_max(&h_cg->hugepage[idx], nr_pages);
+   ret = page_counter_set_max(get_counter(h_cg, idx, reserved),
+  nr_pages);
mutex_unlock(&hugetlb_limit_mutex);
break;
default:
@@ -320,18 +354,26 @@ static ssize_t hugetlb_cgroup_reset(struct 
kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
 {
int ret = 0;
-   struct page_counter *counter;
+   struct page_counter *counter, *reserved_counter;
struct hugetlb_cgroup *h_cg = hugetlb_cgroup_from_css(of_css(of));

counter = &h_cg->hugepage[MEMFILE_IDX(of_cft(of)->private)];
+   reserved_counter = &h_cg->reserved_huge

[RFC PATCH v2 0/5] hugetlb_cgroup: Add hugetlb_cgroup reservation limits

2019-08-08 Thread Mina Almasry
Problem:
Currently tasks attempting to allocate more hugetlb memory than is available get
a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
However, if a task attempts to allocate hugetlb memory only more than its
hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
but will SIGBUS the task when it attempts to fault the memory in.

We have developers interested in using hugetlb_cgroups, and they have expressed
dissatisfaction regarding this behavior. We'd like to improve this
behavior such that tasks violating the hugetlb_cgroup limits get an error on
mmap/shmget time, rather than getting SIGBUS'd when they try to fault
the excess memory in.

The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time.
Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
the offending task gets SIGBUS'd.

Proposed Solution:
A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This
counter has slightly different semantics than
hugetlb.xMB.[limit|usage]_in_bytes:

- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory.

- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than reservation_limit_in_bytes, the kernel will fail this
reservation.

This proposal is implemented in this patch, with tests to verify
functionality and show the usage.

Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to
   the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
   duplication with hugetlb_cgroup. Keeping hugetlb related page counters under
   hugetlb_cgroup seemed cleaner as well.

2. Instead of adding a new counter, we considered adding a sysctl that modifies
   the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at
   reservation time rather than fault time. Adding a new page_counter seems
   better as userspace could, if it wants, choose to enforce different cgroups
   differently: one via limit_in_bytes, and another via
   reservation_limit_in_bytes. This could be very useful if you're
   transitioning how hugetlb memory is partitioned on your system one
   cgroup at a time, for example. Also, someone may find usage for both
   limit_in_bytes and reservation_limit_in_bytes concurrently, and this
   approach gives them the option to do so.

Caveats:
1. This support is implemented for cgroups-v1. I have not tried
   hugetlb_cgroups with cgroups v2, and AFAICT it's not supported yet.
   This is largely because we use cgroups-v1 for now. If required, I
   can add hugetlb_cgroup support to cgroups v2 in this patch or
   a follow up.
2. Most complicated bit of this patch I believe is: where to store the
   pointer to the hugetlb_cgroup to uncharge at unreservation time?
   Normally the cgroup pointers hang off the struct page. But, with
   hugetlb_cgroup reservations, one task can reserve a specific page and another
   task may fault it in (I believe), so storing the pointer in struct
   page is not appropriate. Proposed approach here is to store the pointer in
   the resv_map. See patch for details.

Signed-off-by: Mina Almasry 

[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

Changes in v2:
- Split the patch into a 5 patch series.
- Fixed patch subject.

Mina Almasry (5):
  hugetlb_cgroup: Add hugetlb_cgroup reservation counter
  hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations
  hugetlb_cgroup: add reservation accounting for private mappings
  hugetlb_cgroup: add accounting for shared mappings
  hugetlb_cgroup: Add hugetlb_cgroup reservation tests

 include/linux/hugetlb.h   |  10 +-
 include/linux/hugetlb_cgroup.h|  19 +-
 mm/hugetlb.c  | 256 --
 mm/hugetlb_cgroup.c   | 153 +-
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   4 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 438 ++
 .../selftests/vm/write_hugetlb_memory.sh  |  22 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 252 ++
 9 files changed, 1087 insertions(+), 68 deletions(-)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

--
2.23.0.rc1.153.gdeed80330f-goog


[RFC PATCH v2 2/5] hugetlb_cgroup: Add interface for charge/uncharge

2019-08-08 Thread Mina Almasry
Augments hugetlb_cgroup_charge_cgroup to be able to charge hugetlb
usage or hugetlb reservation counter.

Adds a new interface to uncharge a hugetlb_cgroup counter via
hugetlb_cgroup_uncharge_counter.

Integrates the counter with hugetlb_cgroup, via hugetlb_cgroup_init,
hugetlb_cgroup_have_usage, and hugetlb_cgroup_css_offline.

---
 include/linux/hugetlb_cgroup.h |  8 +++--
 mm/hugetlb.c   |  3 +-
 mm/hugetlb_cgroup.c| 63 --
 3 files changed, 61 insertions(+), 13 deletions(-)

diff --git a/include/linux/hugetlb_cgroup.h b/include/linux/hugetlb_cgroup.h
index 063962f6dfc6a..0725f809cd2d9 100644
--- a/include/linux/hugetlb_cgroup.h
+++ b/include/linux/hugetlb_cgroup.h
@@ -52,7 +52,8 @@ static inline bool hugetlb_cgroup_disabled(void)
 }

 extern int hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-   struct hugetlb_cgroup **ptr);
+   struct hugetlb_cgroup **ptr,
+   bool reserved);
 extern void hugetlb_cgroup_commit_charge(int idx, unsigned long nr_pages,
 struct hugetlb_cgroup *h_cg,
 struct page *page);
@@ -60,6 +61,9 @@ extern void hugetlb_cgroup_uncharge_page(int idx, unsigned 
long nr_pages,
 struct page *page);
 extern void hugetlb_cgroup_uncharge_cgroup(int idx, unsigned long nr_pages,
   struct hugetlb_cgroup *h_cg);
+extern void hugetlb_cgroup_uncharge_counter(struct page_counter *p,
+   unsigned long nr_pages);
+
 extern void hugetlb_cgroup_file_init(void) __init;
 extern void hugetlb_cgroup_migrate(struct page *oldhpage,
   struct page *newhpage);
@@ -83,7 +87,7 @@ static inline bool hugetlb_cgroup_disabled(void)

 static inline int
 hugetlb_cgroup_charge_cgroup(int idx, unsigned long nr_pages,
-struct hugetlb_cgroup **ptr)
+struct hugetlb_cgroup **ptr, bool reserved)
 {
return 0;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ede7e7f5d1ab2..c153bef42e729 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2078,7 +2078,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
gbl_chg = 1;
}

-   ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
+   ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg,
+  false);
if (ret)
goto out_subpool_put;

diff --git a/mm/hugetlb_cgroup.c b/mm/hugetlb_cgroup.c
index 708103663988a..119176a0b2ec5 100644
--- a/mm/hugetlb_cgroup.c
+++ b/mm/hugetlb_cgroup.c
@@ -74,8 +74,10 @@ static inline bool hugetlb_cgroup_have_usage(struct 
hugetlb_cgroup *h_cg)
int idx;

for (idx = 0; idx < hugetlb_max_hstate; idx++) {
-   if (page_counter_read(&h_cg->hugepage[idx]))
+   if (page_counter_read(get_counter(h_cg, idx, true)) ||
+   page_counter_read(get_counter(h_cg, idx, false))) {
return true;
+   }
}
return false;
 }
@@ -86,18 +88,27 @@ static void hugetlb_cgroup_init(struct hugetlb_cgroup 
*h_cgroup,
int idx;

for (idx = 0; idx < HUGE_MAX_HSTATE; idx++) {
-   struct page_counter *counter = &h_cgroup->hugepage[idx];
struct page_counter *parent = NULL;
+   struct page_counter *reserved_parent = NULL;
unsigned long limit;
int ret;

-   if (parent_h_cgroup)
-   parent = &parent_h_cgroup->hugepage[idx];
-   page_counter_init(counter, parent);
+   if (parent_h_cgroup) {
+   parent = get_counter(parent_h_cgroup, idx, false);
+   reserved_parent = get_counter(parent_h_cgroup, idx,
+ true);
+   }
+   page_counter_init(get_counter(h_cgroup, idx, false), parent);
+   page_counter_init(get_counter(h_cgroup, idx, true),
+ reserved_parent);

limit = round_down(PAGE_COUNTER_MAX,
   1 << huge_page_order(&hstates[idx]));
-   ret = page_counter_set_max(counter, limit);
+
+   ret = page_counter_set_max(get_counter(
+   h_cgroup, idx, false), limit);
+   ret = page_counter_set_max(get_counter(
+   h_cgroup, idx, true), limit);
VM_BUG_ON(ret);
}
 }
@@ -127,6 +138,25 @@ static void hugetlb_cgroup_css_free(struct 
cgroup_subsys_state *css)
kfree(h_cgroup);
 }

+static void hugetlb_cgroup_move_par

Re: [RFC PATCH] hugetlbfs: Add hugetlb_cgroup reservation limits

2019-08-08 Thread Mina Almasry
On Thu, Aug 8, 2019 at 1:23 PM shuah  wrote:
>
> On 8/8/19 1:40 PM, Mina Almasry wrote:
> > Problem:
> > Currently tasks attempting to allocate more hugetlb memory than is 
> > available get
> > a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
> > However, if a task attempts to allocate hugetlb memory only more than its
> > hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
> > but will SIGBUS the task when it attempts to fault the memory in.
> >
> > We have developers interested in using hugetlb_cgroups, and they have 
> > expressed
> > dissatisfaction regarding this behavior. We'd like to improve this
> > behavior such that tasks violating the hugetlb_cgroup limits get an error on
> > mmap/shmget time, rather than getting SIGBUS'd when they try to fault
> > the excess memory in.
> >
> > The underlying problem is that today's hugetlb_cgroup accounting happens
> > at hugetlb memory *fault* time, rather than at *reservation* time.
> > Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
> > the offending task gets SIGBUS'd.
> >
> > Proposed Solution:
> > A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. 
> > This
> > counter has slightly different semantics than
> > hugetlb.xMB.[limit|usage]_in_bytes:
> >
> > - While usage_in_bytes tracks all *faulted* hugetlb memory,
> > reservation_usage_in_bytes tracks all *reserved* hugetlb memory.
> >
> > - If a task attempts to reserve more memory than limit_in_bytes allows,
> > the kernel will allow it to do so. But if a task attempts to reserve
> > more memory than reservation_limit_in_bytes, the kernel will fail this
> > reservation.
> >
> > This proposal is implemented in this patch, with tests to verify
> > functionality and show the usage.
> >
> > Alternatives considered:
> > 1. A new cgroup, instead of only a new page_counter attached to
> > the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of 
> > code
> > duplication with hugetlb_cgroup. Keeping hugetlb related page counters 
> > under
> > hugetlb_cgroup seemed cleaner as well.
> >
> > 2. Instead of adding a new counter, we considered adding a sysctl that 
> > modifies
> > the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at
> > reservation time rather than fault time. Adding a new page_counter seems
> > better as userspace could, if it wants, choose to enforce different 
> > cgroups
> > differently: one via limit_in_bytes, and another via
> > reservation_limit_in_bytes. This could be very useful if you're
> > transitioning how hugetlb memory is partitioned on your system one
> > cgroup at a time, for example. Also, someone may find usage for both
> > limit_in_bytes and reservation_limit_in_bytes concurrently, and this
> > approach gives them the option to do so.
> >
> > Caveats:
> > 1. This support is implemented for cgroups-v1. I have not tried
> > hugetlb_cgroups with cgroups v2, and AFAICT it's not supported yet.
> > This is largely because we use cgroups-v1 for now. If required, I
> > can add hugetlb_cgroup support to cgroups v2 in this patch or
> > a follow up.
> > 2. Most complicated bit of this patch I believe is: where to store the
> > pointer to the hugetlb_cgroup to uncharge at unreservation time?
> > Normally the cgroup pointers hang off the struct page. But, with
> > hugetlb_cgroup reservations, one task can reserve a specific page and 
> > another
> > task may fault it in (I believe), so storing the pointer in struct
> > page is not appropriate. Proposed approach here is to store the pointer 
> > in
> > the resv_map. See patch for details.
> >
> > [1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html
> >
> > Signed-off-by: Mina Almasry 
> > ---
> >   include/linux/hugetlb.h   |  10 +-
> >   include/linux/hugetlb_cgroup.h|  19 +-
> >   mm/hugetlb.c  | 256 --
> >   mm/hugetlb_cgroup.c   | 153 +-
>
> Is there a reason why all these changes are in a single patch?
> I can see these split in at least 2 or 3 patches with the test
> as a separate patch.
>

Only because I was expecting feedback on the approach and alternative
approaches before an in-detail review. But, no problem; I'll break it
into smaller patches now.
> Makes it lot easier to review.
>
> thanks,
> -- Shuah


[RFC PATCH] hugetlbfs: Add hugetlb_cgroup reservation limits

2019-08-08 Thread Mina Almasry
Problem:
Currently tasks attempting to allocate more hugetlb memory than is available get
a failure at mmap/shmget time. This is thanks to Hugetlbfs Reservations [1].
However, if a task attempts to allocate hugetlb memory only more than its
hugetlb_cgroup limit allows, the kernel will allow the mmap/shmget call,
but will SIGBUS the task when it attempts to fault the memory in.

We have developers interested in using hugetlb_cgroups, and they have expressed
dissatisfaction regarding this behavior. We'd like to improve this
behavior such that tasks violating the hugetlb_cgroup limits get an error on
mmap/shmget time, rather than getting SIGBUS'd when they try to fault
the excess memory in.

The underlying problem is that today's hugetlb_cgroup accounting happens
at hugetlb memory *fault* time, rather than at *reservation* time.
Thus, enforcing the hugetlb_cgroup limit only happens at fault time, and
the offending task gets SIGBUS'd.

Proposed Solution:
A new page counter named hugetlb.xMB.reservation_[limit|usage]_in_bytes. This
counter has slightly different semantics than
hugetlb.xMB.[limit|usage]_in_bytes:

- While usage_in_bytes tracks all *faulted* hugetlb memory,
reservation_usage_in_bytes tracks all *reserved* hugetlb memory.

- If a task attempts to reserve more memory than limit_in_bytes allows,
the kernel will allow it to do so. But if a task attempts to reserve
more memory than reservation_limit_in_bytes, the kernel will fail this
reservation.

This proposal is implemented in this patch, with tests to verify
functionality and show the usage.

Alternatives considered:
1. A new cgroup, instead of only a new page_counter attached to
   the existing hugetlb_cgroup. Adding a new cgroup seemed like a lot of code
   duplication with hugetlb_cgroup. Keeping hugetlb related page counters under
   hugetlb_cgroup seemed cleaner as well.

2. Instead of adding a new counter, we considered adding a sysctl that modifies
   the behavior of hugetlb.xMB.[limit|usage]_in_bytes, to do accounting at
   reservation time rather than fault time. Adding a new page_counter seems
   better as userspace could, if it wants, choose to enforce different cgroups
   differently: one via limit_in_bytes, and another via
   reservation_limit_in_bytes. This could be very useful if you're
   transitioning how hugetlb memory is partitioned on your system one
   cgroup at a time, for example. Also, someone may find usage for both
   limit_in_bytes and reservation_limit_in_bytes concurrently, and this
   approach gives them the option to do so.

Caveats:
1. This support is implemented for cgroups-v1. I have not tried
   hugetlb_cgroups with cgroups v2, and AFAICT it's not supported yet.
   This is largely because we use cgroups-v1 for now. If required, I
   can add hugetlb_cgroup support to cgroups v2 in this patch or
   a follow up.
2. Most complicated bit of this patch I believe is: where to store the
   pointer to the hugetlb_cgroup to uncharge at unreservation time?
   Normally the cgroup pointers hang off the struct page. But, with
   hugetlb_cgroup reservations, one task can reserve a specific page and another
   task may fault it in (I believe), so storing the pointer in struct
   page is not appropriate. Proposed approach here is to store the pointer in
   the resv_map. See patch for details.

[1]: https://www.kernel.org/doc/html/latest/vm/hugetlbfs_reserv.html

Signed-off-by: Mina Almasry 
---
 include/linux/hugetlb.h   |  10 +-
 include/linux/hugetlb_cgroup.h|  19 +-
 mm/hugetlb.c  | 256 --
 mm/hugetlb_cgroup.c   | 153 +-
 tools/testing/selftests/vm/.gitignore |   1 +
 tools/testing/selftests/vm/Makefile   |   4 +
 .../selftests/vm/charge_reserved_hugetlb.sh   | 438 ++
 .../selftests/vm/write_hugetlb_memory.sh  |  22 +
 .../testing/selftests/vm/write_to_hugetlbfs.c | 252 ++
 9 files changed, 1087 insertions(+), 68 deletions(-)
 create mode 100755 tools/testing/selftests/vm/charge_reserved_hugetlb.sh
 create mode 100644 tools/testing/selftests/vm/write_hugetlb_memory.sh
 create mode 100644 tools/testing/selftests/vm/write_to_hugetlbfs.c

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index edfca42783192..90b3c928d16c1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -46,6 +46,14 @@ struct resv_map {
long adds_in_progress;
struct list_head region_cache;
long region_cache_count;
+ #ifdef CONFIG_CGROUP_HUGETLB
+   /*
+* On private mappings, the counter to uncharge reservations is stored
+* here. If these fields are 0, then the mapping is shared.
+*/
+   struct page_counter *reservation_counter;
+   unsigned long pages_per_hpage;
+#endif
 };
 extern struct resv_map *resv_map_alloc(void);
 void resv_map_release(struct kref *ref);
@@ -340,7 +348,7 @@ struct

Re: [PATCH v2] fs: Fix ovl_i_mutex_dir_key/p->lock/cred cred_guard_mutex deadlock

2019-05-10 Thread Mina Almasry
From: Mina Almasry 
Date: Tue, Apr 23, 2019 at 2:32 PM
To: Mina Almasry, Greg Thelen, Shakeel B, overlayfs
Cc: Alexander Viro, open list:FILESYSTEMS (VFS and infrastructure), open list

> On Fri, Apr 12, 2019 at 4:11 PM Mina Almasry  wrote:
> >
> > These 3 locks are acquired simultaneously in different order causing
> > deadlock:
> >
> > https://syzkaller.appspot.com/bug?id=00f119b8bb35a3acbcfafb9d36a2752b364e8d66
> >
> > ==
> > WARNING: possible circular locking dependency detected
> > 4.19.0-rc5+ #253 Not tainted
> > --
> > syz-executor1/545 is trying to acquire lock:
> > b04209e4 (&ovl_i_mutex_dir_key[depth]){}, at: inode_lock_shared 
> > include/linux/fs.h:748 [inline]
> > b04209e4 (&ovl_i_mutex_dir_key[depth]){}, at: do_last 
> > fs/namei.c:3323 [inline]
> > b04209e4 (&ovl_i_mutex_dir_key[depth]){}, at: 
> > path_openat+0x250d/0x5160 fs/namei.c:3534
> >
> > but task is already holding lock:
> > 44500cca (&sig->cred_guard_mutex){+.+.}, at: 
> > prepare_bprm_creds+0x53/0x120 fs/exec.c:1404
> >
> > which lock already depends on the new lock.
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #3 (&sig->cred_guard_mutex){+.+.}:
> >__mutex_lock_common kernel/locking/mutex.c:925 [inline]
> >__mutex_lock+0x166/0x1700 kernel/locking/mutex.c:1072
> >mutex_lock_killable_nested+0x16/0x20 kernel/locking/mutex.c:1102
> >lock_trace+0x4c/0xe0 fs/proc/base.c:384
> >proc_pid_stack+0x196/0x3b0 fs/proc/base.c:420
> >proc_single_show+0x101/0x190 fs/proc/base.c:723
> >seq_read+0x4af/0x1150 fs/seq_file.c:229
> >do_loop_readv_writev fs/read_write.c:700 [inline]
> >do_iter_read+0x4a3/0x650 fs/read_write.c:924
> >vfs_readv+0x175/0x1c0 fs/read_write.c:986
> >do_preadv+0x1cc/0x280 fs/read_write.c:1070
> >__do_sys_preadv fs/read_write.c:1120 [inline]
> >__se_sys_preadv fs/read_write.c:1115 [inline]
> >__x64_sys_preadv+0x9a/0xf0 fs/read_write.c:1115
> >do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
> >entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > -> #2 (&p->lock){+.+.}:
> >__mutex_lock_common kernel/locking/mutex.c:925 [inline]
> >__mutex_lock+0x166/0x1700 kernel/locking/mutex.c:1072
> >mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1087
> >seq_read+0x71/0x1150 fs/seq_file.c:161
> >do_loop_readv_writev fs/read_write.c:700 [inline]
> >do_iter_read+0x4a3/0x650 fs/read_write.c:924
> >vfs_readv+0x175/0x1c0 fs/read_write.c:986
> >kernel_readv fs/splice.c:362 [inline]
> >default_file_splice_read+0x53c/0xb20 fs/splice.c:417
> >do_splice_to+0x12e/0x190 fs/splice.c:881
> >splice_direct_to_actor+0x270/0x8f0 fs/splice.c:953
> >do_splice_direct+0x2d4/0x420 fs/splice.c:1062
> >do_sendfile+0x62a/0xe20 fs/read_write.c:1440
> >__do_sys_sendfile64 fs/read_write.c:1495 [inline]
> >__se_sys_sendfile64 fs/read_write.c:1487 [inline]
> >__x64_sys_sendfile64+0x15d/0x250 fs/read_write.c:1487
> >do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
> >entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > -> #1 (sb_writers#5){.+.+}:
> >percpu_down_read_preempt_disable include/linux/percpu-rwsem.h:36 
> > [inline]
> >percpu_down_read include/linux/percpu-rwsem.h:59 [inline]
> >__sb_start_write+0x214/0x370 fs/super.c:1387
> >sb_start_write include/linux/fs.h:1566 [inline]
> >mnt_want_write+0x3f/0xc0 fs/namespace.c:360
> >ovl_want_write+0x76/0xa0 fs/overlayfs/util.c:24
> >ovl_create_object+0x142/0x3a0 fs/overlayfs/dir.c:596
> >ovl_create+0x2b/0x30 fs/overlayfs/dir.c:627
> >lookup_open+0x1319/0x1b90 fs/namei.c:3234
> >do_last fs/namei.c:3324 [inline]
> >path_openat+0x15e7/0x5160 fs/namei.c:3534
> >do_filp_open+0x255/0x380 fs/namei.c:3564
> >do_sys_open+0x568/0x700 fs/open.c:1063
> >ksys_open include/linux/syscalls.h:1276 [inline]
> >__do_sys_creat fs/open.c:1121 [inline]
> >__se_sys_creat fs/open.c:1119 [inline]
> >__x64_sys_creat+0x61/0x80 fs/open.c:1119
> >do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>

Re: [PATCH v2] fs: Fix ovl_i_mutex_dir_key/p->lock/cred cred_guard_mutex deadlock

2019-04-23 Thread Mina Almasry
On Fri, Apr 12, 2019 at 4:11 PM Mina Almasry  wrote:
>
> These 3 locks are acquired simultaneously in different order causing
> deadlock:
>
> https://syzkaller.appspot.com/bug?id=00f119b8bb35a3acbcfafb9d36a2752b364e8d66
>
> ==
> WARNING: possible circular locking dependency detected
> 4.19.0-rc5+ #253 Not tainted
> --
> syz-executor1/545 is trying to acquire lock:
> b04209e4 (&ovl_i_mutex_dir_key[depth]){}, at: inode_lock_shared 
> include/linux/fs.h:748 [inline]
> b04209e4 (&ovl_i_mutex_dir_key[depth]){}, at: do_last 
> fs/namei.c:3323 [inline]
> b04209e4 (&ovl_i_mutex_dir_key[depth]){}, at: 
> path_openat+0x250d/0x5160 fs/namei.c:3534
>
> but task is already holding lock:
> 44500cca (&sig->cred_guard_mutex){+.+.}, at: 
> prepare_bprm_creds+0x53/0x120 fs/exec.c:1404
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #3 (&sig->cred_guard_mutex){+.+.}:
>__mutex_lock_common kernel/locking/mutex.c:925 [inline]
>__mutex_lock+0x166/0x1700 kernel/locking/mutex.c:1072
>mutex_lock_killable_nested+0x16/0x20 kernel/locking/mutex.c:1102
>lock_trace+0x4c/0xe0 fs/proc/base.c:384
>proc_pid_stack+0x196/0x3b0 fs/proc/base.c:420
>proc_single_show+0x101/0x190 fs/proc/base.c:723
>seq_read+0x4af/0x1150 fs/seq_file.c:229
>do_loop_readv_writev fs/read_write.c:700 [inline]
>do_iter_read+0x4a3/0x650 fs/read_write.c:924
>vfs_readv+0x175/0x1c0 fs/read_write.c:986
>do_preadv+0x1cc/0x280 fs/read_write.c:1070
>__do_sys_preadv fs/read_write.c:1120 [inline]
>__se_sys_preadv fs/read_write.c:1115 [inline]
>__x64_sys_preadv+0x9a/0xf0 fs/read_write.c:1115
>do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> -> #2 (&p->lock){+.+.}:
>__mutex_lock_common kernel/locking/mutex.c:925 [inline]
>__mutex_lock+0x166/0x1700 kernel/locking/mutex.c:1072
>mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1087
>seq_read+0x71/0x1150 fs/seq_file.c:161
>do_loop_readv_writev fs/read_write.c:700 [inline]
>do_iter_read+0x4a3/0x650 fs/read_write.c:924
>vfs_readv+0x175/0x1c0 fs/read_write.c:986
>kernel_readv fs/splice.c:362 [inline]
>default_file_splice_read+0x53c/0xb20 fs/splice.c:417
>do_splice_to+0x12e/0x190 fs/splice.c:881
>splice_direct_to_actor+0x270/0x8f0 fs/splice.c:953
>do_splice_direct+0x2d4/0x420 fs/splice.c:1062
>do_sendfile+0x62a/0xe20 fs/read_write.c:1440
>__do_sys_sendfile64 fs/read_write.c:1495 [inline]
>__se_sys_sendfile64 fs/read_write.c:1487 [inline]
>__x64_sys_sendfile64+0x15d/0x250 fs/read_write.c:1487
>do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> -> #1 (sb_writers#5){.+.+}:
>percpu_down_read_preempt_disable include/linux/percpu-rwsem.h:36 
> [inline]
>percpu_down_read include/linux/percpu-rwsem.h:59 [inline]
>__sb_start_write+0x214/0x370 fs/super.c:1387
>sb_start_write include/linux/fs.h:1566 [inline]
>mnt_want_write+0x3f/0xc0 fs/namespace.c:360
>ovl_want_write+0x76/0xa0 fs/overlayfs/util.c:24
>ovl_create_object+0x142/0x3a0 fs/overlayfs/dir.c:596
>ovl_create+0x2b/0x30 fs/overlayfs/dir.c:627
>lookup_open+0x1319/0x1b90 fs/namei.c:3234
>do_last fs/namei.c:3324 [inline]
>path_openat+0x15e7/0x5160 fs/namei.c:3534
>do_filp_open+0x255/0x380 fs/namei.c:3564
>do_sys_open+0x568/0x700 fs/open.c:1063
>ksys_open include/linux/syscalls.h:1276 [inline]
>__do_sys_creat fs/open.c:1121 [inline]
>__se_sys_creat fs/open.c:1119 [inline]
>__x64_sys_creat+0x61/0x80 fs/open.c:1119
>do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> -> #0 (&ovl_i_mutex_dir_key[depth]){}:
>lock_acquire+0x1ed/0x520 kernel/locking/lockdep.c:3900
>down_read+0xb0/0x1d0 kernel/locking/rwsem.c:24
>inode_lock_shared include/linux/fs.h:748 [inline]
>do_last fs/namei.c:3323 [inline]
>path_openat+0x250d/0x5160 fs/namei.c:3534
>do_filp_open+0x255/0x380 fs/namei.c:3564
>do_open_execat+0x221/0x8e0 fs/exec.c:853
>__do_execve_file.isra.33+0x173f/0x2540 fs/exec.c:1755
>do_execveat_common fs/exec.c:1866 [inline]
>   

Re: [PATCH] fs: Fix ovl_i_mutex_dir_key/p->lock/cred cred_guard_mutex deadlock

2019-04-23 Thread Mina Almasry
On Tue, Apr 23, 2019 at 7:28 AM Miklos Szeredi  wrote:
>
> Cc: linux-unionfs
>
> On Thu, Apr 11, 2019 at 6:48 PM Mina Almasry  wrote:
> >
> > These 3 locks are acquired simultaneously in different order causing
> > deadlock:
> >
> > https://syzkaller.appspot.com/bug?id=00f119b8bb35a3acbcfafb9d36a2752b364e8d66
> >
> > ==
> > WARNING: possible circular locking dependency detected
> > 4.19.0-rc5+ #253 Not tainted
> > --
> > syz-executor1/545 is trying to acquire lock:
> > b04209e4 (&ovl_i_mutex_dir_key[depth]){}, at: inode_lock_shared 
> > include/linux/fs.h:748 [inline]
> > b04209e4 (&ovl_i_mutex_dir_key[depth]){}, at: do_last 
> > fs/namei.c:3323 [inline]
> > b04209e4 (&ovl_i_mutex_dir_key[depth]){}, at: 
> > path_openat+0x250d/0x5160 fs/namei.c:3534
> >
> > but task is already holding lock:
> > 44500cca (&sig->cred_guard_mutex){+.+.}, at: 
> > prepare_bprm_creds+0x53/0x120 fs/exec.c:1404
> >
> > which lock already depends on the new lock.
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #3 (&sig->cred_guard_mutex){+.+.}:
> >__mutex_lock_common kernel/locking/mutex.c:925 [inline]
> >__mutex_lock+0x166/0x1700 kernel/locking/mutex.c:1072
> >mutex_lock_killable_nested+0x16/0x20 kernel/locking/mutex.c:1102
> >lock_trace+0x4c/0xe0 fs/proc/base.c:384
> >proc_pid_stack+0x196/0x3b0 fs/proc/base.c:420
> >proc_single_show+0x101/0x190 fs/proc/base.c:723
> >seq_read+0x4af/0x1150 fs/seq_file.c:229
> >do_loop_readv_writev fs/read_write.c:700 [inline]
> >do_iter_read+0x4a3/0x650 fs/read_write.c:924
> >vfs_readv+0x175/0x1c0 fs/read_write.c:986
> >do_preadv+0x1cc/0x280 fs/read_write.c:1070
> >__do_sys_preadv fs/read_write.c:1120 [inline]
> >__se_sys_preadv fs/read_write.c:1115 [inline]
> >__x64_sys_preadv+0x9a/0xf0 fs/read_write.c:1115
> >do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
> >entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > -> #2 (&p->lock){+.+.}:
> >__mutex_lock_common kernel/locking/mutex.c:925 [inline]
> >__mutex_lock+0x166/0x1700 kernel/locking/mutex.c:1072
> >mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1087
> >seq_read+0x71/0x1150 fs/seq_file.c:161
> >do_loop_readv_writev fs/read_write.c:700 [inline]
> >do_iter_read+0x4a3/0x650 fs/read_write.c:924
> >vfs_readv+0x175/0x1c0 fs/read_write.c:986
> >kernel_readv fs/splice.c:362 [inline]
> >default_file_splice_read+0x53c/0xb20 fs/splice.c:417
> >do_splice_to+0x12e/0x190 fs/splice.c:881
> >splice_direct_to_actor+0x270/0x8f0 fs/splice.c:953
> >do_splice_direct+0x2d4/0x420 fs/splice.c:1062
> >do_sendfile+0x62a/0xe20 fs/read_write.c:1440
> >__do_sys_sendfile64 fs/read_write.c:1495 [inline]
> >__se_sys_sendfile64 fs/read_write.c:1487 [inline]
> >__x64_sys_sendfile64+0x15d/0x250 fs/read_write.c:1487
> >do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
> >entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > -> #1 (sb_writers#5){.+.+}:
> >percpu_down_read_preempt_disable include/linux/percpu-rwsem.h:36 
> > [inline]
> >percpu_down_read include/linux/percpu-rwsem.h:59 [inline]
> >__sb_start_write+0x214/0x370 fs/super.c:1387
> >sb_start_write include/linux/fs.h:1566 [inline]
> >mnt_want_write+0x3f/0xc0 fs/namespace.c:360
> >ovl_want_write+0x76/0xa0 fs/overlayfs/util.c:24
> >ovl_create_object+0x142/0x3a0 fs/overlayfs/dir.c:596
> >ovl_create+0x2b/0x30 fs/overlayfs/dir.c:627
> >lookup_open+0x1319/0x1b90 fs/namei.c:3234
> >do_last fs/namei.c:3324 [inline]
> >path_openat+0x15e7/0x5160 fs/namei.c:3534
> >do_filp_open+0x255/0x380 fs/namei.c:3564
> >do_sys_open+0x568/0x700 fs/open.c:1063
> >ksys_open include/linux/syscalls.h:1276 [inline]
> >__do_sys_creat fs/open.c:1121 [inline]
> >__se_sys_creat fs/open.c:1119 [inline]
> >__x64_sys_creat+0x61/0x80 fs/open.c:1119
> >do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
> >entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >
> > -> #0 (&ovl_i_mutex_dir_key[depth]){+++

[PATCH] fs: Fix ovl_i_mutex_dir_key/p->lock/cred cred_guard_mutex deadlock

2019-04-11 Thread Mina Almasry
   CPU1
   
  lock(&sig->cred_guard_mutex);
   lock(&p->lock);
   lock(&sig->cred_guard_mutex);
  lock(&ovl_i_mutex_dir_key[depth]);

 *** DEADLOCK ***

Solution: I establish this locking order for these locks:

1. ovl_i_mutex_dir_key
2. p->lock
3. sig->cred_guard_mutex

In this change i fix the locking order of exec.c, which is the only
instance that voilates this order.

Signed-off-by: Mina Almasry 
---
 fs/exec.c | 20 
 1 file changed, 8 insertions(+), 12 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 2e0033348d8e..423d90bc75cc 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1742,6 +1742,12 @@ static int __do_execve_file(int fd, struct filename 
*filename,
if (retval)
goto out_ret;
 
+   if (!file)
+   file = do_open_execat(fd, filename, flags);
+   retval = PTR_ERR(file);
+   if (IS_ERR(file))
+   goto out_free;
+
retval = -ENOMEM;
bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
if (!bprm)
@@ -1754,12 +1760,6 @@ static int __do_execve_file(int fd, struct filename 
*filename,
check_unsafe_exec(bprm);
current->in_execve = 1;
 
-   if (!file)
-   file = do_open_execat(fd, filename, flags);
-   retval = PTR_ERR(file);
-   if (IS_ERR(file))
-   goto out_unmark;
-
sched_exec();
 
bprm->file = file;
@@ -1775,7 +1775,7 @@ static int __do_execve_file(int fd, struct filename 
*filename,
fd, filename->name);
if (!pathbuf) {
retval = -ENOMEM;
-   goto out_unmark;
+   goto out_free;
}
/*
 * Record that a name derived from an O_CLOEXEC fd will be
@@ -1790,7 +1790,7 @@ static int __do_execve_file(int fd, struct filename 
*filename,
 
retval = bprm_mm_init(bprm);
if (retval)
-   goto out_unmark;
+   goto out_free;
 
retval = prepare_arg_pages(bprm, argv, envp);
if (retval < 0)
@@ -1840,10 +1840,6 @@ static int __do_execve_file(int fd, struct filename 
*filename,
mmput(bprm->mm);
}
 
-out_unmark:
-   current->fs->in_exec = 0;
-   current->in_execve = 0;
-
 out_free:
free_bprm(bprm);
kfree(pathbuf);
-- 
2.21.0.392.gf8f6787159e-goog