Re: [dm-devel] [PATCH 6/6] mm: Add memalloc_nowait

2020-06-30 Thread Michal Hocko
On Wed 01-07-20 05:12:03, Matthew Wilcox wrote:
> On Tue, Jun 30, 2020 at 08:34:36AM +0200, Michal Hocko wrote:
> > On Mon 29-06-20 22:28:30, Matthew Wilcox wrote:
> > [...]
> > > The documentation is hard to add a new case to, so I rewrote it.  What
> > > do you think?  (Obviously I'll split this out differently for submission;
> > > this is just what I have in my tree right now).
> > 
> > I am fine with your changes. Few notes below.
> 
> Thanks!
> 
> > > -It turned out though that above approach has led to
> > > -abuses when the restricted gfp mask is used "just in case" without a
> > > -deeper consideration which leads to problems because an excessive use
> > > -of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> > > -reclaim issues.
> > 
> > I believe this is an important part because it shows that new people
> > coming to the existing code shouldn't take it as correct and rather
> > question it. Also having a clear indication that overuse is causing real
> > problems that might be not immediately visible to subsystems outside of
> > MM.
> 
> It seemed to say a lot of the same things as this paragraph:
> 
> +You may notice that quite a few allocations in the existing code specify
> +``GFP_NOIO`` or ``GFP_NOFS``. Historically, they were used to prevent
> +recursion deadlocks caused by direct memory reclaim calling back into
> +the FS or IO paths and blocking on already held resources. Since 4.12
> +the preferred way to address this issue is to use the new scope APIs
> +described below.
> 
> Since this is in core-api/ rather than vm/, I felt that discussion of
> the problems that it causes to the mm was a bit too much detail for the
> people who would be reading this document.  Maybe I could move that
> information into a new Documentation/vm/reclaim.rst file?

Hmm, my experience is that at least some users of NOFS/NOIO use this
flag just to be sure they do not do something wrong without realizing
that this might have a very negative effect on the whole system
operation. That was the main motivation to have an explicit note there.
I am not sure having that in MM internal documentation will make it
stand out for a general reader.

But I will not insist of course.

> Let's see if Our Grumpy Editor has time to give us his advice on this.
> 
> > > -FS/IO code then simply calls the appropriate save function before
> > > -any critical section with respect to the reclaim is started - e.g.
> > > -lock shared with the reclaim context or when a transaction context
> > > -nesting would be possible via reclaim.  
> > 
> > [...]
> > 
> > > +These functions should be called at the point where any memory allocation
> > > +would start to cause problems.  That is, do not simply wrap individual
> > > +memory allocation calls which currently use ``GFP_NOFS`` with a pair
> > > +of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
> > > +find the lock which is taken that would cause problems if memory reclaim
> > > +reentered the filesystem, place a call to memalloc_nofs_save() before it
> > > +is acquired and a call to memalloc_nofs_restore() after it is released.
> > > +Ideally also add a comment explaining why this lock will be problematic.
> > 
> > The above text has mentioned the transaction context nesting as well and
> > that was a hint by Dave IIRC. It is imho good to have an example of
> > other reentrant points than just locks. I believe another useful example
> > would be something like loop device which is mixing IO and FS layers but
> > I am not familiar with all the details to give you an useful text.
> 
> I'll let Mikulas & Dave finish fighting about that before I write any
> text mentioning the loop driver.  How about this for mentioning the
> filesystem transaction possibility?
> 
> @@ -103,12 +103,16 @@ flags specified by any particular call to allocate 
> memory.
>  
>  These functions should be called at the point where any memory allocation
>  would start to cause problems.  That is, do not simply wrap individual
> -memory allocation calls which currently use ``GFP_NOFS`` with a pair
> -of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
> -find the lock which is taken that would cause problems if memory reclaim
> +memory allocation calls which currently use ``GFP_NOFS`` with a pair of
> +calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead, find
> +the resource which is acquired that would cause problems if memory reclaim
>  reentered the filesystem, place a call to memalloc_nofs_save() before it

Re: [dm-devel] [PATCH 6/6] mm: Add memalloc_nowait

2020-06-29 Thread Michal Hocko
On Mon 29-06-20 22:28:30, Matthew Wilcox wrote:
[...]
> The documentation is hard to add a new case to, so I rewrote it.  What
> do you think?  (Obviously I'll split this out differently for submission;
> this is just what I have in my tree right now).

I am fine with your changes. Few notes below.

> -It turned out though that above approach has led to
> -abuses when the restricted gfp mask is used "just in case" without a
> -deeper consideration which leads to problems because an excessive use
> -of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> -reclaim issues.

I believe this is an important part because it shows that new people
coming to the existing code shouldn't take it as correct and rather
question it. Also having a clear indication that overuse is causing real
problems that might be not immediately visible to subsystems outside of
MM.

> -FS/IO code then simply calls the appropriate save function before
> -any critical section with respect to the reclaim is started - e.g.
> -lock shared with the reclaim context or when a transaction context
> -nesting would be possible via reclaim.  

[...]

> +These functions should be called at the point where any memory allocation
> +would start to cause problems.  That is, do not simply wrap individual
> +memory allocation calls which currently use ``GFP_NOFS`` with a pair
> +of calls to memalloc_nofs_save() and memalloc_nofs_restore().  Instead,
> +find the lock which is taken that would cause problems if memory reclaim
> +reentered the filesystem, place a call to memalloc_nofs_save() before it
> +is acquired and a call to memalloc_nofs_restore() after it is released.
> +Ideally also add a comment explaining why this lock will be problematic.

The above text has mentioned the transaction context nesting as well and
that was a hint by Dave IIRC. It is imho good to have an example of
other reentrant points than just locks. I believe another useful example
would be something like loop device which is mixing IO and FS layers but
I am not familiar with all the details to give you an useful text.

[...]
> @@ -104,16 +134,19 @@ ARCH_KMALLOC_MINALIGN bytes.  For sizes which are a 
> power of two, the
>  alignment is also guaranteed to be at least the respective size.
>  
>  For large allocations you can use vmalloc() and vzalloc(), or directly
> -request pages from the page allocator. The memory allocated by `vmalloc`
> -and related functions is not physically contiguous.
> +request pages from the page allocator.  The memory allocated by `vmalloc`
> +and related functions is not physically contiguous.  The `vmalloc`
> +family of functions don't support the old ``GFP_NOFS`` or ``GFP_NOIO``
> +flags because there are hardcoded ``GFP_KERNEL`` allocations deep inside
> +the allocator which are hard to remove.  However, the scope APIs described
> +above can be used to limit the `vmalloc` functions.

I would reiterate "Do not just wrap vmalloc by the scope api but rather
rely on the real scope for the NOFS/NOIO context". Maybe we want to
stress out that once a scope is defined it is sticky to _all_
allocations and all allocators within that scope. The text is already
saying that but maybe we want to make it explicit and make it stand out.

[...]
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 6484569f50df..9fc091274d1d 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -186,9 +186,10 @@ static inline gfp_t current_gfp_context(gfp_t flags)
>* them.  noio implies neither IO nor FS and it is a weaker
>* context so always make sure it takes precedence.
>*/
> - if (current->memalloc_nowait)
> + if (current->memalloc_nowait) {
>   flags &= ~__GFP_DIRECT_RECLAIM;
> - else if (current->memalloc_noio)
> + flags |= __GFP_NOWARN;

I dunno. I wouldn't make nowait implicitly NOWARN as well. At least not
with the initial implementation. Maybe we will learn later that there is
just too much unhelpful noise in the kernel log and will reconsider but
I wouldn't just start with that. Also we might learn that there will be
other modifiers for atomic (or should I say non-sleeping) scopes to be
defined. E.g. access to memory reserves but let's just wait for real
usecases.


Thanks a lot Matthew!
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 6/6] mm: Add memalloc_nowait

2020-06-29 Thread Michal Hocko
On Mon 29-06-20 13:18:16, Matthew Wilcox wrote:
> On Mon, Jun 29, 2020 at 08:08:51AM +0300, Mike Rapoport wrote:
> > > @@ -886,8 +868,12 @@ static struct dm_buffer 
> > > *__alloc_buffer_wait_no_callback(struct dm_bufio_client
> > >   return NULL;
> > >  
> > >   if (dm_bufio_cache_size_latch != 1 && !tried_noio_alloc) {
> > > + unsigned noio_flag;
> > > +
> > >   dm_bufio_unlock(c);
> > > - b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | 
> > > __GFP_NOMEMALLOC | __GFP_NOWARN);
> > > + noio_flag = memalloc_noio_save();
> > 
> > I've read the series twice and I'm still missing the definition of
> > memalloc_noio_save().
> > 
> > And also it would be nice to have a paragraph about it in
> > Documentation/core-api/memory-allocation.rst
> 
> Documentation/core-api/gfp_mask-from-fs-io.rst:``memalloc_nofs_save``, 
> ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
> Documentation/core-api/gfp_mask-from-fs-io.rst:   :functions: 
> memalloc_noio_save memalloc_noio_restore
> Documentation/core-api/gfp_mask-from-fs-io.rst:allows nesting so it is safe 
> to call ``memalloc_noio_save`` or
 
The patch is adding memalloc_nowait* and I suspect Mike had that in
mind, which would be a fair request. Btw. we are missing memalloc_nocma*
documentation either - I was just reminded of its existence today...

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 0/6] Overhaul memalloc_no*

2020-06-29 Thread Michal Hocko
On Sat 27-06-20 09:08:47, Dave Chinner wrote:
> On Fri, Jun 26, 2020 at 11:02:19AM -0400, Mikulas Patocka wrote:
> > Hi
> > 
> > I suggest to join memalloc_noio and memalloc_nofs into just one flag that 
> > prevents both filesystem recursion and i/o recursion.
> > 
> > Note that any I/O can recurse into a filesystem via the loop device, thus 
> > it doesn't make much sense to have a context where PF_MEMALLOC_NOFS is set 
> > and PF_MEMALLOC_NOIO is not set.
> 
> Correct me if I'm wrong, but I think that will prevent swapping from
> GFP_NOFS memory reclaim contexts. IOWs, this will substantially
> change the behaviour of the memory reclaim system under sustained
> GFP_NOFS memory pressure. Sustained GFP_NOFS memory pressure is
> quite common, so I really don't think we want to telling memory
> reclaim "you can't do IO at all" when all we are trying to do is
> prevent recursion back into the same filesystem.
> 
> Given that the loop device IO path already operates under
> memalloc_noio context, (i.e. the recursion restriction is applied in
> only the context that needs is) I see no reason for making that a
> global reclaim limitation
> 
> In reality, we need to be moving the other way with GFP_NOFS - to
> fine grained anti-recursion contexts, not more broad contexts.

Absolutely agreed! It is not really hard to see system struggling due to
heavy FS metadata workload while there are objects which could be
reclaimed.

> That is, GFP_NOFS prevents recursion into any filesystem, not just
> the one that we are actively operating on and needing to prevent
> recursion back into. We can safely have reclaim do relcaim work on
> other filesysetms without fear of recursion deadlocks, but the
> memory reclaim infrastructure does not provide that capability.(*)
> 
> e.g. if memalloc_nofs_save() took a reclaim context structure that
> the filesystem put the superblock, the superblock's nesting depth
> (because layering on loop devices can create cross-filesystem
> recursion dependencies), and any other filesyetm private data the
> fs wanted to add, we could actually have reclaim only avoid reclaim
> from filesytsems where there is a deadlock possiblity. e.g:
> 
>   - superblock nesting depth is different, apply GFP_NOFS
> reclaim unconditionally
>   - superblock different apply GFP_KERNEL reclaim
>   - superblock the same, pass context to filesystem to
> decide if reclaim from the sueprblock is safe.
> 
> At this point, we get memory reclaim able to always be able to
> reclaim from filesystems that are not at risk of recursion
> deadlocks. Direct reclaim is much more likely to be able to make
> progress now because it is much less restricted in what it can
> reclaim. That's going to make direct relcaim faster and more
> efficient, and taht's the ultimate goal we are aiming to acheive
> here...

Yes, we have discussed something like that few years back at LSFMM IIRC.
The scoped NOFS/NOIO api was just a first step to reduce explicit
NOFS/NOIO usage with a hope that we will get no-recursion entry points
much more well defined and get rid of many instances where "this is a fs
code so it has to use NOFS gfp mask".

Some of that has happened and that is really great. On the other hand
many people still like to use that api as a workaround for an immediate
problem because no-recursion scopes are much harder to recognize unless
you are supper familiar with the specific fs/IO layer implementation.
So this is definitely not a project for somebody to go over all code and
just do the clean up.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 0/6] Overhaul memalloc_no*

2020-06-25 Thread Michal Hocko
On Thu 25-06-20 11:48:32, Darrick J. Wong wrote:
> On Thu, Jun 25, 2020 at 12:31:16PM +0100, Matthew Wilcox (Oracle) wrote:
> > I want a memalloc_nowait like we have memalloc_noio and memalloc_nofs
> > for an upcoming patch series, and Jens also wants it for non-blocking
> > io_uring.  It turns out we already have dm-bufio which could benefit
> > from memalloc_nowait, so it may as well go into the tree now.
> > 
> > The biggest problem is that we're basically out of PF_ flags, so we need
> > to find somewhere else to store the PF_MEMALLOC_NOWAIT flag.  It turns
> > out the PF_ flags are really supposed to be used for flags which are
> > accessed from other tasks, and the MEMALLOC flags are only going to
> > be used by this task.  So shuffling everything around frees up some PF
> > flags and generally makes the world a better place.
> 
> So, uh, how does this intersect with the patch "xfs: reintroduce
> PF_FSTRANS for transaction reservation recursion protection" that
> re-adds PF_TRANS because uh I guess we lost some subtlety or another at
> some point?

This is independent, really. It just relocates the NOFS flag. PF_TRANS
is reintroduced for a different reason. When I have replaced the
original PF_TRANS by PF_MEMALLOC_NOFS I didn't realized that xfs doesn't
need only the NOFS semantic but also the transaction tracking so this
cannot be a single bit only. So it has to be added back. But
PF_MEMALLOC_NOFS needs to stay for the scoped NOFS semantic.

Hope this clarifies it a bit.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 4/6] mm: Replace PF_MEMALLOC_NOFS with memalloc_nofs

2020-06-25 Thread Michal Hocko
On Thu 25-06-20 12:31:20, Matthew Wilcox wrote:
> We're short on PF_* flags, so make memalloc_nofs its own bit where we
> have plenty of space.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

forgot to add
Acked-by: Michal Hocko 

> ---
>  fs/iomap/buffered-io.c   |  2 +-
>  include/linux/sched.h|  2 +-
>  include/linux/sched/mm.h | 13 ++---
>  3 files changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index bcfc288dba3f..87d66c13bf5c 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1502,7 +1502,7 @@ iomap_do_writepage(struct page *page, struct 
> writeback_control *wbc, void *data)
>* Given that we do not allow direct reclaim to call us, we should
>* never be called in a recursive filesystem reclaim context.
>*/
> - if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> + if (WARN_ON_ONCE(current->memalloc_nofs))
>   goto redirty;
>  
>   /*
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index cf18a3d2bc4c..eaf36ae1fde2 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -802,6 +802,7 @@ struct task_struct {
>   unsignedin_memstall:1;
>  #endif
>   unsignedmemalloc_noio:1;
> + unsignedmemalloc_nofs:1;
>  
>   unsigned long   atomic_flags; /* Flags requiring atomic 
> access. */
>  
> @@ -1505,7 +1506,6 @@ extern struct pid *cad_pid;
>  #define PF_NOFREEZE  0x8000  /* This thread should not be 
> frozen */
>  #define PF_FROZEN0x0001  /* Frozen for system suspend */
>  #define PF_KSWAPD0x0002  /* I am kswapd */
> -#define PF_MEMALLOC_NOFS 0x0004  /* All allocation requests will 
> inherit GFP_NOFS */
>  #define PF_LOCAL_THROTTLE0x0010  /* Throttle writes only against 
> the bdi I write to,
>* I am cleaning dirty pages 
> from some other bdi. */
>  #define PF_KTHREAD   0x0020  /* I am a kernel thread */
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index b0089eadc367..08bc9d0606a8 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -175,20 +175,19 @@ static inline bool in_vfork(struct task_struct *tsk)
>  
>  /*
>   * Applies per-task gfp context to the given allocation flags.
> - * PF_MEMALLOC_NOFS implies GFP_NOFS
>   * PF_MEMALLOC_NOCMA implies no allocation from CMA region.
>   */
>  static inline gfp_t current_gfp_context(gfp_t flags)
>  {
> - if (unlikely(current->flags & (PF_MEMALLOC_NOFS | PF_MEMALLOC_NOCMA) ||
> -  current->memalloc_noio)) {
> + if (unlikely((current->flags & PF_MEMALLOC_NOCMA) ||
> +  current->memalloc_noio || current->memalloc_nofs)) {
>   /*
>* NOIO implies both NOIO and NOFS and it is a weaker context
>* so always make sure it makes precedence
>*/
>   if (current->memalloc_noio)
>   flags &= ~(__GFP_IO | __GFP_FS);
> - else if (current->flags & PF_MEMALLOC_NOFS)
> + else if (current->memalloc_nofs)
>   flags &= ~__GFP_FS;
>  #ifdef CONFIG_CMA
>   if (current->flags & PF_MEMALLOC_NOCMA)
> @@ -254,8 +253,8 @@ static inline void memalloc_noio_restore(unsigned int 
> flags)
>   */
>  static inline unsigned int memalloc_nofs_save(void)
>  {
> - unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> - current->flags |= PF_MEMALLOC_NOFS;
> + unsigned int flags = current->memalloc_nofs;
> + current->memalloc_nofs = 1;
>   return flags;
>  }
>  
> @@ -269,7 +268,7 @@ static inline unsigned int memalloc_nofs_save(void)
>   */
>  static inline void memalloc_nofs_restore(unsigned int flags)
>  {
> - current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
> + current->memalloc_nofs = flags ? 1 : 0;
>  }
>  
>  static inline unsigned int memalloc_noreclaim_save(void)
> -- 
> 2.27.0

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 6/6] mm: Add memalloc_nowait

2020-06-25 Thread Michal Hocko
On Thu 25-06-20 14:10:55, Matthew Wilcox wrote:
> On Thu, Jun 25, 2020 at 02:40:17PM +0200, Michal Hocko wrote:
> > On Thu 25-06-20 12:31:22, Matthew Wilcox wrote:
> > > Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
> > > guarantees we will not sleep to reclaim memory.  Use it to simplify
> > > dm-bufio's allocations.
> > 
> > memalloc_nowait is a good idea! I suspect the primary usecase would be
> > vmalloc.
> 
> That's funny.  My use case is allocating page tables in an RCU protected
> page fault handler.  Jens' use case is allocating page cache.  This one
> is a vmalloc consumer (which is also indirectly page table allocation).
> 
> > > @@ -877,7 +857,9 @@ static struct dm_buffer 
> > > *__alloc_buffer_wait_no_callback(struct dm_bufio_client
> > >*/
> > >   while (1) {
> > >   if (dm_bufio_cache_size_latch != 1) {
> > > - b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | 
> > > __GFP_NOMEMALLOC | __GFP_NOWARN);
> > > + unsigned nowait_flag = memalloc_nowait_save();
> > > + b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | 
> > > __GFP_NOWARN);
> > > + memalloc_nowait_restore(nowait_flag);
> > 
> > This looks confusing though. I am not familiar with alloc_buffer and
> > there is quite some tweaking around __GFP_NORETRY in alloc_buffer_data
> > which I do not follow but GFP_KERNEL just struck my eyes. So why cannot
> > we have 
> > alloc_buffer(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
> 
> Actually, I wanted to ask about the proliferation of __GFP_NOMEMALLOC
> in the block layer.  Am I right in thinking it really has no effect
> unless GFP_ATOMIC is set?

It does have an effect as an __GFP_MEMALLOC resp PF_MEMALLOC inhibitor.

> It seems like a magic flag that some driver
> developers are sprinkling around randomly, so we probably need to clarify
> the documentation on it.

Would the following make more sense?
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67a0774e080b..014aa7a6d36a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -116,8 +116,9 @@ struct vm_area_struct;
  * Usage of a pre-allocated pool (e.g. mempool) should be always considered
  * before using this flag.
  *
- * %__GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
- * This takes precedence over the %__GFP_MEMALLOC flag if both are set.
+ * %__GFP_NOMEMALLOC is used to inhibit __GFP_MEMALLOC resp. process scope
+ * variant of it PF_MEMALLOC. So use this flag if the caller of the allocation
+ * context might contain one or the other.
  */
 #define __GFP_ATOMIC   ((__force gfp_t)___GFP_ATOMIC)
 #define __GFP_HIGH ((__force gfp_t)___GFP_HIGH)

> What I was trying to do was just use the memalloc_nofoo API to control
> what was going on and then the driver can just use GFP_KERNEL.  I should
> probably have completed that thought before sending the patches out.

Yes the effect will be the same but it just really hit my eyes as this
was in the same diff. IMHO GFP_NOWAIT would be easier to grasp but
nothing I would dare to insist on.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 1/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_noio

2020-06-25 Thread Michal Hocko
On Thu 25-06-20 13:34:18, Matthew Wilcox wrote:
> On Thu, Jun 25, 2020 at 02:22:39PM +0200, Michal Hocko wrote:
> > On Thu 25-06-20 12:31:17, Matthew Wilcox wrote:
> > > We're short on PF_* flags, so make memalloc_noio its own bit where we
> > > have plenty of space.
> > 
> > I do not mind moving that outside of the PF_* space. Unless I
> > misremember all flags in this space were intented to be set only on the
> > current which rules out any RMW races and therefore they can be
> > lockless. I am not sure this holds for the bitfield you are adding this
> > to. At least in_memstall seem to be set on external task as well. But
> > this would require double checking. Maybe that is not really intended or
> > just a bug.
> 
> I was going from the comment:
> 
> /* Unserialized, strictly 'current' */
> (which you can't see from the context of the diff, but is above the block)
> 
> The situation with ->flags is a little more ambiguous:
> 
> /*
>  * Only the _current_ task can read/write to tsk->flags, but other
>  * tasks can access tsk->flags in readonly mode for example
>  * with tsk_used_math (like during threaded core dumping).
>  * There is however an exception to this rule during ptrace
>  * or during fork: the ptracer task is allowed to write to the
>  * child->flags of its traced child (same goes for fork, the parent
>  * can write to the child->flags), because we're guaranteed the
>  * child is not running and in turn not changing child->flags
>  * at the same time the parent does it.
>  */

OK, I have obviously missed that.

> but it wasn't unsafe to use the PF_ flags in the way that you were.
> It's just crowded.
> 
> If in_memstall is set on other tasks, then it should be moved to the
> PFA flags, which there are plenty of.
> 
> But a quick grep shows it only being read on other tasks and always
> set on current:
> 
> kernel/sched/psi.c: *flags = current->in_memstall;
> kernel/sched/psi.c:  * in_memstall setting & accounting needs to be 
> atomic wrt
> kernel/sched/psi.c: current->in_memstall = 1;
> kernel/sched/psi.c:  * in_memstall clearing & accounting needs to be 
> atomic wrt
> kernel/sched/psi.c: current->in_memstall = 0;
> kernel/sched/psi.c: if (task->in_memstall)

Have a look at cgroup_move_task. So I believe this is something to be
fixed but independent on your change.

Feel free to add
Acked-by: Michal Hocko 

> kernel/sched/stats.h:   if (p->in_memstall)
> kernel/sched/stats.h:   if (p->in_memstall)
> kernel/sched/stats.h:   if (unlikely(p->in_iowait || p->in_memstall)) {
> kernel/sched/stats.h:   if (p->in_memstall)
> kernel/sched/stats.h:   if (unlikely(rq->curr->in_memstall))
> 
> so I think everything is fine.

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 6/6] mm: Add memalloc_nowait

2020-06-25 Thread Michal Hocko
On Thu 25-06-20 12:31:22, Matthew Wilcox wrote:
> Similar to memalloc_noio() and memalloc_nofs(), memalloc_nowait()
> guarantees we will not sleep to reclaim memory.  Use it to simplify
> dm-bufio's allocations.

memalloc_nowait is a good idea! I suspect the primary usecase would be
vmalloc.

> Signed-off-by: Matthew Wilcox (Oracle) 

[...]
> @@ -877,7 +857,9 @@ static struct dm_buffer 
> *__alloc_buffer_wait_no_callback(struct dm_bufio_client
>*/
>   while (1) {
>   if (dm_bufio_cache_size_latch != 1) {
> - b = alloc_buffer(c, GFP_NOWAIT | __GFP_NORETRY | 
> __GFP_NOMEMALLOC | __GFP_NOWARN);
> + unsigned nowait_flag = memalloc_nowait_save();
> + b = alloc_buffer(c, GFP_KERNEL | __GFP_NOMEMALLOC | 
> __GFP_NOWARN);
> + memalloc_nowait_restore(nowait_flag);

This looks confusing though. I am not familiar with alloc_buffer and
there is quite some tweaking around __GFP_NORETRY in alloc_buffer_data
which I do not follow but GFP_KERNEL just struck my eyes. So why cannot
we have 
alloc_buffer(GFP_NOWAIT | __GFP_NOMEMALLOC | __GFP_NOWARN);
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 2/6] mm: Add become_kswapd and restore_kswapd

2020-06-25 Thread Michal Hocko
On Thu 25-06-20 12:31:18, Matthew Wilcox wrote:
> Since XFS needs to pretend to be kswapd in some of its worker threads,
> create methods to save & restore kswapd state.  Don't bother restoring
> kswapd state in kswapd -- the only time we reach this code is when we're
> exiting and the task_struct is about to be destroyed anyway.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Certainly better than an opencoded PF_$FOO manipulation
Acked-by: Michal Hocko 

I would just ask for a clarification because this is rellying to have a
good MM knowledge to follow

> +/*
> + * Tell the memory management that we're a "memory allocator",

I would go with.
Tell the memory management that the caller is working on behalf of the
background memory reclaim (aka kswapd) and help it to make a forward
progress. That means that it will get an access to memory reserves
should there be a need to allocate memory in order to make a forward
progress. Note that the caller has to be extremely careful when doing
that.

Or something like that.

> + * and that if we need more memory we should get access to it
> + * regardless (see "__alloc_pages()"). "kswapd" should
> + * never get caught in the normal page freeing logic.
> + *
> + * (Kswapd normally doesn't need memory anyway, but sometimes
> + * you need a small amount of memory in order to be able to
> + * page out something else, and this flag essentially protects
> + * us from recursively trying to free more memory as we're
> + * trying to free the first piece of memory in the first place).
> + */
> +#define KSWAPD_PF_FLAGS  (PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD)
> +
> +static inline unsigned long become_kswapd(void)
> +{
> + unsigned long flags = current->flags & KSWAPD_PF_FLAGS;
> + current->flags |= KSWAPD_PF_FLAGS;
> + return flags;
> +}
> +
> +static inline void restore_kswapd(unsigned long flags)
> +{
> + current->flags &= ~(flags ^ KSWAPD_PF_FLAGS);
> +}
> +
>  static inline void set_current_io_flusher(void)
>  {
>   current->flags |= PF_LOCAL_THROTTLE;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b6d84326bdf2..27ae76699899 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3870,19 +3870,7 @@ static int kswapd(void *p)
>   if (!cpumask_empty(cpumask))
>   set_cpus_allowed_ptr(tsk, cpumask);
>  
> - /*
> -  * Tell the memory management that we're a "memory allocator",
> -  * and that if we need more memory we should get access to it
> -  * regardless (see "__alloc_pages()"). "kswapd" should
> -  * never get caught in the normal page freeing logic.
> -  *
> -  * (Kswapd normally doesn't need memory anyway, but sometimes
> -  * you need a small amount of memory in order to be able to
> -  * page out something else, and this flag essentially protects
> -  * us from recursively trying to free more memory as we're
> -  * trying to free the first piece of memory in the first place).
> -  */
> - tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> + become_kswapd();
>   set_freezable();
>  
>   WRITE_ONCE(pgdat->kswapd_order, 0);
> @@ -3932,8 +3920,6 @@ static int kswapd(void *p)
>   goto kswapd_try_sleep;
>   }
>  
> - tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
> -
>   return 0;
>  }
>  
> -- 
> 2.27.0
> 

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH 1/6] mm: Replace PF_MEMALLOC_NOIO with memalloc_noio

2020-06-25 Thread Michal Hocko
gs & PF_MEMALLOC_NOFS)
>   flags &= ~__GFP_FS;
> @@ -224,8 +223,8 @@ static inline void fs_reclaim_release(gfp_t gfp_mask) { }
>   */
>  static inline unsigned int memalloc_noio_save(void)
>  {
> - unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
> - current->flags |= PF_MEMALLOC_NOIO;
> + unsigned int flags = current->memalloc_noio;
> + current->memalloc_noio = 1;
>   return flags;
>  }
>  
> @@ -239,7 +238,7 @@ static inline unsigned int memalloc_noio_save(void)
>   */
>  static inline void memalloc_noio_restore(unsigned int flags)
>  {
> - current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
> + current->memalloc_noio = flags ? 1 : 0;
>  }
>  
>  /**
> @@ -309,6 +308,23 @@ static inline void memalloc_nocma_restore(unsigned int 
> flags)
>  }
>  #endif
>  
> +static inline void set_current_io_flusher(void)
> +{
> + current->flags |= PF_LOCAL_THROTTLE;
> + current->memalloc_noio = 1;
> +}
> +
> +static inline void clear_current_io_flusher(void)
> +{
> + current->flags &= ~PF_LOCAL_THROTTLE;
> + current->memalloc_noio = 0;
> +}
> +
> +static inline bool get_current_io_flusher(void)
> +{
> + return current->flags & PF_LOCAL_THROTTLE;
> +}
> +
>  #ifdef CONFIG_MEMCG
>  /**
>   * memalloc_use_memcg - Starts the remote memcg charging scope.
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 00a96746e28a..78c90d1e92f4 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -2275,8 +2275,6 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct 
> *t, unsigned long which,
>   return -EINVAL;
>  }
>  
> -#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)
> -
>  SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>   unsigned long, arg4, unsigned long, arg5)
>  {
> @@ -2512,9 +2510,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> arg2, unsigned long, arg3,
>   return -EINVAL;
>  
>   if (arg2 == 1)
> - current->flags |= PR_IO_FLUSHER;
> + set_current_io_flusher();
>   else if (!arg2)
> - current->flags &= ~PR_IO_FLUSHER;
> + clear_current_io_flusher();
>   else
>   return -EINVAL;
>   break;
> @@ -2525,7 +2523,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> arg2, unsigned long, arg3,
>   if (arg2 || arg3 || arg4 || arg5)
>   return -EINVAL;
>  
> - error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER;
> + error = get_current_io_flusher();
>   break;
>   default:
>   error = -EINVAL;
> -- 
> 2.27.0
> 

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel



Re: [dm-devel] [PATCH] mm: convert totalram_pages, totalhigh_pages and managed_pages to atomic.

2018-10-23 Thread Michal Hocko
On Mon 22-10-18 22:53:22, Arun KS wrote:
> Remove managed_page_count_lock spinlock and instead use atomic
> variables.

I assume this has been auto-generated. If yes, it would be better to
mention the script so that people can review it and regenerate for
comparision. Such a large change is hard to review manually.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-09-04 Thread Michal Hocko
On Tue 04-09-18 13:30:44, Mikulas Patocka wrote:
> 
> 
> On Tue, 4 Sep 2018, Michal Hocko wrote:
> 
> > Regarding other other workloads. AFAIR the problem was due to the
> > wait_iff_congested in the direct reclaim. And I've been arguing that
> > special casing __GFP_NORETRY is not a propoer way to handle that case.
> > We have PF_LESS_THROTTLE to handle cases where the caller cannot be
> > really throttled because it is a part of the congestion control. I dunno
> > what happened in that regards since then though.
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> So, could you take this patch https://patchwork.kernel.org/patch/10505523/ 
> and commit it to the kernel?
> 
> You agreed with that patch, but you didn't commit it.

Because I do not maintain any tree that Linus pulls from. You need to
involve Andrew Morton to get this to the mm tree and then to Linus.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-09-04 Thread Michal Hocko
On Tue 04-09-18 11:18:44, Mike Snitzer wrote:
> On Tue, Sep 04 2018 at  3:08am -0400,
> Michal Hocko  wrote:
> 
> > On Mon 03-09-18 18:23:17, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Wed, 1 Aug 2018, jing xia wrote:
> > > 
> > > > We reproduced this issue again and found out the root cause.
> > > > dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and
> > > > takes a long time to do the soft_limit_reclaim, because of the huge
> > > > number of memory excess of the memcg.
> > > > Then, all the task who do shrink_slab() wait for  dm_bufio_lock.
> > > > 
> > > > Any suggestions for this?Thanks.
> > > 
> > > There's hardly any solution because Michal Hocko refuses to change 
> > > __GFP_NORETRY behavior.
> > > 
> > > The patches 41c73a49df31151f4ff868f28fe4f129f113fa2c and 
> > > d12067f428c037b4575aaeb2be00847fc214c24a could reduce the lock contention 
> > > on the dm-bufio lock - the patches don't fix the high CPU consumption 
> > > inside the memory allocation, but the kernel code should wait less on the 
> > > bufio lock.
> > 
> > If you actually looked at the bottom line of the problem then you would
> > quickly find out that dm-bufio lock is the least of the problem with the
> > soft limit reclaim. This is a misfeature which has been merged and we
> > have to live with it. All we can do is to discourage people from using
> > it and use much more saner low limit instead.
> > 
> > So please stop this stupid blaming, try to understand the reasoning
> > behind my arguments.
> 
> Yes, this bickering isn't productive.  Michal, your responses are pretty
> hard to follow.  I'm just trying to follow along on what it is you're
> saying should be done.  It isn't clear to me.
> 
> PLEASE, restate what we should be doing differently.  Or what changes
> need to happen outside of DM, etc.

For this particular case I can only recommend to not use the memcg soft
limit. This is guaranteed to stall and there is no way around it because
this is the semantic of the soft limit. Sad, I know.

Regarding other other workloads. AFAIR the problem was due to the
wait_iff_congested in the direct reclaim. And I've been arguing that
special casing __GFP_NORETRY is not a propoer way to handle that case.
We have PF_LESS_THROTTLE to handle cases where the caller cannot be
really throttled because it is a part of the congestion control. I dunno
what happened in that regards since then though.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-09-04 Thread Michal Hocko
On Mon 03-09-18 18:23:17, Mikulas Patocka wrote:
> 
> 
> On Wed, 1 Aug 2018, jing xia wrote:
> 
> > We reproduced this issue again and found out the root cause.
> > dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and
> > takes a long time to do the soft_limit_reclaim, because of the huge
> > number of memory excess of the memcg.
> > Then, all the task who do shrink_slab() wait for  dm_bufio_lock.
> > 
> > Any suggestions for this?Thanks.
> 
> There's hardly any solution because Michal Hocko refuses to change 
> __GFP_NORETRY behavior.
> 
> The patches 41c73a49df31151f4ff868f28fe4f129f113fa2c and 
> d12067f428c037b4575aaeb2be00847fc214c24a could reduce the lock contention 
> on the dm-bufio lock - the patches don't fix the high CPU consumption 
> inside the memory allocation, but the kernel code should wait less on the 
> bufio lock.

If you actually looked at the bottom line of the problem then you would
quickly find out that dm-bufio lock is the least of the problem with the
soft limit reclaim. This is a misfeature which has been merged and we
have to live with it. All we can do is to discourage people from using
it and use much more saner low limit instead.

So please stop this stupid blaming, try to understand the reasoning
behind my arguments.
-- 
Michal Evil Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-08-05 Thread Michal Hocko
On Wed 01-08-18 10:48:00, jing xia wrote:
> We reproduced this issue again and found out the root cause.
> dm_bufio_prefetch() with dm_bufio_lock enters the direct reclaim and
> takes a long time to do the soft_limit_reclaim, because of the huge
> number of memory excess of the memcg.
> Then, all the task who do shrink_slab() wait for  dm_bufio_lock.
> 
> Any suggestions for this?Thanks.

Do not use soft limit?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-29 Thread Michal Hocko
On Thu 28-06-18 22:43:29, Mikulas Patocka wrote:
> 
> 
> On Mon, 25 Jun 2018, Michal Hocko wrote:
> 
> > On Mon 25-06-18 10:42:30, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Mon, 25 Jun 2018, Michal Hocko wrote:
> > > 
> > > > > And the throttling in dm-bufio prevents kswapd from making forward 
> > > > > progress, causing this situation...
> > > > 
> > > > Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in
> > > > circles like that? Are you even listening?
> > > > 
> > > > [...]
> > > > 
> > > > > And so what do you want to do to prevent block drivers from sleeping?
> > > > 
> > > > use the existing means we have.
> > > > -- 
> > > > Michal Hocko
> > > > SUSE Labs
> > > 
> > > So - do you want this patch?
> > > 
> > > There is no behavior difference between changing the allocator (so that 
> > > it 
> > > implies PF_THROTTLE_LESS for block drivers) and chaning all the block 
> > > drivers to explicitly set PF_THROTTLE_LESS.
> > 
> > As long as you can reliably detect those users. And using gfp_mask is
> 
> You can detect them if __GFP_IO is not set and __GFP_NORETRY is set. You 
> can grep the kernel for __GFP_NORETRY to find all the users.

It seems that arguing doesn't make much sense here. I will not repeat
myself...

> > about the worst way to achieve that because users tend to be creative
> > when it comes to using gfp mask. PF_THROTTLE_LESS in general is a
> > way to tell the allocator that _you_ are the one to help the reclaim by
> > cleaning data.
> 
> But using PF_LESS_THROTTLE explicitly adds more lines of code than 
> implying PF_LESS_THROTTLE in the allocator.

Yes and it will also make the code more explicit about the intention and
so it will be easier to maintain longterm.

> From: Mikulas Patocka 
> Subject: [PATCH] mm: set PF_LESS_THROTTLE when allocating memory for i/o
> 
> When doing __GFP_NORETRY allocation, the system may sleep in
> wait_iff_congested if there are too many dirty pages. Unfortunatelly this
> sleeping may slow down kswapd, preventing it from doing writeback and
> resolving the congestion.

This description is misleading at best if not outright wrong.
if (!sc->hibernation_mode && !current_is_kswapd() &&
   current_may_throttle() && pgdat_memcg_congested(pgdat, root))
wait_iff_congested(BLK_RW_ASYNC, HZ/10);

so this is an explict throttling of the direct reclaim.

So I would use the following wording instead:
"
It has been noticed that congestion throttling can slow down allocations
path that participate in the IO and thus help the memory reclaim.
Stalling those allocation is therefore not productive. Moreover mempool
allocator and md variants of the same already implement their own
throttling which has a better way to be feedback driven. Stalling at the
page allocator is therefore even counterproductive.

PF_LESS_THROTTLE is a task flag denoting allocation context that is
participating in the memory reclaim which fits into these IO paths
model, so use the flag and make the page allocator aware they are
special and they do not really want any dirty data throttling.


"

with a more clear patch description and some data to back them up, you
can add

Acked-by: Michal Hocko  # mempool_alloc and bvec_alloc

I cannot really comment on other md allocators though because I am not
familiar with those. 

> This patch fixes it by setting PF_LESS_THROTTLE when allocating memory for
> block device drivers.
> 
> Signed-off-by: Mikulas Patocka 
> Cc: sta...@vger.kernel.org
> 
> ---
>  block/bio.c   |4 
>  drivers/md/dm-bufio.c |   14 +++---
>  drivers/md/dm-crypt.c |8 
>  drivers/md/dm-integrity.c |4 
>  drivers/md/dm-kcopyd.c|3 +++
>  drivers/md/dm-verity-target.c |4 
>  drivers/md/dm-writecache.c|4 
>  mm/mempool.c  |4 
>  8 files changed, 42 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6/mm/mempool.c
> ===
> --- linux-2.6.orig/mm/mempool.c   2018-06-29 03:47:16.29000 +0200
> +++ linux-2.6/mm/mempool.c2018-06-29 03:47:16.27000 +0200
> @@ -369,6 +369,7 @@ void *mempool_alloc(mempool_t *pool, gfp
>   unsigned long flags;
>   wait_queue_entry_t wait;
>   gfp_t gfp_temp;
> + unsigned old_flags;
>  
>   VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
>   might_sleep_if(gfp_mask & __GFP_DI

Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-25 Thread Michal Hocko
On Mon 25-06-18 10:42:30, Mikulas Patocka wrote:
> 
> 
> On Mon, 25 Jun 2018, Michal Hocko wrote:
> 
> > > And the throttling in dm-bufio prevents kswapd from making forward 
> > > progress, causing this situation...
> > 
> > Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in
> > circles like that? Are you even listening?
> > 
> > [...]
> > 
> > > And so what do you want to do to prevent block drivers from sleeping?
> > 
> > use the existing means we have.
> > -- 
> > Michal Hocko
> > SUSE Labs
> 
> So - do you want this patch?
> 
> There is no behavior difference between changing the allocator (so that it 
> implies PF_THROTTLE_LESS for block drivers) and chaning all the block 
> drivers to explicitly set PF_THROTTLE_LESS.

As long as you can reliably detect those users. And using gfp_mask is
about the worst way to achieve that because users tend to be creative
when it comes to using gfp mask. PF_THROTTLE_LESS in general is a
way to tell the allocator that _you_ are the one to help the reclaim by
cleaning data.

> But if you insist that the allocator can't be changed, we have to repeat 
> the same code over and over again in the block drivers.

I am not familiar with the patched code but mempool change at least
makes sense (bvec_alloc seems to fallback to mempool which then makes
sense as well). If others in md/ do the same thing

I would just use current_restore_flags rather than open code it.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-25 Thread Michal Hocko
On Mon 25-06-18 09:53:34, Mikulas Patocka wrote:
> y
> 
> On Mon, 25 Jun 2018, Michal Hocko wrote:
> 
> > On Fri 22-06-18 14:57:10, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Fri, 22 Jun 2018, Michal Hocko wrote:
> > > 
> > > > On Fri 22-06-18 08:52:09, Mikulas Patocka wrote:
> > > > > 
> > > > > 
> > > > > On Fri, 22 Jun 2018, Michal Hocko wrote:
> > > > > 
> > > > > > On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> > > > > > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> > > > > > [...]
> > > > > > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set 
> > > > > > > > (i.e. the 
> > > > > > > > request comes from a block device driver or a filesystem), we 
> > > > > > > > should not 
> > > > > > > > sleep.
> > > > > > > 
> > > > > > > Why? How are you going to audit all the callers that the behavior 
> > > > > > > makes
> > > > > > > sense and moreover how are you going to ensure that future usage 
> > > > > > > will
> > > > > > > still make sense. The more subtle side effects gfp flags have the 
> > > > > > > harder
> > > > > > > they are to maintain.
> > > > > > 
> > > > > > So just as an excercise. Try to explain the above semantic to 
> > > > > > users. We
> > > > > > currently have the following.
> > > > > > 
> > > > > >  * __GFP_NORETRY: The VM implementation will try only very 
> > > > > > lightweight
> > > > > >  *   memory direct reclaim to get some memory under memory pressure 
> > > > > > (thus
> > > > > >  *   it can sleep). It will avoid disruptive actions like OOM 
> > > > > > killer. The
> > > > > >  *   caller must handle the failure which is quite likely to happen 
> > > > > > under
> > > > > >  *   heavy memory pressure. The flag is suitable when failure can 
> > > > > > easily be
> > > > > >  *   handled at small cost, such as reduced throughput
> > > > > > 
> > > > > >  * __GFP_FS can call down to the low-level FS. Clearing the flag 
> > > > > > avoids the
> > > > > >  *   allocator recursing into the filesystem which might already be 
> > > > > > holding
> > > > > >  *   locks.
> > > > > > 
> > > > > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? 
> > > > > > What
> > > > > > is the actual semantic without explaining the whole reclaim or force
> > > > > > users to look into the code to understand that? What about GFP_NOIO 
> > > > > > |
> > > > > > __GFP_NORETRY? What does it mean to that "should not sleep". Do all
> > > > > > shrinkers have to follow that as well?
> > > > > 
> > > > > My reasoning was that there is broken code that uses __GFP_NORETRY 
> > > > > and 
> > > > > assumes that it can't fail - so conditioning the change on !__GFP_FS 
> > > > > would 
> > > > > minimize the diruption to the broken code.
> > > > > 
> > > > > Anyway - if you want to test only on __GFP_NORETRY (and fix those 16 
> > > > > broken cases that assume that __GFP_NORETRY can't fail), I'm OK with 
> > > > > that.
> > > > 
> > > > As I've already said, this is a subtle change which is really hard to
> > > > reason about. Throttling on congestion has its meaning and reason. Look
> > > > at why we are doing that in the first place. You cannot simply say this
> > > 
> > > So - explain why is throttling needed. You support throttling, I don't, 
> > > so 
> > > you have to explain it :)
> > > 
> > > > is ok based on your specific usecase. We do have means to achieve that.
> > > > It is explicit and thus it will be applied only where it makes sense.
> > > > You keep repeating that implicit behavior change for everybody is
> > > > better.
> > > 
> > > I don't want to change it for everybody. I want to change it for block 
> > > device drivers. I don't care what you do with non-block drivers.
> > 
> > Well, it is usually onus of the patch submitter to justify any change.
> > But let me be nice on you, for once. This throttling is triggered only
> > if we all the pages we have encountered during the reclaim attempt are
> > dirty and that means that we are rushing through the LRU list quicker
> > than flushers are able to clean. If we didn't throttle we could hit
> > stronger reclaim priorities (aka scan more to reclaim memory) and
> > reclaim more pages as a result.
> 
> And the throttling in dm-bufio prevents kswapd from making forward 
> progress, causing this situation...

Which is what we have PF_THROTTLE_LESS for. Geez, do we have to go in
circles like that? Are you even listening?

[...]

> And so what do you want to do to prevent block drivers from sleeping?

use the existing means we have.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-25 Thread Michal Hocko
On Fri 22-06-18 14:57:10, Mikulas Patocka wrote:
> 
> 
> On Fri, 22 Jun 2018, Michal Hocko wrote:
> 
> > On Fri 22-06-18 08:52:09, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Fri, 22 Jun 2018, Michal Hocko wrote:
> > > 
> > > > On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> > > > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> > > > [...]
> > > > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set 
> > > > > > (i.e. the 
> > > > > > request comes from a block device driver or a filesystem), we 
> > > > > > should not 
> > > > > > sleep.
> > > > > 
> > > > > Why? How are you going to audit all the callers that the behavior 
> > > > > makes
> > > > > sense and moreover how are you going to ensure that future usage will
> > > > > still make sense. The more subtle side effects gfp flags have the 
> > > > > harder
> > > > > they are to maintain.
> > > > 
> > > > So just as an excercise. Try to explain the above semantic to users. We
> > > > currently have the following.
> > > > 
> > > >  * __GFP_NORETRY: The VM implementation will try only very lightweight
> > > >  *   memory direct reclaim to get some memory under memory pressure 
> > > > (thus
> > > >  *   it can sleep). It will avoid disruptive actions like OOM killer. 
> > > > The
> > > >  *   caller must handle the failure which is quite likely to happen 
> > > > under
> > > >  *   heavy memory pressure. The flag is suitable when failure can 
> > > > easily be
> > > >  *   handled at small cost, such as reduced throughput
> > > > 
> > > >  * __GFP_FS can call down to the low-level FS. Clearing the flag avoids 
> > > > the
> > > >  *   allocator recursing into the filesystem which might already be 
> > > > holding
> > > >  *   locks.
> > > > 
> > > > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What
> > > > is the actual semantic without explaining the whole reclaim or force
> > > > users to look into the code to understand that? What about GFP_NOIO |
> > > > __GFP_NORETRY? What does it mean to that "should not sleep". Do all
> > > > shrinkers have to follow that as well?
> > > 
> > > My reasoning was that there is broken code that uses __GFP_NORETRY and 
> > > assumes that it can't fail - so conditioning the change on !__GFP_FS 
> > > would 
> > > minimize the diruption to the broken code.
> > > 
> > > Anyway - if you want to test only on __GFP_NORETRY (and fix those 16 
> > > broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that.
> > 
> > As I've already said, this is a subtle change which is really hard to
> > reason about. Throttling on congestion has its meaning and reason. Look
> > at why we are doing that in the first place. You cannot simply say this
> 
> So - explain why is throttling needed. You support throttling, I don't, so 
> you have to explain it :)
> 
> > is ok based on your specific usecase. We do have means to achieve that.
> > It is explicit and thus it will be applied only where it makes sense.
> > You keep repeating that implicit behavior change for everybody is
> > better.
> 
> I don't want to change it for everybody. I want to change it for block 
> device drivers. I don't care what you do with non-block drivers.

Well, it is usually onus of the patch submitter to justify any change.
But let me be nice on you, for once. This throttling is triggered only
if we all the pages we have encountered during the reclaim attempt are
dirty and that means that we are rushing through the LRU list quicker
than flushers are able to clean. If we didn't throttle we could hit
stronger reclaim priorities (aka scan more to reclaim memory) and
reclaim more pages as a result.

> > I guess we will not agree on that part. I consider it a hack
> > rather than a systematic solution. I can easily imagine that we just
> > find out other call sites that would cause over reclaim or similar
> 
> If a __GFP_NORETRY allocation does overreclaim - it could be fixed by 
> returning NULL instead of doing overreclaim. The specification says that 
> callers must handle failure of __GFP_NORETRY allocations.
> 
> So yes - if you think that just skipping throttling on __GFP_NORETRY could 
> cause excessive CPU consumption trying to reclaim unreclaimable pages or 
> something like that - then you can add more points where the __GFP_NORETRY 
> is failed with NULL to avoid the excessive CPU consumption.

Which is exactly something I do not want to do. Spread __GFP_NORETRY all
over the reclaim code. Especially for something we already have means
for...
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-22 Thread Michal Hocko
On Fri 22-06-18 08:44:52, Mikulas Patocka wrote:
> On Fri, 22 Jun 2018, Michal Hocko wrote:
[...]
> > Why? How are you going to audit all the callers that the behavior makes
> > sense and moreover how are you going to ensure that future usage will
> > still make sense. The more subtle side effects gfp flags have the harder
> > they are to maintain.
> 
> I did audit them - see the previous email in this thread: 
> https://www.redhat.com/archives/dm-devel/2018-June/thread.html

I do not see any mention about throttling expectations for those users.
You have focused only on the allocation failure fallback AFAIR
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-22 Thread Michal Hocko
On Fri 22-06-18 08:52:09, Mikulas Patocka wrote:
> 
> 
> On Fri, 22 Jun 2018, Michal Hocko wrote:
> 
> > On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> > > On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
> > [...]
> > > > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. 
> > > > the 
> > > > request comes from a block device driver or a filesystem), we should 
> > > > not 
> > > > sleep.
> > > 
> > > Why? How are you going to audit all the callers that the behavior makes
> > > sense and moreover how are you going to ensure that future usage will
> > > still make sense. The more subtle side effects gfp flags have the harder
> > > they are to maintain.
> > 
> > So just as an excercise. Try to explain the above semantic to users. We
> > currently have the following.
> > 
> >  * __GFP_NORETRY: The VM implementation will try only very lightweight
> >  *   memory direct reclaim to get some memory under memory pressure (thus
> >  *   it can sleep). It will avoid disruptive actions like OOM killer. The
> >  *   caller must handle the failure which is quite likely to happen under
> >  *   heavy memory pressure. The flag is suitable when failure can easily be
> >  *   handled at small cost, such as reduced throughput
> > 
> >  * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
> >  *   allocator recursing into the filesystem which might already be holding
> >  *   locks.
> > 
> > So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What
> > is the actual semantic without explaining the whole reclaim or force
> > users to look into the code to understand that? What about GFP_NOIO |
> > __GFP_NORETRY? What does it mean to that "should not sleep". Do all
> > shrinkers have to follow that as well?
> 
> My reasoning was that there is broken code that uses __GFP_NORETRY and 
> assumes that it can't fail - so conditioning the change on !__GFP_FS would 
> minimize the diruption to the broken code.
> 
> Anyway - if you want to test only on __GFP_NORETRY (and fix those 16 
> broken cases that assume that __GFP_NORETRY can't fail), I'm OK with that.

As I've already said, this is a subtle change which is really hard to
reason about. Throttling on congestion has its meaning and reason. Look
at why we are doing that in the first place. You cannot simply say this
is ok based on your specific usecase. We do have means to achieve that.
It is explicit and thus it will be applied only where it makes sense.
You keep repeating that implicit behavior change for everybody is
better. I guess we will not agree on that part. I consider it a hack
rather than a systematic solution. I can easily imagine that we just
find out other call sites that would cause over reclaim or similar
problems because they are not throttled on too many dirty pages due to
congested storage. What are we going to then? Add another magic gfp
flag? Really, try to not add even more subtle side effects for gfp
flags. We _do_ have ways to accomplish what your particular usecase
requires.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-22 Thread Michal Hocko
On Fri 22-06-18 11:01:51, Michal Hocko wrote:
> On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
[...]
> > What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the 
> > request comes from a block device driver or a filesystem), we should not 
> > sleep.
> 
> Why? How are you going to audit all the callers that the behavior makes
> sense and moreover how are you going to ensure that future usage will
> still make sense. The more subtle side effects gfp flags have the harder
> they are to maintain.

So just as an excercise. Try to explain the above semantic to users. We
currently have the following.

 * __GFP_NORETRY: The VM implementation will try only very lightweight
 *   memory direct reclaim to get some memory under memory pressure (thus
 *   it can sleep). It will avoid disruptive actions like OOM killer. The
 *   caller must handle the failure which is quite likely to happen under
 *   heavy memory pressure. The flag is suitable when failure can easily be
 *   handled at small cost, such as reduced throughput

 * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
 *   allocator recursing into the filesystem which might already be holding
 *   locks.

So how are you going to explain gfp & (__GFP_NORETRY | ~__GFP_FS)? What
is the actual semantic without explaining the whole reclaim or force
users to look into the code to understand that? What about GFP_NOIO |
__GFP_NORETRY? What does it mean to that "should not sleep". Do all
shrinkers have to follow that as well?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-22 Thread Michal Hocko
On Thu 21-06-18 21:17:24, Mikulas Patocka wrote:
[...]
> > But seriously, isn't the best way around the throttling issue to use
> > PF_LESS_THROTTLE?
> 
> Yes - it could be done by setting PF_LESS_THROTTLE. But I think it would 
> be better to change it just in one place than to add PF_LESS_THROTTLE to 
> every block device driver (because adding it to every block driver results 
> in more code).

Why would every block device need this? I thought we were talking about
mempool allocator and the md variant of it. They are explicitly doing
their own back off so PF_LESS_THROTTLE sounds appropriate to me.

> What about this patch? If __GFP_NORETRY and __GFP_FS is not set (i.e. the 
> request comes from a block device driver or a filesystem), we should not 
> sleep.

Why? How are you going to audit all the callers that the behavior makes
sense and moreover how are you going to ensure that future usage will
still make sense. The more subtle side effects gfp flags have the harder
they are to maintain.

> Signed-off-by: Mikulas Patocka 
> Cc: sta...@vger.kernel.org
> 
> Index: linux-2.6/mm/vmscan.c
> ===
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -2674,6 +2674,7 @@ static bool shrink_node(pg_data_t *pgdat
>* the LRU too quickly.
>*/
>   if (!sc->hibernation_mode && !current_is_kswapd() &&
> +(sc->gfp_mask & (__GFP_NORETRY | __GFP_FS)) != __GFP_NORETRY 
> &&
>  current_may_throttle() && pgdat_memcg_congested(pgdat, root))
>   wait_iff_congested(BLK_RW_ASYNC, HZ/10);
>  

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-19 Thread Michal Hocko
On Mon 18-06-18 18:11:26, Mikulas Patocka wrote:
[...]
> I grepped the kernel for __GFP_NORETRY and triaged them. I found 16 cases 
> without a fallback - those are bugs that make various functions randomly 
> return -ENOMEM.

Well, maybe those are just optimistic attempts to allocate memory and
have a fallback somewhere else. So I would be careful calling them
outright bugs. But maybe you are right.

> Most of the callers provide callback.
> 
> There is another strange flag - __GFP_RETRY_MAYFAIL - it provides two 
> different functions - if the allocation is larger than 
> PAGE_ALLOC_COSTLY_ORDER, it retries the allocation as if it were smaller. 
> If the allocations is smaller than PAGE_ALLOC_COSTLY_ORDER, 
> __GFP_RETRY_MAYFAIL will avoid the oom killer (larger order allocations 
> don't trigger the oom killer at all).

Well, the primary purpose of this flag is to provide a consistent
failure behavior for all requests regardless of the size.

> So, perhaps __GFP_RETRY_MAYFAIL could be used instead of __GFP_NORETRY in 
> the cases where the caller wants to avoid trigerring the oom killer (the 
> problem is that __GFP_NORETRY causes random failure even in no-oom 
> situations but __GFP_RETRY_MAYFAIL doesn't).

myabe yes.

> So my suggestion is - fix these obvious bugs when someone allocates memory 
> with __GFP_NORETRY without any fallback - and then, __GFP_NORETRY could be 
> just changed to return NULL instead of sleeping.

No real objection to fixing wrong __GFP_NORETRY usage. But __GFP_NORETRY
can sleep. Nothing will really change in that regards.  It does a
reclaim and that _might_ sleep.

But seriously, isn't the best way around the throttling issue to use
PF_LESS_THROTTLE?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-15 Thread Michal Hocko
On Fri 15-06-18 08:47:52, Mikulas Patocka wrote:
> 
> 
> On Fri, 15 Jun 2018, Michal Hocko wrote:
> 
> > On Fri 15-06-18 07:35:07, Mikulas Patocka wrote:
> > > 
> > > Because mempool uses it. Mempool uses allocations with "GFP_NOIO | 
> > > __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses 
> > > these flags too. dm-bufio is just a big mempool.
> > 
> > This doesn't answer my question though. Somebody else is doing it is not
> > an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO
> > allocation AFAICS. So why do you really need it now? Why cannot you
> 
> dm-bufio always used "GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | 
> __GFP_NOWARN" since the kernel 3.2 when it was introduced.
> 
> In the kernel 4.10, dm-bufio was changed so that it does GFP_NOWAIT 
> allocation, then drops the lock and does GFP_NOIO with the dropped lock 
> (because someone was likely experiencing the same issue that is reported 
> in this thread) - there are two commits that change it - 9ea61cac0 and 
> 41c73a49df31.

OK, I see. Is there any fundamental reason why this path has to do one
round of GFP_IO or it can keep NOWAIT, drop the lock, sleep and retry
again?

[...]
> > is the same class of problem, honestly, I dunno. And I've already said
> > that stalling __GFP_NORETRY might be a good way around that but that
> > needs much more consideration and existing users examination. I am not
> > aware anybody has done that. Doing changes like that based on a single
> > user is certainly risky.
> 
> Why don't you set any rules how these flags should be used?

It is really hard to change rules during the game. You basically have to
examine all existing users and that is well beyond my time scope. I've
tried that where it was possible. E.g. __GFP_REPEAT and turned it into a
well defined semantic. __GFP_NORETRY is a bigger beast.

Anyway, I believe that it would be much safer to look at the problem
from a highlevel perspective. You seem to be focused on __GFP_NORETRY
little bit too much IMHO. We are not throttling callers which explicitly
do not want to or cannot - see current_may_throttle. Is it possible that
both your md and mempool allocators can either (ab)use PF_LESS_THROTTLE
or use other means? E.g. do you have backing_dev_info at that time?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-15 Thread Michal Hocko
On Fri 15-06-18 07:35:07, Mikulas Patocka wrote:
> 
> 
> On Fri, 15 Jun 2018, Michal Hocko wrote:
> 
> > On Thu 14-06-18 14:34:06, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Thu, 14 Jun 2018, Michal Hocko wrote:
> > > 
> > > > On Thu 14-06-18 15:18:58, jing xia wrote:
> > > > [...]
> > > > > PID: 22920  TASK: ffc0120f1a00  CPU: 1   COMMAND: "kworker/u8:2"
> > > > >  #0 [ffc0282af3d0] __switch_to at ff8008085e48
> > > > >  #1 [ffc0282af3f0] __schedule at ff8008850cc8
> > > > >  #2 [ffc0282af450] schedule at ff8008850f4c
> > > > >  #3 [ffc0282af470] schedule_timeout at ff8008853a0c
> > > > >  #4 [ffc0282af520] schedule_timeout_uninterruptible at 
> > > > > ff8008853aa8
> > > > >  #5 [ffc0282af530] wait_iff_congested at ff8008181b40
> > > > 
> > > > This trace doesn't provide the full picture unfortunately. Waiting in
> > > > the direct reclaim means that the underlying bdi is congested. The real
> > > > question is why it doesn't flush IO in time.
> > > 
> > > I pointed this out two years ago and you just refused to fix it:
> > > http://lkml.iu.edu/hypermail/linux/kernel/1608.1/04507.html
> > 
> > Let me be evil again and let me quote the old discussion:
> > : > I agree that mempool_alloc should _primarily_ sleep on their own
> > : > throttling mechanism. I am not questioning that. I am just saying that
> > : > the page allocator has its own throttling which it relies on and that
> > : > cannot be just ignored because that might have other undesirable side
> > : > effects. So if the right approach is really to never throttle certain
> > : > requests then we have to bail out from a congested nodes/zones as soon
> > : > as the congestion is detected.
> > : >
> > : > Now, I would like to see that something like that is _really_ necessary.
> > :
> > : Currently, it is not a problem - device mapper reports the device as
> > : congested only if the underlying physical disks are congested.
> > :
> > : But once we change it so that device mapper reports congested state on its
> > : own (when it has too many bios in progress), this starts being a problem.
> > 
> > So has this changed since then? If yes then we can think of a proper
> > solution but that would require to actually describe why we see the
> > congestion, why it does help to wait on the caller rather than the
> > allocator etc...
> 
> Device mapper doesn't report congested state - but something else does 
> (perhaps the user inserted a cheap slow usb stick or sdcard?). And device 
> mapper is just a victim of that.

Maybe yes and that would require some more debugging to find out,
analyze and think of a proper solution.

> Why should device mapper sleep because some other random block device is 
> congested?

Well, the direct reclaim is a way to throttle memory allocations. There
is no real concept on who is asking for the memory. If you do not want
to get blocked then use GFP_NOWAIT.

> > Throwing statements like ...
> > 
> > > I'm sure you'll come up with another creative excuse why GFP_NORETRY 
> > > allocations need incur deliberate 100ms delays in block device drivers.
> > 
> > ... is not really productive. I've tried to explain why I am not _sure_ what
> > possible side effects such a change might have and your hand waving
> > didn't really convince me. MD is not the only user of the page
> > allocator...
> > 
> > E.g. why has 41c73a49df31 ("dm bufio: drop the lock when doing GFP_NOIO
> > allocation") even added GFP_NOIO request in the first place when you
> > keep retrying and sleep yourself?
> 
> Because mempool uses it. Mempool uses allocations with "GFP_NOIO | 
> __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN". An so dm-bufio uses 
> these flags too. dm-bufio is just a big mempool.

This doesn't answer my question though. Somebody else is doing it is not
an explanation. Prior to your 41c73a49df31 there was no GFP_NOIO
allocation AFAICS. So why do you really need it now? Why cannot you
simply keep retrying GFP_NOWAIT with your own throttling?

Note that I am not trying to say that 41c73a49df31, I am merely trying
to understand why this blocking allocation is done in the first place.
 
> If you argue that these flags are incorrect - then fix mempool_alloc.

AFAICS there is no report about mempool_alloc stalling here. Maybe this
is the same class of problem, honestly, I dunno. And I've already said
that stalling __GFP_NORETRY might be a good way around that but that
needs much more consideration and existing users examination. I am not
aware anybody has done that. Doing changes like that based on a single
user is certainly risky.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-15 Thread Michal Hocko
On Thu 14-06-18 14:34:06, Mikulas Patocka wrote:
> 
> 
> On Thu, 14 Jun 2018, Michal Hocko wrote:
> 
> > On Thu 14-06-18 15:18:58, jing xia wrote:
> > [...]
> > > PID: 22920  TASK: ffc0120f1a00  CPU: 1   COMMAND: "kworker/u8:2"
> > >  #0 [ffc0282af3d0] __switch_to at ff8008085e48
> > >  #1 [ffc0282af3f0] __schedule at ff8008850cc8
> > >  #2 [ffc0282af450] schedule at ff8008850f4c
> > >  #3 [ffc0282af470] schedule_timeout at ff8008853a0c
> > >  #4 [ffc0282af520] schedule_timeout_uninterruptible at 
> > > ff8008853aa8
> > >  #5 [ffc0282af530] wait_iff_congested at ff8008181b40
> > 
> > This trace doesn't provide the full picture unfortunately. Waiting in
> > the direct reclaim means that the underlying bdi is congested. The real
> > question is why it doesn't flush IO in time.
> 
> I pointed this out two years ago and you just refused to fix it:
> http://lkml.iu.edu/hypermail/linux/kernel/1608.1/04507.html

Let me be evil again and let me quote the old discussion:
: > I agree that mempool_alloc should _primarily_ sleep on their own
: > throttling mechanism. I am not questioning that. I am just saying that
: > the page allocator has its own throttling which it relies on and that
: > cannot be just ignored because that might have other undesirable side
: > effects. So if the right approach is really to never throttle certain
: > requests then we have to bail out from a congested nodes/zones as soon
: > as the congestion is detected.
: >
: > Now, I would like to see that something like that is _really_ necessary.
:
: Currently, it is not a problem - device mapper reports the device as
: congested only if the underlying physical disks are congested.
:
: But once we change it so that device mapper reports congested state on its
: own (when it has too many bios in progress), this starts being a problem.

So has this changed since then? If yes then we can think of a proper
solution but that would require to actually describe why we see the
congestion, why it does help to wait on the caller rather than the
allocator etc...

Throwing statements like ...

> I'm sure you'll come up with another creative excuse why GFP_NORETRY 
> allocations need incur deliberate 100ms delays in block device drivers.

... is not really productive. I've tried to explain why I am not _sure_ what
possible side effects such a change might have and your hand waving
didn't really convince me. MD is not the only user of the page
allocator...

E.g. why has 41c73a49df31 ("dm bufio: drop the lock when doing GFP_NOIO
allocation") even added GFP_NOIO request in the first place when you
keep retrying and sleep yourself? The changelog only describes what but
doesn't explain why. Or did I misread the code and this is not the
allocation which is stalling due to congestion?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] dm bufio: Reduce dm_bufio_lock contention

2018-06-14 Thread Michal Hocko
On Thu 14-06-18 15:18:58, jing xia wrote:
[...]
> PID: 22920  TASK: ffc0120f1a00  CPU: 1   COMMAND: "kworker/u8:2"
>  #0 [ffc0282af3d0] __switch_to at ff8008085e48
>  #1 [ffc0282af3f0] __schedule at ff8008850cc8
>  #2 [ffc0282af450] schedule at ff8008850f4c
>  #3 [ffc0282af470] schedule_timeout at ff8008853a0c
>  #4 [ffc0282af520] schedule_timeout_uninterruptible at ff8008853aa8
>  #5 [ffc0282af530] wait_iff_congested at ff8008181b40

This trace doesn't provide the full picture unfortunately. Waiting in
the direct reclaim means that the underlying bdi is congested. The real
question is why it doesn't flush IO in time.

>  #6 [ffc0282af5b0] shrink_inactive_list at ff8008177c80
>  #7 [ffc0282af680] shrink_lruvec at ff8008178510
>  #8 [ffc0282af790] mem_cgroup_shrink_node_zone at ff80081793bc
>  #9 [ffc0282af840] mem_cgroup_soft_limit_reclaim at ff80081b6040
> #10 [ffc0282af8f0] do_try_to_free_pages at ff8008178b6c
> #11 [ffc0282af990] try_to_free_pages at ff8008178f3c
> #12 [ffc0282afa30] __perform_reclaim at ff8008169130
> #13 [ffc0282afab0] __alloc_pages_nodemask at ff800816c9b8
> #14 [ffc0282afbd0] __get_free_pages at ff800816cd6c
> #15 [ffc0282afbe0] alloc_buffer at ff8008591a94
> #16 [ffc0282afc20] __bufio_new at ff8008592e94
> #17 [ffc0282afc70] dm_bufio_prefetch at ff8008593198
> #18 [ffc0282afd20] verity_prefetch_io at ff8008598384
> #19 [ffc0282afd70] process_one_work at ff80080b5b3c
> #20 [ffc0282afdc0] worker_thread at ff80080b64fc
> #21 [ffc0282afe20] kthread at ff80080bae34
> 
> > Mikulas

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-27 Thread Michal Hocko
On Thu 26-04-18 18:52:05, Mikulas Patocka wrote:
> 
> 
> On Fri, 27 Apr 2018, Michael S. Tsirkin wrote:
[...]
> >But assuming it's important to control this kind of
> >fault injection to be controlled from
> >a dedicated menuconfig option, why not the rest of
> >faults?
> 
> The injected faults cause damage to the user, so there's no point to 
> enable them by default. vmalloc fallback should not cause any damage 
> (assuming that the code is correctly written).

But you want to find those bugs which would BUG_ON easier, so there is a
risk of harm IIUC and this is not much different than other fault
injecting paths.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options

2018-04-26 Thread Michal Hocko
On Wed 25-04-18 18:42:57, Mikulas Patocka wrote:
> 
> 
> On Wed, 25 Apr 2018, James Bottomley wrote:
[...]
> > Kconfig proliferation, conversely, is a bit of a nightmare from both
> > the user and the tester's point of view, so we're trying to avoid it
> > unless absolutely necessary.
> > 
> > James
> 
> I already offered that we don't need to introduce a new kernel option and 
> we can bind this feature to any other kernel option, that is enabled in 
> the debug kernel, for example CONFIG_DEBUG_SG. Michal said no and he said 
> that he wants a new kernel option instead.

Just for the record. I didn't say I _want_ a config option. Do not
misinterpret my words. I've said that a config option would be
acceptable if there is no way to deliver the functionality via kernel
package automatically. You haven't provided any argument that would
explain why the kernel package cannot add a boot option. Maybe there are
some but I do not see them right now.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG

2018-04-24 Thread Michal Hocko
On Tue 24-04-18 13:28:49, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > On Tue 24-04-18 13:00:11, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > 
> > > > On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> > > > > 
> > > > > 
> > > > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > > > 
> > > > > > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > > > > > [...]
> > > > > > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > > > > >*/
> > > > > > >   WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > > > > > >  
> > > > > > > +#ifdef CONFIG_DEBUG_SG
> > > > > > > + /* Catch bugs when the caller uses DMA API on the result of 
> > > > > > > kvmalloc. */
> > > > > > > + if (!(prandom_u32_max(2) & 1))
> > > > > > > + goto do_vmalloc;
> > > > > > > +#endif
> > > > > > 
> > > > > > I really do not think there is anything DEBUG_SG specific here. Why 
> > > > > > you
> > > > > > simply do not follow should_failslab path or even reuse the 
> > > > > > function?
> > > > > 
> > > > > CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel 
> > > > > (if 
> > > > > you don't like CONFIG_DEBUG_SG, pick any other option that is enabled 
> > > > > there).
> > > > 
> > > > Are you telling me that you are shaping a debugging functionality basing
> > > > on what RHEL has enabled? And you call me evil. This is just rediculous.
> > > > 
> > > > > Fail-injection framework is if off by default and it must be 
> > > > > explicitly 
> > > > > enabled and configured by the user - and most users won't enable it.
> > > > 
> > > > It can be enabled easily. And if you care enough for your debugging
> > > > kernel then just make it enabled unconditionally.
> > > 
> > > So, should we add a new option CONFIG_KVMALLOC_FALLBACK_DEFAULT? I'm not 
> > > quite sure if 3 lines of debugging code need an extra option, but if you 
> > > don't want to reuse any existing debug option, it may be possible. Adding 
> > > it to the RHEL debug kernel would be trivial.
> > 
> > Wouldn't it be equally trivial to simply enable the fault injection? You
> > would get additional failure paths testing as a bonus.
> 
> The RHEL and Fedora debugging kernels are compiled with fault injection. 
> But the fault-injection framework will do nothing unless it is enabled by 
> a kernel parameter or debugfs write.
> 
> Most users don't know about the fault injection kernel parameters or 
> debugfs files and won't enabled it. We need a CONFIG_ option to enable it 
> by default in the debugging kernels (and we could add a kernel parameter 
> to override the default, fine-tune the fallback probability etc.)

If it is a real issue to install the debugging kernel with the required
kernel parameter then I a config option for the default on makes sense
to me.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG

2018-04-24 Thread Michal Hocko
On Tue 24-04-18 13:00:11, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> > > 
> > > 
> > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > 
> > > > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > > > [...]
> > > > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > > > >*/
> > > > >   WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > > > >  
> > > > > +#ifdef CONFIG_DEBUG_SG
> > > > > + /* Catch bugs when the caller uses DMA API on the result of 
> > > > > kvmalloc. */
> > > > > + if (!(prandom_u32_max(2) & 1))
> > > > > + goto do_vmalloc;
> > > > > +#endif
> > > > 
> > > > I really do not think there is anything DEBUG_SG specific here. Why you
> > > > simply do not follow should_failslab path or even reuse the function?
> > > 
> > > CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if 
> > > you don't like CONFIG_DEBUG_SG, pick any other option that is enabled 
> > > there).
> > 
> > Are you telling me that you are shaping a debugging functionality basing
> > on what RHEL has enabled? And you call me evil. This is just rediculous.
> > 
> > > Fail-injection framework is if off by default and it must be explicitly 
> > > enabled and configured by the user - and most users won't enable it.
> > 
> > It can be enabled easily. And if you care enough for your debugging
> > kernel then just make it enabled unconditionally.
> 
> So, should we add a new option CONFIG_KVMALLOC_FALLBACK_DEFAULT? I'm not 
> quite sure if 3 lines of debugging code need an extra option, but if you 
> don't want to reuse any existing debug option, it may be possible. Adding 
> it to the RHEL debug kernel would be trivial.

Wouldn't it be equally trivial to simply enable the fault injection? You
would get additional failure paths testing as a bonus.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG

2018-04-24 Thread Michal Hocko
On Tue 24-04-18 11:50:30, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
> > [...]
> > > @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
> > >*/
> > >   WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
> > >  
> > > +#ifdef CONFIG_DEBUG_SG
> > > + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> > > + if (!(prandom_u32_max(2) & 1))
> > > + goto do_vmalloc;
> > > +#endif
> > 
> > I really do not think there is anything DEBUG_SG specific here. Why you
> > simply do not follow should_failslab path or even reuse the function?
> 
> CONFIG_DEBUG_SG is enabled by default in RHEL and Fedora debug kernel (if 
> you don't like CONFIG_DEBUG_SG, pick any other option that is enabled 
> there).

Are you telling me that you are shaping a debugging functionality basing
on what RHEL has enabled? And you call me evil. This is just rediculous.

> Fail-injection framework is if off by default and it must be explicitly 
> enabled and configured by the user - and most users won't enable it.

It can be enabled easily. And if you care enough for your debugging
kernel then just make it enabled unconditionally.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-24 Thread Michal Hocko
On Tue 24-04-18 10:12:42, Michal Hocko wrote:
> On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> > 
> > 
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > 
> > > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > > 
> > > > Fixing __vmalloc code 
> > > > is easy and it doesn't require cooperation with maintainers.
> > > 
> > > But it is a hack against the intention of the scope api.
> > 
> > It is not!
> 
> This discussion simply doesn't make much sense it seems. The scope API
> is to document the scope of the reclaim recursion critical section. That
> certainly is not a utility function like vmalloc.

http://lkml.kernel.org/r/20180424162712.gl17...@dhcp22.suse.cz

let's see how it rolls this time.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-24 Thread Michal Hocko
On Tue 24-04-18 11:30:40, Mikulas Patocka wrote:
> 
> 
> On Tue, 24 Apr 2018, Michal Hocko wrote:
> 
> > On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> > 
> > > Fixing __vmalloc code 
> > > is easy and it doesn't require cooperation with maintainers.
> > 
> > But it is a hack against the intention of the scope api.
> 
> It is not!

This discussion simply doesn't make much sense it seems. The scope API
is to document the scope of the reclaim recursion critical section. That
certainly is not a utility function like vmalloc.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-24 Thread Michal Hocko
On Mon 23-04-18 20:25:15, Mikulas Patocka wrote:
> 
> 
> On Mon, 23 Apr 2018, Michal Hocko wrote:
> 
> > On Mon 23-04-18 10:06:08, Mikulas Patocka wrote:
> > 
> > > > > He didn't want to fix vmalloc(GFP_NOIO)
> > > > 
> > > > I don't remember that conversation, so I don't know whether I agree with
> > > > his reasoning or not.  But we are supposed to be moving away from 
> > > > GFP_NOIO
> > > > towards marking regions with memalloc_noio_save() / restore.  If you do
> > > > that, you won't need vmalloc(GFP_NOIO).
> > > 
> > > He said the same thing a year ago. And there was small progress. 6 out of 
> > > 27 __vmalloc calls were converted to memalloc_noio_save in a year - 5 in 
> > > infiniband and 1 in btrfs. (the whole discussion is here 
> > > http://lkml.iu.edu/hypermail/linux/kernel/1706.3/04681.html )
> > 
> > Well this is not that easy. It requires a cooperation from maintainers.
> > I can only do as much. I've posted patches in the past and actively
> > bringing up this topic at LSFMM last two years...
> 
> You're right - but you have chosen the uneasy path.

Yes.

> Fixing __vmalloc code 
> is easy and it doesn't require cooperation with maintainers.

But it is a hack against the intention of the scope api. It also alows
maintainers to not care about their broken code.

> > > He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 
> > > 4 
> > > years, the kernel will be refactored and GFP_NOIO will be eliminated. Why 
> > > does he have veto over this part of the code? I'd much rather argue with 
> > > people who have constructive comments about fixing bugs than with him.
> > 
> > I didn't NACK the patch AFAIR. I've said it is not a good idea longterm.
> > I would be much more willing to change my mind if you would back your
> > patch by a real bug report. Hacks are acceptable when we have a real
> > issue in hands. But if we want to fix potential issue then better make
> > it properly.
> 
> Developers should fix bugs in advance, not to wait until a crash hapens, 
> is analyzed and reported.

I agree. But are those existing users broken in the first place? I have
seen so many GFP_NOFS abuses that I would dare to guess that most of
those vmalloc NOFS abusers can be simply turned into GFP_KERNEL. Maybe
that is the reason we haven't heard any complains in years.

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG

2018-04-24 Thread Michal Hocko
On Mon 23-04-18 20:06:16, Mikulas Patocka wrote:
[...]
> @@ -404,6 +405,12 @@ void *kvmalloc_node(size_t size, gfp_t f
>*/
>   WARN_ON_ONCE((flags & GFP_KERNEL) != GFP_KERNEL);
>  
> +#ifdef CONFIG_DEBUG_SG
> + /* Catch bugs when the caller uses DMA API on the result of kvmalloc. */
> + if (!(prandom_u32_max(2) & 1))
> + goto do_vmalloc;
> +#endif

I really do not think there is anything DEBUG_SG specific here. Why you
simply do not follow should_failslab path or even reuse the function?

> +
>   /*
>* We want to attempt a large physically contiguous block first because
>* it is less likely to fragment multiple larger blocks and therefore
> @@ -427,6 +434,9 @@ void *kvmalloc_node(size_t size, gfp_t f
>   if (ret || size <= PAGE_SIZE)
>   return ret;
>  
> +#ifdef CONFIG_DEBUG_SG
> +do_vmalloc:
> +#endif
>   return __vmalloc_node_flags_caller(size, node, flags,
>   __builtin_return_address(0));
>  }

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-23 Thread Michal Hocko
On Mon 23-04-18 10:06:08, Mikulas Patocka wrote:
> 
> 
> On Sat, 21 Apr 2018, Matthew Wilcox wrote:
> 
> > On Fri, Apr 20, 2018 at 05:21:26PM -0400, Mikulas Patocka wrote:
> > > On Fri, 20 Apr 2018, Matthew Wilcox wrote:
> > > > On Fri, Apr 20, 2018 at 04:54:53PM -0400, Mikulas Patocka wrote:
> > > > > On Fri, 20 Apr 2018, Michal Hocko wrote:
> > > > > > No way. This is just wrong! First of all, you will explode most 
> > > > > > likely
> > > > > > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends 
> > > > > > to be
> > > > > > enabled quite often.
> > > > > 
> > > > > You're an evil person who doesn't want to fix bugs.
> > > > 
> > > > Steady on.  There's no need for that.  Michal isn't evil.  Please
> > > > apologise.
> > > 
> > > I see this attitude from Michal again and again.
> > 
> > Fine; then *say that*.  I also see Michal saying "No" a lot.  Sometimes
> > I agree with him, sometimes I don't.  I think he genuinely wants the best
> > code in the kernel, and saying "No" is part of it.
> > 
> > > He didn't want to fix vmalloc(GFP_NOIO)
> > 
> > I don't remember that conversation, so I don't know whether I agree with
> > his reasoning or not.  But we are supposed to be moving away from GFP_NOIO
> > towards marking regions with memalloc_noio_save() / restore.  If you do
> > that, you won't need vmalloc(GFP_NOIO).
> 
> He said the same thing a year ago. And there was small progress. 6 out of 
> 27 __vmalloc calls were converted to memalloc_noio_save in a year - 5 in 
> infiniband and 1 in btrfs. (the whole discussion is here 
> http://lkml.iu.edu/hypermail/linux/kernel/1706.3/04681.html )

Well this is not that easy. It requires a cooperation from maintainers.
I can only do as much. I've posted patches in the past and actively
bringing up this topic at LSFMM last two years...

> He refuses 15-line patch to fix GFP_NOIO bug because he believes that in 4 
> years, the kernel will be refactored and GFP_NOIO will be eliminated. Why 
> does he have veto over this part of the code? I'd much rather argue with 
> people who have constructive comments about fixing bugs than with him.

I didn't NACK the patch AFAIR. I've said it is not a good idea longterm.
I would be much more willing to change my mind if you would back your
patch by a real bug report. Hacks are acceptable when we have a real
issue in hands. But if we want to fix potential issue then better make
it properly.

[...]

> I sent the CONFIG_DEBUG_SG patch before (I wonder why he didn't repond to 
> it). I'll send a third version of the patch that actually randomly chooses 
> between kmalloc and vmalloc, because some abuses can only be detected with 
> kmalloc and we should test both.
> 
> For bisecting, it is better to always fallback to vmalloc, but for general 
> testing, it is better to test both branches.

Agreed!

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-23 Thread Michal Hocko
On Mon 23-04-18 10:24:02, Mikulas Patocka wrote:
[...]

I am not going to comment on your continuous accusations. We can discuss
patches but general rants make very limited sense.
 
> > > > I already said that we can change it from CONFIG_DEBUG_VM to 
> > > > CONFIG_DEBUG_SG - or to whatever other option you may want, just to 
> > > > make 
> > > > sure that it is enabled in distro debug kernels by default.
> > > 
> > > Yes, and I think that's the right idea.  So send a v2 and ignore the
> > > replies that are clearly relating to an earlier version of the patch.
> > > Not everybody reads every mail in the thread before responding to one they
> > > find interesting.  Yes, ideally, one would, but sometimes one doesn't.
> > 
> > And look here. This is yet another ad-hoc idea. We have many users of
> > kvmalloc which have no relation to SG, yet you are going to control
> > their behavior by CONFIG_DEBUG_SG? No way! (yeah evil again)
> 
> Why aren't you constructive and pick up pick up the CONFIG flag?

Because config doesn't make that much of a sense. You do not want a
permanent vmalloc fallback unconditionally. Debugging option which
changes the behavior completely is not useful IMNHO. Besides that who is
going to enable it?

> > Really, we have a fault injection framework and this sounds like
> > something to hook in there.
> 
> The testing people won't set it up. They install the "kernel-debug" 
> package and run the tests in it.
> 
> If you introduce a hidden option that no one knows about, no one will use 
> it.

then make sure people know about it. Fuzzers already do test fault
injections.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-22 Thread Michal Hocko
On Sat 21-04-18 07:47:57, Matthew Wilcox wrote:
> On Fri, Apr 20, 2018 at 05:21:26PM -0400, Mikulas Patocka wrote:
> > On Fri, 20 Apr 2018, Matthew Wilcox wrote:
> > > On Fri, Apr 20, 2018 at 04:54:53PM -0400, Mikulas Patocka wrote:
> > > > On Fri, 20 Apr 2018, Michal Hocko wrote:
> > > > > No way. This is just wrong! First of all, you will explode most likely
> > > > > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to 
> > > > > be
> > > > > enabled quite often.
> > > > 
> > > > You're an evil person who doesn't want to fix bugs.
> > > 
> > > Steady on.  There's no need for that.  Michal isn't evil.  Please
> > > apologise.
> > 
> > I see this attitude from Michal again and again.
> 
> Fine; then *say that*.  I also see Michal saying "No" a lot.  Sometimes
> I agree with him, sometimes I don't.  I think he genuinely wants the best
> code in the kernel, and saying "No" is part of it.

Exactly! We have a lot of legacy herritage and random semantic because
we were too eager to merge stuff. I think it is time to stop that and
think twice before merging someothing. If you call that evil then I am
fine to be evil.

> > He didn't want to fix vmalloc(GFP_NOIO)
> 
> I don't remember that conversation, so I don't know whether I agree with
> his reasoning or not.  But we are supposed to be moving away from GFP_NOIO
> towards marking regions with memalloc_noio_save() / restore.  If you do
> that, you won't need vmalloc(GFP_NOIO).

It was basically to detect GFP_NOIO context _inside_ vmalloc and use the
scope API to enforce it there. Does it solve potential problems? Yes it
does. Does it solve any existing report, no I am not aware of any. Is
it a good fix longterm? Absolutely no, because the scope API should be
used _at the place_ where the scope starts rather than a random utility
function. If we are going the easier way now, we will never teach users
to use the API properly. And I am willing to risk to keep a broken
code which we have for years rather than allow a random hack that will
seemingly fix it.

Btw. I was pretty much explicit with this reasoning when rejecting the
patch. Do you still call that evil?

> > he didn't want to fix alloc_pages sleeping when __GFP_NORETRY is used.
> 
> The GFP flags are a mess, still.

I do not remember that one but __GFP_NORETRY is _allowed_ to sleep. And
yes I do _agree_ gfp flags are a mess which is really hard to get fixed
because they are lacking a good design from the very beginning. Fixing
some of those issues today is a completely PITA.

> > So what should I say? Fix them and 
> > you won't be evil :-)
> 
> No, you should reserve calling somebody evil for truly evil things.
> 
> > (he could also fix the oom killer, so that it is triggered when 
> > free_memory+cache+free_swap goes beyond a threshold and not when you loop 
> > too long in the allocator)
> 
> ... that also doesn't make somebody evil.

And again, it is way much more easier to claim that something will get
fixed when the reality is much more complicated. I've tried to explain
those risks as well.

> > I already said that we can change it from CONFIG_DEBUG_VM to 
> > CONFIG_DEBUG_SG - or to whatever other option you may want, just to make 
> > sure that it is enabled in distro debug kernels by default.
> 
> Yes, and I think that's the right idea.  So send a v2 and ignore the
> replies that are clearly relating to an earlier version of the patch.
> Not everybody reads every mail in the thread before responding to one they
> find interesting.  Yes, ideally, one would, but sometimes one doesn't.

And look here. This is yet another ad-hoc idea. We have many users of
kvmalloc which have no relation to SG, yet you are going to control
their behavior by CONFIG_DEBUG_SG? No way! (yeah evil again)

Really, we have a fault injection framework and this sounds like
something to hook in there.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-20 Thread Michal Hocko
On Fri 20-04-18 06:41:36, Matthew Wilcox wrote:
> On Fri, Apr 20, 2018 at 03:08:52PM +0200, Michal Hocko wrote:
> > > In order to detect these bugs reliably I submit this patch that changes
> > > kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.
> > 
> > No way. This is just wrong! First of all, you will explode most likely
> > on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
> > enabled quite often.
> 
> I think it'll still suit Mikulas' debugging needs if we always use
> vmalloc for sizes above PAGE_SIZE?

Even if that was the case then this doesn't sounds like CONFIG_DEBUG_VM
material. We do not want a completely different behavior when the config
is enabled. If we really need some better fallback testing coverage
then the fault injection, as suggested by Vlastimil, sounds much more
reasonable to me

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM

2018-04-20 Thread Michal Hocko
On Thu 19-04-18 12:12:38, Mikulas Patocka wrote:
[...]
> From: Mikulas Patocka 
> Subject: [PATCH] kvmalloc: always use vmalloc if CONFIG_DEBUG_VM
> 
> The kvmalloc function tries to use kmalloc and falls back to vmalloc if
> kmalloc fails.
> 
> Unfortunatelly, some kernel code has bugs - it uses kvmalloc and then
> uses DMA-API on the returned memory or frees it with kfree. Such bugs were
> found in the virtio-net driver, dm-integrity or RHEL7 powerpc-specific
> code.
> 
> These bugs are hard to reproduce because vmalloc falls back to kmalloc
> only if memory is fragmented.
> 
> In order to detect these bugs reliably I submit this patch that changes
> kvmalloc to always use vmalloc if CONFIG_DEBUG_VM is turned on.

No way. This is just wrong! First of all, you will explode most likely
on many allocations of small sizes. Second, CONFIG_DEBUG_VM tends to be
enabled quite often.

> Signed-off-by: Mikulas Patocka 

Nacked-by: Michal Hocko 

> ---
>  mm/util.c |2 ++
>  1 file changed, 2 insertions(+)
> 
> Index: linux-2.6/mm/util.c
> ===
> --- linux-2.6.orig/mm/util.c  2018-04-18 15:46:23.0 +0200
> +++ linux-2.6/mm/util.c   2018-04-18 16:00:43.0 +0200
> @@ -395,6 +395,7 @@ EXPORT_SYMBOL(vm_mmap);
>   */
>  void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  {
> +#ifndef CONFIG_DEBUG_VM
>   gfp_t kmalloc_flags = flags;
>   void *ret;
>  
> @@ -426,6 +427,7 @@ void *kvmalloc_node(size_t size, gfp_t f
>*/
>   if (ret || size <= PAGE_SIZE)
>   return ret;
> +#endif
>  
>       return __vmalloc_node_flags_caller(size, node, flags,
>   __builtin_return_address(0));

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] SLUB: Do not fallback to mininum order if __GFP_NORETRY is set

2018-04-19 Thread Michal Hocko
On Wed 18-04-18 09:45:39, Cristopher Lameter wrote:
> Mikulas Patoka wants to ensure that no fallback to lower order happens. I
> think __GFP_NORETRY should work correctly in that case too and not fall
> back.

Overriding __GFP_NORETRY is just a bad idea. It will make the semantic
of the flag just more confusing. Note there are users who use
__GFP_NORETRY as a way to suppress heavy memory pressure and/or the OOM
killer. You do not want to change the semantic for them.

Besides that the changelog is less than optimal. What is the actual
problem? Why somebody doesn't want a fallback? Is there a configuration
that could prevent the same?

> Allocating at a smaller order is a retry operation and should not
> be attempted.
> 
> If the caller does not want retries then respect that.
> 
> GFP_NORETRY allows callers to ensure that only maximum order
> allocations are attempted.
> 
> Signed-off-by: Christoph Lameter 
> 
> Index: linux/mm/slub.c
> ===
> --- linux.orig/mm/slub.c
> +++ linux/mm/slub.c
> @@ -1598,7 +1598,7 @@ static struct page *allocate_slab(struct
>   alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & 
> ~(__GFP_RECLAIM|__GFP_NOFAIL);
> 
>   page = alloc_slab_page(s, alloc_gfp, node, oo);
> - if (unlikely(!page)) {
> + if (unlikely(!page) && !(flags & __GFP_NORETRY)) {
>   oo = s->min;
>   alloc_gfp = flags;
>   /*

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH] mm: introduce kv[mz]alloc helpers

2016-12-21 Thread Michal Hocko
It seems that this email didn't get delivered due to some stupid gmail
spam policy. Let me try to repost via a different relay. Sorry to those
who have seen the original message and get a duplicate now.

On Wed 21-12-16 08:03:53, Michal Hocko wrote:
> On Tue 20-12-16 14:13:41, Andrew Morton wrote:
> > On Tue, 20 Dec 2016 09:38:22 -0800 Joe Perches  wrote:
> > 
> > > > So what are we going to do about this patch?
> > > 
> > > Well if Andrew doesn't object again, it should probably be applied.
> > > Unless his silence here acts like a pocket-veto.
> > > 
> > > Andrew?  Anything to add?
> > 
> > I guess we should give in to reality and do this, or something like it.
> > But Al said he was going to dig out some patches for us to consider?
> 
> Al wanted to cover vmalloc GFP_NOFS context _inside_ the vmalloc
> code.  This is mostly orthogonal to this patch I believe. Besides
> that I _think_ that it would be better to convert those vmalloc(NOFS)
> users to the scope api rather than tweak the vmalloc. One reason to go
> that way is that those vmalloc(NOFS) users need to be checked anyway
> and something tells me that some of them can really be changed to
> GFP_KERNEL.
> 
> This helper is clear about its gfp mask expectation and complain loudly
> if somebody wants something unexpected which is a good start I believe.
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH] mm: introduce kv[mz]alloc helpers

2016-12-20 Thread Michal Hocko
On Wed 14-12-16 09:59:16, Michal Hocko wrote:
> On Tue 13-12-16 14:07:33, Joe Perches wrote:
> > On Tue, 2016-12-13 at 11:14 +0100, Michal Hocko wrote:
> > > Are there any more comments or objections to this patch? Is this a good
> > > start or kv[mz]alloc has to provide a way to cover GFP_NOFS users as
> > > well in the initial version.
> > 
> > Did Andrew Morton ever comment on this?
> > I believe he was the primary objector in the past.
> > 
> > Last I recollect was over a year ago:
> > 
> > https://lkml.org/lkml/2015/7/7/1050
> 
> Let me quote:
> : Sigh.  We've resisted doing this because vmalloc() is somewhat of a bad
> : thing, and we don't want to make it easy for people to do bad things.
> : 
> : And vmalloc is bad because a) it's slow and b) it does GFP_KERNEL
> : allocations for page tables and c) it is susceptible to arena
> : fragmentation.
> : 
> : We'd prefer that people fix their junk so it doesn't depend upon large
> : contiguous allocations.  This isn't userspace - kernel space is hostile
> : and kernel code should be robust.
> : 
> : So I dunno.  Should we continue to make it a bit more awkward to use
> : vmalloc()?  Probably that tactic isn't being very successful - people
> : will just go ahead and open-code it.  And given the surprising amount
> : of stuff you've placed in kvmalloc_node(), they'll implement it
> : incorrectly...
> : 
> : How about we compromise: add kvmalloc_node(), but include a BUG_ON("you
> : suck") to it?
> 
> While I agree with some of those points, the reality really sucks,
> though. We have tried the same tactic with __GFP_NOFAIL and failed as
> well. I guess we should just bite the bullet and provide an api which is
> so common that people keep reinventing their own ways around that, many
> times wrongly or suboptimally. BUG_ON("you suck") is just not going to
> help much I am afraid.
> 
> What do you think Andrew?

So what are we going to do about this patch?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH] mm: introduce kv[mz]alloc helpers

2016-12-14 Thread Michal Hocko
On Tue 13-12-16 13:55:46, Andreas Dilger wrote:
> On Dec 13, 2016, at 3:14 AM, Michal Hocko  wrote:
> > 
> > Are there any more comments or objections to this patch? Is this a good
> > start or kv[mz]alloc has to provide a way to cover GFP_NOFS users as
> > well in the initial version.
> 
> I'm in favour of this cleanup as a starting point.  I definitely agree
> that this same functionality is in use in a number of places and should
> be consolidated.
> 
> The vmalloc() from GFP_NOFS can be addressed separately in later patches.
> That is an issue for several filesystems, and while XFS works around this,
> it would be better to lift that out of the filesystem code into the VM.

Well, my longer term plan is to change how GFP_NOFS is used from the fs
code rather than tweak the VM layer. The current situation with the nofs
is messy and confusing. In many contexts it is used without a good
reason - just to be sure that nothing will break. I strongly believe
that we should use a scope api [1] which marks whole regions of
potentially reclaim dangerous code paths and all the allocations within
that region will inherit the nofs protection automatically. That would
solve the vmalloc(GFP_NOFS) problem as well. The route to get there is
no short or easy. I am planning to repost the scope patchset hopefully
soon with ext4 converted.

[1] http://lkml.kernel.org/r/1461671772-1269-1-git-send-email-mho...@kernel.org

> Really, there are several of things about vmalloc() that could improve
> if we decided to move it out of the dog house and allow it to become a
> first class citizen, but that needs a larger discussion, and you can
> already do a lot of cleanup with just the introduction of kvmalloc().
> 
> Since this is changing the ext4 code, you can add my:
> 
> Reviewed-by: Andreas Dilger 

thanks!
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH] mm: introduce kv[mz]alloc helpers

2016-12-14 Thread Michal Hocko
On Tue 13-12-16 14:07:33, Joe Perches wrote:
> On Tue, 2016-12-13 at 11:14 +0100, Michal Hocko wrote:
> > Are there any more comments or objections to this patch? Is this a good
> > start or kv[mz]alloc has to provide a way to cover GFP_NOFS users as
> > well in the initial version.
> 
> Did Andrew Morton ever comment on this?
> I believe he was the primary objector in the past.
> 
> Last I recollect was over a year ago:
> 
> https://lkml.org/lkml/2015/7/7/1050

Let me quote:
: Sigh.  We've resisted doing this because vmalloc() is somewhat of a bad
: thing, and we don't want to make it easy for people to do bad things.
: 
: And vmalloc is bad because a) it's slow and b) it does GFP_KERNEL
: allocations for page tables and c) it is susceptible to arena
: fragmentation.
: 
: We'd prefer that people fix their junk so it doesn't depend upon large
: contiguous allocations.  This isn't userspace - kernel space is hostile
: and kernel code should be robust.
: 
: So I dunno.  Should we continue to make it a bit more awkward to use
: vmalloc()?  Probably that tactic isn't being very successful - people
: will just go ahead and open-code it.  And given the surprising amount
: of stuff you've placed in kvmalloc_node(), they'll implement it
: incorrectly...
: 
: How about we compromise: add kvmalloc_node(), but include a BUG_ON("you
: suck") to it?

While I agree with some of those points, the reality really sucks,
though. We have tried the same tactic with __GFP_NOFAIL and failed as
well. I guess we should just bite the bullet and provide an api which is
so common that people keep reinventing their own ways around that, many
times wrongly or suboptimally. BUG_ON("you suck") is just not going to
help much I am afraid.

What do you think Andrew?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH] mm: introduce kv[mz]alloc helpers

2016-12-13 Thread Michal Hocko
Are there any more comments or objections to this patch? Is this a good
start or kv[mz]alloc has to provide a way to cover GFP_NOFS users as
well in the initial version.

On Thu 08-12-16 11:33:00, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Using kmalloc with the vmalloc fallback for larger allocations is a
> common pattern in the kernel code. Yet we do not have any common helper
> for that and so users have invented their own helpers. Some of them are
> really creative when doing so. Let's just add kv[mz]alloc and make sure
> it is implemented properly. This implementation makes sure to not make
> a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
> to not warn about allocation failures. This also rules out the OOM
> killer as the vmalloc is a more approapriate fallback than a disruptive
> user visible action.
> 
> This patch also changes some existing users and removes helpers which
> are specific for them. In some cases this is not possible (e.g.
> ext4_kvmalloc, libcfs_kvzalloc, __aa_kvmalloc) because those seems to be
> broken and require GFP_NO{FS,IO} context which is not vmalloc compatible
> in general (note that the page table allocation is GFP_KERNEL). Those
> need to be fixed separately.
> 
> apparmor has already claimed kv[mz]alloc so remove those and use
> __aa_kvmalloc instead to prevent from the naming clashes.
> 
> Cc: Paolo Bonzini 
> Cc: Mike Snitzer 
> Cc: dm-devel@redhat.com
> Cc: "Michael S. Tsirkin" 
> Cc: "Theodore Ts'o" 
> Cc: k...@vger.kernel.org
> Cc: linux-e...@vger.kernel.org
> Cc: linux-f2fs-de...@lists.sourceforge.net
> Cc: linux-security-mod...@vger.kernel.org
> Signed-off-by: Michal Hocko 
> ---
> 
> Hi,
> this has been brought up during [1] discussion. I think we are long overdue
> with kvmalloc helpers provided by the core mm code. There are so many users
> out there. This patch doesn't try to convert all existing users. I have just
> tried to identified those who have invented their own helpers. There are many
> others who are openconding that. This is something for a coccinelle script to
> automate.
> 
> While looking into this I have encountered many (as noted in the
> changelog) users who are broken. Especially GFP_NOFS users which might
> go down the vmalloc path are worrying. Those need to be fixed but that
> is out of scope of this patch. I have simply left them in the place. A proper
> fix for them is to not use GFP_NOFS and rather move over to a scope gfp_nofs
> api [2]. This will take quite some time though.
> 
> One thing I haven't considered in this patch - but I can if there is a demand 
> -
> is that the current callers of kv[mz]alloc cannot really override GFP_NORETRY
> for larger requests. This flag is implicit. I can imagine some users would
> rather prefer to retry hard before falling back to vmalloc though. There 
> doesn't
> seem to be any such user in the tree right now AFAICS. vhost_kvzalloc
> used __GFP_REPEAT but git history doesn't show any sign there would be a 
> strong
> reason for that. I might be wrong here. If that is the case then it is not a 
> problem
> to do
> 
>   /*
>* Make sure that larger requests are not too disruptive as long as
>* the caller doesn't insist by giving __GFP_REPEAT. No OOM
>* killer and no allocation failure warnings as we have a fallback
>* is done by default.
>*/
>   if (size > PAGE_SZE) {
>   kmalloc_flags |= __GFP_NOWARN;
> 
>   if (!(flags & __GFP_REPEAT))
>   flags |= __GFP_NORETRY;
>   }
> 
> [1] 
> http://lkml.kernel.org/r/1480554981-195198-1-git-send-email-astepa...@cloudlinux.com
> [2] 
> http://lkml.kernel.org/r/1461671772-1269-1-git-send-email-mho...@kernel.org
> 
>  arch/x86/kvm/lapic.c |  4 ++--
>  arch/x86/kvm/page_track.c|  4 ++--
>  arch/x86/kvm/x86.c   |  4 ++--
>  drivers/md/dm-stats.c|  7 +--
>  drivers/vhost/vhost.c| 15 +++---
>  fs/ext4/mballoc.c|  2 +-
>  fs/ext4/super.c  |  4 ++--
>  fs/f2fs/f2fs.h   | 20 --
>  fs/f2fs/file.c   |  4 ++--
>  fs/f2fs/segment.c| 14 ++---
>  fs/seq_file.c| 16 +--
>  include/linux/kvm_host.h |  2 --
>  include/linux/mm.h   | 14 +
>  include/linux/vmalloc.h  |  1 +
>  mm/util.c| 40 
> 
>  mm/vmalloc.c |  2 +-
>  sec

Re: [dm-devel] [RFC PATCH] mm: introduce kv[mz]alloc helpers

2016-12-08 Thread Michal Hocko
On Fri 09-12-16 06:38:04, Al Viro wrote:
> On Fri, Dec 09, 2016 at 07:22:25AM +0100, Michal Hocko wrote:
> 
> > > Easier to handle those in vmalloc() itself.
> > 
> > I think there were some attempts in the past but some of the code paths
> > are burried too deep and adding gfp_mask all the way down there seemed
> > like a major surgery.
> 
> No need to propagate gfp_mask - the same trick XFS is doing right now can
> be done in vmalloc.c in a couple of places and that's it; I'll resurrect the
> patches and post them tomorrow after I get some sleep.

That would work as an immediate mitigation. No question about that but
what I've tried to point out in the reply to Dave is that longerm we
shouldn't hide this trickiness inside the vmalloc and rather handle
those users who are requesting NOFS/NOIO context from vmalloc. We
already have a scope api for NOIO and I want to add the same for NOFS.
I believe that much more sane approach is to use the API at those places
which really start/stop reclaim recursion dangerous scope (e.g. the
transaction context) rather than using GFP_NOFS randomly because this
approach has proven to not work properly over years. We have so many
place using GFP_NOFS just because nobody bothered to think whether it is
needed but it must be safe for sure that it is not funny.

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH] mm: introduce kv[mz]alloc helpers

2016-12-08 Thread Michal Hocko
On Fri 09-12-16 02:00:17, Al Viro wrote:
> On Fri, Dec 09, 2016 at 12:44:17PM +1100, Dave Chinner wrote:
> > On Thu, Dec 08, 2016 at 11:33:00AM +0100, Michal Hocko wrote:
> > > From: Michal Hocko 
> > > 
> > > Using kmalloc with the vmalloc fallback for larger allocations is a
> > > common pattern in the kernel code. Yet we do not have any common helper
> > > for that and so users have invented their own helpers. Some of them are
> > > really creative when doing so. Let's just add kv[mz]alloc and make sure
> > > it is implemented properly. This implementation makes sure to not make
> > > a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
> > > to not warn about allocation failures. This also rules out the OOM
> > > killer as the vmalloc is a more approapriate fallback than a disruptive
> > > user visible action.
> > > 
> > > This patch also changes some existing users and removes helpers which
> > > are specific for them. In some cases this is not possible (e.g.
> > > ext4_kvmalloc, libcfs_kvzalloc, __aa_kvmalloc) because those seems to be
> > > broken and require GFP_NO{FS,IO} context which is not vmalloc compatible
> > > in general (note that the page table allocation is GFP_KERNEL). Those
> > > need to be fixed separately.
> > 
> > See fs/xfs/kmem.c::kmem_zalloc_large(), which is XFS's version of
> > kvmalloc() that is GFP_NOFS/GFP_NOIO safe. Any generic API for this
> > functionality will have to play these memalloc_noio_save/
> > memalloc_noio_restore games to ensure they are GFP_NOFS safe
> 
> Easier to handle those in vmalloc() itself.

I think there were some attempts in the past but some of the code paths
are burried too deep and adding gfp_mask all the way down there seemed
like a major surgery.

> The problem I have with these
> helpers is that different places have different cutoff thresholds for
> switch from kmalloc to vmalloc; has anyone done an analysis of those?

Yes, I have noticed some creativity as well. Some of them didn't bother
to kmalloc at all for size > PAGE_SIZE. Some where playing tricks with
PAGE_ALLOC_COSTLY_ORDER. I believe the right thing to do is to simply do
not hammer the system with size > PAGE_SZE which means __GFP_NORETRY for
them and fallback to vmalloc on the failure (basically what
seq_buf_alloc did). I cannot offer any numbers but at least
seq_buf_alloc has proven to do the right thing over time.

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH] mm: introduce kv[mz]alloc helpers

2016-12-08 Thread Michal Hocko
On Fri 09-12-16 12:44:17, Dave Chinner wrote:
> On Thu, Dec 08, 2016 at 11:33:00AM +0100, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > Using kmalloc with the vmalloc fallback for larger allocations is a
> > common pattern in the kernel code. Yet we do not have any common helper
> > for that and so users have invented their own helpers. Some of them are
> > really creative when doing so. Let's just add kv[mz]alloc and make sure
> > it is implemented properly. This implementation makes sure to not make
> > a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
> > to not warn about allocation failures. This also rules out the OOM
> > killer as the vmalloc is a more approapriate fallback than a disruptive
> > user visible action.
> > 
> > This patch also changes some existing users and removes helpers which
> > are specific for them. In some cases this is not possible (e.g.
> > ext4_kvmalloc, libcfs_kvzalloc, __aa_kvmalloc) because those seems to be
> > broken and require GFP_NO{FS,IO} context which is not vmalloc compatible
> > in general (note that the page table allocation is GFP_KERNEL). Those
> > need to be fixed separately.
> 
> See fs/xfs/kmem.c::kmem_zalloc_large(), which is XFS's version of
> kvmalloc() that is GFP_NOFS/GFP_NOIO safe. Any generic API for this
> functionality will have to play these memalloc_noio_save/
> memalloc_noio_restore games to ensure they are GFP_NOFS safe

Well, I didn't want to play this games in the generic kvmalloc, at least
not now, because all the converted users didn't really need it so far
and I believe that the existing users need a) inspection to check
whether NO{FS,IO} context is really needed and b) I still believe that
the scope nofs api should be used longterm rather than an explicit
GFP_NOFS. I am already working on ext[34] code.

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH] mm: introduce kv[mz]alloc helpers

2016-12-08 Thread Michal Hocko
On Thu 08-12-16 14:00:20, David Hildenbrand wrote:
> Am 08.12.2016 um 11:33 schrieb Michal Hocko:
> > From: Michal Hocko 
> > 
> > Using kmalloc with the vmalloc fallback for larger allocations is a
> > common pattern in the kernel code. Yet we do not have any common helper
> > for that and so users have invented their own helpers. Some of them are
> > really creative when doing so. Let's just add kv[mz]alloc and make sure
> > it is implemented properly. This implementation makes sure to not make
> > a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
> > to not warn about allocation failures. This also rules out the OOM
> > killer as the vmalloc is a more approapriate fallback than a disruptive
> > user visible action.
> > 
> > This patch also changes some existing users and removes helpers which
> > are specific for them. In some cases this is not possible (e.g.
> > ext4_kvmalloc, libcfs_kvzalloc, __aa_kvmalloc) because those seems to be
> > broken and require GFP_NO{FS,IO} context which is not vmalloc compatible
> > in general (note that the page table allocation is GFP_KERNEL). Those
> > need to be fixed separately.
> > 
> > apparmor has already claimed kv[mz]alloc so remove those and use
> > __aa_kvmalloc instead to prevent from the naming clashes.
> > 
> > Cc: Paolo Bonzini 
> > Cc: Mike Snitzer 
> > Cc: dm-devel@redhat.com
> > Cc: "Michael S. Tsirkin" 
> > Cc: "Theodore Ts'o" 
> > Cc: k...@vger.kernel.org
> > Cc: linux-e...@vger.kernel.org
> > Cc: linux-f2fs-de...@lists.sourceforge.net
> > Cc: linux-security-mod...@vger.kernel.org
> > Signed-off-by: Michal Hocko 
> 
> I remember yet another similar user in arch/s390/kvm/kvm-s390.c
> -> kvm_s390_set_skeys()
> 
> ...
> keys = kmalloc_array(args->count, sizeof(uint8_t),
>  GFP_KERNEL | __GFP_NOWARN);
> if (!keys)
> vmalloc(sizeof(uint8_t) * args->count);
> ...
> 
> would kvmalloc_array make sense? (it would even make the code here
> less error prone and better to read)

Well, if there are more users like that then why not. I just do not want
to duplicate the whole kmalloc API right now. The above could be
trivially changed to kvmalloc(args->count * sizeof(uint8_t), GFP_KERNEL)
so a special API might not be really needed. 

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [RFC PATCH] mm: introduce kv[mz]alloc helpers

2016-12-08 Thread Michal Hocko
From: Michal Hocko 

Using kmalloc with the vmalloc fallback for larger allocations is a
common pattern in the kernel code. Yet we do not have any common helper
for that and so users have invented their own helpers. Some of them are
really creative when doing so. Let's just add kv[mz]alloc and make sure
it is implemented properly. This implementation makes sure to not make
a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
to not warn about allocation failures. This also rules out the OOM
killer as the vmalloc is a more approapriate fallback than a disruptive
user visible action.

This patch also changes some existing users and removes helpers which
are specific for them. In some cases this is not possible (e.g.
ext4_kvmalloc, libcfs_kvzalloc, __aa_kvmalloc) because those seems to be
broken and require GFP_NO{FS,IO} context which is not vmalloc compatible
in general (note that the page table allocation is GFP_KERNEL). Those
need to be fixed separately.

apparmor has already claimed kv[mz]alloc so remove those and use
__aa_kvmalloc instead to prevent from the naming clashes.

Cc: Paolo Bonzini 
Cc: Mike Snitzer 
Cc: dm-devel@redhat.com
Cc: "Michael S. Tsirkin" 
Cc: "Theodore Ts'o" 
Cc: k...@vger.kernel.org
Cc: linux-e...@vger.kernel.org
Cc: linux-f2fs-de...@lists.sourceforge.net
Cc: linux-security-mod...@vger.kernel.org
Signed-off-by: Michal Hocko 
---

Hi,
this has been brought up during [1] discussion. I think we are long overdue
with kvmalloc helpers provided by the core mm code. There are so many users
out there. This patch doesn't try to convert all existing users. I have just
tried to identified those who have invented their own helpers. There are many
others who are openconding that. This is something for a coccinelle script to
automate.

While looking into this I have encountered many (as noted in the
changelog) users who are broken. Especially GFP_NOFS users which might
go down the vmalloc path are worrying. Those need to be fixed but that
is out of scope of this patch. I have simply left them in the place. A proper
fix for them is to not use GFP_NOFS and rather move over to a scope gfp_nofs
api [2]. This will take quite some time though.

One thing I haven't considered in this patch - but I can if there is a demand -
is that the current callers of kv[mz]alloc cannot really override GFP_NORETRY
for larger requests. This flag is implicit. I can imagine some users would
rather prefer to retry hard before falling back to vmalloc though. There doesn't
seem to be any such user in the tree right now AFAICS. vhost_kvzalloc
used __GFP_REPEAT but git history doesn't show any sign there would be a strong
reason for that. I might be wrong here. If that is the case then it is not a 
problem
to do

/*
 * Make sure that larger requests are not too disruptive as long as
 * the caller doesn't insist by giving __GFP_REPEAT. No OOM
 * killer and no allocation failure warnings as we have a fallback
 * is done by default.
 */
if (size > PAGE_SZE) {
kmalloc_flags |= __GFP_NOWARN;

if (!(flags & __GFP_REPEAT))
flags |= __GFP_NORETRY;
}

[1] 
http://lkml.kernel.org/r/1480554981-195198-1-git-send-email-astepa...@cloudlinux.com
[2] http://lkml.kernel.org/r/1461671772-1269-1-git-send-email-mho...@kernel.org

 arch/x86/kvm/lapic.c |  4 ++--
 arch/x86/kvm/page_track.c|  4 ++--
 arch/x86/kvm/x86.c   |  4 ++--
 drivers/md/dm-stats.c|  7 +--
 drivers/vhost/vhost.c| 15 +++---
 fs/ext4/mballoc.c|  2 +-
 fs/ext4/super.c  |  4 ++--
 fs/f2fs/f2fs.h   | 20 --
 fs/f2fs/file.c   |  4 ++--
 fs/f2fs/segment.c| 14 ++---
 fs/seq_file.c| 16 +--
 include/linux/kvm_host.h |  2 --
 include/linux/mm.h   | 14 +
 include/linux/vmalloc.h  |  1 +
 mm/util.c| 40 
 mm/vmalloc.c |  2 +-
 security/apparmor/apparmorfs.c   |  2 +-
 security/apparmor/include/apparmor.h | 10 -
 security/apparmor/match.c|  2 +-
 virt/kvm/kvm_main.c  | 18 +++-
 20 files changed, 84 insertions(+), 101 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index b62c85229711..465e5ff4c304 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -167,8 +167,8 @@ static void recalculate_apic_map(struct kvm *kvm)
if (kvm_apic_present(vcpu))
max_id = max(max_id, kvm_apic_id(vcpu->arch.apic));
 
-   new = kvm_kvzalloc(sizeof(struct kvm_apic_map) +
-  siz

Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks

2016-08-03 Thread Michal Hocko
On Wed 03-08-16 09:59:11, Mikulas Patocka wrote:
> 
> 
> On Wed, 27 Jul 2016, Michal Hocko wrote:
> 
> > On Wed 27-07-16 10:28:40, Mikulas Patocka wrote:
[...]
> > > I think that approach with PF_LESS_THROTTLE in mempool_alloc is incorrect 
> > > and that mempool allocations should never be throttled.
> > 
> > I'm not really sure this is the right approach. If a particular mempool
> > user cannot ever be throttled by the page allocator then it should
> > perform GFP_NOWAIT.
> 
> Then, all block device drivers should have GFP_NOWAIT - which means that 
> we can as well make it default.
> 
> But GFP_NOWAIT also disables direct reclaim. We really want direct reclaim 
> when allocating from mempool - we just don't want to throttle due to block 
> device congestion.
> 
> We could use __GFP_NORETRY as an indication that we don't want to sleep - 
> or make a new flag __GFP_NO_THROTTLE.

__GFP_NORETRY is used for other contexts so it is not suitable.
__GFP_NO_THROTTLE would be possible but I would still prefer if we
didn't go that way unless really necessary.

> > Even mempool allocations shouldn't allow reclaim to
> > scan pages too quickly even when LRU lists are full of dirty pages. But
> > as I've said that would restrict the success rates even under light page
> > cache load. Throttling on the wait_iff_congested should be quite rare.
> > 
> > Anyway do you see an excessive throttling with the patch posted
> > http://lkml.kernel.org/r/20160725192344.gd2...@dhcp22.suse.cz ? Or from
> 
> It didn't have much effect.
> 
> Since the patch 4e390b2b2f34b8daaabf2df1df0cf8f798b87ddb (revert of the 
> limitless mempool allocations), swapping to dm-crypt works in the simple 
> example.

OK. Do you see any throttling due to wait_iff_congested?
writeback_wait_iff_congested trace point should help here. If not maybe
we should start with the above patch and see how it works in practise.
If the there is still an excessive and unexpected throttling then we
should move on to a more mempool/block layer users specific solution.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks

2016-07-28 Thread Michal Hocko
On Thu 28-07-16 07:33:19, NeilBrown wrote:
> On Thu, Jul 28 2016, Michal Hocko wrote:
> 
> > On Wed 27-07-16 13:43:35, NeilBrown wrote:
> >> On Mon, Jul 25 2016, Michal Hocko wrote:
> >> 
> >> > On Sat 23-07-16 10:12:24, NeilBrown wrote:
> > [...]
> >> So should there be a limit on dirty
> >> pages in the swap cache just like there is for dirty pages in any
> >> filesystem (the max_dirty_ratio thing) ??
> >> Maybe there is?
> >
> > There is no limit AFAIK. We are relying that the reclaim is throttled
> > when necessary.
> 
> Is that a bit indirect?

Yes it is. Dunno, how much of a problem is that, though.

> It is hard to tell without a clear big-picture.
> Something to keep in mind anyway.
> 
> >
> >> I think we'd end up with cleaner code if we removed the cute-hacks.  And
> >> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
> >> need all those 26).
> >
> > Well, maybe we are able to remove those hacks, I wouldn't definitely
> > be opposed.  But right now I am not even convinced that the mempool
> > specific gfp flags is the right way to go.
> 
> I'm not suggesting a mempool-specific gfp flag.  I'm suggesting a
> transient-allocation gfp flag, which would be quite useful for mempool.
> 
> Can you give more details on why using a gfp flag isn't your first choice
> for guiding what happens when the system is trying to get a free page
> :-?

If we get rid of throttle_vm_writeout then I guess it might turn out to
be unnecessary. There are other places which will still throttle but I
believe those should be kept regardless of who is doing the allocation
because they are helping the LRU scanning sane. I might be wrong here
and bailing out from the reclaim rather than waiting would turn out
better for some users but I would like to see whether the first approach
works reasonably well.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks

2016-07-27 Thread Michal Hocko
On Wed 27-07-16 10:28:40, Mikulas Patocka wrote:
> 
> 
> On Wed, 27 Jul 2016, NeilBrown wrote:
> 
> > On Tue, Jul 26 2016, Mikulas Patocka wrote:
> > 
> > > On Sat, 23 Jul 2016, NeilBrown wrote:
> > >
> > >> "dirtying ... from the reclaim context" ??? What does that mean?
> > >> According to
> > >>   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
> > >> From the history tree, the purpose of throttle_vm_writeout() is to
> > >> limit the amount of memory that is concurrently under I/O.
> > >> That seems strange to me because I thought it was the responsibility of
> > >> each backing device to impose a limit - a maximum queue size of some
> > >> sort.
> > >
> > > Device mapper doesn't impose any limit for in-flight bios.
> > 
> > I would suggest that it probably should. At least it should
> > "set_wb_congested()" when the number of in-flight bios reaches some
> > arbitrary threshold.
> 
> If we set the device mapper device as congested, it can again trigger that 
> mempool alloc throttling bug.
> 
> I.e. suppose that we swap to a dm-crypt device. The dm-crypt device 
> becomes clogged and sets its state as congested. The underlying block 
> device is not congested.
> 
> The mempool_alloc function in the dm-crypt workqueue sets the 
> PF_LESS_THROTTLE flag, and tries to allocate memory, but according to 
> Michal's patches, processes with PF_LESS_THROTTLE may still get throttled.
> 
> So if we set the dm-crypt device as congested, it can incorrectly throttle 
> the dm-crypt workqueue that does allocations of temporary pages and 
> encryption.
> 
> I think that approach with PF_LESS_THROTTLE in mempool_alloc is incorrect 
> and that mempool allocations should never be throttled.

I'm not really sure this is the right approach. If a particular mempool
user cannot ever be throttled by the page allocator then it should
perform GFP_NOWAIT. Even mempool allocations shouldn't allow reclaim to
scan pages too quickly even when LRU lists are full of dirty pages. But
as I've said that would restrict the success rates even under light page
cache load. Throttling on the wait_iff_congested should be quite rare.

Anyway do you see an excessive throttling with the patch posted
http://lkml.kernel.org/r/20160725192344.gd2...@dhcp22.suse.cz ? Or from
another side. Do you see an excessive number of dirty/writeback pages
wrt. the dirty threshold or any other undesirable side effects?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks

2016-07-27 Thread Michal Hocko
On Wed 27-07-16 13:43:35, NeilBrown wrote:
> On Mon, Jul 25 2016, Michal Hocko wrote:
> 
> > On Sat 23-07-16 10:12:24, NeilBrown wrote:
[...]
> >> > My thinking was that throttle_vm_writeout is there to prevent from
> >> > dirtying too many pages from the reclaim the context.  PF_LESS_THROTTLE
> >> > is part of the writeout so throttling it on too many dirty pages is
> >> > questionable (well we get some bias but that is not really reliable). It
> >> > still makes sense to throttle when the backing device is congested
> >> > because the writeout path wouldn't make much progress anyway and we also
> >> > do not want to cycle through LRU lists too quickly in that case.
> >> 
> >> "dirtying ... from the reclaim context" ??? What does that mean?
> >
> > Say you would cause a swapout from the reclaim context. You would
> > effectively dirty that anon page until it gets written down to the
> > storage.
> 
> I should probably figure out how swap really works.  I have vague ideas
> which are probably missing important details...
> Isn't the first step that the page gets moved into the swap-cache - and
> marked dirty I guess.  Then it gets written out and the page is marked
> 'clean'.
> Then further memory pressure might push it out of the cache, or an early
> re-use would pull it back from the cache.
> If so, then "dirtying in reclaim context" could also be described as
> "moving into the swap cache" - yes?

Yes that is basically correct

> So should there be a limit on dirty
> pages in the swap cache just like there is for dirty pages in any
> filesystem (the max_dirty_ratio thing) ??
> Maybe there is?

There is no limit AFAIK. We are relying that the reclaim is throttled
when necessary.
 
> >> The use of PF_LESS_THROTTLE in current_may_throttle() in vmscan.c is to
> >> avoid a live-lock.  A key premise is that nfsd only allocates unbounded
> >> memory when it is writing to the page cache.  So it only needs to be
> >> throttled when the backing device it is writing to is congested.  It is
> >> particularly important that it *doesn't* get throttled just because an
> >> NFS backing device is congested, because nfsd might be trying to clear
> >> that congestion.
> >
> > Thanks for the clarification. IIUC then removing throttle_vm_writeout
> > for the nfsd writeout should be harmless as well, right?
> 
> Certainly shouldn't hurt from the perspective of nfsd.
> 
> >> >> The purpose of that flag is to allow a thread to dirty a page-cache page
> >> >> as part of cleaning another page-cache page.
> >> >> So it makes sense for loop and sometimes for nfsd.  It would make sense
> >> >> for dm-crypt if it was putting the encrypted version in the page cache.
> >> >> But if dm-crypt is just allocating a transient page (which I think it
> >> >> is), then a mempool should be sufficient (and we should make sure it is
> >> >> sufficient) and access to an extra 10% (or whatever) of the page cache
> >> >> isn't justified.
> >> >
> >> > If you think that PF_LESS_THROTTLE (ab)use in mempool_alloc is not
> >> > appropriate then would a PF_MEMPOOL be any better?
> >> 
> >> Why a PF rather than a GFP flag?
> >
> > Well, short answer is that gfp masks are almost depleted.
> 
> Really?  We have 26.
> 
> pagemap has a cute hack to store both GFP flags and other flag bits in
> the one 32 it number per address_space.  'struct address_space' could
> afford an extra 32 number I think.
> 
> radix_tree_root adds 3 'tag' flags to the gfp_mask.
> There is 16bits of free space in radix_tree_node (between 'offset' and
> 'count').  That space on the root node could store a record of which tags
> are set anywhere.  Or would that extra memory de-ref be a killer?

Yes these are reasons why adding new gfp flags is more complicated.

> I think we'd end up with cleaner code if we removed the cute-hacks.  And
> we'd be able to use 6 more GFP flags!!  (though I do wonder if we really
> need all those 26).

Well, maybe we are able to remove those hacks, I wouldn't definitely
be opposed.  But right now I am not even convinced that the mempool
specific gfp flags is the right way to go.

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks

2016-07-26 Thread Michal Hocko
On Mon 25-07-16 17:52:17, Mikulas Patocka wrote:
> 
> 
> On Sat, 23 Jul 2016, NeilBrown wrote:
> 
> > "dirtying ... from the reclaim context" ??? What does that mean?
> > According to
> >   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
> > From the history tree, the purpose of throttle_vm_writeout() is to
> > limit the amount of memory that is concurrently under I/O.
> > That seems strange to me because I thought it was the responsibility of
> > each backing device to impose a limit - a maximum queue size of some
> > sort.
> 
> Device mapper doesn't impose any limit for in-flight bios.
> 
> Some simple device mapper targets (such as linear or stripe) pass bio 
> directly to the underlying device with generic_make_request, so if the 
> underlying device's request limit is reached, the target's request routine 
> waits.
> 
> However, complex dm targets (such as dm-crypt, dm-mirror, dm-thin) pass 
> bios to a workqueue that processes them. And since there is no limit on 
> the number of workqueue entries, there is no limit on the number of 
> in-flight bios.
> 
> I've seen a case when I had a HPFS filesystem on dm-crypt. I wrote to the 
> filesystem, there was about 2GB dirty data. The HPFS filesystem used 
> 512-byte bios. dm-crypt allocates one temporary page for each incoming 
> bio. So, there were 4M bios in flight, each bio allocated 4k temporary 
> page - that is attempted 16GB allocation. It didn't trigger OOM condition 
> (because mempool allocations don't ever trigger it), but it temporarily 
> exhausted all computer's memory.

OK, that is certainly not good and something that throttle_vm_writeout
aimed at protecting from. It is a little bit poor protection because
it might fire much more earlier than necessary. Shouldn't those workers
simply backoff when the underlying bdi is congested? It wouldn't help
to queue more IO when the bdi is hammered already.
 
> I've made some patches that limit in-flight bios for device mapper in the 
> past, but there were not integrated into upstream.

Care to revive them? I am not an expert in dm but unbounded amount of
inflight IO doesn't really sound good.

[...]
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks

2016-07-26 Thread Michal Hocko
On Mon 25-07-16 21:23:44, Michal Hocko wrote:
> [CC Marcelo who might remember other details for the loads which made
>  him to add this code - see the patch changelog for more context]
> 
> On Mon 25-07-16 10:32:47, Michal Hocko wrote:
[...]
> From 0d950d64e3c59061f7cca71fe5877d4e430499c9 Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Mon, 25 Jul 2016 14:18:54 +0200
> Subject: [PATCH] mm, vmscan: get rid of throttle_vm_writeout
> 
> throttle_vm_writeout has been introduced back in 2005 to fix OOMs caused
> by excessive pageout activity during the reclaim. Too many pages could
> be put under writeback therefore LRUs would be full of unreclaimable pages
> until the IO completes and in turn the OOM killer could be invoked.
> 
> There have been some important changes introduced since then in the
> reclaim path though. Writers are throttled by balance_dirty_pages
> when initiating the buffered IO and later during the memory pressure,
> the direct reclaim is throttled by wait_iff_congested if the node is
> considered congested by dirty pages on LRUs and the underlying bdi
> is congested by the queued IO. The kswapd is throttled as well if it
> encounters pages marked for immediate reclaim or under writeback which
> signals that that there are too many pages under writeback already.
> Another important aspect is that we do not issue any IO from the direct
> reclaim context anymore. In a heavy parallel load this could queue a lot
> of IO which would be very scattered and thus unefficient which would
> just make the problem worse.

And I forgot another throttling point. should_reclaim_retry which is the
main logic to decide whether we go OOM or not has a congestion_wait if
there are too many dirty/writeback pages. That should give the IO
subsystem some time to finish the IO.

> This three mechanisms should throttle and keep the amount of IO in a
> steady state even under heavy IO and memory pressure so yet another
> throttling point doesn't really seem helpful. Quite contrary, Mikulas
> Patocka has reported that swap backed by dm-crypt doesn't work properly
> because the swapout IO cannot make sufficient progress as the writeout
> path depends on dm_crypt worker which has to allocate memory to perform
> the encryption. In order to guarantee a forward progress it relies
> on the mempool allocator. mempool_alloc(), however, prefers to use
> the underlying (usually page) allocator before it grabs objects from
> the pool. Such an allocation can dive into the memory reclaim and
> consequently to throttle_vm_writeout. If there are too many dirty or
> pages under writeback it will get throttled even though it is in fact a
> flusher to clear pending pages.
> 
> [  345.352536] kworker/u4:0D 88003df7f438 10488 6  2  
> 0x
> [  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
> [  345.352536]  88003df7f438 88003e5d0380 88003e5d0380 
> 88003e5d8e80
> [  345.352536]  88003dfb3240 88003df73240 88003df8 
> 88003df7f470
> [  345.352536]  88003e5d0380 88003e5d0380 88003df7f828 
> 88003df7f450
> [  345.352536] Call Trace:
> [  345.352536]  [] schedule+0x3c/0x90
> [  345.352536]  [] schedule_timeout+0x1d8/0x360
> [  345.352536]  [] ? detach_if_pending+0x1c0/0x1c0
> [  345.352536]  [] ? ktime_get+0xb3/0x150
> [  345.352536]  [] ? __delayacct_blkio_start+0x1f/0x30
> [  345.352536]  [] io_schedule_timeout+0xa4/0x110
> [  345.352536]  [] congestion_wait+0x86/0x1f0
> [  345.352536]  [] ? prepare_to_wait_event+0xf0/0xf0
> [  345.352536]  [] throttle_vm_writeout+0x44/0xd0
> [  345.352536]  [] shrink_zone_memcg+0x613/0x720
> [  345.352536]  [] shrink_zone+0xe0/0x300
> [  345.352536]  [] do_try_to_free_pages+0x1ad/0x450
> [  345.352536]  [] try_to_free_pages+0xef/0x300
> [  345.352536]  [] __alloc_pages_nodemask+0x879/0x1210
> [  345.352536]  [] ? sched_clock_cpu+0x90/0xc0
> [  345.352536]  [] alloc_pages_current+0xa1/0x1f0
> [  345.352536]  [] ? new_slab+0x3f5/0x6a0
> [  345.352536]  [] new_slab+0x2d7/0x6a0
> [  345.352536]  [] ? sched_clock_local+0x17/0x80
> [  345.352536]  [] ___slab_alloc+0x3fb/0x5c0
> [  345.352536]  [] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [] ? sched_clock_local+0x17/0x80
> [  345.352536]  [] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [] __slab_alloc+0x51/0x90
> [  345.352536]  [] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [] kmem_cache_alloc+0x27b/0x310
> [  345.352536]  [] mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [] mempool_alloc+0x91/0x230
> [  345.352536]  [] bio_alloc_bioset+0xbd/0x260
> [  345.352536]  [] kcryptd_crypt+0x114/0x3b0 [dm_crypt]
> 
> Let's just drop throttle_vm_writeout altogether. It is not very much
> helpful anymore.

Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks

2016-07-25 Thread Michal Hocko
[CC Marcelo who might remember other details for the loads which made
 him to add this code - see the patch changelog for more context]

On Mon 25-07-16 10:32:47, Michal Hocko wrote:
> On Sat 23-07-16 10:12:24, NeilBrown wrote:
[...]
> > So I wonder what throttle_vm_writeout() really achieves these days.  Is
> > it just a bandaid that no-one is brave enough to remove?
> 
> Maybe yes. It is sitting there quietly and you do not know about it
> until it bites. Like in this particular case.

So I was playing with this today and tried to provoke throttle_vm_writeout
and couldn't hit that path with my pretty much default IO stack. I
probably need a more complex IO setup like dm-crypt or something that
basically have to double buffer every page in the writeout for some
time.

Anyway I believe that the throttle_vm_writeout is just a relict from the
past which just survived after many other changes in the reclaim path. I
fully realize my testing is quite poor and I would really appreciate if
Mikulas could try to retest with his more complex IO setups but let me
post a patch with the changelog so that we can at least reason about the
justification. In principle the reclaim path should have sufficient
throttling already and if that is not the case then we should
consolidate the remaining rather than have yet another one.

Thoughts?
---
>From 0d950d64e3c59061f7cca71fe5877d4e430499c9 Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Mon, 25 Jul 2016 14:18:54 +0200
Subject: [PATCH] mm, vmscan: get rid of throttle_vm_writeout

throttle_vm_writeout has been introduced back in 2005 to fix OOMs caused
by excessive pageout activity during the reclaim. Too many pages could
be put under writeback therefore LRUs would be full of unreclaimable pages
until the IO completes and in turn the OOM killer could be invoked.

There have been some important changes introduced since then in the
reclaim path though. Writers are throttled by balance_dirty_pages
when initiating the buffered IO and later during the memory pressure,
the direct reclaim is throttled by wait_iff_congested if the node is
considered congested by dirty pages on LRUs and the underlying bdi
is congested by the queued IO. The kswapd is throttled as well if it
encounters pages marked for immediate reclaim or under writeback which
signals that that there are too many pages under writeback already.
Another important aspect is that we do not issue any IO from the direct
reclaim context anymore. In a heavy parallel load this could queue a lot
of IO which would be very scattered and thus unefficient which would
just make the problem worse.

This three mechanisms should throttle and keep the amount of IO in a
steady state even under heavy IO and memory pressure so yet another
throttling point doesn't really seem helpful. Quite contrary, Mikulas
Patocka has reported that swap backed by dm-crypt doesn't work properly
because the swapout IO cannot make sufficient progress as the writeout
path depends on dm_crypt worker which has to allocate memory to perform
the encryption. In order to guarantee a forward progress it relies
on the mempool allocator. mempool_alloc(), however, prefers to use
the underlying (usually page) allocator before it grabs objects from
the pool. Such an allocation can dive into the memory reclaim and
consequently to throttle_vm_writeout. If there are too many dirty or
pages under writeback it will get throttled even though it is in fact a
flusher to clear pending pages.

[  345.352536] kworker/u4:0D 88003df7f438 10488 6  2
0x
[  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
[  345.352536]  88003df7f438 88003e5d0380 88003e5d0380 
88003e5d8e80
[  345.352536]  88003dfb3240 88003df73240 88003df8 
88003df7f470
[  345.352536]  88003e5d0380 88003e5d0380 88003df7f828 
88003df7f450
[  345.352536] Call Trace:
[  345.352536]  [] schedule+0x3c/0x90
[  345.352536]  [] schedule_timeout+0x1d8/0x360
[  345.352536]  [] ? detach_if_pending+0x1c0/0x1c0
[  345.352536]  [] ? ktime_get+0xb3/0x150
[  345.352536]  [] ? __delayacct_blkio_start+0x1f/0x30
[  345.352536]  [] io_schedule_timeout+0xa4/0x110
[  345.352536]  [] congestion_wait+0x86/0x1f0
[  345.352536]  [] ? prepare_to_wait_event+0xf0/0xf0
[  345.352536]  [] throttle_vm_writeout+0x44/0xd0
[  345.352536]  [] shrink_zone_memcg+0x613/0x720
[  345.352536]  [] shrink_zone+0xe0/0x300
[  345.352536]  [] do_try_to_free_pages+0x1ad/0x450
[  345.352536]  [] try_to_free_pages+0xef/0x300
[  345.352536]  [] __alloc_pages_nodemask+0x879/0x1210
[  345.352536]  [] ? sched_clock_cpu+0x90/0xc0
[  345.352536]  [] alloc_pages_current+0xa1/0x1f0
[  345.352536]  [] ? new_slab+0x3f5/0x6a0
[  345.352536]  [] new_slab+0x2d7/0x6a0
[  345.352536]  [] ? sched_clock_local+0x17/0x80
[  345.352536]  [] ___slab_alloc+0x3fb/0x5c0
[  345.352536]  [] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [] ?

Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks

2016-07-25 Thread Michal Hocko
On Sat 23-07-16 10:12:24, NeilBrown wrote:
> On Fri, Jul 22 2016, Michal Hocko wrote:
[...]
> >  If we just back off and rely on kswapd which
> > might get stuck on the writeout then the IO throughput can be reduced
> 
> If I were king of MM, I would make a decree to be proclaimed throughout
> the land
> kswapd must never sleep except when it explicitly chooses to
> 
> Maybe that is impractical, but having firm rules like that would go a
> long way to make it possible to actually understand and reason about how
> MM works.  As it is, there seems to be a tendency to put bandaids over
> bandaids.

Ohh, I would definitely wish for this to be more clear but as it turned
out over time there are quite some interdependencies between MM/FS/IO
layers which make the picture really blur. If there is a brave soul to
make that more clear without breaking any of that it would be really
cool ;)

> > I believe which would make the whole memory pressure just worse. So I am
> > not sure this is a good idea in general. I completely agree with you
> > that the mempool request shouldn't be throttled unless there is a strong
> > reason for that. More on that below.
> >
> >> If I'm following the code properly, the stack trace below can only
> >> happen if the first pool->alloc() attempt, with direct-reclaim disabled,
> >> fails and the pool is empty, so mempool_alloc() calls prepare_to_wait()
> >> and io_schedule_timeout().
> >
> > mempool_alloc retries immediatelly without any sleep after the first
> > no-reclaim attempt.
> 
> I missed that ... I see it now... I wonder if anyone has contemplated
> using some modern programming techniques like, maybe, a "while" loop in
> there..
> Something like the below...

Heh, why not, the code could definitely see some more love. Care to send
a proper patch so that we are not mixing two different things here.

> >> I suspect the timeout *doesn't* fire (5 seconds is along time) so it
> >> gets woken up when there is something in the pool.  It then loops around
> >> and tries pool->alloc() again, even though there is something in the
> >> pool.  This might be justified if that ->alloc would never block, but
> >> obviously it does.
> >> 
> >> I would very strongly recommend just changing mempool_alloc() to
> >> permanently mask out __GFP_DIRECT_RECLAIM.
> >> 
> >> Quite separately I don't think PF_LESS_THROTTLE is at all appropriate.
> >> It is "LESS" throttle, not "NO" throttle, but you have made
> >> throttle_vm_writeout never throttle PF_LESS_THROTTLE threads.
> >
> > Yes that is correct. But it still allows to throttle on congestion:
> > shrink_inactive_list:
> > /*
> >  * Stall direct reclaim for IO completions if underlying BDIs or zone
> >  * is congested. Allow kswapd to continue until it starts encountering
> >  * unqueued dirty pages or cycling through the LRU too quickly.
> >  */
> > if (!sc->hibernation_mode && !current_is_kswapd() &&
> > current_may_throttle())
> > wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);
> >
> > My thinking was that throttle_vm_writeout is there to prevent from
> > dirtying too many pages from the reclaim the context.  PF_LESS_THROTTLE
> > is part of the writeout so throttling it on too many dirty pages is
> > questionable (well we get some bias but that is not really reliable). It
> > still makes sense to throttle when the backing device is congested
> > because the writeout path wouldn't make much progress anyway and we also
> > do not want to cycle through LRU lists too quickly in that case.
> 
> "dirtying ... from the reclaim context" ??? What does that mean?

Say you would cause a swapout from the reclaim context. You would
effectively dirty that anon page until it gets written down to the
storage.

> According to
>   Commit: 26eecbf3543b ("[PATCH] vm: pageout throttling")
> From the history tree, the purpose of throttle_vm_writeout() is to
> limit the amount of memory that is concurrently under I/O.
> That seems strange to me because I thought it was the responsibility of
> each backing device to impose a limit - a maximum queue size of some
> sort.

We do throttle on the congestion during the reclaim so in some
sense this is already implemented but I am not really sure that is
sufficient. Maybe this is something to re-evaluate because
wait_iff_congested came in much later after throttle_vm_writeout. Let me
think about it some more.

> I remember when NFS didn't impose a limit a

Re: [dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks

2016-07-22 Thread Michal Hocko
On Fri 22-07-16 18:46:57, Neil Brown wrote:
> On Mon, Jul 18 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko 
> >
> > Mikulas has reported that a swap backed by dm-crypt doesn't work
> > properly because the swapout cannot make a sufficient forward progress
> > as the writeout path depends on dm_crypt worker which has to allocate
> > memory to perform the encryption. In order to guarantee a forward
> > progress it relies on the mempool allocator. mempool_alloc(), however,
> > prefers to use the underlying (usually page) allocator before it grabs
> > objects from the pool. Such an allocation can dive into the memory
> > reclaim and consequently to throttle_vm_writeout.
> 
> That's just broken.
> I used to think mempool should always use the pre-allocated reserves
> first.  That is surely the most logical course of action.  Otherwise
> that memory is just sitting there doing nothing useful.
> 
> I spoke to Nick Piggin about this some years ago and he pointed out that
> the kmalloc allocation paths are much better optimized for low overhead
> when there is plenty of memory.  They can just pluck a free block of a
> per-CPU list without taking any locks.   By contrast, accessing the
> preallocated pool always requires a spinlock.
> 
> So it makes lots of sense to prefer the underlying allocator if it can
> provide a quick response.  If it cannot, the sensible thing is to use
> the pool, or wait for the pool to be replenished.
> 
> So the allocator should never wait at all, never enter reclaim, never
> throttle.
> 
> Looking at the current code, __GFP_DIRECT_RECLAIM is disabled the first
> time through, but if the pool is empty, direct-reclaim is allowed on the
> next attempt.  Presumably this is where the throttling comes in ??

Yes that is correct.

> I suspect that it really shouldn't do that. It should leave kswapd to
> do reclaim (so __GFP_KSWAPD_RECLAIM is appropriate) and only wait in
> mempool_alloc where pool->wait can wake it up.

Mikulas was already suggesting that and my concern was that this would
give up prematurely even under mild page cache load when there are many
clean page cache pages. If we just back off and rely on kswapd which
might get stuck on the writeout then the IO throughput can be reduced
I believe which would make the whole memory pressure just worse. So I am
not sure this is a good idea in general. I completely agree with you
that the mempool request shouldn't be throttled unless there is a strong
reason for that. More on that below.

> If I'm following the code properly, the stack trace below can only
> happen if the first pool->alloc() attempt, with direct-reclaim disabled,
> fails and the pool is empty, so mempool_alloc() calls prepare_to_wait()
> and io_schedule_timeout().

mempool_alloc retries immediatelly without any sleep after the first
no-reclaim attempt.

> I suspect the timeout *doesn't* fire (5 seconds is along time) so it
> gets woken up when there is something in the pool.  It then loops around
> and tries pool->alloc() again, even though there is something in the
> pool.  This might be justified if that ->alloc would never block, but
> obviously it does.
> 
> I would very strongly recommend just changing mempool_alloc() to
> permanently mask out __GFP_DIRECT_RECLAIM.
> 
> Quite separately I don't think PF_LESS_THROTTLE is at all appropriate.
> It is "LESS" throttle, not "NO" throttle, but you have made
> throttle_vm_writeout never throttle PF_LESS_THROTTLE threads.

Yes that is correct. But it still allows to throttle on congestion:
shrink_inactive_list:
/*
 * Stall direct reclaim for IO completions if underlying BDIs or zone
 * is congested. Allow kswapd to continue until it starts encountering
 * unqueued dirty pages or cycling through the LRU too quickly.
 */
if (!sc->hibernation_mode && !current_is_kswapd() &&
current_may_throttle())
wait_iff_congested(pgdat, BLK_RW_ASYNC, HZ/10);

My thinking was that throttle_vm_writeout is there to prevent from
dirtying too many pages from the reclaim the context.  PF_LESS_THROTTLE
is part of the writeout so throttling it on too many dirty pages is
questionable (well we get some bias but that is not really reliable). It
still makes sense to throttle when the backing device is congested
because the writeout path wouldn't make much progress anyway and we also
do not want to cycle through LRU lists too quickly in that case.

Or is this assumption wrong for nfsd_vfs_write? Can it cause unbounded
dirtying of memory?

> The purpose of that flag is to allow a thread to dirty a page-cache page
> as part of cleaning another page-cache page.
> So it makes sense for loop 

Re: [dm-devel] [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path

2016-07-21 Thread Michal Hocko
On Thu 21-07-16 16:53:09, Michal Hocko wrote:
> From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001
> From: Michal Hocko 
> Date: Thu, 21 Jul 2016 16:40:59 +0200
> Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are
>  free elements"
> 
> This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23.

I've noticed that Andrew has already picked this one up. Is anybody
against marking it for stable?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path

2016-07-21 Thread Michal Hocko
On Thu 21-07-16 08:13:00, Johannes Weiner wrote:
> On Thu, Jul 21, 2016 at 10:52:03AM +0200, Michal Hocko wrote:
> > Look, there are
> > $ git grep mempool_alloc | wc -l
> > 304
> > 
> > many users of this API and we do not want to flip the default behavior
> > which is there for more than 10 years. So far you have been arguing
> > about potential deadlocks and haven't shown any particular path which
> > would have a direct or indirect dependency between mempool and normal
> > allocator and it wouldn't be a bug. As the matter of fact the change
> > we are discussing here causes a regression. If you want to change the
> > semantic of mempool allocator then you are absolutely free to do so. In
> > a separate patch which would be discussed with IO people and other
> > users, though. But we _absolutely_ want to fix the regression first
> > and have a simple fix for 4.6 and 4.7 backports. At this moment there
> > are revert and patch 1 on the table.  The later one should make your
> > backtrace happy and should be only as a temporal fix until we find out
> > what is actually misbehaving on your systems. If you are not interested
> > to pursue that way I will simply go with the revert.
> 
> +1
> 
> It's very unlikely that decade-old mempool semantics are suddenly a
> fundamental livelock problem, when all the evidence we have is one
> hang and vague speculation. Given that the patch causes regressions,
> and that the bug is most likely elsewhere anyway, a full revert rather
> than merely-less-invasive mempool changes makes the most sense to me.

OK, fair enough. What do you think about the following then? Mikulas, I
have dropped your Tested-by and Reviewed-by because the patch is
different but unless you have hit the OOM killer then the testing
results should be same.
---
>From d64815758c212643cc1750774e2751721685059a Mon Sep 17 00:00:00 2001
From: Michal Hocko 
Date: Thu, 21 Jul 2016 16:40:59 +0200
Subject: [PATCH] Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are
 free elements"

This reverts commit f9054c70d28bc214b2857cf8db8269f4f45a5e23.

There has been a report about OOM killer invoked when swapping out to
a dm-crypt device. The primary reason seems to be that the swapout
out IO managed to completely deplete memory reserves. Ondrej was
able to bisect and explained the issue by pointing to f9054c70d28b
("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").

The reason is that the swapout path is not throttled properly because
the md-raid layer needs to allocate from the generic_make_request path
which means it allocates from the PF_MEMALLOC context. dm layer uses
mempool_alloc in order to guarantee a forward progress which used to
inhibit access to memory reserves when using page allocator. This has
changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
there are free elements") which has dropped the __GFP_NOMEMALLOC
protection when the memory pool is depleted.

If we are running out of memory and the only way forward to free memory
is to perform swapout we just keep consuming memory reserves rather than
throttling the mempool allocations and allowing the pending IO to
complete up to a moment when the memory is depleted completely and there
is no way forward but invoking the OOM killer. This is less than
optimal.

The original intention of f9054c70d28b was to help with the OOM
situations where the oom victim depends on mempool allocation to make a
forward progress. David has mentioned the following backtrace:

schedule
schedule_timeout
io_schedule_timeout
mempool_alloc
__split_and_process_bio
dm_request
generic_make_request
submit_bio
mpage_readpages
ext4_readpages
__do_page_cache_readahead
ra_submit
filemap_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault

We do not know more about why the mempool is depleted without being
replenished in time, though. In any case the dm layer shouldn't depend
on any allocations outside of the dedicated pools so a forward progress
should be guaranteed. If this is not the case then the dm should be
fixed rather than papering over the problem and postponing it to later
by accessing more memory reserves.

mempools are a mechanism to maintain dedicated memory reserves to guaratee
forward progress. Allowing them an unbounded access to the page allocator
memory reserves is going against the whole purpose of this mechanism.

Bisected-by: Ondrej Kozina 
Signed-off-by: Michal Hocko 
---
 mm/mempool.c | 20 
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index 8f65464da5de..5ba6c8b3b814 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -306,36 +306,25 @@ EXPORT_SYMBOL(mempool_resize);
  * returns NULL. Note that due to preallocation, this function
  * *never* fails when called from p

Re: [dm-devel] [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path

2016-07-21 Thread Michal Hocko
On Wed 20-07-16 14:06:26, David Rientjes wrote:
> On Wed, 20 Jul 2016, Michal Hocko wrote:
> 
> > > Any mempool_alloc() user that then takes a contended mutex can do this.  
> > > An example:
> > > 
> > >   taskA   taskB   taskC
> > >   -   -   -
> > >   mempool_alloc(a)
> > >   mutex_lock(b)
> > >   mutex_lock(b)
> > >   mempool_alloc(a)
> > > 
> > > Imagine the mempool_alloc() done by taskA depleting all free elements so 
> > > we rely on it to do mempool_free() before any other mempool allocator can 
> > > be guaranteed.
> > > 
> > > If taskC is oom killed, or has PF_MEMALLOC set, it cannot access memory 
> > > reserves from the page allocator if __GFP_NOMEMALLOC is automatic in 
> > > mempool_alloc().  This livelocks the page allocator for all processes.
> > > 
> > > taskB in this case need only stall after taking mutex_lock() 
> > > successfully; 
> > > that could be because of the oom livelock, it is contended on another 
> > > mutex held by an allocator, etc.
> > 
> > But that falls down to the deadlock described by Johannes above because
> > then the mempool user would _depend_ on an "unguarded page allocation"
> > via that particular lock and that is a bug.
> >  
> 
> It becomes a deadlock because of mempool_alloc(a) forcing 
> __GFP_NOMEMALLOC, I agree.
> 
> For that not to be the case, it must be required that between 
> mempool_alloc() and mempool_free() that we take no mutex that may be held 
> by any other thread on the system, in any context, that is allocating 
> memory.  If that's a caller's bug as you describe it, and only enabled by 
> mempool_alloc() forcing __GFP_NOMEMALLOC, then please add the relevant 
> lockdep detection, which would be trivial to add, so we can determine if 
> any users are unsafe and prevent this issue in the future.

I am sorry but I am neither familiar with the lockdep internals nor I
have a time to add this support.

> The 
> overwhelming goal here should be to prevent possible problems in the 
> future especially if an API does not allow you to opt-out of the behavior.

The __GFP_NOMEMALLOC enforcement is there since b84a35be0285 ("[PATCH]
mempool: NOMEMALLOC and NORETRY") so more than 10 years ago. So I think
it is quite reasonable to expect that users are familiar with this fact
and handle it properly in the vast majority cases. In fact mempool
deadlocks are really rare.

[...]

> > Or it would get stuck because even page allocator memory reserves got
> > depleted. Without any way to throttle there is no guarantee to make
> > further progress. In fact this is not a theoretical situation. It has
> > been observed with the swap over dm-crypt and there shouldn't be any
> > lock dependeces you are describing above there AFAIU.
> > 
> 
> They should do mempool_alloc(__GFP_NOMEMALLOC), no argument.

How that would be any different from any other mempool user which can be
invoked from the swap out path - aka any other IO path?

> What is the objection to allowing __GFP_NOMEMALLOC from the caller with 
> clear documentation on how to use it?  It can be described to not allow 
> depletion of memory reserves with the caveat that the caller must ensure 
> mempool_free() cannot be blocked in lowmem situations.

Look, there are
$ git grep mempool_alloc | wc -l
304

many users of this API and we do not want to flip the default behavior
which is there for more than 10 years. So far you have been arguing
about potential deadlocks and haven't shown any particular path which
would have a direct or indirect dependency between mempool and normal
allocator and it wouldn't be a bug. As the matter of fact the change
we are discussing here causes a regression. If you want to change the
semantic of mempool allocator then you are absolutely free to do so. In
a separate patch which would be discussed with IO people and other
users, though. But we _absolutely_ want to fix the regression first
and have a simple fix for 4.6 and 4.7 backports. At this moment there
are revert and patch 1 on the table.  The later one should make your
backtrace happy and should be only as a temporal fix until we find out
what is actually misbehaving on your systems. If you are not interested
to pursue that way I will simply go with the revert.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path

2016-07-20 Thread Michal Hocko
t being said, f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC
if there are free elements") should be either reverted or
http://lkml.kernel.org/r/1468831285-27242-1-git-send-email-mho...@kernel.org
should be applied as a temporal workaround because it would make a
lockup less likely for now until we find out more about your issue.

Does that sound like a way forward?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path

2016-07-19 Thread Michal Hocko
On Tue 19-07-16 17:50:29, Mikulas Patocka wrote:
> 
> 
> On Mon, 18 Jul 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko 
> > 
> > There has been a report about OOM killer invoked when swapping out to
> > a dm-crypt device. The primary reason seems to be that the swapout
> > out IO managed to completely deplete memory reserves. Mikulas was
> > able to bisect and explained the issue by pointing to f9054c70d28b
> > ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").
> > 
> > The reason is that the swapout path is not throttled properly because
> > the md-raid layer needs to allocate from the generic_make_request path
> > which means it allocates from the PF_MEMALLOC context. dm layer uses
> > mempool_alloc in order to guarantee a forward progress which used to
> > inhibit access to memory reserves when using page allocator. This has
> > changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
> > there are free elements") which has dropped the __GFP_NOMEMALLOC
> > protection when the memory pool is depleted.
> > 
> > If we are running out of memory and the only way forward to free memory
> > is to perform swapout we just keep consuming memory reserves rather than
> > throttling the mempool allocations and allowing the pending IO to
> > complete up to a moment when the memory is depleted completely and there
> > is no way forward but invoking the OOM killer. This is less than
> > optimal.
> > 
> > The original intention of f9054c70d28b was to help with the OOM
> > situations where the oom victim depends on mempool allocation to make a
> > forward progress. We can handle that case in a different way, though. We
> > can check whether the current task has access to memory reserves ad an
> > OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> > is empty.
> > 
> > David Rientjes was objecting that such an approach wouldn't help if the
> > oom victim was blocked on a lock held by process doing mempool_alloc. This
> > is very similar to other oom deadlock situations and we have oom_reaper
> > to deal with them so it is reasonable to rely on the same mechanism
> > rather inventing a different one which has negative side effects.
> > 
> > Fixes: f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are 
> > free elements")
> > Bisected-by: Mikulas Patocka 
> 
> Bisect was done by Ondrej Kozina.

OK, fixed

> > Signed-off-by: Michal Hocko 
> 
> Reviewed-by: Mikulas Patocka 
> Tested-by: Mikulas Patocka 

Let's see whether we decide to go with this patch or a plain revert. In
any case I will mark the patch for stable so it will end up in both 4.6
and 4.7

Anyway thanks for your and Ondrejs help here!
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path

2016-07-19 Thread Michal Hocko
On Tue 19-07-16 09:54:26, Johannes Weiner wrote:
> On Mon, Jul 18, 2016 at 10:41:24AM +0200, Michal Hocko wrote:
> > The original intention of f9054c70d28b was to help with the OOM
> > situations where the oom victim depends on mempool allocation to make a
> > forward progress. We can handle that case in a different way, though. We
> > can check whether the current task has access to memory reserves ad an
> > OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
> > is empty.
> > 
> > David Rientjes was objecting that such an approach wouldn't help if the
> > oom victim was blocked on a lock held by process doing mempool_alloc. This
> > is very similar to other oom deadlock situations and we have oom_reaper
> > to deal with them so it is reasonable to rely on the same mechanism
> > rather inventing a different one which has negative side effects.
> 
> I don't understand how this scenario wouldn't be a flat-out bug.
> 
> Mempool guarantees forward progress by having all necessary memory
> objects for the guaranteed operation in reserve. Think about it this
> way: you should be able to delete the pool->alloc() call entirely and
> still make reliable forward progress. It would kill concurrency and be
> super slow, but how could it be affected by a system OOM situation?

Yes this is my understanding of the mempool usage as well. It is much
harder to check whether mempool users are really behaving and they do
not request more than the pre allocated pool allows them, though. That
would be a bug in the consumer not the mempool as such of course.

My original understanding of f9054c70d28b was that it acts as
a prevention for issues where the OOM victim loops inside the
mempool_alloc not doing reasonable progress because those who should
refill the pool are stuck for some reason (aka assume that not all
mempool users are behaving or they have unexpected dependencies like WQ
without WQ_MEM_RECLAIM and similar).

My thinking was that the victim has access to memory reserves by default
so it sounds reasonable to preserve this access also when it is in the
mempool_alloc. Therefore I wanted to preserve that particular logic and
came up with this patch which should be safer than f9054c70d28b. But the
more I am thinking about it the more it sounds like papering over a bug
somewhere else.

So I guess we should just go and revert f9054c70d28b and get back to
David's lockup and investigate what exactly went wrong and why. The
current form of f9054c70d28b is simply too dangerous.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path

2016-07-19 Thread Michal Hocko
On Mon 18-07-16 19:00:57, David Rientjes wrote:
> On Mon, 18 Jul 2016, Michal Hocko wrote:
> 
> > David Rientjes was objecting that such an approach wouldn't help if the
> > oom victim was blocked on a lock held by process doing mempool_alloc. This
> > is very similar to other oom deadlock situations and we have oom_reaper
> > to deal with them so it is reasonable to rely on the same mechanism
> > rather inventing a different one which has negative side effects.
> > 
> 
> Right, this causes oom livelock as described in the aforementioned thread: 
> the oom victim is waiting on a mutex that is held by a thread doing 
> mempool_alloc().

The backtrace you have provided:
schedule
schedule_timeout
io_schedule_timeout
mempool_alloc
__split_and_process_bio
dm_request
generic_make_request
submit_bio
mpage_readpages
ext4_readpages
__do_page_cache_readahead
ra_submit
filemap_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault

is not PF_MEMALLOC context AFAICS so clearing __GFP_NOMEMALLOC for such
a task will not help unless that task has TIF_MEMDIE. Could you provide
a trace where the PF_MEMALLOC context holding a lock cannot make a
forward progress?

> The oom reaper is not guaranteed to free any memory, so 
> nothing on the system can allocate memory from the page allocator.

Sure, there is no guarantee but as I've said earlier, 1) oom_reaper will
allow to select another victim in many cases and 2) such a deadlock is
no different from any other where the victim cannot continue because of
another context blocking a lock while waiting for memory. Tweaking
mempool allocator to potentially catch such a case in a different way
doesn't sound right in principle, not to mention this is other dangerous
side effects.
 
> I think the better solution here is to allow mempool_alloc() users to set 
> __GFP_NOMEMALLOC if they are in a context which allows them to deplete 
> memory reserves.

I am not really sure about that. I agree with Johannes [1] that this
is bending mempool allocator into an undesirable direction because
the point of the mempool is to have its own reliably reusable memory
reserves. Now I am even not sure whether TIF_MEMDIE exception is a
good way forward and a plain revert is more appropriate. Let's CC
Johannes. The patch is [2].

[1] http://lkml.kernel.org/r/20160718151445.gb14...@cmpxchg.org
[2] http://lkml.kernel.org/r/1468831285-27242-1-git-send-email-mho...@kernel.org
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [RFC PATCH 1/2] mempool: do not consume memory reserves from the reclaim path

2016-07-18 Thread Michal Hocko
From: Michal Hocko 

There has been a report about OOM killer invoked when swapping out to
a dm-crypt device. The primary reason seems to be that the swapout
out IO managed to completely deplete memory reserves. Mikulas was
able to bisect and explained the issue by pointing to f9054c70d28b
("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").

The reason is that the swapout path is not throttled properly because
the md-raid layer needs to allocate from the generic_make_request path
which means it allocates from the PF_MEMALLOC context. dm layer uses
mempool_alloc in order to guarantee a forward progress which used to
inhibit access to memory reserves when using page allocator. This has
changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
there are free elements") which has dropped the __GFP_NOMEMALLOC
protection when the memory pool is depleted.

If we are running out of memory and the only way forward to free memory
is to perform swapout we just keep consuming memory reserves rather than
throttling the mempool allocations and allowing the pending IO to
complete up to a moment when the memory is depleted completely and there
is no way forward but invoking the OOM killer. This is less than
optimal.

The original intention of f9054c70d28b was to help with the OOM
situations where the oom victim depends on mempool allocation to make a
forward progress. We can handle that case in a different way, though. We
can check whether the current task has access to memory reserves ad an
OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
is empty.

David Rientjes was objecting that such an approach wouldn't help if the
oom victim was blocked on a lock held by process doing mempool_alloc. This
is very similar to other oom deadlock situations and we have oom_reaper
to deal with them so it is reasonable to rely on the same mechanism
rather inventing a different one which has negative side effects.

Fixes: f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free 
elements")
Bisected-by: Mikulas Patocka 
Signed-off-by: Michal Hocko 
---
 mm/mempool.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index 8f65464da5de..ea26d75c8adf 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -322,20 +322,20 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
 
+   gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
gfp_mask |= __GFP_NORETRY;  /* don't loop in __alloc_pages */
gfp_mask |= __GFP_NOWARN;   /* failures are OK */
 
gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
 repeat_alloc:
-   if (likely(pool->curr_nr)) {
-   /*
-* Don't allocate from emergency reserves if there are
-* elements available.  This check is racy, but it will
-* be rechecked each loop.
-*/
-   gfp_temp |= __GFP_NOMEMALLOC;
-   }
+   /*
+* Make sure that the OOM victim will get access to memory reserves
+* properly if there are no objects in the pool to prevent from
+* livelocks.
+*/
+   if (!likely(pool->curr_nr) && test_thread_flag(TIF_MEMDIE))
+   gfp_temp &= ~__GFP_NOMEMALLOC;
 
element = pool->alloc(gfp_temp, pool->pool_data);
if (likely(element != NULL))
@@ -359,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 * We use gfp mask w/o direct reclaim or IO for the first round.  If
 * alloc failed with that and @pool was empty, retry immediately.
 */
-   if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
+   if ((gfp_temp & __GFP_DIRECT_RECLAIM) != (gfp_mask & 
__GFP_DIRECT_RECLAIM)) {
spin_unlock_irqrestore(&pool->lock, flags);
gfp_temp = gfp_mask;
goto repeat_alloc;
-- 
2.8.1

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [RFC PATCH 2/2] mm, mempool: do not throttle PF_LESS_THROTTLE tasks

2016-07-18 Thread Michal Hocko
From: Michal Hocko 

Mikulas has reported that a swap backed by dm-crypt doesn't work
properly because the swapout cannot make a sufficient forward progress
as the writeout path depends on dm_crypt worker which has to allocate
memory to perform the encryption. In order to guarantee a forward
progress it relies on the mempool allocator. mempool_alloc(), however,
prefers to use the underlying (usually page) allocator before it grabs
objects from the pool. Such an allocation can dive into the memory
reclaim and consequently to throttle_vm_writeout. If there are too many
dirty or pages under writeback it will get throttled even though it is
in fact a flusher to clear pending pages.

[  345.352536] kworker/u4:0D 88003df7f438 10488 6  2 0x
[  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
[  345.352536]  88003df7f438 88003e5d0380 88003e5d0380 
88003e5d8e80
[  345.352536]  88003dfb3240 88003df73240 88003df8 
88003df7f470
[  345.352536]  88003e5d0380 88003e5d0380 88003df7f828 
88003df7f450
[  345.352536] Call Trace:
[  345.352536]  [] schedule+0x3c/0x90
[  345.352536]  [] schedule_timeout+0x1d8/0x360
[  345.352536]  [] ? detach_if_pending+0x1c0/0x1c0
[  345.352536]  [] ? ktime_get+0xb3/0x150
[  345.352536]  [] ? __delayacct_blkio_start+0x1f/0x30
[  345.352536]  [] io_schedule_timeout+0xa4/0x110
[  345.352536]  [] congestion_wait+0x86/0x1f0
[  345.352536]  [] ? prepare_to_wait_event+0xf0/0xf0
[  345.352536]  [] throttle_vm_writeout+0x44/0xd0
[  345.352536]  [] shrink_zone_memcg+0x613/0x720
[  345.352536]  [] shrink_zone+0xe0/0x300
[  345.352536]  [] do_try_to_free_pages+0x1ad/0x450
[  345.352536]  [] try_to_free_pages+0xef/0x300
[  345.352536]  [] __alloc_pages_nodemask+0x879/0x1210
[  345.352536]  [] ? sched_clock_cpu+0x90/0xc0
[  345.352536]  [] alloc_pages_current+0xa1/0x1f0
[  345.352536]  [] ? new_slab+0x3f5/0x6a0
[  345.352536]  [] new_slab+0x2d7/0x6a0
[  345.352536]  [] ? sched_clock_local+0x17/0x80
[  345.352536]  [] ___slab_alloc+0x3fb/0x5c0
[  345.352536]  [] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [] ? sched_clock_local+0x17/0x80
[  345.352536]  [] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [] __slab_alloc+0x51/0x90
[  345.352536]  [] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [] kmem_cache_alloc+0x27b/0x310
[  345.352536]  [] mempool_alloc_slab+0x1d/0x30
[  345.352536]  [] mempool_alloc+0x91/0x230
[  345.352536]  [] bio_alloc_bioset+0xbd/0x260
[  345.352536]  [] kcryptd_crypt+0x114/0x3b0 [dm_crypt]

Memory pools are usually used for the writeback paths and it doesn't
really make much sense to throttle them just because there are too many
dirty/writeback pages. The main purpose of throttle_vm_writeout is to
make sure that the pageout path doesn't generate too much dirty data.
Considering that we are in mempool path which performs __GFP_NORETRY
requests the risk shouldn't be really high.

Fix this by ensuring that mempool users will get PF_LESS_THROTTLE and
that such processes are not throttled in throttle_vm_writeout. They can
still get throttled due to current_may_throttle() sleeps but that should
happen when the backing device itself is congested which sounds like a
proper reaction.

Please note that the bonus given by domain_dirty_limits() alone is not
sufficient because at least dm-crypt has to double buffer each page
under writeback so this won't be sufficient to prevent from being
throttled.

There are other users of the flag but they are in the writeout path so
this looks like a proper thing for them as well.

Reported-by: Mikulas Patocka 
Signed-off-by: Michal Hocko 
---
 mm/mempool.c| 19 +++
 mm/page-writeback.c |  3 +++
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index ea26d75c8adf..916e95c4192c 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -310,7 +310,8 @@ EXPORT_SYMBOL(mempool_resize);
  */
 void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 {
-   void *element;
+   unsigned int pflags = current->flags;
+   void *element = NULL;
unsigned long flags;
wait_queue_t wait;
gfp_t gfp_temp;
@@ -328,6 +329,12 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
+   /*
+* Make sure that the allocation doesn't get throttled during the
+* reclaim
+*/
+   if (gfpflags_allow_blocking(gfp_mask))
+   current->flags |= PF_LESS_THROTTLE;
 repeat_alloc:
/*
 * Make sure that the OOM victim will get access to memory reserves
@@ -339,7 +346,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
element = pool->alloc(gfp_temp, pool->pool_data);
if (likely(element != NULL))
-   return element;
+   goto out;
 
spin_lock_irqsave(&pool->lock, flags);
if (likely(pool

[dm-devel] [RFC PATCH 0/2] mempool vs. page allocator interaction

2016-07-18 Thread Michal Hocko
Hi,
there have been two issues identified when investigating dm-crypt
backed swap recently [1]. The first one looks like a regression from
f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free
elements") because swapout path can now deplete all the available memory
reserves. The first patch tries to address that issue by dropping
__GFP_NOMEMALLOC only to TIF_MEMDIE tasks.

The second issue is that dm writeout path which relies on mempool
allocator gets throttled by the direct reclaim in throttle_vm_writeout
which just makes the whole memory pressure problem even worse. The
patch2 just makes sure that we annotate mempool users to be throttled
less by PF_LESS_THROTTLE flag and prevent from throttle_vm_writeout for
that path. mempool users are usually the IO path and throttle them less
sounds like a reasonable way to go.

I do not have any more complicated dm setup available so I would
appreciate if dm people (CCed) could give these two a try.

Also it would be great to iron out concerns from David. He has posted a
deadlock stack trace [2] which has led to f9054c70d28b which is bio
allocation lockup because the TIF_MEMDIE process cannot make a forward
progress without access to memory reserve. This case should be fixed by
patch 1 AFAICS. There are other potential cases when the stuck mempool
is called from PF_MEMALLOC context and blocks the oom victim indirectly
(over a lock) but I believe those are much less likely and we have the
oom reaper to make a forward progress.

Sorry of pulling the discussion outside of the original email thread
but there were more lines of dicussion there and I felt discussing
particualr solution with its justification has a greater chance of
moving towards a solution. I am sending this as an RFC because this
needs a deep review as there might be other side effects I do not see
(especially about patch 2).

Any comments, suggestions are welcome.

---
[1] 
http://lkml.kernel.org/r/alpine.lrh.2.02.1607111027080.14...@file01.intranet.prod.int.rdu2.redhat.com
[2] 
http://lkml.kernel.org/r/alpine.deb.2.10.1607131644590.92...@chino.kir.corp.google.com


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] System freezes after OOM

2016-07-18 Thread Michal Hocko
On Fri 15-07-16 13:02:17, Mikulas Patocka wrote:
> 
> 
> On Fri, 15 Jul 2016, Michal Hocko wrote:
> 
> > On Fri 15-07-16 08:11:22, Mikulas Patocka wrote:
> > > 
> > > The stacktraces showed that the kcryptd process was throttled when it 
> > > tried to do mempool allocation. Mempool adds the __GFP_NORETRY flag to 
> > > the 
> > > allocation, but unfortunatelly, this flag doesn't prevent the allocator 
> > > from throttling.
> > 
> > Yes and in fact it shouldn't prevent any throttling. The flag merely
> > says that the allocation should give up rather than retry
> > reclaim/compaction again and again.
> > 
> > > I say that the process doing mempool allocation shouldn't ever be 
> > > throttled. Maybe add __GFP_NOTHROTTLE?
> > 
> > A specific gfp flag would be an option but we are slowly running out of
> > bit space there and I am not yet convinced PF_LESS_THROTTLE is
> > unsuitable.
> 
> PF_LESS_THROTTLE will make it throttle less, but it doesn't eliminate 
> throttling entirely. So, maybe add PF_NO_THROTTLE? But PF_* flags are also 
> almost exhausted.

I am not really sure we can make anybody so special to not throttle at all.
Seeing a congested backig device sounds like a reasonable compromise.
Besides that it seems that we do not really need to eliminate
wait_iff_congested for dm to work properly again AFAIU. I plan to repost
both patch today after some more internal review. If we need to do more
changes I would suggest making them in separet patches.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] System freezes after OOM

2016-07-15 Thread Michal Hocko
On Fri 15-07-16 08:11:22, Mikulas Patocka wrote:
> 
> 
> On Fri, 15 Jul 2016, Michal Hocko wrote:
> 
> > On Thu 14-07-16 13:35:35, Mikulas Patocka wrote:
> > > On Thu, 14 Jul 2016, Michal Hocko wrote:
> > > > On Thu 14-07-16 10:00:16, Mikulas Patocka wrote:
> > > > > But it needs other changes to honor the PF_LESS_THROTTLE flag:
> > > > > 
> > > > > static int current_may_throttle(void)
> > > > > {
> > > > > return !(current->flags & PF_LESS_THROTTLE) ||
> > > > > current->backing_dev_info == NULL ||
> > > > > bdi_write_congested(current->backing_dev_info);
> > > > > }
> > > > > --- if you set PF_LESS_THROTTLE, current_may_throttle may still 
> > > > > return 
> > > > > true if one of the other conditions is met.
> > > > 
> > > > That is true but doesn't that mean that the device is congested and
> > > > waiting a bit is the right thing to do?
> > > 
> > > You shouldn't really throttle mempool allocations at all. It's better to 
> > > fail the allocation quickly and allocate from a mempool reserve than to 
> > > wait 0.1 seconds in the reclaim path.
> > 
> > Well, but we do that already, no? The first allocation request is NOWAIT
> 
> The stacktraces showed that the kcryptd process was throttled when it 
> tried to do mempool allocation. Mempool adds the __GFP_NORETRY flag to the 
> allocation, but unfortunatelly, this flag doesn't prevent the allocator 
> from throttling.

Yes and in fact it shouldn't prevent any throttling. The flag merely
says that the allocation should give up rather than retry
reclaim/compaction again and again.

> I say that the process doing mempool allocation shouldn't ever be 
> throttled. Maybe add __GFP_NOTHROTTLE?

A specific gfp flag would be an option but we are slowly running out of
bit space there and I am not yet convinced PF_LESS_THROTTLE is
unsuitable.

> > and then we try to consume an object from the pool. We are re-adding
> > __GFP_DIRECT_RECLAIM in case both fail. The point of throttling is to
> > prevent from scanning through LRUs too quickly while we know that the
> > bdi is congested.
> 
> > > dm-crypt can do approximatelly 100MB/s. That means that it processes 25k 
> > > swap pages per second. If you wait in mempool_alloc, the allocation would 
> > > be satisfied in 0.4s. If you wait in the allocator's throttle 
> > > function, you waste 0.1s.
> > > 
> > > 
> > > It is also questionable if those 0.1 second sleeps are reasonable at all. 
> > > SSDs with 100k IOPS are common - they can drain the request queue in much 
> > > less time than 0.1 second. I think those hardcoded 0.1 second sleeps 
> > > should be replaced with sleeps until the device stops being congested.
> > 
> > Well if we do not do throttle_vm_writeout then the only remaining
> > writeout throttling for PF_LESS_THROTTLE is wait_iff_congested for
> > the direct reclaim and that should wake up if the device stops being
> > congested AFAIU.
> 
> I mean - a proper thing is to use active wakeup for the throttling, rather 
> than retrying every 0.1 second. Polling for some condition is generally 
> bad idea.
> 
> If there are too many pages under writeback, you should sleep on a wait 
> queue. When the number of pages under writeback drops, wake up the wait 
> queue.

I might be missing something but exactly this is what happens in
wait_iff_congested no? If the bdi doesn't see the congestion it wakes up
the reclaim context even before the timeout. Or are we talking past each
other?

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] System freezes after OOM

2016-07-15 Thread Michal Hocko
On Thu 14-07-16 13:35:35, Mikulas Patocka wrote:
> On Thu, 14 Jul 2016, Michal Hocko wrote:
> > On Thu 14-07-16 10:00:16, Mikulas Patocka wrote:
> > > But it needs other changes to honor the PF_LESS_THROTTLE flag:
> > > 
> > > static int current_may_throttle(void)
> > > {
> > > return !(current->flags & PF_LESS_THROTTLE) ||
> > > current->backing_dev_info == NULL ||
> > > bdi_write_congested(current->backing_dev_info);
> > > }
> > > --- if you set PF_LESS_THROTTLE, current_may_throttle may still return 
> > > true if one of the other conditions is met.
> > 
> > That is true but doesn't that mean that the device is congested and
> > waiting a bit is the right thing to do?
> 
> You shouldn't really throttle mempool allocations at all. It's better to 
> fail the allocation quickly and allocate from a mempool reserve than to 
> wait 0.1 seconds in the reclaim path.

Well, but we do that already, no? The first allocation request is NOWAIT
and then we try to consume an object from the pool. We are re-adding
__GFP_DIRECT_RECLAIM in case both fail. The point of throttling is to
prevent from scanning through LRUs too quickly while we know that the
bdi is congested.

> dm-crypt can do approximatelly 100MB/s. That means that it processes 25k 
> swap pages per second. If you wait in mempool_alloc, the allocation would 
> be satisfied in 0.4s. If you wait in the allocator's throttle 
> function, you waste 0.1s.
> 
> 
> It is also questionable if those 0.1 second sleeps are reasonable at all. 
> SSDs with 100k IOPS are common - they can drain the request queue in much 
> less time than 0.1 second. I think those hardcoded 0.1 second sleeps 
> should be replaced with sleeps until the device stops being congested.

Well if we do not do throttle_vm_writeout then the only remaining
writeout throttling for PF_LESS_THROTTLE is wait_iff_congested for
the direct reclaim and that should wake up if the device stops being
congested AFAIU.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] System freezes after OOM

2016-07-14 Thread Michal Hocko
On Thu 14-07-16 19:36:59, Michal Hocko wrote:
> On Thu 14-07-16 19:07:52, Ondrej Kozina wrote:
> > On 07/14/2016 05:31 PM, Michal Hocko wrote:
> > > On Thu 14-07-16 16:08:28, Ondrej Kozina wrote:
> > > [...]
> > > > As Mikulas pointed out, this doesn't work. The system froze as well 
> > > > with the
> > > > patch above. Will try to tweak the patch with Mikulas's suggestion...
> > > 
> > > Thank you for testing! Do you happen to have traces of the frozen
> > > processes? Does the flusher still gets throttled because the bias it
> > > gets is not sufficient. Or does it get throttled at a different place?
> > > 
> > 
> > Sure. Here it is (including sysrq+t and sysrq+w output): 
> > https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/4.7.0-rc7+/1/4.7.0-rc7+.log
> 
> Thanks a lot! This is helpful.
> [  162.716376] active_anon:107874 inactive_anon:108176 isolated_anon:64
> [  162.716376]  active_file:1086 inactive_file:1103 isolated_file:0
> [  162.716376]  unevictable:0 dirty:0 writeback:69824 unstable:0
> [  162.716376]  slab_reclaimable:3119 slab_unreclaimable:24124
> [  162.716376]  mapped:2165 shmem:57 pagetables:1509 bounce:0
> [  162.716376]  free:701 free_pcp:0 free_cma:0
> 
> No surprise that PF_LESS_THROTTLE didn't help. It gives some bias but
> considering how many pages are under writeback it cannot possibly help
> to prevent from sleeping in throttle_vm_writeout. I suppose adding
> the following on top of the memalloc patch helps, right?
> It is an alternative to what you were suggesting in other email but
> it doesn't affect current_may_throttle paths which I would rather not
> touch.

Just read the other email and your patch properly and this is basically
equivalent thing. I will think more about potential issues this might
cause and send a proper patch for wider review.

> ---
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 7fbb2d008078..a37661f1a11b 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1971,6 +1971,9 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>   unsigned long background_thresh;
>   unsigned long dirty_thresh;
>  
> + if (current->flags & PF_LESS_THROTTLE)
> +     return;
> +
>  for ( ; ; ) {
>   global_dirty_limits(&background_thresh, &dirty_thresh);
>   dirty_thresh = hard_dirty_limit(&global_wb_domain, 
> dirty_thresh);
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] System freezes after OOM

2016-07-14 Thread Michal Hocko
On Thu 14-07-16 19:07:52, Ondrej Kozina wrote:
> On 07/14/2016 05:31 PM, Michal Hocko wrote:
> > On Thu 14-07-16 16:08:28, Ondrej Kozina wrote:
> > [...]
> > > As Mikulas pointed out, this doesn't work. The system froze as well with 
> > > the
> > > patch above. Will try to tweak the patch with Mikulas's suggestion...
> > 
> > Thank you for testing! Do you happen to have traces of the frozen
> > processes? Does the flusher still gets throttled because the bias it
> > gets is not sufficient. Or does it get throttled at a different place?
> > 
> 
> Sure. Here it is (including sysrq+t and sysrq+w output): 
> https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/4.7.0-rc7+/1/4.7.0-rc7+.log

Thanks a lot! This is helpful.
[  162.716376] active_anon:107874 inactive_anon:108176 isolated_anon:64
[  162.716376]  active_file:1086 inactive_file:1103 isolated_file:0
[  162.716376]  unevictable:0 dirty:0 writeback:69824 unstable:0
[  162.716376]  slab_reclaimable:3119 slab_unreclaimable:24124
[  162.716376]  mapped:2165 shmem:57 pagetables:1509 bounce:0
[  162.716376]  free:701 free_pcp:0 free_cma:0

No surprise that PF_LESS_THROTTLE didn't help. It gives some bias but
considering how many pages are under writeback it cannot possibly help
to prevent from sleeping in throttle_vm_writeout. I suppose adding
the following on top of the memalloc patch helps, right?
It is an alternative to what you were suggesting in other email but
it doesn't affect current_may_throttle paths which I would rather not
touch.
---
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7fbb2d008078..a37661f1a11b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1971,6 +1971,9 @@ void throttle_vm_writeout(gfp_t gfp_mask)
unsigned long background_thresh;
unsigned long dirty_thresh;
 
+   if (current->flags & PF_LESS_THROTTLE)
+   return;
+
 for ( ; ; ) {
global_dirty_limits(&background_thresh, &dirty_thresh);
dirty_thresh = hard_dirty_limit(&global_wb_domain, 
dirty_thresh);
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] System freezes after OOM

2016-07-14 Thread Michal Hocko
On Thu 14-07-16 16:08:28, Ondrej Kozina wrote:
[...]
> As Mikulas pointed out, this doesn't work. The system froze as well with the
> patch above. Will try to tweak the patch with Mikulas's suggestion...

Thank you for testing! Do you happen to have traces of the frozen
processes? Does the flusher still gets throttled because the bias it
gets is not sufficient. Or does it get throttled at a different place?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] System freezes after OOM

2016-07-14 Thread Michal Hocko
On Thu 14-07-16 10:00:16, Mikulas Patocka wrote:
> 
> 
> On Thu, 14 Jul 2016, Michal Hocko wrote:
> 
> > On Wed 13-07-16 11:02:15, Mikulas Patocka wrote:
> 
> > > > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> > > > index 4f3cb3554944..0b806810efab 100644
> > > > --- a/drivers/md/dm-crypt.c
> > > > +++ b/drivers/md/dm-crypt.c
> > > > @@ -1392,11 +1392,14 @@ static void kcryptd_async_done(struct 
> > > > crypto_async_request *async_req,
> > > >  static void kcryptd_crypt(struct work_struct *work)
> > > >  {
> > > > struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, 
> > > > work);
> > > > +   unsigned int pflags = current->flags;
> > > >  
> > > > +   current->flags |= PF_LESS_THROTTLE;
> > > > if (bio_data_dir(io->base_bio) == READ)
> > > > kcryptd_crypt_read_convert(io);
> > > > else
> > > > kcryptd_crypt_write_convert(io);
> > > > +   tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
> > > >  }
> > > >  
> > > >  static void kcryptd_queue_crypt(struct dm_crypt_io *io)
> > > 
> > > ^^^ That fixes just one specific case - but there may be other threads 
> > > doing mempool allocations in the device mapper subsystem - and you would 
> > > need to mark all of them.
> > 
> > Now that I am thinking about it some more. Are there any mempool users
> > which would actually want to be throttled? I would expect mempool users
> > are necessary to push IO through and throttle them sounds like a bad
> > decision in the first place but there might be other mempool users which
> > could cause issues. Anyway how about setting PF_LESS_THROTTLE
> > unconditionally inside mempool_alloc? Something like the following:
> >
> > diff --git a/mm/mempool.c b/mm/mempool.c
> > index 8f65464da5de..e21fb632983f 100644
> > --- a/mm/mempool.c
> > +++ b/mm/mempool.c
> > @@ -310,7 +310,8 @@ EXPORT_SYMBOL(mempool_resize);
> >   */
> >  void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> >  {
> > -   void *element;
> > +   unsigned int pflags = current->flags;
> > +   void *element = NULL;
> > unsigned long flags;
> > wait_queue_t wait;
> > gfp_t gfp_temp;
> > @@ -327,6 +328,12 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> >  
> > gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
> >  
> > +   /*
> > +* Make sure that the allocation doesn't get throttled during the
> > +* reclaim
> > +*/
> > +   if (gfpflags_allow_blocking(gfp_mask))
> > +   current->flags |= PF_LESS_THROTTLE;
> >  repeat_alloc:
> > if (likely(pool->curr_nr)) {
> > /*
> > @@ -339,7 +346,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> >  
> > element = pool->alloc(gfp_temp, pool->pool_data);
> > if (likely(element != NULL))
> > -   return element;
> > +   goto out;
> >  
> > spin_lock_irqsave(&pool->lock, flags);
> > if (likely(pool->curr_nr)) {
> > @@ -352,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> >  * for debugging.
> >  */
> > kmemleak_update_trace(element);
> > -   return element;
> > +   goto out;
> > }
> >  
> > /*
> > @@ -369,7 +376,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> > /* We must not sleep if !__GFP_DIRECT_RECLAIM */
> > if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
> > spin_unlock_irqrestore(&pool->lock, flags);
> > -   return NULL;
> > +   goto out;
> > }
> >  
> > /* Let's wait for someone else to return an element to @pool */
> > @@ -386,6 +393,10 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> >  
> > finish_wait(&pool->wait, &wait);
> > goto repeat_alloc;
> > +out:
> > +   if (gfpflags_allow_blocking(gfp_mask))
> > +   tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
> > +   return element;
> >  }
> >  EXPORT_SYMBOL(mempool_alloc);
> >  
> 
> But it needs other changes to honor the PF_LESS_THROTTLE flag:
> 
> static int current_may_throttle(void)
> {
> return !(current->flags & PF_LESS_THROTTLE) ||
> current->backing_dev_info == NULL ||
> bdi_write_congested(current->backing_dev_info);
> }
> --- if you set PF_LESS_THROTTLE, current_may_throttle may still return 
> true if one of the other conditions is met.

That is true but doesn't that mean that the device is congested and
waiting a bit is the right thing to do?

> shrink_zone_memcg calls throttle_vm_writeout without checking 
> PF_LESS_THROTTLE at all.

Yes it doesn't call it because it relies on
global_dirty_limits()->domain_dirty_limits() to DTRT. It will give the
caller with PF_LESS_THROTTLE some boost wrt. all other writers.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] System freezes after OOM

2016-07-14 Thread Michal Hocko
On Wed 13-07-16 11:21:41, Mikulas Patocka wrote:
> 
> 
> On Wed, 13 Jul 2016, Milan Broz wrote:
> 
> > On 07/13/2016 02:50 PM, Michal Hocko wrote:
> > > On Wed 13-07-16 13:10:06, Michal Hocko wrote:
> > >> On Tue 12-07-16 19:44:11, Mikulas Patocka wrote:
> > > [...]
> > >>> As long as swapping is in progress, the free memory is below the limit 
> > >>> (because the swapping activity itself consumes any memory over the 
> > >>> limit). 
> > >>> And that triggered the OOM killer prematurely.
> > >>
> > >> I am not sure I understand the last part. Are you saing that we trigger
> > >> OOM because the initiated swapout will not be able to finish the IO thus
> > >> release the page in time?
> > >>
> > >> The oom detection checks waits for an ongoing writeout if there is no
> > >> reclaim progress and at least half of the reclaimable memory is either
> > >> dirty or under writeback. Pages under swaout are marked as under
> > >> writeback AFAIR. The writeout path (dm-crypt worker in this case) should
> > >> be able to allocate a memory from the mempool, hand over to the crypt
> > >> layer and finish the IO. Is it possible this might take a lot of time?
> > > 
> > > I am not familiar with the crypto API but from what I understood from
> > > crypt_convert the encryption is done asynchronously. Then I got lost in
> > > the indirection. Who is completing the request and from what kind of
> > > context? Is it possible it wouldn't be runable for a long time?
> > 
> > If you mean crypt_convert in dm-crypt, then it can do asynchronous 
> > completion
> > but usually (with AES-NI ans sw implementations) it run the operation 
> > completely
> > synchronously.
> > Asynchronous processing is quite rare, usually only on some specific 
> > hardware
> > crypto accelerators.
> > 
> > Once the encryption is finished, the cloned bio is sent to the block
> > layer for processing.
> > (There is also some magic with sorting writes but Mikulas knows this 
> > better.)
> 
> dm-crypt receives requests in crypt_map, then it distributes write 
> requests to multiple encryption threads. Encryption is done usually 
> synchronously; asynchronous completion is used only when using some PCI 
> cards that accelerate encryption. When encryption finishes, the encrypted 
> pages are submitted to a thread dmcrypt_write that sorts the requests 
> using rbtree and submits them.

OK. I was worried that the async context would depend on WQ and a lack
of workers could lead to long stalls. Dedicated kernel threads seem
sufficient.

Thanks for the clarification.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] System freezes after OOM

2016-07-13 Thread Michal Hocko
On Wed 13-07-16 10:18:35, Mikulas Patocka wrote:
> 
> 
> On Wed, 13 Jul 2016, Michal Hocko wrote:
> 
> > [CC David]
> > 
> > > > It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. 
> > > > Prior to this commit, mempool allocations set __GFP_NOMEMALLOC, so 
> > > > they never exhausted reserved memory. With this commit, mempool 
> > > > allocations drop __GFP_NOMEMALLOC, so they can dig deeper (if the 
> > > > process has PF_MEMALLOC, they can bypass all limits).
> > > 
> > > I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set 
> > > __GFP_NOMEMALLOC if there are free elements") is doing correct thing. 
> > > It says
> > > 
> > > If an oom killed thread calls mempool_alloc(), it is possible that 
> > > it'll
> > > loop forever if there are no elements on the freelist since
> > > __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
> > > oom conditions.
> > 
> > I haven't studied the patch very deeply so I might be missing something
> > but from a quick look the patch does exactly what the above says.
> > 
> > mempool_alloc used to inhibit ALLOC_NO_WATERMARKS by default. David has
> > only changed that to allow ALLOC_NO_WATERMARKS if there are no objects
> > in the pool and so we have no fallback for the default __GFP_NORETRY
> > request.
> 
> The swapper core sets the flag PF_MEMALLOC and calls generic_make_request 
> to submit the swapping bio to the block driver. The device mapper driver 
> uses mempools for all its I/O processing.

OK, this is the part I have missed. I didn't realize that the swapout
path, which is indeed PF_MEMALLOC, can get down to blk code which uses
mempools. A quick code travers shows that at least
make_request_fn = blk_queue_bio
blk_queue_bio
  get_request
    __get_request

might do that. And in that case I agree that the above mentioned patch
has unintentional side effects and should be re-evaluated. David, what
do you think? An obvious fixup would be considering TIF_MEMDIE in
mempool_alloc explicitly.

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] Page Allocation Failures/OOM with dm-crypt on software RAID10 (Intel Rapid Storage)

2016-07-13 Thread Michal Hocko
On Wed 13-07-16 15:18:11, Matthias Dahl wrote:
> Hello Michal,
> 
> many thanks for all your time and help on this issue. It is very much
> appreciated and I hope we can track this down somehow.
> 
> On 2016-07-13 14:18, Michal Hocko wrote:
> 
> > So it seems we are accumulating bios and 256B objects. Buffer heads as
> > well but so much. Having over 4G worth of bios sounds really suspicious.
> > Note that they pin pages to be written so this might be consuming the
> > rest of the unaccounted memory! So the main question is why those bios
> > do not get dispatched or finished.
> 
> Ok. It is the Block IOs that do not get completed. I do get it right
> that those bio-3 are already the encrypted data that should be written
> out but do not for some reason?

Hard to tell. Maybe they are just allocated and waiting for encryption.
But this is just a wild guessing.


> I tried to figure this out myself but
> couldn't find anything -- what does the number "-3" state? It is the
> position in some chain or has it a different meaning?

$ git grep "kmem_cache_create.*bio"
block/bio-integrity.c:  bip_slab = kmem_cache_create("bio_integrity_payload",

so there doesn't seem to be any cache like that in the vanilla kernel.

> Do you think a trace like you mentioned would help shed some more light
> on this? Or would you recommend something else?

Dunno. Seeing who is allocating those bios might be helpful but it won't
tell much about what has happened to them after allocation. The tracing
would be more helpful for a mem leak situation which doesn't seem to be
the case here.

This is getting out of my area of expertise so I am not sure I can help
you much more, I am afraid.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] System freezes after OOM

2016-07-13 Thread Michal Hocko
On Wed 13-07-16 10:35:01, Jerome Marchand wrote:
> On 07/13/2016 01:44 AM, Mikulas Patocka wrote:
> > The problem of swapping to dm-crypt is this.
> > 
> > The free memory goes low, kswapd decides that some page should be swapped 
> > out. However, when you swap to an ecrypted device, writeback of each page 
> > requires another page to hold the encrypted data. dm-crypt uses mempools 
> > for all its structures and pages, so that it can make forward progress 
> > even if there is no memory free. However, the mempool code first allocates 
> > from general memory allocator and resorts to the mempool only if the 
> > memory is below limit.
> > 
> > So every attempt to swap out some page allocates another page.
> > 
> > As long as swapping is in progress, the free memory is below the limit 
> > (because the swapping activity itself consumes any memory over the limit). 
> > And that triggered the OOM killer prematurely.
> 
> There is a quite recent sysctl vm knob that I believe can help in this
> case: watermark_scale_factor. If you increase this value, kswapd will
> start paging out earlier, when there might still be enough free memory.
> 
> Ondrej, have you tried to increase /proc/sys/vm/watermark_scale_factor?

I suspect this would just change the timing or the real problem gets
hidden.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] System freezes after OOM

2016-07-13 Thread Michal Hocko
&& unlikely(test_thread_flag(TIF_MEMDIE)))
> + alloc_flags |= ALLOC_NO_WATERMARKS;
> + else if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
>   if (gfp_mask & __GFP_MEMALLOC)
>   alloc_flags |= ALLOC_NO_WATERMARKS;
>   else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
>   alloc_flags |= ALLOC_NO_WATERMARKS;
> - else if (!in_interrupt() &&
> - ((current->flags & PF_MEMALLOC) ||
> -  unlikely(test_thread_flag(TIF_MEMDIE
> + else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
>   alloc_flags |= ALLOC_NO_WATERMARKS;
>   }
>  #ifdef CONFIG_CMA
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majord...@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: mailto:"d...@kvack.org";> em...@kvack.org 

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] System freezes after OOM

2016-07-13 Thread Michal Hocko
On Tue 12-07-16 19:44:11, Mikulas Patocka wrote:
> The problem of swapping to dm-crypt is this.
> 
> The free memory goes low, kswapd decides that some page should be swapped 
> out. However, when you swap to an ecrypted device, writeback of each page 
> requires another page to hold the encrypted data. dm-crypt uses mempools 
> for all its structures and pages, so that it can make forward progress 
> even if there is no memory free. However, the mempool code first allocates 
> from general memory allocator and resorts to the mempool only if the 
> memory is below limit.

OK, thanks for the clarification. I guess the core part happens in
crypt_alloc_buffer, right?

> So every attempt to swap out some page allocates another page.
> 
> As long as swapping is in progress, the free memory is below the limit 
> (because the swapping activity itself consumes any memory over the limit). 
> And that triggered the OOM killer prematurely.

I am not sure I understand the last part. Are you saing that we trigger
OOM because the initiated swapout will not be able to finish the IO thus
release the page in time?

The oom detection checks waits for an ongoing writeout if there is no
reclaim progress and at least half of the reclaimable memory is either
dirty or under writeback. Pages under swaout are marked as under
writeback AFAIR. The writeout path (dm-crypt worker in this case) should
be able to allocate a memory from the mempool, hand over to the crypt
layer and finish the IO. Is it possible this might take a lot of time?

> On Tue, 12 Jul 2016, Michal Hocko wrote:
> 
> > On Mon 11-07-16 11:43:02, Mikulas Patocka wrote:
> > [...]
> > > The general problem is that the memory allocator does 16 retries to 
> > > allocate a page and then triggers the OOM killer (and it doesn't take 
> > > into 
> > > account how much swap space is free or how many dirty pages were really 
> > > swapped out while it waited).
> > 
> > Well, that is not how it works exactly. We retry as long as there is a
> > reclaim progress (at least one page freed) back off only if the
> > reclaimable memory can exceed watermks which is scaled down in 16
> > retries. The overal size of free swap is not really that important if we
> > cannot swap out like here due to complete memory reserves depletion:
> > https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-0/sample-00011/dmesg:
> > [   90.491276] Node 0 DMA free:0kB min:60kB low:72kB high:84kB 
> > active_anon:4096kB inactive_anon:4636kB active_file:212kB 
> > inactive_file:280kB unevictable:488kB isolated(anon):0kB isolated(file):0kB 
> > present:15992kB managed:15908kB mlocked:488kB dirty:276kB writeback:4636kB 
> > mapped:476kB shmem:12kB slab_reclaimable:204kB slab_unreclaimable:4700kB 
> > kernel_stack:48kB pagetables:120kB unstable:0kB bounce:0kB free_pcp:0kB 
> > local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:61132 
> > all_unreclaimable? yes
> > [   90.491283] lowmem_reserve[]: 0 977 977 977
> > [   90.491286] Node 0 DMA32 free:0kB min:3828kB low:4824kB high:5820kB 
> > active_anon:423820kB inactive_anon:424916kB active_file:17996kB 
> > inactive_file:21800kB unevictable:20724kB isolated(anon):384kB 
> > isolated(file):0kB present:1032184kB managed:1001260kB mlocked:20724kB 
> > dirty:25236kB writeback:49972kB mapped:23076kB shmem:1364kB 
> > slab_reclaimable:13796kB slab_unreclaimable:43008kB kernel_stack:2816kB 
> > pagetables:7320kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB 
> > free_cma:0kB writeback_tmp:0kB pages_scanned:5635400 all_unreclaimable? yes
> > 
> > Look at the amount of free memory. It is completely depleted. So it
> > smells like a process which has access to memory reserves has consumed
> > all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq
> > context user which went off the leash.
> 
> It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. Prior 
> to this commit, mempool allocations set __GFP_NOMEMALLOC, so they never 
> exhausted reserved memory. With this commit, mempool allocations drop 
> __GFP_NOMEMALLOC, so they can dig deeper (if the process has PF_MEMALLOC, 
> they can bypass all limits).

Hmm, but the patch allows access to the memory reserves only when the
pool is empty. And even then the caller would have to request access to
reserves explicitly either by __GFP_NOMEMALLOC or PF_MEMALLOC. That
doesn't seem to be the case for the dm-crypt, though. Or do you suspect
that some other mempool user might be doing so? 

> But swapping should proceed even if there is no memory free. There is a 
> comment "TODO: this could cause a theoretical memory reclaim deadlock in 
> the sw

Re: [dm-devel] System freezes after OOM

2016-07-13 Thread Michal Hocko
On Mon 11-07-16 11:43:02, Mikulas Patocka wrote:
[...]
> The general problem is that the memory allocator does 16 retries to 
> allocate a page and then triggers the OOM killer (and it doesn't take into 
> account how much swap space is free or how many dirty pages were really 
> swapped out while it waited).

Well, that is not how it works exactly. We retry as long as there is a
reclaim progress (at least one page freed) back off only if the
reclaimable memory can exceed watermks which is scaled down in 16
retries. The overal size of free swap is not really that important if we
cannot swap out like here due to complete memory reserves depletion:
https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-0/sample-00011/dmesg:
[   90.491276] Node 0 DMA free:0kB min:60kB low:72kB high:84kB 
active_anon:4096kB inactive_anon:4636kB active_file:212kB inactive_file:280kB 
unevictable:488kB isolated(anon):0kB isolated(file):0kB present:15992kB 
managed:15908kB mlocked:488kB dirty:276kB writeback:4636kB mapped:476kB 
shmem:12kB slab_reclaimable:204kB slab_unreclaimable:4700kB kernel_stack:48kB 
pagetables:120kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB writeback_tmp:0kB pages_scanned:61132 all_unreclaimable? yes
[   90.491283] lowmem_reserve[]: 0 977 977 977
[   90.491286] Node 0 DMA32 free:0kB min:3828kB low:4824kB high:5820kB 
active_anon:423820kB inactive_anon:424916kB active_file:17996kB 
inactive_file:21800kB unevictable:20724kB isolated(anon):384kB 
isolated(file):0kB present:1032184kB managed:1001260kB mlocked:20724kB 
dirty:25236kB writeback:49972kB mapped:23076kB shmem:1364kB 
slab_reclaimable:13796kB slab_unreclaimable:43008kB kernel_stack:2816kB 
pagetables:7320kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB 
free_cma:0kB writeback_tmp:0kB pages_scanned:5635400 all_unreclaimable? yes

Look at the amount of free memory. It is completely depleted. So it
smells like a process which has access to memory reserves has consumed
all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq
context user which went off the leash.

> So, it could prematurely trigger OOM killer on any slow swapping device 
> (including dm-crypt). Michal Hocko reworked the OOM killer in the patch 
> 0a0337e0d1d134465778a16f5cbea95086e8e9e0, but it still has the flaw that 
> it triggers OOM if there is plenty of free swap space free.
> 
> Michal, would you accept a change to the OOM killer, to prevent it from 
> triggerring when there is free swap space?

No this doesn't sound like a proper solution. The current decision
logic, as explained above relies on the feedback from the reclaim. A
free swap space doesn't really mean we can make a forward progress.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] System freezes after OOM

2016-07-13 Thread Michal Hocko
On Wed 13-07-16 13:10:06, Michal Hocko wrote:
> On Tue 12-07-16 19:44:11, Mikulas Patocka wrote:
[...]
> > As long as swapping is in progress, the free memory is below the limit 
> > (because the swapping activity itself consumes any memory over the limit). 
> > And that triggered the OOM killer prematurely.
> 
> I am not sure I understand the last part. Are you saing that we trigger
> OOM because the initiated swapout will not be able to finish the IO thus
> release the page in time?
> 
> The oom detection checks waits for an ongoing writeout if there is no
> reclaim progress and at least half of the reclaimable memory is either
> dirty or under writeback. Pages under swaout are marked as under
> writeback AFAIR. The writeout path (dm-crypt worker in this case) should
> be able to allocate a memory from the mempool, hand over to the crypt
> layer and finish the IO. Is it possible this might take a lot of time?

I am not familiar with the crypto API but from what I understood from
crypt_convert the encryption is done asynchronously. Then I got lost in
the indirection. Who is completing the request and from what kind of
context? Is it possible it wouldn't be runable for a long time?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] Page Allocation Failures/OOM with dm-crypt on software RAID10 (Intel Rapid Storage)

2016-07-13 Thread Michal Hocko
On Wed 13-07-16 13:21:26, Michal Hocko wrote:
> On Tue 12-07-16 16:56:32, Matthias Dahl wrote:
[...]
> > If that support is baked into the Fedora provided kernel that is. If
> > you could give me a few hints or pointers, how to properly do a allocator
> > trace point and get some decent data out of it, that would be nice.
> 
> You need to have a kernel with CONFIG_TRACEPOINTS and then enable them
> via debugfs. You are interested in kmalloc tracepoint and specify a size
> as a filter to only see those that are really interesting. I haven't
> checked your slabinfo yet - hope to get to it later today.

The largest contributors seem to be
$ zcat slabinfo.txt.gz | awk '{printf "%s %d\n" , $1, $6*$15}' | 
head_and_tail.sh 133 | paste-with-diff.sh | sort -n -k3
initial diff [#pages]
radix_tree_node 34442592
debug_objects_cache 388 46159
file_lock_ctx   114 138570
buffer_head 5616238704
kmalloc-256 328 573164
bio-3   24  1118984

So it seems we are accumulating bios and 256B objects. Buffer heads as
well but so much. Having over 4G worth of bios sounds really suspicious.
Note that they pin pages to be written so this might be consuming the
rest of the unaccounted memory! So the main question is why those bios
do not get dispatched or finished.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] Page Allocation Failures/OOM with dm-crypt on software RAID10 (Intel Rapid Storage)

2016-07-13 Thread Michal Hocko
On Tue 12-07-16 16:56:32, Matthias Dahl wrote:
> Hello Michal...
> 
> On 2016-07-12 16:07, Michal Hocko wrote:
> 
> > /proc/slabinfo could at least point on who is eating that memory.
> 
> Thanks. I have made another test (and thus again put the RAID10 out of
> sync for the 100th time, sigh) and made regular snapshots of slabinfo
> which I have attached to this mail.
> 
> > Direct IO doesn't get throttled like buffered IO.
> 
> Is buffered i/o not used in both cases if I don't explicitly request
> direct i/o?
> 
> dd if=/dev/zero /dev/md126p5 bs=512K
> and dd if=/dev/zero /dev/mapper/test-device bs=512K

OK, I misunderstood your question though. You were mentioning the direct
IO earlier so I thought you were referring to it here as well.
 
> Given that the test-device is dm-crypt on md125p5. Aren't both using
> buffered i/o?

Yes they are.

> > the number of pages under writeback was more or less same throughout
> > the time but there are some local fluctuations when some pages do get
> > completed.
> 
> The pages under writeback are those directly destined for the disk, so
> after dm-crypt had done its encryption?

Those are submitted for the IO. dm-crypt will allocate a "shadow" page
for each of them to perform the encryption and only then submit the IO
to the storage underneath see
http://lkml.kernel.org/r/alpine.lrh.2.02.1607121907160.24...@file01.intranet.prod.int.rdu2.redhat.com

> > If not you can enable allocator trace point for a particular object
> > size (or range of sizes) and see who is requesting them.
> 
> If that support is baked into the Fedora provided kernel that is. If
> you could give me a few hints or pointers, how to properly do a allocator
> trace point and get some decent data out of it, that would be nice.

You need to have a kernel with CONFIG_TRACEPOINTS and then enable them
via debugfs. You are interested in kmalloc tracepoint and specify a size
as a filter to only see those that are really interesting. I haven't
checked your slabinfo yet - hope to get to it later today.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] Page Allocation Failures/OOM with dm-crypt on software RAID10 (Intel Rapid Storage)

2016-07-12 Thread Michal Hocko
On Tue 12-07-16 14:42:12, Matthias Dahl wrote:
> Hello Michal...
> 
> On 2016-07-12 13:49, Michal Hocko wrote:
> 
> > I am not a storage expert (not even mention dm-crypt). But what those
> > counters say is that the IO completion doesn't trigger so the
> > PageWriteback flag is still set. Such a page is not reclaimable
> > obviously. So I would check the IO delivery path and focus on the
> > potential dm-crypt involvement if you suspect this is a contributing
> > factor.
> 
> Sounds reasonable... except that I have no clue how to trace that with
> the limited means I have at my disposal right now and with the limited
> knowledge I have of the kernel internals. ;-)

I guess dm resp. block layer experts would be much better to advise
here.
 
> > Who is consuming those objects? Where is the rest 70% of memory hiding?
> 
> Is there any way to get a more detailed listing of where the memory is
> spent while dd is running? Something I could pipe every 500ms or so for
> later analysis or so?

/proc/slabinfo could at least point on who is eating that memory.

> > Writer will get throttled but the concurrent memory consumer will not
> > normally. So you can end up in this situation.
> 
> Hm, okay. I am still confused though: If I, for example, let dd do the
> exact same thing on a raw partition on the RAID10, nothing like that
> happens. Wouldn't we have the same race and problem then too...?

Direct IO doesn't get throttled like buffered IO.

> It is
> only with dm-crypt in-between that all of this shows itself. But I do
> somehow suspect the RAID10 Intel Rapid Storage to be the cause or at
> least partially.

Well, there are many allocation failures for GFP_ATOMIC requests
from scsi_request_fn path. AFAIU the code, the request is deferred
and retried later.  I cannot find any path which would do regular
__GFP_DIRECT_RECLAIM fallback allocation. So while GFP_ATOMIC is
___GFP_KSWAPD_RECLAIM so it would kick kswapd which should reclaim some
memory there is no guarantee for a forward progress. Anyway, I believe
even __GFP_DIRECT_RECLAIM allocation wouldn't help much in this
particular case. There is not much of a reclaimable memory left - most
of it is dirty/writeback - so we are close to a deadlock state.

But we do not seem to be stuck completely:
unevictable:12341 dirty:90458 writeback:651272 unstable:0
unevictable:12341 dirty:90458 writeback:651272 unstable:0
unevictable:12341 dirty:90458 writeback:651272 unstable:0
unevictable:12341 dirty:90222 writeback:651252 unstable:0
unevictable:12341 dirty:90222 writeback:651231 unstable:0
unevictable:12341 dirty:89321 writeback:651905 unstable:0
unevictable:12341 dirty:89212 writeback:652014 unstable:0
unevictable:12341 dirty:89212 writeback:651993 unstable:0
unevictable:12341 dirty:89212 writeback:651993 unstable:0
unevictable:12488 dirty:42892 writeback:656597 unstable:0
unevictable:12488 dirty:42783 writeback:656597 unstable:0
unevictable:12488 dirty:42125 writeback:657125 unstable:0
unevictable:12488 dirty:42125 writeback:657125 unstable:0
unevictable:12488 dirty:42125 writeback:657125 unstable:0
unevictable:12488 dirty:42125 writeback:657125 unstable:0
unevictable:12556 dirty:54778 writeback:648616 unstable:0
unevictable:12556 dirty:54168 writeback:648919 unstable:0
unevictable:12556 dirty:54168 writeback:648919 unstable:0
unevictable:12556 dirty:53237 writeback:649506 unstable:0
unevictable:12556 dirty:53237 writeback:649506 unstable:0
unevictable:12556 dirty:53128 writeback:649615 unstable:0
unevictable:12556 dirty:53128 writeback:649615 unstable:0
unevictable:12556 dirty:52256 writeback:650159 unstable:0
unevictable:12556 dirty:52256 writeback:650159 unstable:0
unevictable:12556 dirty:52256 writeback:650138 unstable:0
unevictable:12635 dirty:49929 writeback:650724 unstable:0
unevictable:12635 dirty:49820 writeback:650833 unstable:0
unevictable:12635 dirty:49820 writeback:650833 unstable:0
unevictable:12635 dirty:49820 writeback:650833 unstable:0
unevictable:13001 dirty:167859 writeback:651864 unstable:0
unevictable:13001 dirty:167672 writeback:652038 unstable:0

the number of pages under writeback was more or less same throughout
the time but there are some local fluctuations when some pages do get
completed.

That being said, I believe that IO is stuck due to lack of memory which
is caused by some memory leak or excessive memory consumption. 

Re: [dm-devel] Page Allocation Failures/OOM with dm-crypt on software RAID10 (Intel Rapid Storage)

2016-07-12 Thread Michal Hocko
On Tue 12-07-16 13:49:20, Michal Hocko wrote:
> On Tue 12-07-16 13:28:12, Matthias Dahl wrote:
> > Hello Michal...
> > 
> > On 2016-07-12 11:50, Michal Hocko wrote:
> > 
> > > This smells like file pages are stuck in the writeback somewhere and the
> > > anon memory is not reclaimable because you do not have any swap device.
> > 
> > Not having a swap device shouldn't be a problem -- and in this case, it
> > would cause even more trouble as in disk i/o.
> > 
> > What could cause the file pages to get stuck or stopped from being written
> > to the disk? And more importantly, what is so unique/special about the
> > Intel Rapid Storage that it happens (seemingly) exclusively with that
> > and not the the normal Linux s/w raid support?
> 
> I am not a storage expert (not even mention dm-crypt). But what those
> counters say is that the IO completion doesn't trigger so the
> PageWriteback flag is still set. Such a page is not reclaimable
> obviously. So I would check the IO delivery path and focus on the
> potential dm-crypt involvement if you suspect this is a contributing
> factor.
>  
> > Also, if the pages are not written to disk, shouldn't something error
> > out or slow dd down?
> 
> Writers are normally throttled when we the dirty limit. You seem to have
> dirty_ratio set to 20% which is quite a lot considering how much memory
> you have.

And just to clarify. dirty_ratio refers to dirtyable memory which is
free_pages+file_lru pages. In your case you you have only 9% of the total
memory size dirty/writeback but that is 90% of dirtyable memory. This is
quite possible if somebody consumes free_pages racing with the writer.
Writer will get throttled but the concurrent memory consumer will not
normally. So you can end up in this situation.

> If you get back to the memory info from the OOM killer report:
> [18907.592209] active_anon:110314 inactive_anon:295 isolated_anon:0
> active_file:27534 inactive_file:819673 isolated_file:160
> unevictable:13001 dirty:167859 writeback:651864 unstable:0
> slab_reclaimable:177477 slab_unreclaimable:1817501
> mapped:934 shmem:588 pagetables:7109 bounce:0
> free:49928 free_pcp:45 free_cma:0
> 
> The dirty+writeback is ~9%. What is more interesting, though, LRU
> pages are negligible to the memory size (~11%). Note the numer of
> unreclaimable slab pages (~20%). Who is consuming those objects?
> Where is the rest 70% of memory hiding?


-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] Page Allocation Failures/OOM with dm-crypt on software RAID10 (Intel Rapid Storage)

2016-07-12 Thread Michal Hocko
On Tue 12-07-16 13:28:12, Matthias Dahl wrote:
> Hello Michal...
> 
> On 2016-07-12 11:50, Michal Hocko wrote:
> 
> > This smells like file pages are stuck in the writeback somewhere and the
> > anon memory is not reclaimable because you do not have any swap device.
> 
> Not having a swap device shouldn't be a problem -- and in this case, it
> would cause even more trouble as in disk i/o.
> 
> What could cause the file pages to get stuck or stopped from being written
> to the disk? And more importantly, what is so unique/special about the
> Intel Rapid Storage that it happens (seemingly) exclusively with that
> and not the the normal Linux s/w raid support?

I am not a storage expert (not even mention dm-crypt). But what those
counters say is that the IO completion doesn't trigger so the
PageWriteback flag is still set. Such a page is not reclaimable
obviously. So I would check the IO delivery path and focus on the
potential dm-crypt involvement if you suspect this is a contributing
factor.
 
> Also, if the pages are not written to disk, shouldn't something error
> out or slow dd down?

Writers are normally throttled when we the dirty limit. You seem to have
dirty_ratio set to 20% which is quite a lot considering how much memory
you have. If you get back to the memory info from the OOM killer report:
[18907.592209] active_anon:110314 inactive_anon:295 isolated_anon:0
active_file:27534 inactive_file:819673 isolated_file:160
unevictable:13001 dirty:167859 writeback:651864 unstable:0
slab_reclaimable:177477 slab_unreclaimable:1817501
mapped:934 shmem:588 pagetables:7109 bounce:0
free:49928 free_pcp:45 free_cma:0

The dirty+writeback is ~9%. What is more interesting, though, LRU
pages are negligible to the memory size (~11%). Note the numer of
unreclaimable slab pages (~20%). Who is consuming those objects?
Where is the rest 70% of memory hiding?
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] Page Allocation Failures/OOM with dm-crypt on software RAID10 (Intel Rapid Storage)

2016-07-12 Thread Michal Hocko
On Tue 12-07-16 10:27:37, Matthias Dahl wrote:
> Hello,
> 
> I posted this issue already on linux-mm, linux-kernel and dm-devel a
> few days ago and after further investigation it seems like that this
> issue is somehow related to the fact that I am using an Intel Rapid
> Storage RAID10, so I am summarizing everything again in this mail
> and include linux-raid in my post. Sorry for the noise... :(
> 
> I am currently setting up a new machine (since my old one broke down)
> and I ran into a lot of " Unable to allocate memory on node -1" warnings
> while using dm-crypt. I have attached as much of the full log as I could
> recover.
> 
> The encrypted device is sitting on a RAID10 (software raid, Intel Rapid
> Storage). I am currently limited to testing via Linux live images since
> the machine is not yet properly setup but I did my tests across several
> of those.
> 
> Steps to reproduce are:
> 
> 1)
> cryptsetup -s 512 -d /dev/urandom -c aes-xts-plain64 open --type plain
> /dev/md126p5 test-device
> 
> 2)
> dd if=/dev/zero of=/dev/mapper/test-device status=progress bs=512K
> 
> While running and monitoring the memory usage with free, it can be seen
> that the used memory increases rapidly and after just a few seconds, the
> system is out of memory and page allocation failures start to be issued
> as well as the OOM killer gets involved.

Here are two instances of the oom killer Mem-Info:

[18907.592206] Mem-Info:
[18907.592209] active_anon:110314 inactive_anon:295 isolated_anon:0
active_file:27534 inactive_file:819673 isolated_file:160
unevictable:13001 dirty:167859 writeback:651864 unstable:0
slab_reclaimable:177477 slab_unreclaimable:1817501
mapped:934 shmem:588 pagetables:7109 bounce:0
free:49928 free_pcp:45 free_cma:0

[18908.976349] Mem-Info:
[18908.976352] active_anon:109647 inactive_anon:295 isolated_anon:0
active_file:27535 inactive_file:819602 isolated_file:128
unevictable:13001 dirty:167672 writeback:652038 unstable:0
slab_reclaimable:177477 slab_unreclaimable:1817828
mapped:934 shmem:588 pagetables:7109 bounce:0
free:50252 free_pcp:91 free_cma:0

This smells like file pages are stuck in the writeback somewhere and the
anon memory is not reclaimable because you do not have any swap device.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 17/20] dm: get rid of superfluous gfp flags

2016-05-02 Thread Michal Hocko
On Fri 29-04-16 14:54:52, Mike Snitzer wrote:
> On Thu, Apr 28 2016 at  9:24am -0400,
> Michal Hocko  wrote:
> 
> > From: Michal Hocko 
> > 
> > copy_params seems to be little bit confused about which allocation flags
> > to use. It enforces GFP_NOIO even though it uses
> > memalloc_noio_{save,restore} which enforces GFP_NOIO at the page
> > allocator level automatically (via memalloc_noio_flags). It also
> > uses __GFP_REPEAT for the __vmalloc request which doesn't make much
> > sense either because vmalloc doesn't rely on costly high order
> > allocations. Let's just drop the __GFP_REPEAT and leave the further
> > cleanup to later changes.
> > 
> > Cc: Shaohua Li 
> > Cc: Mikulas Patocka 
> > Cc: dm-devel@redhat.com
> > Signed-off-by: Michal Hocko 
> 
> I've taken this patch for 4.7 but editted the header, see:
> https://git.kernel.org/cgit/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.7&id=0222c76e96163355620224625c1cd80991086dc7

Thanks!

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


[dm-devel] [PATCH] md: simplify free_params for kmalloc vs vmalloc fallback

2016-04-29 Thread Michal Hocko
From: Michal Hocko 

Use kvfree rather than DM_PARAMS_[KV]MALLOC specific param flags.

Cc: Shaohua Li 
Cc: Mikulas Patocka 
Cc: dm-devel@redhat.com
Signed-off-by: Michal Hocko 
---
Hi,
this is a rebase on top of dropped "dm: clean up GFP_NIO usage" which
should be dropped as per the feedback from Mikulas.

 drivers/md/dm-ioctl.c | 16 +++-
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 2c7ca258c4e4..e66e5b43bc18 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1670,19 +1670,14 @@ static int check_version(unsigned int cmd, struct 
dm_ioctl __user *user)
return r;
 }
 
-#define DM_PARAMS_KMALLOC  0x0001  /* Params alloced with kmalloc */
-#define DM_PARAMS_VMALLOC  0x0002  /* Params alloced with vmalloc */
-#define DM_WIPE_BUFFER 0x0010  /* Wipe input buffer before returning 
from ioctl */
+#define DM_WIPE_BUFFER 0x0001  /* Wipe input buffer before returning 
from ioctl */
 
 static void free_params(struct dm_ioctl *param, size_t param_size, int 
param_flags)
 {
if (param_flags & DM_WIPE_BUFFER)
memset(param, 0, param_size);
 
-   if (param_flags & DM_PARAMS_KMALLOC)
-   kfree(param);
-   if (param_flags & DM_PARAMS_VMALLOC)
-   vfree(param);
+   kvfree(param);
 }
 
 static int copy_params(struct dm_ioctl __user *user, struct dm_ioctl 
*param_kernel,
@@ -1714,19 +1709,14 @@ static int copy_params(struct dm_ioctl __user *user, 
struct dm_ioctl *param_kern
 * Use kmalloc() rather than vmalloc() when we can.
 */
dmi = NULL;
-   if (param_kernel->data_size <= KMALLOC_MAX_SIZE) {
+   if (param_kernel->data_size <= KMALLOC_MAX_SIZE)
dmi = kmalloc(param_kernel->data_size, GFP_NOIO | __GFP_NORETRY 
| __GFP_NOMEMALLOC | __GFP_NOWARN);
-   if (dmi)
-   *param_flags |= DM_PARAMS_KMALLOC;
-   }
 
if (!dmi) {
unsigned noio_flag;
noio_flag = memalloc_noio_save();
dmi = __vmalloc(param_kernel->data_size, GFP_NOIO | __GFP_HIGH 
| __GFP_HIGHMEM, PAGE_KERNEL);
memalloc_noio_restore(noio_flag);
-   if (dmi)
-   *param_flags |= DM_PARAMS_VMALLOC;
}
 
if (!dmi) {
-- 
2.8.0.rc3

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] md: simplify free_params for kmalloc vs vmalloc fallback

2016-04-29 Thread Michal Hocko
On Thu 28-04-16 11:40:59, Mikulas Patocka wrote:
[...]
> There are many users that use one of these patterns:
> 
>   if (size <= some_threshold)
>   p = kmalloc(size);
>   else
>   p = vmalloc(size);
> 
> or
> 
>   p = kmalloc(size);
>   if (!p)
>   p = vmalloc(size);
> 
> 
> For example: alloc_fdmem, seq_buf_alloc, setxattr, getxattr, ipc_alloc, 
> pidlist_allocate, get_pages_array, alloc_bucket_locks, 
> frame_vector_create. If you grep the kernel for vmalloc, you'll find this 
> pattern over and over again.

It is certainly good to address a common pattern by a helper if it makes
to code easier to follo IMHO.

> 
> In alloc_large_system_hash, there is
>   table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
> - that is clearly wrong because __vmalloc doesn't respect GFP_ATOMIC

I have seen this code some time already. I guess it was Al complaining
about it but then I just forgot about it. I have no idea why GFP_ATOMIC
was used there. This predates git times but it should be
https://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.10/2.6.10-mm1/broken-out/alloc_large_system_hash-numa-interleaving.patch
The changelog is quite verbose but no mention about this ugliness.

So I do agree that the above should be fixed and a common helper might
be interesting but I am afraid we are getting off topic here.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 19/20] md: simplify free_params for kmalloc vs vmalloc fallback

2016-04-29 Thread Michal Hocko
On Thu 28-04-16 11:37:31, Mike Snitzer wrote:
> On Thu, Apr 28 2016 at  9:24am -0400,
> Michal Hocko  wrote:
> 
> > From: Michal Hocko 
> > 
> > Use kvfree rather than DM_PARAMS_[KV]MALLOC specific param flags.
> > 
> > Cc: Shaohua Li 
> > Cc: Mikulas Patocka 
> > Cc: dm-devel@redhat.com
> > Signed-off-by: Michal Hocko 
> 
> Nack, seriously, this is the 3rd time this patch has been attempted.
> Did you actually test the change?  It'll crash very quickly, see:
> 
> https://www.redhat.com/archives/dm-devel/2016-April/msg00103.html

You are right! My bad I should have checked the other callers!

-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH 18/20] dm: clean up GFP_NIO usage

2016-04-29 Thread Michal Hocko
On Thu 28-04-16 10:20:09, Mikulas Patocka wrote:
> 
> 
> On Thu, 28 Apr 2016, Michal Hocko wrote:
> 
> > From: Michal Hocko 
> > 
> > copy_params uses GFP_NOIO for explicit allocation requests because this
> > might be called from the suspend path. To quote Mikulas:
> > : The LVM tool calls suspend and resume ioctls on device mapper block
> > : devices.
> > :
> > : When a device is suspended, any bio sent to the device is held. If the
> > : resume ioctl did GFP_KERNEL allocation, the allocation could get stuck
> > : trying to write some dirty cached pages to the suspended device.
> > :
> > : The LVM tool and the dmeventd daemon use mlock to lock its address space,
> > : so the copy_from_user/copy_to_user call cannot trigger a page fault.
> > 
> > Relying on the mlock is quite fragile and we have a better way in kernel
> > to enfore NOIO which is already used for the vmalloc fallback. Just use
> > memalloc_noio_{save,restore} around the whole copy_params function which
> > will force the same also to the page fult paths via copy_{from,to}_user.
> 
> The userspace memory is locked, so we don't need to use memalloc_noio_save 
> around copy_from_user. If the memory weren't locked, memalloc_noio_save 
> wouldn't help us to prevent the IO.

OK, you are right. Got your point. You would have to read from disk to
fault memory in so this is not just about not performing IO during the
reclaim.

So scratch this patch then.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


Re: [dm-devel] [PATCH] md: simplify free_params for kmalloc vs vmalloc fallback

2016-04-29 Thread Michal Hocko
On Thu 28-04-16 11:04:05, Mikulas Patocka wrote:
> Acked-by: Mikulas Patocka 

Thanks!

> BTW. we could also use kvmalloc to complement kvfree, proposed here: 
> https://www.redhat.com/archives/dm-devel/2015-July/msg00046.html

If there are sufficient users (I haven't checked other than quick git
grep on KMALLOC_MAX_SIZE and there do not seem that many) who are
sharing the same fallback strategy then why not. But I suspect that some
would rather fallback earlier and even do not attempt larger than e.g.
order-1 requests.
-- 
Michal Hocko
SUSE Labs

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel


  1   2   >